Convert all README files to Markdown
A mindless task for a lazy weekend: convert all the README and README.txt files to Markdown. Paired with the slow conversion of all of our man pages to Markdown, this gives a uniform language to the Open MPI docs. This commit moved a bunch of copyright headers out of the top-level README.txt file, so I updated the relevant copyright header years in the top-level LICENSE file to match what was removed from README.txt. Additionally, this commit did (very) little to update the actual content of the README files. A very small number of updates were made for topics that I found blatently obvious while Markdown-izing the content, but in general, I did not update content during this commit. For example, there's still quite a bit of text about ORTE that was not meaningfully updated. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> Co-authored-by: Josh Hursey <jhursey@us.ibm.com>
Этот коммит содержится в:
родитель
686c2142e2
Коммит
c960d292ec
272
HACKING
272
HACKING
@ -1,272 +0,0 @@
|
||||
Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
||||
University Research and Technology
|
||||
Corporation. All rights reserved.
|
||||
Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
of Tennessee Research Foundation. All rights
|
||||
reserved.
|
||||
Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
||||
University of Stuttgart. All rights reserved.
|
||||
Copyright (c) 2004-2005 The Regents of the University of California.
|
||||
All rights reserved.
|
||||
Copyright (c) 2008-2020 Cisco Systems, Inc. All rights reserved.
|
||||
Copyright (c) 2013 Intel, Inc. All rights reserved.
|
||||
$COPYRIGHT$
|
||||
|
||||
Additional copyrights may follow
|
||||
|
||||
$HEADER$
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
This file is here for those who are building/exploring OMPI in its
|
||||
source code form, most likely through a developer's tree (i.e., a
|
||||
Git clone).
|
||||
|
||||
|
||||
Developer Builds: Compiler Pickyness by Default
|
||||
===============================================
|
||||
|
||||
If you are building Open MPI from a Git clone (i.e., there is a ".git"
|
||||
directory in your build tree), the default build includes extra
|
||||
compiler pickyness, which will result in more compiler warnings than
|
||||
in non-developer builds. Getting these extra compiler warnings is
|
||||
helpful to Open MPI developers in making the code base as clean as
|
||||
possible.
|
||||
|
||||
Developers can disable this picky-by-default behavior by using the
|
||||
--disable-picky configure option. Also note that extra-picky compiles
|
||||
do *not* happen automatically when you do a VPATH build (e.g., if
|
||||
".git" is in your source tree, but not in your build tree).
|
||||
|
||||
Prior versions of Open MPI would automatically activate a lot of
|
||||
(performance-reducing) debugging code by default if ".git" was found
|
||||
in your build tree. This is no longer true. You can manually enable
|
||||
these (performance-reducing) debugging features in the Open MPI code
|
||||
base with these configure options:
|
||||
|
||||
--enable-debug
|
||||
--enable-mem-debug
|
||||
--enable-mem-profile
|
||||
|
||||
NOTE: These options are really only relevant to those who are
|
||||
developing Open MPI itself. They are not generally helpful for
|
||||
debugging general MPI applications.
|
||||
|
||||
|
||||
Use of GNU Autoconf, Automake, and Libtool (and m4)
|
||||
===================================================
|
||||
|
||||
You need to read/care about this section *ONLY* if you are building
|
||||
from a developer's tree (i.e., a Git clone of the Open MPI source
|
||||
tree). If you have an Open MPI distribution tarball, the contents of
|
||||
this section are optional -- you can (and probably should) skip
|
||||
reading this section.
|
||||
|
||||
If you are building Open MPI from a developer's tree, you must first
|
||||
install fairly recent versions of the GNU tools Autoconf, Automake,
|
||||
and Libtool (and possibly GNU m4, because recent versions of Autoconf
|
||||
have specific GNU m4 version requirements). The specific versions
|
||||
required depend on if you are using the Git master branch or a release
|
||||
branch (and which release branch you are using). The specific
|
||||
versions can be found here:
|
||||
|
||||
https://www.open-mpi.org/source/building.php
|
||||
|
||||
You can check what versions of the autotools you have installed with
|
||||
the following:
|
||||
|
||||
shell$ m4 --version
|
||||
shell$ autoconf --version
|
||||
shell$ automake --version
|
||||
shell$ libtoolize --version
|
||||
|
||||
Required version levels for all the OMPI releases can be found here:
|
||||
|
||||
https://www.open-mpi.org/source/building.php
|
||||
|
||||
To strengthen the above point: the core Open MPI developers typically
|
||||
use very, very recent versions of the GNU tools. There are known bugs
|
||||
in older versions of the GNU tools that Open MPI no longer compensates
|
||||
for (it seemed senseless to indefinitely support patches for ancient
|
||||
versions of Autoconf, for example). You *WILL* have problems if you
|
||||
do not use recent versions of the GNU tools.
|
||||
|
||||
If you need newer versions, you are *strongly* encouraged to heed the
|
||||
following advice:
|
||||
|
||||
NOTE: On MacOS/X, the default "libtool" program is different than the
|
||||
GNU libtool. You must download and install the GNU version
|
||||
(e.g., via MacPorts, Homebrew, or some other mechanism).
|
||||
|
||||
1. Unless your OS distribution has easy-to-use binary installations,
|
||||
the sources can be can be downloaded from:
|
||||
|
||||
ftp://ftp.gnu.org/gnu/autoconf/
|
||||
ftp://ftp.gnu.org/gnu/automake/
|
||||
ftp://ftp.gnu.org/gnu/libtool/
|
||||
and if you need it:
|
||||
ftp://ftp.gnu.org/gnu/m4/
|
||||
|
||||
NOTE: It is certainly easiest to download/build/install all four of
|
||||
these tools together. But note that Open MPI has no specific m4
|
||||
requirements; it is only listed here because Autoconf requires
|
||||
minimum versions of GNU m4. Hence, you may or may not *need* to
|
||||
actually install a new version of GNU m4. That being said, if you
|
||||
are confused or don't know, just install the latest GNU m4 with the
|
||||
rest of the GNU Autotools and everything will work out fine.
|
||||
|
||||
2. Build and install the tools in the following order:
|
||||
|
||||
2a. m4
|
||||
2b. Autoconf
|
||||
2c. Automake
|
||||
2d. Libtool
|
||||
|
||||
3. You MUST install the last three tools (Autoconf, Automake, Libtool)
|
||||
into the same prefix directory. These three tools are somewhat
|
||||
inter-related, and if they're going to be used together, they MUST
|
||||
share a common installation prefix.
|
||||
|
||||
You can install m4 anywhere as long as it can be found in the path;
|
||||
it may be convenient to install it in the same prefix as the other
|
||||
three. Or you can use any recent-enough m4 that is in your path.
|
||||
|
||||
3a. It is *strongly* encouraged that you do not install your new
|
||||
versions over the OS-installed versions. This could cause
|
||||
other things on your system to break. Instead, install into
|
||||
$HOME/local, or /usr/local, or wherever else you tend to
|
||||
install "local" kinds of software.
|
||||
3b. In doing so, be sure to prefix your $path with the directory
|
||||
where they are installed. For example, if you install into
|
||||
$HOME/local, you may want to edit your shell startup file
|
||||
(.bashrc, .cshrc, .tcshrc, etc.) to have something like:
|
||||
|
||||
# For bash/sh:
|
||||
export PATH=$HOME/local/bin:$PATH
|
||||
# For csh/tcsh:
|
||||
set path = ($HOME/local/bin $path)
|
||||
|
||||
3c. Ensure to set your $path *BEFORE* you configure/build/install
|
||||
the four packages.
|
||||
|
||||
4. All four packages require two simple commands to build and
|
||||
install (where PREFIX is the prefix discussed in 3, above).
|
||||
|
||||
shell$ cd <m4 directory>
|
||||
shell$ ./configure --prefix=PREFIX
|
||||
shell$ make; make install
|
||||
|
||||
--> If you are using the csh or tcsh shells, be sure to run the
|
||||
"rehash" command after you install each package.
|
||||
|
||||
shell$ cd <autoconf directory>
|
||||
shell$ ./configure --prefix=PREFIX
|
||||
shell$ make; make install
|
||||
|
||||
--> If you are using the csh or tcsh shells, be sure to run the
|
||||
"rehash" command after you install each package.
|
||||
|
||||
shell$ cd <automake directory>
|
||||
shell$ ./configure --prefix=PREFIX
|
||||
shell$ make; make install
|
||||
|
||||
--> If you are using the csh or tcsh shells, be sure to run the
|
||||
"rehash" command after you install each package.
|
||||
|
||||
shell$ cd <libtool directory>
|
||||
shell$ ./configure --prefix=PREFIX
|
||||
shell$ make; make install
|
||||
|
||||
--> If you are using the csh or tcsh shells, be sure to run the
|
||||
"rehash" command after you install each package.
|
||||
|
||||
m4, Autoconf and Automake build and install very quickly; Libtool will
|
||||
take a minute or two.
|
||||
|
||||
5. You can now run OMPI's top-level "autogen.pl" script. This script
|
||||
will invoke the GNU Autoconf, Automake, and Libtool commands in the
|
||||
proper order and setup to run OMPI's top-level "configure" script.
|
||||
|
||||
Running autogen.pl may take a few minutes, depending on your
|
||||
system. It's not very exciting to watch. :-)
|
||||
|
||||
If you have a multi-processor system, enabling the multi-threaded
|
||||
behavior in Automake 1.11 (or newer) can result in autogen.pl
|
||||
running faster. Do this by setting the AUTOMAKE_JOBS environment
|
||||
variable to the number of processors (threads) that you want it to
|
||||
use before invoking autogen.pl. For example (you can again put
|
||||
this in your shell startup files):
|
||||
|
||||
# For bash/sh:
|
||||
export AUTOMAKE_JOBS=4
|
||||
# For csh/tcsh:
|
||||
set AUTOMAKE_JOBS 4
|
||||
|
||||
5a. You generally need to run autogen.pl whenever the top-level
|
||||
file "configure.ac" changes, or any files in the config/ or
|
||||
<project>/config/ directories change (these directories are
|
||||
where a lot of "include" files for OMPI's configure script
|
||||
live).
|
||||
|
||||
5b. You do *NOT* need to re-run autogen.pl if you modify a
|
||||
Makefile.am.
|
||||
|
||||
Use of Flex
|
||||
===========
|
||||
|
||||
Flex is used during the compilation of a developer's checkout (it is
|
||||
not used to build official distribution tarballs). Other flavors of
|
||||
lex are *not* supported: given the choice of making parsing code
|
||||
portable between all flavors of lex and doing more interesting work on
|
||||
Open MPI, we greatly prefer the latter.
|
||||
|
||||
Note that no testing has been performed to see what the minimum
|
||||
version of Flex is required by Open MPI. We suggest that you use
|
||||
v2.5.35 at the earliest.
|
||||
|
||||
*** NOTE: Windows developer builds of Open MPI *require* Flex version
|
||||
2.5.35. Specifically, we know that v2.5.35 works and 2.5.4a does not.
|
||||
We have not tested to figure out exactly what the minimum required
|
||||
flex version is on Windows; we suggest that you use 2.5.35 at the
|
||||
earliest. It is for this reason that the
|
||||
contrib/dist/make_dist_tarball script checks for a Windows-friendly
|
||||
version of flex before continuing.
|
||||
|
||||
For now, Open MPI will allow developer builds with Flex 2.5.4. This
|
||||
is primarily motivated by the fact that RedHat/Centos 5 ships with
|
||||
Flex 2.5.4. It is likely that someday Open MPI developer builds will
|
||||
require Flex version >=2.5.35.
|
||||
|
||||
Note that the flex-generated code generates some compiler warnings on
|
||||
some platforms, but the warnings do not seem to be consistent or
|
||||
uniform on all platforms, compilers, and flex versions. As such, we
|
||||
have done little to try to remove those warnings.
|
||||
|
||||
If you do not have Flex installed, it can be downloaded from the
|
||||
following URL:
|
||||
|
||||
https://github.com/westes/flex
|
||||
|
||||
Use of Pandoc
|
||||
=============
|
||||
|
||||
Similar to prior sections, you need to read/care about this section
|
||||
*ONLY* if you are building from a developer's tree (i.e., a Git clone
|
||||
of the Open MPI source tree). If you have an Open MPI distribution
|
||||
tarball, the contents of this section are optional -- you can (and
|
||||
probably should) skip reading this section.
|
||||
|
||||
The Pandoc tool is used to generate Open MPI's man pages.
|
||||
Specifically: Open MPI's man pages are written in Markdown; Pandoc is
|
||||
the tool that converts that Markdown to nroff (i.e., the format of man
|
||||
pages).
|
||||
|
||||
You must have Pandoc >=v1.12 when building Open MPI from a developer's
|
||||
tree. If configure cannot find Pandoc >=v1.12, it will abort.
|
||||
|
||||
If you need to install Pandoc, check your operating system-provided
|
||||
packages (to include MacOS Homebrew and MacPorts). The Pandoc project
|
||||
itself also offers binaries for their releases:
|
||||
|
||||
https://pandoc.org/
|
258
HACKING.md
Обычный файл
258
HACKING.md
Обычный файл
@ -0,0 +1,258 @@
|
||||
# Open MPI Hacking / Developer's Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This file is here for those who are building/exploring OMPI in its
|
||||
source code form, most likely through a developer's tree (i.e., a
|
||||
Git clone).
|
||||
|
||||
|
||||
## Developer Builds: Compiler Pickyness by Default
|
||||
|
||||
If you are building Open MPI from a Git clone (i.e., there is a `.git`
|
||||
directory in your build tree), the default build includes extra
|
||||
compiler pickyness, which will result in more compiler warnings than
|
||||
in non-developer builds. Getting these extra compiler warnings is
|
||||
helpful to Open MPI developers in making the code base as clean as
|
||||
possible.
|
||||
|
||||
Developers can disable this picky-by-default behavior by using the
|
||||
`--disable-picky` configure option. Also note that extra-picky compiles
|
||||
do *not* happen automatically when you do a VPATH build (e.g., if
|
||||
`.git` is in your source tree, but not in your build tree).
|
||||
|
||||
Prior versions of Open MPI would automatically activate a lot of
|
||||
(performance-reducing) debugging code by default if `.git` was found
|
||||
in your build tree. This is no longer true. You can manually enable
|
||||
these (performance-reducing) debugging features in the Open MPI code
|
||||
base with these configure options:
|
||||
|
||||
* `--enable-debug`
|
||||
* `--enable-mem-debug`
|
||||
* `--enable-mem-profile`
|
||||
|
||||
***NOTE:*** These options are really only relevant to those who are
|
||||
developing Open MPI itself. They are not generally helpful for
|
||||
debugging general MPI applications.
|
||||
|
||||
|
||||
## Use of GNU Autoconf, Automake, and Libtool (and m4)
|
||||
|
||||
You need to read/care about this section *ONLY* if you are building
|
||||
from a developer's tree (i.e., a Git clone of the Open MPI source
|
||||
tree). If you have an Open MPI distribution tarball, the contents of
|
||||
this section are optional -- you can (and probably should) skip
|
||||
reading this section.
|
||||
|
||||
If you are building Open MPI from a developer's tree, you must first
|
||||
install fairly recent versions of the GNU tools Autoconf, Automake,
|
||||
and Libtool (and possibly GNU m4, because recent versions of Autoconf
|
||||
have specific GNU m4 version requirements). The specific versions
|
||||
required depend on if you are using the Git master branch or a release
|
||||
branch (and which release branch you are using). [The specific
|
||||
versions can be found
|
||||
here](https://www.open-mpi.org/source/building.php).
|
||||
|
||||
You can check what versions of the autotools you have installed with
|
||||
the following:
|
||||
|
||||
```
|
||||
shell$ m4 --version
|
||||
shell$ autoconf --version
|
||||
shell$ automake --version
|
||||
shell$ libtoolize --version
|
||||
```
|
||||
|
||||
[Required version levels for all the OMPI releases can be found
|
||||
here](https://www.open-mpi.org/source/building.php).
|
||||
|
||||
To strengthen the above point: the core Open MPI developers typically
|
||||
use very, very recent versions of the GNU tools. There are known bugs
|
||||
in older versions of the GNU tools that Open MPI no longer compensates
|
||||
for (it seemed senseless to indefinitely support patches for ancient
|
||||
versions of Autoconf, for example). You *WILL* have problems if you
|
||||
do not use recent versions of the GNU tools.
|
||||
|
||||
***NOTE:*** On MacOS/X, the default `libtool` program is different
|
||||
than the GNU libtool. You must download and install the GNU version
|
||||
(e.g., via MacPorts, Homebrew, or some other mechanism).
|
||||
|
||||
If you need newer versions, you are *strongly* encouraged to heed the
|
||||
following advice:
|
||||
|
||||
1. Unless your OS distribution has easy-to-use binary installations,
|
||||
the sources can be can be downloaded from:
|
||||
* https://ftp.gnu.org/gnu/autoconf/
|
||||
* https://ftp.gnu.org/gnu/automake/
|
||||
* https://ftp.gnu.org/gnu/libtool/
|
||||
* And if you need it: https://ftp.gnu.org/gnu/m4/
|
||||
|
||||
***NOTE:*** It is certainly easiest to download/build/install all
|
||||
four of these tools together. But note that Open MPI has no
|
||||
specific m4 requirements; it is only listed here because Autoconf
|
||||
requires minimum versions of GNU m4. Hence, you may or may not
|
||||
*need* to actually install a new version of GNU m4. That being
|
||||
said, if you are confused or don't know, just install the latest
|
||||
GNU m4 with the rest of the GNU Autotools and everything will work
|
||||
out fine.
|
||||
|
||||
1. Build and install the tools in the following order:
|
||||
1. m4
|
||||
1. Autoconf
|
||||
1. Automake
|
||||
1. Libtool
|
||||
|
||||
1. You MUST install the last three tools (Autoconf, Automake, Libtool)
|
||||
into the same prefix directory. These three tools are somewhat
|
||||
inter-related, and if they're going to be used together, they MUST
|
||||
share a common installation prefix.
|
||||
|
||||
You can install m4 anywhere as long as it can be found in the path;
|
||||
it may be convenient to install it in the same prefix as the other
|
||||
three. Or you can use any recent-enough m4 that is in your path.
|
||||
|
||||
1. It is *strongly* encouraged that you do not install your new
|
||||
versions over the OS-installed versions. This could cause
|
||||
other things on your system to break. Instead, install into
|
||||
`$HOME/local`, or `/usr/local`, or wherever else you tend to
|
||||
install "local" kinds of software.
|
||||
1. In doing so, be sure to prefix your $path with the directory
|
||||
where they are installed. For example, if you install into
|
||||
`$HOME/local`, you may want to edit your shell startup file
|
||||
(`.bashrc`, `.cshrc`, `.tcshrc`, etc.) to have something like:
|
||||
|
||||
```sh
|
||||
# For bash/sh:
|
||||
export PATH=$HOME/local/bin:$PATH
|
||||
# For csh/tcsh:
|
||||
set path = ($HOME/local/bin $path)
|
||||
```
|
||||
|
||||
1. Ensure to set your `$PATH` *BEFORE* you configure/build/install
|
||||
the four packages.
|
||||
|
||||
1. All four packages require two simple commands to build and
|
||||
install (where PREFIX is the prefix discussed in 3, above).
|
||||
|
||||
```
|
||||
shell$ cd <m4 directory>
|
||||
shell$ ./configure --prefix=PREFIX
|
||||
shell$ make; make install
|
||||
```
|
||||
|
||||
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
|
||||
run the `rehash` command after you install each package.
|
||||
|
||||
```
|
||||
shell$ cd <autoconf directory>
|
||||
shell$ ./configure --prefix=PREFIX
|
||||
shell$ make; make install
|
||||
```
|
||||
|
||||
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
|
||||
run the `rehash` command after you install each package.
|
||||
|
||||
```
|
||||
shell$ cd <automake directory>
|
||||
shell$ ./configure --prefix=PREFIX
|
||||
shell$ make; make install
|
||||
```
|
||||
|
||||
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
|
||||
run the `rehash` command after you install each package.
|
||||
|
||||
```
|
||||
shell$ cd <libtool directory>
|
||||
shell$ ./configure --prefix=PREFIX
|
||||
shell$ make; make install
|
||||
```
|
||||
|
||||
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
|
||||
run the `rehash` command after you install each package.
|
||||
|
||||
m4, Autoconf and Automake build and install very quickly; Libtool
|
||||
will take a minute or two.
|
||||
|
||||
1. You can now run OMPI's top-level `autogen.pl` script. This script
|
||||
will invoke the GNU Autoconf, Automake, and Libtool commands in the
|
||||
proper order and setup to run OMPI's top-level `configure` script.
|
||||
|
||||
Running `autogen.pl` may take a few minutes, depending on your
|
||||
system. It's not very exciting to watch. :smile:
|
||||
|
||||
If you have a multi-processor system, enabling the multi-threaded
|
||||
behavior in Automake 1.11 (or newer) can result in `autogen.pl`
|
||||
running faster. Do this by setting the `AUTOMAKE_JOBS` environment
|
||||
variable to the number of processors (threads) that you want it to
|
||||
use before invoking `autogen`.pl. For example (you can again put
|
||||
this in your shell startup files):
|
||||
|
||||
```sh
|
||||
# For bash/sh:
|
||||
export AUTOMAKE_JOBS=4
|
||||
# For csh/tcsh:
|
||||
set AUTOMAKE_JOBS 4
|
||||
```
|
||||
|
||||
1. You generally need to run autogen.pl whenever the top-level file
|
||||
`configure.ac` changes, or any files in the `config/` or
|
||||
`<project>/config/` directories change (these directories are
|
||||
where a lot of "include" files for Open MPI's `configure` script
|
||||
live).
|
||||
|
||||
1. You do *NOT* need to re-run `autogen.pl` if you modify a
|
||||
`Makefile.am`.
|
||||
|
||||
## Use of Flex
|
||||
|
||||
Flex is used during the compilation of a developer's checkout (it is
|
||||
not used to build official distribution tarballs). Other flavors of
|
||||
lex are *not* supported: given the choice of making parsing code
|
||||
portable between all flavors of lex and doing more interesting work on
|
||||
Open MPI, we greatly prefer the latter.
|
||||
|
||||
Note that no testing has been performed to see what the minimum
|
||||
version of Flex is required by Open MPI. We suggest that you use
|
||||
v2.5.35 at the earliest.
|
||||
|
||||
***NOTE:*** Windows developer builds of Open MPI *require* Flex version
|
||||
2.5.35. Specifically, we know that v2.5.35 works and 2.5.4a does not.
|
||||
We have not tested to figure out exactly what the minimum required
|
||||
flex version is on Windows; we suggest that you use 2.5.35 at the
|
||||
earliest. It is for this reason that the
|
||||
`contrib/dist/make_dist_tarball` script checks for a Windows-friendly
|
||||
version of Flex before continuing.
|
||||
|
||||
For now, Open MPI will allow developer builds with Flex 2.5.4. This
|
||||
is primarily motivated by the fact that RedHat/Centos 5 ships with
|
||||
Flex 2.5.4. It is likely that someday Open MPI developer builds will
|
||||
require Flex version >=2.5.35.
|
||||
|
||||
Note that the `flex`-generated code generates some compiler warnings
|
||||
on some platforms, but the warnings do not seem to be consistent or
|
||||
uniform on all platforms, compilers, and flex versions. As such, we
|
||||
have done little to try to remove those warnings.
|
||||
|
||||
If you do not have Flex installed, see [the Flex Github
|
||||
repository](https://github.com/westes/flex).
|
||||
|
||||
## Use of Pandoc
|
||||
|
||||
Similar to prior sections, you need to read/care about this section
|
||||
*ONLY* if you are building from a developer's tree (i.e., a Git clone
|
||||
of the Open MPI source tree). If you have an Open MPI distribution
|
||||
tarball, the contents of this section are optional -- you can (and
|
||||
probably should) skip reading this section.
|
||||
|
||||
The Pandoc tool is used to generate Open MPI's man pages.
|
||||
Specifically: Open MPI's man pages are written in Markdown; Pandoc is
|
||||
the tool that converts that Markdown to nroff (i.e., the format of man
|
||||
pages).
|
||||
|
||||
You must have Pandoc >=v1.12 when building Open MPI from a developer's
|
||||
tree. If configure cannot find Pandoc >=v1.12, it will abort.
|
||||
|
||||
If you need to install Pandoc, check your operating system-provided
|
||||
packages (to include MacOS Homebrew and MacPorts). [The Pandoc
|
||||
project web site](https://pandoc.org/) itself also offers binaries for
|
||||
their releases.
|
11
LICENSE
11
LICENSE
@ -15,9 +15,9 @@ Copyright (c) 2004-2010 High Performance Computing Center Stuttgart,
|
||||
University of Stuttgart. All rights reserved.
|
||||
Copyright (c) 2004-2008 The Regents of the University of California.
|
||||
All rights reserved.
|
||||
Copyright (c) 2006-2017 Los Alamos National Security, LLC. All rights
|
||||
Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
|
||||
reserved.
|
||||
Copyright (c) 2006-2017 Cisco Systems, Inc. All rights reserved.
|
||||
Copyright (c) 2006-2020 Cisco Systems, Inc. All rights reserved.
|
||||
Copyright (c) 2006-2010 Voltaire, Inc. All rights reserved.
|
||||
Copyright (c) 2006-2017 Sandia National Laboratories. All rights reserved.
|
||||
Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved.
|
||||
@ -25,7 +25,7 @@ Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved.
|
||||
Copyright (c) 2006-2017 The University of Houston. All rights reserved.
|
||||
Copyright (c) 2006-2009 Myricom, Inc. All rights reserved.
|
||||
Copyright (c) 2007-2017 UT-Battelle, LLC. All rights reserved.
|
||||
Copyright (c) 2007-2017 IBM Corporation. All rights reserved.
|
||||
Copyright (c) 2007-2020 IBM Corporation. All rights reserved.
|
||||
Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich Supercomputing
|
||||
Centre, Federal Republic of Germany
|
||||
Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
|
||||
@ -45,7 +45,7 @@ Copyright (c) 2016 ARM, Inc. All rights reserved.
|
||||
Copyright (c) 2010-2011 Alex Brick <bricka@ccs.neu.edu>. All rights reserved.
|
||||
Copyright (c) 2012 The University of Wisconsin-La Crosse. All rights
|
||||
reserved.
|
||||
Copyright (c) 2013-2016 Intel, Inc. All rights reserved.
|
||||
Copyright (c) 2013-2020 Intel, Inc. All rights reserved.
|
||||
Copyright (c) 2011-2017 NVIDIA Corporation. All rights reserved.
|
||||
Copyright (c) 2016 Broadcom Limited. All rights reserved.
|
||||
Copyright (c) 2011-2017 Fujitsu Limited. All rights reserved.
|
||||
@ -56,7 +56,8 @@ Copyright (c) 2013-2017 Research Organization for Information Science (RIST).
|
||||
Copyright (c) 2017-2020 Amazon.com, Inc. or its affiliates. All Rights
|
||||
reserved.
|
||||
Copyright (c) 2018 DataDirect Networks. All rights reserved.
|
||||
Copyright (c) 2018-2019 Triad National Security, LLC. All rights reserved.
|
||||
Copyright (c) 2018-2020 Triad National Security, LLC. All rights reserved.
|
||||
Copyright (c) 2020 Google, LLC. All rights reserved.
|
||||
|
||||
$COPYRIGHT$
|
||||
|
||||
|
@ -24,7 +24,7 @@
|
||||
|
||||
SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_SUBDIRS) test
|
||||
DIST_SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_DIST_SUBDIRS) test
|
||||
EXTRA_DIST = README INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.txt AUTHORS
|
||||
EXTRA_DIST = README.md INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.md AUTHORS
|
||||
|
||||
include examples/Makefile.include
|
||||
|
||||
|
2243
README
2243
README
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
281
README.JAVA.md
Обычный файл
281
README.JAVA.md
Обычный файл
@ -0,0 +1,281 @@
|
||||
# Open MPI Java Bindings
|
||||
|
||||
## Important node
|
||||
|
||||
JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
|
||||
NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
|
||||
JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
|
||||
THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
|
||||
CONTINUED DEVELOPER SUPPORT.
|
||||
|
||||
## Overview
|
||||
|
||||
This version of Open MPI provides support for Java-based
|
||||
MPI applications.
|
||||
|
||||
The rest of this document provides step-by-step instructions on
|
||||
building OMPI with Java bindings, and compiling and running Java-based
|
||||
MPI applications. Also, part of the functionality is explained with
|
||||
examples. Further details about the design, implementation and usage
|
||||
of Java bindings in Open MPI can be found in [1]. The bindings follow
|
||||
a JNI approach, that is, we do not provide a pure Java implementation
|
||||
of MPI primitives, but a thin layer on top of the C
|
||||
implementation. This is the same approach as in mpiJava [2]; in fact,
|
||||
mpiJava was taken as a starting point for Open MPI Java bindings, but
|
||||
they were later totally rewritten.
|
||||
|
||||
1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
|
||||
implementation of Java bindings in Open MPI". Parallel Comput.
|
||||
59: 1-20 (2016).
|
||||
2. M. Baker et al. "mpiJava: An object-oriented Java interface to
|
||||
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
|
||||
pp. 748-762, Springer (1999).
|
||||
|
||||
## Building Java Bindings
|
||||
|
||||
If this software was obtained as a developer-level checkout as opposed
|
||||
to a tarball, you will need to start your build by running
|
||||
`./autogen.pl`. This will also require that you have a fairly recent
|
||||
version of GNU Autotools on your system - see the HACKING.md file for
|
||||
details.
|
||||
|
||||
Java support requires that Open MPI be built at least with shared libraries
|
||||
(i.e., `--enable-shared`) - any additional options are fine and will not
|
||||
conflict. Note that this is the default for Open MPI, so you don't
|
||||
have to explicitly add the option. The Java bindings will build only
|
||||
if `--enable-mpi-java` is specified, and a JDK is found in a typical
|
||||
system default location.
|
||||
|
||||
If the JDK is not in a place where we automatically find it, you can
|
||||
specify the location. For example, this is required on the Mac
|
||||
platform as the JDK headers are located in a non-typical location. Two
|
||||
options are available for this purpose:
|
||||
|
||||
1. `--with-jdk-bindir=<foo>`: the location of `javac` and `javah`
|
||||
1. `--with-jdk-headers=<bar>`: the directory containing `jni.h`
|
||||
|
||||
For simplicity, typical configurations are provided in platform files
|
||||
under `contrib/platform/hadoop`. These will meet the needs of most
|
||||
users, or at least provide a starting point for your own custom
|
||||
configuration.
|
||||
|
||||
In summary, therefore, you can configure the system using the
|
||||
following Java-related options:
|
||||
|
||||
```
|
||||
$ ./configure --with-platform=contrib/platform/hadoop/<your-platform> ...
|
||||
|
||||
````
|
||||
|
||||
or
|
||||
|
||||
```
|
||||
$ ./configure --enable-mpi-java --with-jdk-bindir=<foo> --with-jdk-headers=<bar> ...
|
||||
```
|
||||
|
||||
or simply
|
||||
|
||||
```
|
||||
$ ./configure --enable-mpi-java ...
|
||||
```
|
||||
|
||||
if JDK is in a "standard" place that we automatically find.
|
||||
|
||||
## Running Java Applications
|
||||
|
||||
For convenience, the `mpijavac` wrapper compiler has been provided for
|
||||
compiling Java-based MPI applications. It ensures that all required MPI
|
||||
libraries and class paths are defined. You can see the actual command
|
||||
line using the `--showme` option, if you are interested.
|
||||
|
||||
Once your application has been compiled, you can run it with the
|
||||
standard `mpirun` command line:
|
||||
|
||||
```
|
||||
$ mpirun <options> java <your-java-options> <my-app>
|
||||
```
|
||||
|
||||
For convenience, `mpirun` has been updated to detect the `java` command
|
||||
and ensure that the required MPI libraries and class paths are defined
|
||||
to support execution. You therefore do _NOT_ need to specify the Java
|
||||
library path to the MPI installation, nor the MPI classpath. Any class
|
||||
path definitions required for your application should be specified
|
||||
either on the command line or via the `CLASSPATH` environment
|
||||
variable. Note that the local directory will be added to the class
|
||||
path if nothing is specified.
|
||||
|
||||
As always, the `java` executable, all required libraries, and your
|
||||
application classes must be available on all nodes.
|
||||
|
||||
## Basic usage of Java bindings
|
||||
|
||||
There is an MPI package that contains all classes of the MPI Java
|
||||
bindings: `Comm`, `Datatype`, `Request`, etc. These classes have a
|
||||
direct correspondence with classes defined by the MPI standard. MPI
|
||||
primitives are just methods included in these classes. The convention
|
||||
used for naming Java methods and classes is the usual camel-case
|
||||
convention, e.g., the equivalent of `MPI_File_set_info(fh,info)` is
|
||||
`fh.setInfo(info)`, where `fh` is an object of the class `File`.
|
||||
|
||||
Apart from classes, the MPI package contains predefined public
|
||||
attributes under a convenience class `MPI`. Examples are the
|
||||
predefined communicator `MPI.COMM_WORLD` or predefined datatypes such
|
||||
as `MPI.DOUBLE`. Also, MPI initialization and finalization are methods
|
||||
of the `MPI` class and must be invoked by all MPI Java
|
||||
applications. The following example illustrates these concepts:
|
||||
|
||||
```java
|
||||
import mpi.*;
|
||||
|
||||
class ComputePi {
|
||||
|
||||
public static void main(String args[]) throws MPIException {
|
||||
|
||||
MPI.Init(args);
|
||||
|
||||
int rank = MPI.COMM_WORLD.getRank(),
|
||||
size = MPI.COMM_WORLD.getSize(),
|
||||
nint = 100; // Intervals.
|
||||
double h = 1.0/(double)nint, sum = 0.0;
|
||||
|
||||
for(int i=rank+1; i<=nint; i+=size) {
|
||||
double x = h * ((double)i - 0.5);
|
||||
sum += (4.0 / (1.0 + x * x));
|
||||
}
|
||||
|
||||
double sBuf[] = { h * sum },
|
||||
rBuf[] = new double[1];
|
||||
|
||||
MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
|
||||
|
||||
if(rank == 0) System.out.println("PI: " + rBuf[0]);
|
||||
MPI.Finalize();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Exception handling
|
||||
|
||||
Java bindings in Open MPI support exception handling. By default, errors
|
||||
are fatal, but this behavior can be changed. The Java API will throw
|
||||
exceptions if the MPI.ERRORS_RETURN error handler is set:
|
||||
|
||||
```java
|
||||
MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
|
||||
```
|
||||
|
||||
If you add this statement to your program, it will show the line
|
||||
where it breaks, instead of just crashing in case of an error.
|
||||
Error-handling code can be separated from main application code by
|
||||
means of try-catch blocks, for instance:
|
||||
|
||||
```java
|
||||
try
|
||||
{
|
||||
File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
|
||||
}
|
||||
catch(MPIException ex)
|
||||
{
|
||||
System.err.println("Error Message: "+ ex.getMessage());
|
||||
System.err.println(" Error Class: "+ ex.getErrorClass());
|
||||
ex.printStackTrace();
|
||||
System.exit(-1);
|
||||
}
|
||||
```
|
||||
|
||||
## How to specify buffers
|
||||
|
||||
In MPI primitives that require a buffer (either send or receive) the
|
||||
Java API admits a Java array. Since Java arrays can be relocated by
|
||||
the Java runtime environment, the MPI Java bindings need to make a
|
||||
copy of the contents of the array to a temporary buffer, then pass the
|
||||
pointer to this buffer to the underlying C implementation. From the
|
||||
practical point of view, this implies an overhead associated to all
|
||||
buffers that are represented by Java arrays. The overhead is small
|
||||
for small buffers but increases for large arrays.
|
||||
|
||||
There is a pool of temporary buffers with a default capacity of 64K.
|
||||
If a temporary buffer of 64K or less is needed, then the buffer will
|
||||
be obtained from the pool. But if the buffer is larger, then it will
|
||||
be necessary to allocate the buffer and free it later.
|
||||
|
||||
The default capacity of pool buffers can be modified with an Open MPI
|
||||
MCA parameter:
|
||||
|
||||
```
|
||||
shell$ mpirun --mca mpi_java_eager size ...
|
||||
```
|
||||
|
||||
Where `size` is the number of bytes, or kilobytes if it ends with 'k',
|
||||
or megabytes if it ends with 'm'.
|
||||
|
||||
An alternative is to use "direct buffers" provided by standard classes
|
||||
available in the Java SDK such as `ByteBuffer`. For convenience we
|
||||
provide a few static methods `new[Type]Buffer` in the `MPI` class to
|
||||
create direct buffers for a number of basic datatypes. Elements of the
|
||||
direct buffer can be accessed with methods `put()` and `get()`, and
|
||||
the number of elements in the buffer can be obtained with the method
|
||||
`capacity()`. This example illustrates its use:
|
||||
|
||||
```java
|
||||
int myself = MPI.COMM_WORLD.getRank();
|
||||
int tasks = MPI.COMM_WORLD.getSize();
|
||||
|
||||
IntBuffer in = MPI.newIntBuffer(MAXLEN * tasks),
|
||||
out = MPI.newIntBuffer(MAXLEN);
|
||||
|
||||
for(int i = 0; i < MAXLEN; i++)
|
||||
out.put(i, myself); // fill the buffer with the rank
|
||||
|
||||
Request request = MPI.COMM_WORLD.iAllGather(
|
||||
out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
|
||||
request.waitFor();
|
||||
request.free();
|
||||
|
||||
for(int i = 0; i < tasks; i++)
|
||||
{
|
||||
for(int k = 0; k < MAXLEN; k++)
|
||||
{
|
||||
if(in.get(k + i * MAXLEN) != i)
|
||||
throw new AssertionError("Unexpected value");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Direct buffers are available for: `BYTE`, `CHAR`, `SHORT`, `INT`,
|
||||
`LONG`, `FLOAT`, and `DOUBLE`. There is no direct buffer for booleans.
|
||||
|
||||
Direct buffers are not a replacement for arrays, because they have
|
||||
higher allocation and deallocation costs than arrays. In some
|
||||
cases arrays will be a better choice. You can easily convert a
|
||||
buffer into an array and vice versa.
|
||||
|
||||
All non-blocking methods must use direct buffers and only
|
||||
blocking methods can choose between arrays and direct buffers.
|
||||
|
||||
The above example also illustrates that it is necessary to call
|
||||
the `free()` method on objects whose class implements the `Freeable`
|
||||
interface. Otherwise a memory leak is produced.
|
||||
|
||||
## Specifying offsets in buffers
|
||||
|
||||
In a C program, it is common to specify an offset in a array with
|
||||
`&array[i]` or `array+i`, for instance to send data starting from
|
||||
a given position in the array. The equivalent form in the Java bindings
|
||||
is to `slice()` the buffer to start at an offset. Making a `slice()`
|
||||
on a buffer is only necessary, when the offset is not zero. Slices
|
||||
work for both arrays and direct buffers.
|
||||
|
||||
```java
|
||||
import static mpi.MPI.slice;
|
||||
// ...
|
||||
int numbers[] = new int[SIZE];
|
||||
// ...
|
||||
MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
|
||||
```
|
||||
|
||||
## Questions? Problems?
|
||||
|
||||
If you have any problems, or find any bugs, please feel free to report
|
||||
them to [Open MPI user's mailing
|
||||
list](https://www.open-mpi.org/community/lists/ompi.php).
|
275
README.JAVA.txt
275
README.JAVA.txt
@ -1,275 +0,0 @@
|
||||
***************************************************************************
|
||||
IMPORTANT NOTE
|
||||
|
||||
JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
|
||||
NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
|
||||
JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
|
||||
THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
|
||||
CONTINUED DEVELOPER SUPPORT.
|
||||
|
||||
***************************************************************************
|
||||
|
||||
This version of Open MPI provides support for Java-based
|
||||
MPI applications.
|
||||
|
||||
The rest of this document provides step-by-step instructions on
|
||||
building OMPI with Java bindings, and compiling and running
|
||||
Java-based MPI applications. Also, part of the functionality is
|
||||
explained with examples. Further details about the design,
|
||||
implementation and usage of Java bindings in Open MPI can be found
|
||||
in [1]. The bindings follow a JNI approach, that is, we do not
|
||||
provide a pure Java implementation of MPI primitives, but a thin
|
||||
layer on top of the C implementation. This is the same approach
|
||||
as in mpiJava [2]; in fact, mpiJava was taken as a starting point
|
||||
for Open MPI Java bindings, but they were later totally rewritten.
|
||||
|
||||
[1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
|
||||
implementation of Java bindings in Open MPI". Parallel Comput.
|
||||
59: 1-20 (2016).
|
||||
|
||||
[2] M. Baker et al. "mpiJava: An object-oriented Java interface to
|
||||
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
|
||||
pp. 748-762, Springer (1999).
|
||||
|
||||
============================================================================
|
||||
|
||||
Building Java Bindings
|
||||
|
||||
If this software was obtained as a developer-level
|
||||
checkout as opposed to a tarball, you will need to start your build by
|
||||
running ./autogen.pl. This will also require that you have a fairly
|
||||
recent version of autotools on your system - see the HACKING file for
|
||||
details.
|
||||
|
||||
Java support requires that Open MPI be built at least with shared libraries
|
||||
(i.e., --enable-shared) - any additional options are fine and will not
|
||||
conflict. Note that this is the default for Open MPI, so you don't
|
||||
have to explicitly add the option. The Java bindings will build only
|
||||
if --enable-mpi-java is specified, and a JDK is found in a typical
|
||||
system default location.
|
||||
|
||||
If the JDK is not in a place where we automatically find it, you can
|
||||
specify the location. For example, this is required on the Mac
|
||||
platform as the JDK headers are located in a non-typical location. Two
|
||||
options are available for this purpose:
|
||||
|
||||
--with-jdk-bindir=<foo> - the location of javac and javah
|
||||
--with-jdk-headers=<bar> - the directory containing jni.h
|
||||
|
||||
For simplicity, typical configurations are provided in platform files
|
||||
under contrib/platform/hadoop. These will meet the needs of most
|
||||
users, or at least provide a starting point for your own custom
|
||||
configuration.
|
||||
|
||||
In summary, therefore, you can configure the system using the
|
||||
following Java-related options:
|
||||
|
||||
$ ./configure --with-platform=contrib/platform/hadoop/<your-platform>
|
||||
...
|
||||
|
||||
or
|
||||
|
||||
$ ./configure --enable-mpi-java --with-jdk-bindir=<foo>
|
||||
--with-jdk-headers=<bar> ...
|
||||
|
||||
or simply
|
||||
|
||||
$ ./configure --enable-mpi-java ...
|
||||
|
||||
if JDK is in a "standard" place that we automatically find.
|
||||
|
||||
----------------------------------------------------------------------------
|
||||
|
||||
Running Java Applications
|
||||
|
||||
For convenience, the "mpijavac" wrapper compiler has been provided for
|
||||
compiling Java-based MPI applications. It ensures that all required MPI
|
||||
libraries and class paths are defined. You can see the actual command
|
||||
line using the --showme option, if you are interested.
|
||||
|
||||
Once your application has been compiled, you can run it with the
|
||||
standard "mpirun" command line:
|
||||
|
||||
$ mpirun <options> java <your-java-options> <my-app>
|
||||
|
||||
For convenience, mpirun has been updated to detect the "java" command
|
||||
and ensure that the required MPI libraries and class paths are defined
|
||||
to support execution. You therefore do NOT need to specify the Java
|
||||
library path to the MPI installation, nor the MPI classpath. Any class
|
||||
path definitions required for your application should be specified
|
||||
either on the command line or via the CLASSPATH environmental
|
||||
variable. Note that the local directory will be added to the class
|
||||
path if nothing is specified.
|
||||
|
||||
As always, the "java" executable, all required libraries, and your
|
||||
application classes must be available on all nodes.
|
||||
|
||||
----------------------------------------------------------------------------
|
||||
|
||||
Basic usage of Java bindings
|
||||
|
||||
There is an MPI package that contains all classes of the MPI Java
|
||||
bindings: Comm, Datatype, Request, etc. These classes have a direct
|
||||
correspondence with classes defined by the MPI standard. MPI primitives
|
||||
are just methods included in these classes. The convention used for
|
||||
naming Java methods and classes is the usual camel-case convention,
|
||||
e.g., the equivalent of MPI_File_set_info(fh,info) is fh.setInfo(info),
|
||||
where fh is an object of the class File.
|
||||
|
||||
Apart from classes, the MPI package contains predefined public attributes
|
||||
under a convenience class MPI. Examples are the predefined communicator
|
||||
MPI.COMM_WORLD or predefined datatypes such as MPI.DOUBLE. Also, MPI
|
||||
initialization and finalization are methods of the MPI class and must
|
||||
be invoked by all MPI Java applications. The following example illustrates
|
||||
these concepts:
|
||||
|
||||
import mpi.*;
|
||||
|
||||
class ComputePi {
|
||||
|
||||
public static void main(String args[]) throws MPIException {
|
||||
|
||||
MPI.Init(args);
|
||||
|
||||
int rank = MPI.COMM_WORLD.getRank(),
|
||||
size = MPI.COMM_WORLD.getSize(),
|
||||
nint = 100; // Intervals.
|
||||
double h = 1.0/(double)nint, sum = 0.0;
|
||||
|
||||
for(int i=rank+1; i<=nint; i+=size) {
|
||||
double x = h * ((double)i - 0.5);
|
||||
sum += (4.0 / (1.0 + x * x));
|
||||
}
|
||||
|
||||
double sBuf[] = { h * sum },
|
||||
rBuf[] = new double[1];
|
||||
|
||||
MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
|
||||
|
||||
if(rank == 0) System.out.println("PI: " + rBuf[0]);
|
||||
MPI.Finalize();
|
||||
}
|
||||
}
|
||||
|
||||
----------------------------------------------------------------------------
|
||||
|
||||
Exception handling
|
||||
|
||||
Java bindings in Open MPI support exception handling. By default, errors
|
||||
are fatal, but this behavior can be changed. The Java API will throw
|
||||
exceptions if the MPI.ERRORS_RETURN error handler is set:
|
||||
|
||||
MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
|
||||
|
||||
If you add this statement to your program, it will show the line
|
||||
where it breaks, instead of just crashing in case of an error.
|
||||
Error-handling code can be separated from main application code by
|
||||
means of try-catch blocks, for instance:
|
||||
|
||||
try
|
||||
{
|
||||
File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
|
||||
}
|
||||
catch(MPIException ex)
|
||||
{
|
||||
System.err.println("Error Message: "+ ex.getMessage());
|
||||
System.err.println(" Error Class: "+ ex.getErrorClass());
|
||||
ex.printStackTrace();
|
||||
System.exit(-1);
|
||||
}
|
||||
|
||||
|
||||
----------------------------------------------------------------------------
|
||||
|
||||
How to specify buffers
|
||||
|
||||
In MPI primitives that require a buffer (either send or receive) the
|
||||
Java API admits a Java array. Since Java arrays can be relocated by
|
||||
the Java runtime environment, the MPI Java bindings need to make a
|
||||
copy of the contents of the array to a temporary buffer, then pass the
|
||||
pointer to this buffer to the underlying C implementation. From the
|
||||
practical point of view, this implies an overhead associated to all
|
||||
buffers that are represented by Java arrays. The overhead is small
|
||||
for small buffers but increases for large arrays.
|
||||
|
||||
There is a pool of temporary buffers with a default capacity of 64K.
|
||||
If a temporary buffer of 64K or less is needed, then the buffer will
|
||||
be obtained from the pool. But if the buffer is larger, then it will
|
||||
be necessary to allocate the buffer and free it later.
|
||||
|
||||
The default capacity of pool buffers can be modified with an 'mca'
|
||||
parameter:
|
||||
|
||||
mpirun --mca mpi_java_eager size ...
|
||||
|
||||
Where 'size' is the number of bytes, or kilobytes if it ends with 'k',
|
||||
or megabytes if it ends with 'm'.
|
||||
|
||||
An alternative is to use "direct buffers" provided by standard
|
||||
classes available in the Java SDK such as ByteBuffer. For convenience
|
||||
we provide a few static methods "new[Type]Buffer" in the MPI class
|
||||
to create direct buffers for a number of basic datatypes. Elements
|
||||
of the direct buffer can be accessed with methods put() and get(),
|
||||
and the number of elements in the buffer can be obtained with the
|
||||
method capacity(). This example illustrates its use:
|
||||
|
||||
int myself = MPI.COMM_WORLD.getRank();
|
||||
int tasks = MPI.COMM_WORLD.getSize();
|
||||
|
||||
IntBuffer in = MPI.newIntBuffer(MAXLEN * tasks),
|
||||
out = MPI.newIntBuffer(MAXLEN);
|
||||
|
||||
for(int i = 0; i < MAXLEN; i++)
|
||||
out.put(i, myself); // fill the buffer with the rank
|
||||
|
||||
Request request = MPI.COMM_WORLD.iAllGather(
|
||||
out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
|
||||
request.waitFor();
|
||||
request.free();
|
||||
|
||||
for(int i = 0; i < tasks; i++)
|
||||
{
|
||||
for(int k = 0; k < MAXLEN; k++)
|
||||
{
|
||||
if(in.get(k + i * MAXLEN) != i)
|
||||
throw new AssertionError("Unexpected value");
|
||||
}
|
||||
}
|
||||
|
||||
Direct buffers are available for: BYTE, CHAR, SHORT, INT, LONG,
|
||||
FLOAT, and DOUBLE. There is no direct buffer for booleans.
|
||||
|
||||
Direct buffers are not a replacement for arrays, because they have
|
||||
higher allocation and deallocation costs than arrays. In some
|
||||
cases arrays will be a better choice. You can easily convert a
|
||||
buffer into an array and vice versa.
|
||||
|
||||
All non-blocking methods must use direct buffers and only
|
||||
blocking methods can choose between arrays and direct buffers.
|
||||
|
||||
The above example also illustrates that it is necessary to call
|
||||
the free() method on objects whose class implements the Freeable
|
||||
interface. Otherwise a memory leak is produced.
|
||||
|
||||
----------------------------------------------------------------------------
|
||||
|
||||
Specifying offsets in buffers
|
||||
|
||||
In a C program, it is common to specify an offset in a array with
|
||||
"&array[i]" or "array+i", for instance to send data starting from
|
||||
a given position in the array. The equivalent form in the Java bindings
|
||||
is to "slice()" the buffer to start at an offset. Making a "slice()"
|
||||
on a buffer is only necessary, when the offset is not zero. Slices
|
||||
work for both arrays and direct buffers.
|
||||
|
||||
import static mpi.MPI.slice;
|
||||
...
|
||||
int numbers[] = new int[SIZE];
|
||||
...
|
||||
MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
|
||||
|
||||
----------------------------------------------------------------------------
|
||||
|
||||
If you have any problems, or find any bugs, please feel free to report
|
||||
them to Open MPI user's mailing list (see
|
||||
https://www.open-mpi.org/community/lists/ompi.php).
|
2191
README.md
Обычный файл
2191
README.md
Обычный файл
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
@ -64,7 +64,7 @@ EXTRA_DIST = \
|
||||
platform/lanl/cray_xc_cle5.2/optimized-common \
|
||||
platform/lanl/cray_xc_cle5.2/optimized-lustre \
|
||||
platform/lanl/cray_xc_cle5.2/optimized-lustre.conf \
|
||||
platform/lanl/toss/README \
|
||||
platform/lanl/toss/README.md \
|
||||
platform/lanl/toss/common \
|
||||
platform/lanl/toss/common-optimized \
|
||||
platform/lanl/toss/cray-lustre-optimized \
|
||||
|
@ -1,121 +1,108 @@
|
||||
# Description
|
||||
|
||||
2 Feb 2011
|
||||
|
||||
Description
|
||||
===========
|
||||
|
||||
This sample "tcp2" BTL component is a simple example of how to build
|
||||
This sample `tcp2` BTL component is a simple example of how to build
|
||||
an Open MPI MCA component from outside of the Open MPI source tree.
|
||||
This is a valuable technique for 3rd parties who want to provide their
|
||||
own components for Open MPI, but do not want to be in the mainstream
|
||||
distribution (i.e., their code is not part of the main Open MPI code
|
||||
base).
|
||||
|
||||
NOTE: We do recommend that 3rd party developers investigate using a
|
||||
DVCS such as Mercurial or Git to keep up with Open MPI
|
||||
development. Using a DVCS allows you to host your component in
|
||||
your own copy of the Open MPI source tree, and yet still keep up
|
||||
with development changes, stable releases, etc.
|
||||
|
||||
Previous colloquial knowledge held that building a component from
|
||||
outside of the Open MPI source tree required configuring Open MPI
|
||||
--with-devel-headers, and then building and installing it. This
|
||||
configure switch installs all of OMPI's internal .h files under
|
||||
$prefix/include/openmpi, and therefore allows 3rd party code to be
|
||||
`--with-devel-headers`, and then building and installing it. This
|
||||
configure switch installs all of OMPI's internal `.h` files under
|
||||
`$prefix/include/openmpi`, and therefore allows 3rd party code to be
|
||||
compiled outside of the Open MPI tree.
|
||||
|
||||
This method definitely works, but is annoying:
|
||||
|
||||
* You have to ask users to use this special configure switch.
|
||||
* Not all users install from source; many get binary packages (e.g.,
|
||||
RPMs).
|
||||
* You have to ask users to use this special configure switch.
|
||||
* Not all users install from source; many get binary packages (e.g.,
|
||||
RPMs).
|
||||
|
||||
This example package shows two ways to build an Open MPI MCA component
|
||||
from outside the Open MPI source tree:
|
||||
|
||||
1. Using the above --with-devel-headers technique
|
||||
2. Compiling against the Open MPI source tree itself (vs. the
|
||||
installation tree)
|
||||
1. Using the above `--with-devel-headers` technique
|
||||
2. Compiling against the Open MPI source tree itself (vs. the
|
||||
installation tree)
|
||||
|
||||
The user still has to have a source tree, but at least they don't have
|
||||
to be required to use --with-devel-headers (which most users don't) --
|
||||
to be required to use `--with-devel-headers` (which most users don't) --
|
||||
they can likely build off the source tree that they already used.
|
||||
|
||||
Example project contents
|
||||
========================
|
||||
# Example project contents
|
||||
|
||||
The "tcp2" component is a direct copy of the TCP BTL as of January
|
||||
The `tcp2` component is a direct copy of the TCP BTL as of January
|
||||
2011 -- it has just been renamed so that it can be built separately
|
||||
and installed alongside the real TCP BTL component.
|
||||
|
||||
Most of the mojo for both methods is handled in the example
|
||||
components' configure.ac, but the same techniques are applicable
|
||||
components' `configure.ac`, but the same techniques are applicable
|
||||
outside of the GNU Auto toolchain.
|
||||
|
||||
This sample "tcp2" component has an autogen.sh script that requires
|
||||
This sample `tcp2` component has an `autogen.sh` script that requires
|
||||
the normal Autoconf, Automake, and Libtool. It also adds the
|
||||
following two configure switches:
|
||||
|
||||
--with-openmpi-install=DIR
|
||||
1. `--with-openmpi-install=DIR`:
|
||||
If provided, `DIR` is an Open MPI installation tree that was
|
||||
installed `--with-devel-headers`.
|
||||
|
||||
If provided, DIR is an Open MPI installation tree that was
|
||||
installed --with-devel-headers.
|
||||
|
||||
This switch uses the installed mpicc --showme:<foo> functionality
|
||||
to extract the relevant CPPFLAGS, LDFLAGS, and LIBS.
|
||||
|
||||
--with-openmpi-source=DIR
|
||||
|
||||
If provided, DIR is the source of a configured and built Open MPI
|
||||
This switch uses the installed `mpicc --showme:<foo>` functionality
|
||||
to extract the relevant `CPPFLAGS`, `LDFLAGS`, and `LIBS`.
|
||||
1. `--with-openmpi-source=DIR`:
|
||||
If provided, `DIR` is the source of a configured and built Open MPI
|
||||
source tree (corresponding to the version expected by the example
|
||||
component). The source tree is not required to have been
|
||||
configured --with-devel-headers.
|
||||
configured `--with-devel-headers`.
|
||||
|
||||
This switch uses the source tree's config.status script to extract
|
||||
the relevant CPPFLAGS and CFLAGS.
|
||||
This switch uses the source tree's `config.status` script to
|
||||
extract the relevant `CPPFLAGS` and `CFLAGS`.
|
||||
|
||||
Either one of these two switches must be provided, or appropriate
|
||||
CPPFLAGS, CFLAGS, LDFLAGS, and/or LIBS must be provided such that
|
||||
valid Open MPI header and library files can be found and compiled /
|
||||
linked against, respectively.
|
||||
`CPPFLAGS`, `CFLAGS`, `LDFLAGS`, and/or `LIBS` must be provided such
|
||||
that valid Open MPI header and library files can be found and compiled
|
||||
/ linked against, respectively.
|
||||
|
||||
Example use
|
||||
===========
|
||||
# Example use
|
||||
|
||||
First, download, build, and install Open MPI:
|
||||
|
||||
-----
|
||||
```
|
||||
$ cd $HOME
|
||||
$ wget \
|
||||
https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
|
||||
[lots of output]
|
||||
$ wget https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
|
||||
[...lots of output...]
|
||||
$ tar jxf openmpi-X.Y.Z.tar.bz2
|
||||
$ cd openmpi-X.Y.Z
|
||||
$ ./configure --prefix=/opt/openmpi ...
|
||||
[lots of output]
|
||||
[...lots of output...]
|
||||
$ make -j 4 install
|
||||
[lots of output]
|
||||
[...lots of output...]
|
||||
$ /opt/openmpi/bin/ompi_info | grep btl
|
||||
MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||
MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||
MCA btl: tcp (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||
[where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
|
||||
$
|
||||
-----
|
||||
```
|
||||
|
||||
Notice the installed BTLs from ompi_info.
|
||||
Notice the installed BTLs from `ompi_info`.
|
||||
|
||||
Now cd into this example project and build it, pointing it to the
|
||||
Now `cd` into this example project and build it, pointing it to the
|
||||
source directory of the Open MPI that you just built. Note that we
|
||||
use the same --prefix as when installing Open MPI (so that the built
|
||||
use the same `--prefix` as when installing Open MPI (so that the built
|
||||
component will be installed into the Right place):
|
||||
|
||||
-----
|
||||
```
|
||||
$ cd /path/to/this/sample
|
||||
$ ./autogen.sh
|
||||
$ ./configure --prefix=/opt/openmpi --with-openmpi-source=$HOME/openmpi-X.Y.Z
|
||||
[lots of output]
|
||||
[...lots of output...]
|
||||
$ make -j 4 install
|
||||
[lots of output]
|
||||
[...lots of output...]
|
||||
$ /opt/openmpi/bin/ompi_info | grep btl
|
||||
MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||
MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||
@ -123,12 +110,11 @@ $ /opt/openmpi/bin/ompi_info | grep btl
|
||||
MCA btl: tcp2 (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||
[where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
|
||||
$
|
||||
-----
|
||||
```
|
||||
|
||||
Notice that the "tcp2" BTL is now installed.
|
||||
Notice that the `tcp2` BTL is now installed.
|
||||
|
||||
Random notes
|
||||
============
|
||||
# Random notes
|
||||
|
||||
The component in this project is just an example; I whipped it up in
|
||||
the span of several hours. Your component may be a bit more complex
|
||||
@ -139,17 +125,15 @@ what you need.
|
||||
Changes required to the component to make it build in a standalone
|
||||
mode:
|
||||
|
||||
1. Write your own configure script. This component is just a sample.
|
||||
You basically need to build against an OMPI install that was
|
||||
installed --with-devel-headers or a built OMPI source tree. See
|
||||
./configure --help for details.
|
||||
|
||||
2. I also provided a bogus btl_tcp2_config.h (generated by configure).
|
||||
This file is not included anywhere, but it does provide protection
|
||||
against re-defined PACKAGE_* macros when running configure, which
|
||||
is quite annoying.
|
||||
|
||||
3. Modify Makefile.am to only build DSOs. I.e., you can optionally
|
||||
1. Write your own `configure` script. This component is just a
|
||||
sample. You basically need to build against an OMPI install that
|
||||
was installed `--with-devel-headers` or a built OMPI source tree.
|
||||
See `./configure --help` for details.
|
||||
1. I also provided a bogus `btl_tcp2_config.h` (generated by
|
||||
`configure`). This file is not included anywhere, but it does
|
||||
provide protection against re-defined `PACKAGE_*` macros when
|
||||
running `configure`, which is quite annoying.
|
||||
1. Modify `Makefile.am` to only build DSOs. I.e., you can optionally
|
||||
take the static option out since the component can *only* build in
|
||||
DSO mode when building standalone. That being said, it doesn't
|
||||
hurt to leave the static builds in -- this would (hypothetically)
|
105
contrib/dist/linux/README
поставляемый
105
contrib/dist/linux/README
поставляемый
@ -1,105 +0,0 @@
|
||||
Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
|
||||
University Research and Technology
|
||||
Corporation. All rights reserved.
|
||||
Copyright (c) 2004-2006 The University of Tennessee and The University
|
||||
of Tennessee Research Foundation. All rights
|
||||
reserved.
|
||||
Copyright (c) 2004-2006 High Performance Computing Center Stuttgart,
|
||||
University of Stuttgart. All rights reserved.
|
||||
Copyright (c) 2004-2006 The Regents of the University of California.
|
||||
All rights reserved.
|
||||
Copyright (c) 2006-2016 Cisco Systems, Inc. All rights reserved.
|
||||
$COPYRIGHT$
|
||||
|
||||
Additional copyrights may follow
|
||||
|
||||
$HEADER$
|
||||
|
||||
===========================================================================
|
||||
|
||||
Note that you probably want to download the latest release of the SRPM
|
||||
for any given Open MPI version. The SRPM release number is the
|
||||
version after the dash in the SRPM filename. For example,
|
||||
"openmpi-1.6.3-2.src.rpm" is the 2nd release of the SRPM for Open MPI
|
||||
v1.6.3. Subsequent releases of SRPMs typically contain bug fixes for
|
||||
the RPM packaging, but not Open MPI itself.
|
||||
|
||||
The buildrpm.sh script takes a single mandatory argument -- a filename
|
||||
pointing to an Open MPI tarball (may be either .gz or .bz2). It will
|
||||
create one or more RPMs from this tarball:
|
||||
|
||||
1. Source RPM
|
||||
2. "All in one" RPM, where all of Open MPI is put into a single RPM.
|
||||
3. "Multiple" RPM, where Open MPI is split into several sub-package
|
||||
RPMs:
|
||||
- openmpi-runtime
|
||||
- openmpi-devel
|
||||
- openmpi-docs
|
||||
|
||||
The folowing arguments could be used to affect script behaviour.
|
||||
Please, do NOT set the same settings with parameters and config vars.
|
||||
|
||||
-b
|
||||
If you specify this option, only the all-in-one binary RPM will
|
||||
be built. By default, only the source RPM (SRPM) is built. Other
|
||||
parameters that affect the all-in-one binary RPM will be ignored
|
||||
unless this option is specified.
|
||||
|
||||
-n name
|
||||
This option will change the name of the produced RPM to the "name".
|
||||
It is useful to use with "-o" and "-m" options if you want to have
|
||||
multiple Open MPI versions installed simultaneously in the same
|
||||
enviroment. Requires use of option "-b".
|
||||
|
||||
-o
|
||||
With this option the install path of the binary RPM will be changed
|
||||
to /opt/_NAME_/_VERSION_. Requires use of option "-b".
|
||||
|
||||
-m
|
||||
This option causes the RPM to also install modulefiles
|
||||
to the location specified in the specfile. Requires use of option "-b".
|
||||
|
||||
-i
|
||||
Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
|
||||
Requires use of option "-b".
|
||||
|
||||
-f lf_location
|
||||
Include support for Libfabric. "lf_location" is Libfabric install
|
||||
path. Requires use of option "-b".
|
||||
|
||||
-t tm_location
|
||||
Include support for Torque/PBS Pro. "tm_location" is path of the
|
||||
Torque/PBS Pro header files. Requires use of option "-b".
|
||||
|
||||
-d
|
||||
Build with debugging support. By default,
|
||||
the RPM is built without debugging support.
|
||||
|
||||
-c parameter
|
||||
Add custom configure parameter.
|
||||
|
||||
-r parameter
|
||||
Add custom RPM build parameter.
|
||||
|
||||
-s
|
||||
If specified, the script will try to unpack the openmpi.spec
|
||||
file from the tarball specified on the command line. By default,
|
||||
the script will look for the specfile in the current directory.
|
||||
|
||||
-R directory
|
||||
Specifies the top level RPM build direcotry.
|
||||
|
||||
-h
|
||||
Prints script usage information.
|
||||
|
||||
|
||||
Target architecture is currently hard-coded in the beginning
|
||||
of the buildrpm.sh script.
|
||||
|
||||
Alternatively, you can build directly from the openmpi.spec spec file
|
||||
or SRPM directly. Many options can be passed to the building process
|
||||
via rpmbuild's --define option (there are older versions of rpmbuild
|
||||
that do not seem to handle --define'd values properly in all cases,
|
||||
but we generally don't care about those old versions of rpmbuild...).
|
||||
The available options are described in the comments in the beginning
|
||||
of the spec file in this directory.
|
88
contrib/dist/linux/README.md
поставляемый
Обычный файл
88
contrib/dist/linux/README.md
поставляемый
Обычный файл
@ -0,0 +1,88 @@
|
||||
# Open MPI Linux distribution helpers
|
||||
|
||||
Note that you probably want to download the latest release of the SRPM
|
||||
for any given Open MPI version. The SRPM release number is the
|
||||
version after the dash in the SRPM filename. For example,
|
||||
`openmpi-1.6.3-2.src.rpm` is the 2nd release of the SRPM for Open MPI
|
||||
v1.6.3. Subsequent releases of SRPMs typically contain bug fixes for
|
||||
the RPM packaging, but not Open MPI itself.
|
||||
|
||||
The `buildrpm.sh` script takes a single mandatory argument -- a
|
||||
filename pointing to an Open MPI tarball (may be either `.gz` or
|
||||
`.bz2`). It will create one or more RPMs from this tarball:
|
||||
|
||||
1. Source RPM
|
||||
1. "All in one" RPM, where all of Open MPI is put into a single RPM.
|
||||
1. "Multiple" RPM, where Open MPI is split into several sub-package
|
||||
RPMs:
|
||||
* `openmpi-runtime`
|
||||
* `openmpi-devel`
|
||||
* `openmpi-docs`
|
||||
|
||||
The folowing arguments could be used to affect script behaviour.
|
||||
Please, do NOT set the same settings with parameters and config vars.
|
||||
|
||||
* `-b`:
|
||||
If you specify this option, only the all-in-one binary RPM will
|
||||
be built. By default, only the source RPM (SRPM) is built. Other
|
||||
parameters that affect the all-in-one binary RPM will be ignored
|
||||
unless this option is specified.
|
||||
|
||||
* `-n name`:
|
||||
This option will change the name of the produced RPM to the "name".
|
||||
It is useful to use with "-o" and "-m" options if you want to have
|
||||
multiple Open MPI versions installed simultaneously in the same
|
||||
enviroment. Requires use of option `-b`.
|
||||
|
||||
* `-o`:
|
||||
With this option the install path of the binary RPM will be changed
|
||||
to `/opt/_NAME_/_VERSION_`. Requires use of option `-b`.
|
||||
|
||||
* `-m`:
|
||||
This option causes the RPM to also install modulefiles
|
||||
to the location specified in the specfile. Requires use of option `-b`.
|
||||
|
||||
* `-i`:
|
||||
Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
|
||||
Requires use of option `-b`.
|
||||
|
||||
* `-f lf_location`:
|
||||
Include support for Libfabric. "lf_location" is Libfabric install
|
||||
path. Requires use of option `-b`.
|
||||
|
||||
* `-t tm_location`:
|
||||
Include support for Torque/PBS Pro. "tm_location" is path of the
|
||||
Torque/PBS Pro header files. Requires use of option `-b`.
|
||||
|
||||
* `-d`:
|
||||
Build with debugging support. By default,
|
||||
the RPM is built without debugging support.
|
||||
|
||||
* `-c parameter`:
|
||||
Add custom configure parameter.
|
||||
|
||||
* `-r parameter`:
|
||||
Add custom RPM build parameter.
|
||||
|
||||
* `-s`:
|
||||
If specified, the script will try to unpack the openmpi.spec
|
||||
file from the tarball specified on the command line. By default,
|
||||
the script will look for the specfile in the current directory.
|
||||
|
||||
* `-R directory`:
|
||||
Specifies the top level RPM build direcotry.
|
||||
|
||||
* `-h`:
|
||||
Prints script usage information.
|
||||
|
||||
|
||||
Target architecture is currently hard-coded in the beginning
|
||||
of the `buildrpm.sh` script.
|
||||
|
||||
Alternatively, you can build directly from the `openmpi.spec` spec
|
||||
file or SRPM directly. Many options can be passed to the building
|
||||
process via `rpmbuild`'s `--define` option (there are older versions
|
||||
of `rpmbuild` that do not seem to handle `--define`'d values properly
|
||||
in all cases, but we generally don't care about those old versions of
|
||||
`rpmbuild`...). The available options are described in the comments
|
||||
in the beginning of the spec file in this directory.
|
@ -61,7 +61,7 @@ created.
|
||||
- copy of toss3-hfi-optimized.conf with the following changes:
|
||||
- change: comment "Add the interface for out-of-band communication and set
|
||||
it up" to "Set up the interface for out-of-band communication"
|
||||
- remove: oob_tcp_if_exclude = ib0
|
||||
- remove: oob_tcp_if_exclude = ib0
|
||||
- remove: btl (let Open MPI figure out what best to use for ethernet-
|
||||
connected hardware)
|
||||
- remove: btl_openib_want_fork_support (no infiniband)
|
@ -33,7 +33,7 @@
|
||||
# Automake).
|
||||
|
||||
EXTRA_DIST += \
|
||||
examples/README \
|
||||
examples/README.md \
|
||||
examples/Makefile \
|
||||
examples/hello_c.c \
|
||||
examples/hello_mpifh.f \
|
||||
|
@ -1,67 +0,0 @@
|
||||
Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
|
||||
University Research and Technology
|
||||
Corporation. All rights reserved.
|
||||
Copyright (c) 2006-2012 Cisco Systems, Inc. All rights reserved.
|
||||
Copyright (c) 2007-2009 Sun Microsystems, Inc. All rights reserved.
|
||||
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
|
||||
Copyright (c) 2013 Mellanox Technologies, Inc. All rights reserved.
|
||||
|
||||
$COPYRIGHT$
|
||||
|
||||
The files in this directory are sample MPI applications provided both
|
||||
as a trivial primer to MPI as well as simple tests to ensure that your
|
||||
Open MPI installation is working properly.
|
||||
|
||||
If you are looking for a comprehensive MPI tutorial, these samples are
|
||||
not enough. Excellent MPI tutorials are available here:
|
||||
|
||||
http://www.citutor.org/login.php
|
||||
|
||||
Get a free account and login; you can then browse to the list of
|
||||
available courses. Look for the ones with "MPI" in the title.
|
||||
|
||||
There are two MPI examples in this directory, each using one of six
|
||||
different MPI interfaces:
|
||||
|
||||
- Hello world
|
||||
C: hello_c.c
|
||||
C++: hello_cxx.cc
|
||||
Fortran mpif.h: hello_mpifh.f
|
||||
Fortran use mpi: hello_usempi.f90
|
||||
Fortran use mpi_f08: hello_usempif08.f90
|
||||
Java: Hello.java
|
||||
C shmem.h: hello_oshmem_c.c
|
||||
Fortran shmem.fh: hello_oshmemfh.f90
|
||||
|
||||
- Send a trivial message around in a ring
|
||||
C: ring_c.c
|
||||
C++: ring_cxx.cc
|
||||
Fortran mpif.h: ring_mpifh.f
|
||||
Fortran use mpi: ring_usempi.f90
|
||||
Fortran use mpi_f08: ring_usempif08.f90
|
||||
Java: Ring.java
|
||||
C shmem.h: ring_oshmem_c.c
|
||||
Fortran shmem.fh: ring_oshmemfh.f90
|
||||
|
||||
Additionally, there's one further example application, but this one
|
||||
only uses the MPI C bindings:
|
||||
|
||||
- Test the connectivity between all processes
|
||||
C: connectivity_c.c
|
||||
|
||||
The Makefile in this directory will build as many of the examples as
|
||||
you have language support (e.g., if you do not have the Fortran "use
|
||||
mpi" bindings compiled as part of Open MPI, the those examples will be
|
||||
skipped).
|
||||
|
||||
The Makefile assumes that the wrapper compilers mpicc, mpic++, and
|
||||
mpifort are in your path.
|
||||
|
||||
Although the Makefile is tailored for Open MPI (e.g., it checks the
|
||||
"ompi_info" command to see if you have support for C++, mpif.h, use
|
||||
mpi, and use mpi_f08 F90), all of the example programs are pure MPI,
|
||||
and therefore not specific to Open MPI. Hence, you can use a
|
||||
different MPI implementation to compile and run these programs if you
|
||||
wish.
|
||||
|
||||
Make today an Open MPI day!
|
66
examples/README.md
Обычный файл
66
examples/README.md
Обычный файл
@ -0,0 +1,66 @@
|
||||
The files in this directory are sample MPI applications provided both
|
||||
as a trivial primer to MPI as well as simple tests to ensure that your
|
||||
Open MPI installation is working properly.
|
||||
|
||||
If you are looking for a comprehensive MPI tutorial, these samples are
|
||||
not enough. [Excellent MPI tutorials are available
|
||||
here](http://www.citutor.org/login.php).
|
||||
|
||||
Get a free account and login; you can then browse to the list of
|
||||
available courses. Look for the ones with "MPI" in the title.
|
||||
|
||||
There are two MPI examples in this directory, each using one of six
|
||||
different MPI interfaces:
|
||||
|
||||
## Hello world
|
||||
|
||||
The MPI version of the canonical "hello world" program:
|
||||
|
||||
* C: `hello_c.c`
|
||||
* C++: `hello_cxx.cc`
|
||||
* Fortran mpif.h: `hello_mpifh.f`
|
||||
* Fortran use mpi: `hello_usempi.f90`
|
||||
* Fortran use mpi_f08: `hello_usempif08.f90`
|
||||
* Java: `Hello.java`
|
||||
* C shmem.h: `hello_oshmem_c.c`
|
||||
* Fortran shmem.fh: `hello_oshmemfh.f90`
|
||||
|
||||
## Ring
|
||||
|
||||
Send a trivial message around in a ring:
|
||||
|
||||
* C: `ring_c.c`
|
||||
* C++: `ring_cxx.cc`
|
||||
* Fortran mpif.h: `ring_mpifh.f`
|
||||
* Fortran use mpi: `ring_usempi.f90`
|
||||
* Fortran use mpi_f08: `ring_usempif08.f90`
|
||||
* Java: `Ring.java`
|
||||
* C shmem.h: `ring_oshmem_c.c`
|
||||
* Fortran shmem.fh: `ring_oshmemfh.f90`
|
||||
|
||||
## Connectivity Test
|
||||
|
||||
Additionally, there's one further example application, but this one
|
||||
only uses the MPI C bindings to test the connectivity between all
|
||||
processes:
|
||||
|
||||
* C: `connectivity_c.c`
|
||||
|
||||
## Makefile
|
||||
|
||||
The `Makefile` in this directory will build as many of the examples as
|
||||
you have language support (e.g., if you do not have the Fortran `use
|
||||
mpi` bindings compiled as part of Open MPI, the those examples will be
|
||||
skipped).
|
||||
|
||||
The `Makefile` assumes that the wrapper compilers `mpicc`, `mpic++`, and
|
||||
`mpifort` are in your path.
|
||||
|
||||
Although the `Makefile` is tailored for Open MPI (e.g., it checks the
|
||||
`ompi_info` command to see if you have support for `mpif.h`, the `mpi`
|
||||
module, and the `use mpi_f08` module), all of the example programs are
|
||||
pure MPI, and therefore not specific to Open MPI. Hence, you can use
|
||||
a different MPI implementation to compile and run these programs if
|
||||
you wish.
|
||||
|
||||
Make today an Open MPI day!
|
19
ompi/contrib/README.md
Обычный файл
19
ompi/contrib/README.md
Обычный файл
@ -0,0 +1,19 @@
|
||||
This is the OMPI contrib system. It is (far) less functional and
|
||||
flexible than the OMPI MCA framework/component system.
|
||||
|
||||
Each contrib package must have a `configure.m4`. It may optionally also
|
||||
have an `autogen.subdirs` file.
|
||||
|
||||
If it has a `configure.m4` file, it must specify its own relevant
|
||||
files to `AC_CONFIG_FILES` to create during `AC_OUTPUT` -- just like
|
||||
MCA components (at a minimum, usually its own `Makefile`). The
|
||||
`configure.m4` file will be slurped up into the main `configure`
|
||||
script, just like other MCA components. Note that there is currently
|
||||
no "no configure" option for contrib packages -- you *must* have a
|
||||
`configure.m4` (even if all it does it call `$1`). Feel free to fix
|
||||
this situation if you want -- it probably won't not be too difficult
|
||||
to extend `autogen.pl` to support this scenario, similar to how it is
|
||||
done for MCA components. :smile:
|
||||
|
||||
If it has an `autogen.subdirs` file, then it needs to be a
|
||||
subdirectory that is autogen-able.
|
@ -1,19 +0,0 @@
|
||||
This is the OMPI contrib system. It is (far) less functional and
|
||||
flexible than the OMPI MCA framework/component system.
|
||||
|
||||
Each contrib package must have a configure.m4. It may optionally also
|
||||
have an autogen.subdirs file.
|
||||
|
||||
If it has a configure.m4 file, it must specify its own relevant files
|
||||
to AC_CONFIG_FILES to create during AC_OUTPUT -- just like MCA
|
||||
components (at a minimum, usually its own Makefile). The configure.m4
|
||||
file will be slurped up into the main configure script, just like
|
||||
other MCA components. Note that there is currently no "no configure"
|
||||
option for contrib packages -- you *must* have a configure.m4 (even if
|
||||
all it does it call $1). Feel free to fix this situation if you want
|
||||
-- it probably won't not be too difficult to extend autogen.pl to
|
||||
support this scenario, similar to how it is done for MCA components.
|
||||
:-)
|
||||
|
||||
If it has an autogen.subdirs file, then it needs to be a subdirectory
|
||||
that is autogen-able.
|
@ -13,7 +13,7 @@
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
EXTRA_DIST = profile2mat.pl aggregate_profile.pl
|
||||
EXTRA_DIST = profile2mat.pl aggregate_profile.pl README.md
|
||||
|
||||
sources = common_monitoring.c common_monitoring_coll.c
|
||||
headers = common_monitoring.h common_monitoring_coll.h
|
||||
|
@ -1,181 +0,0 @@
|
||||
|
||||
Copyright (c) 2013-2015 The University of Tennessee and The University
|
||||
of Tennessee Research Foundation. All rights
|
||||
reserved.
|
||||
Copyright (c) 2013-2015 Inria. All rights reserved.
|
||||
$COPYRIGHT$
|
||||
|
||||
Additional copyrights may follow
|
||||
|
||||
$HEADER$
|
||||
|
||||
===========================================================================
|
||||
|
||||
Low level communication monitoring interface in Open MPI
|
||||
|
||||
Introduction
|
||||
------------
|
||||
This interface traces and monitors all messages sent by MPI before they go to the
|
||||
communication channels. At that levels all communication are point-to-point communications:
|
||||
collectives are already decomposed in send and receive calls.
|
||||
|
||||
The monitoring is stored internally by each process and output on stderr at the end of the
|
||||
application (during MPI_Finalize()).
|
||||
|
||||
|
||||
Enabling the monitoring
|
||||
-----------------------
|
||||
To enable the monitoring add --mca pml_monitoring_enable x to the mpirun command line.
|
||||
If x = 1 it monitors internal and external tags indifferently and aggregate everything.
|
||||
If x = 2 it monitors internal tags and external tags separately.
|
||||
If x = 0 the monitoring is disabled.
|
||||
Other value of x are not supported.
|
||||
|
||||
Internal tags are tags < 0. They are used to tag send and receive coming from
|
||||
collective operations or from protocol communications
|
||||
|
||||
External tags are tags >=0. They are used by the application in point-to-point communication.
|
||||
|
||||
Therefore, distinguishing external and internal tags help to distinguish between point-to-point
|
||||
and other communication (mainly collectives).
|
||||
|
||||
Output format
|
||||
-------------
|
||||
The output of the monitoring looks like (with --mca pml_monitoring_enable 2):
|
||||
I 0 1 108 bytes 27 msgs sent
|
||||
E 0 1 1012 bytes 30 msgs sent
|
||||
E 0 2 23052 bytes 61 msgs sent
|
||||
I 1 2 104 bytes 26 msgs sent
|
||||
I 1 3 208 bytes 52 msgs sent
|
||||
E 1 0 860 bytes 24 msgs sent
|
||||
E 1 3 2552 bytes 56 msgs sent
|
||||
I 2 3 104 bytes 26 msgs sent
|
||||
E 2 0 22804 bytes 49 msgs sent
|
||||
E 2 3 860 bytes 24 msgs sent
|
||||
I 3 0 104 bytes 26 msgs sent
|
||||
I 3 1 204 bytes 51 msgs sent
|
||||
E 3 1 2304 bytes 44 msgs sent
|
||||
E 3 2 860 bytes 24 msgs sent
|
||||
|
||||
Where:
|
||||
- the first column distinguishes internal (I) and external (E) tags.
|
||||
- the second column is the sender rank
|
||||
- the third column is the receiver rank
|
||||
- the fourth column is the number of bytes sent
|
||||
- the last column is the number of messages.
|
||||
|
||||
In this example process 0 as sent 27 messages to process 1 using point-to-point call
|
||||
for 108 bytes and 30 messages with collectives and protocol related communication
|
||||
for 1012 bytes to process 1.
|
||||
|
||||
If the monitoring was called with --mca pml_monitoring_enable 1 everything is aggregated
|
||||
under the internal tags. With te above example, you have:
|
||||
I 0 1 1120 bytes 57 msgs sent
|
||||
I 0 2 23052 bytes 61 msgs sent
|
||||
I 1 0 860 bytes 24 msgs sent
|
||||
I 1 2 104 bytes 26 msgs sent
|
||||
I 1 3 2760 bytes 108 msgs sent
|
||||
I 2 0 22804 bytes 49 msgs sent
|
||||
I 2 3 964 bytes 50 msgs sent
|
||||
I 3 0 104 bytes 26 msgs sent
|
||||
I 3 1 2508 bytes 95 msgs sent
|
||||
I 3 2 860 bytes 24 msgs sent
|
||||
|
||||
Monitoring phases
|
||||
-----------------
|
||||
If one wants to monitor phases of the application, it is possible to flush the monitoring
|
||||
at the application level. In this case all the monitoring since the last flush is stored
|
||||
by every process in a file.
|
||||
|
||||
An example of how to flush such monitoring is given in test/monitoring/monitoring_test.c
|
||||
|
||||
Moreover, all the different flushed phased are aggregated at runtime and output at the end
|
||||
of the application as described above.
|
||||
|
||||
Example
|
||||
-------
|
||||
A working example is given in test/monitoring/monitoring_test.c
|
||||
It features, MPI_COMM_WORLD monitoring , sub-communicator monitoring, collective and
|
||||
point-to-point communication monitoring and phases monitoring
|
||||
|
||||
To compile:
|
||||
> make monitoring_test
|
||||
|
||||
Helper scripts
|
||||
--------------
|
||||
Two perl scripts are provided in test/monitoring
|
||||
- aggregate_profile.pl is for aggregating monitoring phases of different processes
|
||||
This script aggregates the profiles generated by the flush_monitoring function.
|
||||
The files need to be in in given format: name_<phase_id>_<process_id>
|
||||
They are then aggregated by phases.
|
||||
If one needs the profile of all the phases he can concatenate the different files,
|
||||
or use the output of the monitoring system done at MPI_Finalize
|
||||
in the example it should be call as:
|
||||
./aggregate_profile.pl prof/phase to generate
|
||||
prof/phase_1.prof
|
||||
prof/phase_2.prof
|
||||
|
||||
- profile2mat.pl is for transforming a the monitoring output into a communication matrix.
|
||||
Take a profile file and aggregates all the recorded communicator into matrices.
|
||||
It generated a matrices for the number of messages, (msg),
|
||||
for the total bytes transmitted (size) and
|
||||
the average number of bytes per messages (avg)
|
||||
|
||||
The output matrix is symmetric
|
||||
|
||||
Do not forget to enable the execution right to these scripts.
|
||||
|
||||
For instance, the provided examples store phases output in ./prof
|
||||
|
||||
If you type:
|
||||
> mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
|
||||
you should have the following output
|
||||
Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
|
||||
Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
|
||||
Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
|
||||
Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
|
||||
Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
|
||||
Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
|
||||
Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
|
||||
Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
|
||||
I 2 3 104 bytes 26 msgs sent
|
||||
E 2 0 22804 bytes 49 msgs sent
|
||||
E 2 3 860 bytes 24 msgs sent
|
||||
I 3 0 104 bytes 26 msgs sent
|
||||
I 3 1 204 bytes 51 msgs sent
|
||||
E 3 1 2304 bytes 44 msgs sent
|
||||
E 3 2 860 bytes 24 msgs sent
|
||||
I 0 1 108 bytes 27 msgs sent
|
||||
E 0 1 1012 bytes 30 msgs sent
|
||||
E 0 2 23052 bytes 61 msgs sent
|
||||
I 1 2 104 bytes 26 msgs sent
|
||||
I 1 3 208 bytes 52 msgs sent
|
||||
E 1 0 860 bytes 24 msgs sent
|
||||
E 1 3 2552 bytes 56 msgs sent
|
||||
|
||||
you can parse the phases with:
|
||||
> /aggregate_profile.pl prof/phase
|
||||
Building prof/phase_1.prof
|
||||
Building prof/phase_2.prof
|
||||
|
||||
And you can build the different communication matrices of phase 1 with:
|
||||
> ./profile2mat.pl prof/phase_1.prof
|
||||
prof/phase_1.prof -> all
|
||||
prof/phase_1_size_all.mat
|
||||
prof/phase_1_msg_all.mat
|
||||
prof/phase_1_avg_all.mat
|
||||
|
||||
prof/phase_1.prof -> external
|
||||
prof/phase_1_size_external.mat
|
||||
prof/phase_1_msg_external.mat
|
||||
prof/phase_1_avg_external.mat
|
||||
|
||||
prof/phase_1.prof -> internal
|
||||
prof/phase_1_size_internal.mat
|
||||
prof/phase_1_msg_internal.mat
|
||||
prof/phase_1_avg_internal.mat
|
||||
|
||||
Credit
|
||||
------
|
||||
Designed by George Bosilca <bosilca@icl.utk.edu> and
|
||||
Emmanuel Jeannot <emmanuel.jeannot@inria.fr>
|
209
ompi/mca/common/monitoring/README.md
Обычный файл
209
ompi/mca/common/monitoring/README.md
Обычный файл
@ -0,0 +1,209 @@
|
||||
# Open MPI common monitoring module
|
||||
|
||||
Copyright (c) 2013-2015 The University of Tennessee and The University
|
||||
of Tennessee Research Foundation. All rights
|
||||
reserved.
|
||||
Copyright (c) 2013-2015 Inria. All rights reserved.
|
||||
|
||||
Low level communication monitoring interface in Open MPI
|
||||
|
||||
## Introduction
|
||||
|
||||
This interface traces and monitors all messages sent by MPI before
|
||||
they go to the communication channels. At that levels all
|
||||
communication are point-to-point communications: collectives are
|
||||
already decomposed in send and receive calls.
|
||||
|
||||
The monitoring is stored internally by each process and output on
|
||||
stderr at the end of the application (during `MPI_Finalize()`).
|
||||
|
||||
|
||||
## Enabling the monitoring
|
||||
|
||||
To enable the monitoring add `--mca pml_monitoring_enable x` to the
|
||||
`mpirun` command line:
|
||||
|
||||
* If x = 1 it monitors internal and external tags indifferently and aggregate everything.
|
||||
* If x = 2 it monitors internal tags and external tags separately.
|
||||
* If x = 0 the monitoring is disabled.
|
||||
* Other value of x are not supported.
|
||||
|
||||
Internal tags are tags < 0. They are used to tag send and receive
|
||||
coming from collective operations or from protocol communications
|
||||
|
||||
External tags are tags >=0. They are used by the application in
|
||||
point-to-point communication.
|
||||
|
||||
Therefore, distinguishing external and internal tags help to
|
||||
distinguish between point-to-point and other communication (mainly
|
||||
collectives).
|
||||
|
||||
## Output format
|
||||
|
||||
The output of the monitoring looks like (with `--mca
|
||||
pml_monitoring_enable 2`):
|
||||
|
||||
```
|
||||
I 0 1 108 bytes 27 msgs sent
|
||||
E 0 1 1012 bytes 30 msgs sent
|
||||
E 0 2 23052 bytes 61 msgs sent
|
||||
I 1 2 104 bytes 26 msgs sent
|
||||
I 1 3 208 bytes 52 msgs sent
|
||||
E 1 0 860 bytes 24 msgs sent
|
||||
E 1 3 2552 bytes 56 msgs sent
|
||||
I 2 3 104 bytes 26 msgs sent
|
||||
E 2 0 22804 bytes 49 msgs sent
|
||||
E 2 3 860 bytes 24 msgs sent
|
||||
I 3 0 104 bytes 26 msgs sent
|
||||
I 3 1 204 bytes 51 msgs sent
|
||||
E 3 1 2304 bytes 44 msgs sent
|
||||
E 3 2 860 bytes 24 msgs sent
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
1. the first column distinguishes internal (I) and external (E) tags.
|
||||
1. the second column is the sender rank
|
||||
1. the third column is the receiver rank
|
||||
1. the fourth column is the number of bytes sent
|
||||
1. the last column is the number of messages.
|
||||
|
||||
In this example process 0 as sent 27 messages to process 1 using
|
||||
point-to-point call for 108 bytes and 30 messages with collectives and
|
||||
protocol related communication for 1012 bytes to process 1.
|
||||
|
||||
If the monitoring was called with `--mca pml_monitoring_enable 1`,
|
||||
everything is aggregated under the internal tags. With the e above
|
||||
example, you have:
|
||||
|
||||
```
|
||||
I 0 1 1120 bytes 57 msgs sent
|
||||
I 0 2 23052 bytes 61 msgs sent
|
||||
I 1 0 860 bytes 24 msgs sent
|
||||
I 1 2 104 bytes 26 msgs sent
|
||||
I 1 3 2760 bytes 108 msgs sent
|
||||
I 2 0 22804 bytes 49 msgs sent
|
||||
I 2 3 964 bytes 50 msgs sent
|
||||
I 3 0 104 bytes 26 msgs sent
|
||||
I 3 1 2508 bytes 95 msgs sent
|
||||
I 3 2 860 bytes 24 msgs sent
|
||||
```
|
||||
|
||||
## Monitoring phases
|
||||
|
||||
If one wants to monitor phases of the application, it is possible to
|
||||
flush the monitoring at the application level. In this case all the
|
||||
monitoring since the last flush is stored by every process in a file.
|
||||
|
||||
An example of how to flush such monitoring is given in
|
||||
`test/monitoring/monitoring_test.c`.
|
||||
|
||||
Moreover, all the different flushed phased are aggregated at runtime
|
||||
and output at the end of the application as described above.
|
||||
|
||||
## Example
|
||||
|
||||
A working example is given in `test/monitoring/monitoring_test.c` It
|
||||
features, `MPI_COMM_WORLD` monitoring , sub-communicator monitoring,
|
||||
collective and point-to-point communication monitoring and phases
|
||||
monitoring
|
||||
|
||||
To compile:
|
||||
|
||||
```
|
||||
shell$ make monitoring_test
|
||||
```
|
||||
|
||||
## Helper scripts
|
||||
|
||||
Two perl scripts are provided in test/monitoring:
|
||||
|
||||
1. `aggregate_profile.pl` is for aggregating monitoring phases of
|
||||
different processes This script aggregates the profiles generated by
|
||||
the `flush_monitoring` function.
|
||||
|
||||
The files need to be in in given format: `name_<phase_id>_<process_id>`
|
||||
They are then aggregated by phases.
|
||||
If one needs the profile of all the phases he can concatenate the different files,
|
||||
or use the output of the monitoring system done at `MPI_Finalize`
|
||||
in the example it should be call as:
|
||||
```
|
||||
./aggregate_profile.pl prof/phase to generate
|
||||
prof/phase_1.prof
|
||||
prof/phase_2.prof
|
||||
```
|
||||
|
||||
1. `profile2mat.pl` is for transforming a the monitoring output into a
|
||||
communication matrix. Take a profile file and aggregates all the
|
||||
recorded communicator into matrices. It generated a matrices for
|
||||
the number of messages, (msg), for the total bytes transmitted
|
||||
(size) and the average number of bytes per messages (avg)
|
||||
|
||||
The output matrix is symmetric.
|
||||
|
||||
For instance, the provided examples store phases output in `./prof`:
|
||||
|
||||
```
|
||||
shell$ mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
|
||||
```
|
||||
|
||||
Should provide the following output:
|
||||
|
||||
```
|
||||
Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
|
||||
Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
|
||||
Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
|
||||
Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
|
||||
Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
|
||||
Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
|
||||
Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
|
||||
Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
|
||||
I 2 3 104 bytes 26 msgs sent
|
||||
E 2 0 22804 bytes 49 msgs sent
|
||||
E 2 3 860 bytes 24 msgs sent
|
||||
I 3 0 104 bytes 26 msgs sent
|
||||
I 3 1 204 bytes 51 msgs sent
|
||||
E 3 1 2304 bytes 44 msgs sent
|
||||
E 3 2 860 bytes 24 msgs sent
|
||||
I 0 1 108 bytes 27 msgs sent
|
||||
E 0 1 1012 bytes 30 msgs sent
|
||||
E 0 2 23052 bytes 61 msgs sent
|
||||
I 1 2 104 bytes 26 msgs sent
|
||||
I 1 3 208 bytes 52 msgs sent
|
||||
E 1 0 860 bytes 24 msgs sent
|
||||
E 1 3 2552 bytes 56 msgs sent
|
||||
```
|
||||
|
||||
You can then parse the phases with:
|
||||
|
||||
```
|
||||
shell$ /aggregate_profile.pl prof/phase
|
||||
Building prof/phase_1.prof
|
||||
Building prof/phase_2.prof
|
||||
```
|
||||
|
||||
And you can build the different communication matrices of phase 1
|
||||
with:
|
||||
|
||||
```
|
||||
shell$ ./profile2mat.pl prof/phase_1.prof
|
||||
prof/phase_1.prof -> all
|
||||
prof/phase_1_size_all.mat
|
||||
prof/phase_1_msg_all.mat
|
||||
prof/phase_1_avg_all.mat
|
||||
|
||||
prof/phase_1.prof -> external
|
||||
prof/phase_1_size_external.mat
|
||||
prof/phase_1_msg_external.mat
|
||||
prof/phase_1_avg_external.mat
|
||||
|
||||
prof/phase_1.prof -> internal
|
||||
prof/phase_1_size_internal.mat
|
||||
prof/phase_1_msg_internal.mat
|
||||
prof/phase_1_avg_internal.mat
|
||||
```
|
||||
|
||||
## Authors
|
||||
|
||||
Designed by George Bosilca <bosilca@icl.utk.edu> and
|
||||
Emmanuel Jeannot <emmanuel.jeannot@inria.fr>
|
@ -1,340 +0,0 @@
|
||||
OFI MTL:
|
||||
--------
|
||||
The OFI MTL supports Libfabric (a.k.a. Open Fabrics Interfaces OFI,
|
||||
https://ofiwg.github.io/libfabric/) tagged APIs (fi_tagged(3)). At
|
||||
initialization time, the MTL queries libfabric for providers supporting tag matching
|
||||
(fi_getinfo(3)). Libfabric will return a list of providers that satisfy the requested
|
||||
capabilities, having the most performant one at the top of the list.
|
||||
The user may modify the OFI provider selection with mca parameters
|
||||
mtl_ofi_provider_include or mtl_ofi_provider_exclude.
|
||||
|
||||
PROGRESS:
|
||||
---------
|
||||
The MTL registers a progress function to opal_progress. There is currently
|
||||
no support for asynchronous progress. The progress function reads multiple events
|
||||
from the OFI provider Completion Queue (CQ) per iteration (defaults to 100, can be
|
||||
modified with the mca mtl_ofi_progress_event_cnt) and iterates until the
|
||||
completion queue is drained.
|
||||
|
||||
COMPLETIONS:
|
||||
------------
|
||||
Each operation uses a request type ompi_mtl_ofi_request_t which includes a reference
|
||||
to an operation specific completion callback, an MPI request, and a context. The
|
||||
context (fi_context) is used to map completion events with MPI_requests when reading the
|
||||
CQ.
|
||||
|
||||
OFI TAG:
|
||||
--------
|
||||
MPI needs to send 96 bits of information per message (32 bits communicator id,
|
||||
32 bits source rank, 32 bits MPI tag) but OFI only offers 64 bits tags. In
|
||||
addition, the OFI MTL uses 2 bits of the OFI tag for the synchronous send protocol.
|
||||
Therefore, there are only 62 bits available in the OFI tag for message usage. The
|
||||
OFI MTL offers the mtl_ofi_tag_mode mca parameter with 4 modes to address this:
|
||||
|
||||
"auto" (Default):
|
||||
After the OFI provider is selected, a runtime check is performed to assess
|
||||
FI_REMOTE_CQ_DATA and FI_DIRECTED_RECV support (see fi_tagged(3), fi_msg(2)
|
||||
and fi_getinfo(3)). If supported, "ofi_tag_full" is used. If not supported,
|
||||
fall back to "ofi_tag_1".
|
||||
|
||||
"ofi_tag_1":
|
||||
For providers that do not support FI_REMOTE_CQ_DATA, the OFI MTL will
|
||||
trim the fields (Communicator ID, Source Rank, MPI tag) to make them fit the 62
|
||||
bits available bit in the OFI tag. There are two options available with different
|
||||
number of bits for the Communicator ID and MPI tag fields. This tag distribution
|
||||
offers: 12 bits for Communicator ID (max Communicator ID 4,095) subject to
|
||||
provider reserved bits (see mem_tag_format below), 18 bits for Source Rank (max
|
||||
Source Rank 262,143), 32 bits for MPI tag (max MPI tag is INT_MAX).
|
||||
|
||||
"ofi_tag_2":
|
||||
Same as 2 "ofi_tag_1" but offering a different OFI tag distribution for
|
||||
applications that may require a greater number of supported Communicators at the
|
||||
expense of fewer MPI tag bits. This tag distribution offers: 24 bits for
|
||||
Communicator ID (max Communicator ED 16,777,215. See mem_tag_format below), 18
|
||||
bits for Source Rank (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
|
||||
524,287).
|
||||
|
||||
"ofi_tag_full":
|
||||
For executions that cannot accept trimming source rank or MPI tag, this mode sends
|
||||
source rank for each message in the CQ DATA. The Source Rank is made available at
|
||||
the remote process CQ (FI_CQ_FORMAT_TAGGED is used, see fi_cq(3)) at the completion
|
||||
of the matching receive operation. Since the minimum size for FI_REMOTE_CQ_DATA
|
||||
is 32 bits, the Source Rank fits with no limitations. The OFI tag is used for the
|
||||
Communicator id (28 bits, max Communicator ID 268,435,455. See mem_tag_format below),
|
||||
and the MPI tag (max MPI tag is INT_MAX). If this mode is selected by the user
|
||||
and FI_REMOTE_CQ_DATA or FI_DIRECTED_RECV are not supported, the execution will abort.
|
||||
|
||||
mem_tag_format (fi_endpoint(3))
|
||||
Some providers can reserve the higher order bits from the OFI tag for internal purposes.
|
||||
This is signaled in mem_tag_format (see fi_endpoint(3)) by setting higher order bits
|
||||
to zero. In such cases, the OFI MTL will reduce the number of communicator ids supported
|
||||
by reducing the bits available for the communicator ID field in the OFI tag.
|
||||
|
||||
SCALABLE ENDPOINTS:
|
||||
-------------------
|
||||
OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to improve
|
||||
multi-threaded application throughput and message rate. Currently the feature
|
||||
is designed to utilize multiple TX/RX contexts exposed by the OFI provider in
|
||||
conjunction with a multi-communicator MPI application model. Therefore, new OFI
|
||||
contexts are created as and when communicators are duplicated in a lazy fashion
|
||||
instead of creating them all at once during init time and this approach also
|
||||
favours only creating as many contexts as needed.
|
||||
|
||||
1. Multi-communicator model:
|
||||
With this approach, the MPI application is requried to first duplicate
|
||||
the communicators it wants to use with MPI operations (ideally creating
|
||||
as many communicators as the number of threads it wants to use to call
|
||||
into MPI). The duplicated communicators are then used by the
|
||||
corresponding threads to perform MPI operations. A possible usage
|
||||
scenario could be in an MPI + OMP application as follows
|
||||
(example limited to 2 ranks):
|
||||
|
||||
MPI_Comm dup_comm[n];
|
||||
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
|
||||
for (i = 0; i < n; i++) {
|
||||
MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
|
||||
}
|
||||
if (rank == 0) {
|
||||
#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
|
||||
for (i = 0; i < n ; i++) {
|
||||
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
|
||||
1, MSG_TAG, dup_comm[i]);
|
||||
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
|
||||
1, MSG_TAG, dup_comm[i], &status);
|
||||
}
|
||||
} else if (rank == 1) {
|
||||
#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
|
||||
for (i = 0; i < n ; i++) {
|
||||
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
|
||||
0, MSG_TAG, dup_comm[i], &status);
|
||||
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
|
||||
0, MSG_TAG, dup_comm[i]);
|
||||
}
|
||||
}
|
||||
|
||||
2. MCA variables:
|
||||
To utilize the feature, the following MCA variables need to be set:
|
||||
mtl_ofi_enable_sep:
|
||||
This MCA variable needs to be set to enable the use of Scalable Endpoints (SEP)
|
||||
feature in the OFI MTL. The underlying provider is also checked to ensure the
|
||||
feature is supported. If the provider chosen does not support it, user needs
|
||||
to either set this variable to 0 or select a different provider which supports
|
||||
the feature.
|
||||
For single-threaded applications one OFI context is sufficient, so OFI SEPs
|
||||
may not add benefit.
|
||||
Note that mtl_ofi_thread_grouping (see below) needs to be enabled to use the
|
||||
different OFI SEP contexts. Otherwise, only one context (ctxt 0) will be used.
|
||||
|
||||
Default: 0
|
||||
|
||||
Command-line syntax:
|
||||
"-mca mtl_ofi_enable_sep 1"
|
||||
|
||||
mtl_ofi_thread_grouping:
|
||||
Turn Thread Grouping feature on. This is needed to use the Multi-communicator
|
||||
model explained above. This means that the OFI MTL will use the communicator
|
||||
ID to decide the SEP contexts to be used by the thread. In this way, each
|
||||
thread will have direct access to different OFI resources. If disabled,
|
||||
only context 0 will be used.
|
||||
Requires mtl_ofi_enable_sep to be set to 1.
|
||||
|
||||
Default: 0
|
||||
|
||||
It is not recommended to set the MCA variable for:
|
||||
- Multi-threaded MPI applications not following multi-communicator approach.
|
||||
- Applications that have multiple threads using a single communicator as
|
||||
it may degrade performance.
|
||||
|
||||
Command-line syntax:
|
||||
"-mca mtl_ofi_thread_grouping 1"
|
||||
|
||||
mtl_ofi_num_ctxts:
|
||||
This MCA variable allows user to set the number of OFI SEP contexts the
|
||||
application expects to use. For multi-threaded applications using Thread
|
||||
Grouping feature, this number should be set to the number of user threads
|
||||
that will call into MPI. This variable will only have effect if
|
||||
mtl_ofi_enable_sep is set to 1.
|
||||
|
||||
Default: 1
|
||||
|
||||
Command-line syntax:
|
||||
"-mca mtl_ofi_num_ctxts N" [ N: number of OFI contexts required by
|
||||
application ]
|
||||
|
||||
3. Notes on performance:
|
||||
- OFI MTL will create as many TX/RX contexts as set by MCA mtl_ofi_num_ctxts.
|
||||
The number of contexts that can be created is also limited by the underlying
|
||||
provider as each provider may have different thresholds. Once the threshold
|
||||
is exceeded, contexts are used in a round-robin fashion which leads to
|
||||
resource sharing among threads. Therefore locks are required to guard
|
||||
against race conditions. For performance, it is recommended to have
|
||||
|
||||
Number of threads = Number of communicators = Number of contexts
|
||||
|
||||
For example, when using PSM2 provider, the number of contexts is dictated
|
||||
by the Intel Omni-Path HFI1 driver module.
|
||||
|
||||
- OPAL layer allows for multiple threads to enter progress simultaneously. To
|
||||
enable this feature, user needs to set MCA variable
|
||||
"max_thread_in_progress". When using Thread Grouping feature, it is
|
||||
recommended to set this MCA parameter to the number of threads expected to
|
||||
call into MPI as it provides performance benefits.
|
||||
|
||||
Command-line syntax:
|
||||
"-mca opal_max_thread_in_progress N" [ N: number of threads expected to
|
||||
make MPI calls ]
|
||||
Default: 1
|
||||
|
||||
- For applications using a single thread with multiple communicators and MCA
|
||||
variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple
|
||||
contexts, but the benefits may be negligible as only one thread is driving
|
||||
progress.
|
||||
|
||||
SPECIALIZED FUNCTIONS:
|
||||
-------------------
|
||||
To improve performance when calling message passing APIs in the OFI mtl
|
||||
specialized functions are generated at compile time that eliminate all the
|
||||
if conditionals that can be determined at init and don't need to be
|
||||
queried again during the critical path. These functions are generated by
|
||||
perl scripts during make which generate functions and symbols for every
|
||||
combination of flags for each function.
|
||||
|
||||
1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
|
||||
To add a new flag to an existing specialized function for handling cases
|
||||
where different OFI providers may or may not support a particular feature,
|
||||
then you must follow these steps:
|
||||
1) Update the "_generic" function in mtl_ofi.h with the new flag and
|
||||
the if conditionals to read the new value.
|
||||
2) Update the *.pm file corresponding to the function with the new flag in:
|
||||
gen_funcs(), gen_*_function(), & gen_*_sym_init()
|
||||
3) Update mtl_ofi_opt.h with:
|
||||
The new flag as #define NEW_FLAG_TYPES #NUMBER_OF_STATES
|
||||
example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
|
||||
Update the function's types with:
|
||||
#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]
|
||||
|
||||
2. ADDING A NEW FUNCTION FOR SPECIALIZATION:
|
||||
To add a new function to be specialized you must
|
||||
follow these steps:
|
||||
1) Create a new mtl_ofi_"function_name"_opt.pm based off opt_common/mtl_ofi_opt.pm.template
|
||||
2) Add new .pm file to generated_source_modules in Makefile.am
|
||||
3) Add .c file to generated_sources in Makefile.am named the same as the corresponding .pm file
|
||||
4) Update existing or create function in mtl_ofi.h to _generic with new flags.
|
||||
5) Update mtl_ofi_opt.h with:
|
||||
a) New function types: #define OMPI_MTL_OFI_FUNCTION_TYPES [FLAG_TYPES]
|
||||
b) Add new function to the struct ompi_mtl_ofi_symtable:
|
||||
struct ompi_mtl_ofi_symtable {
|
||||
...
|
||||
int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
|
||||
}
|
||||
c) Add new symbol table init function definition:
|
||||
void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
|
||||
6) Add calls to init the new function in the symbol table and assign the function
|
||||
pointer to be used based off the flags in mtl_ofi_component.c:
|
||||
ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);
|
||||
ompi_mtl_ofi.base.mtl_FUNCTION =
|
||||
ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];
|
||||
|
||||
3. EXAMPLE SPECIALIZED FILE:
|
||||
The code below is an example of what is generated by the specialization
|
||||
scripts for use in the OFI mtl. This code specializes the blocking
|
||||
send functionality based on FI_REMOTE_CQ_DATA & OFI Scalable Endpoint support
|
||||
provided by an OFI Provider. Only one function and symbol is used during
|
||||
runtime based on if FI_REMOTE_CQ_DATA is supported and/or if OFI Scalable
|
||||
Endpoint support is enabled.
|
||||
/*
|
||||
* Copyright (c) 2013-2018 Intel, Inc. All rights reserved
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "mtl_ofi.h"
|
||||
|
||||
__opal_attribute_always_inline__ static inline int
|
||||
ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
|
||||
struct ompi_communicator_t *comm,
|
||||
int dest,
|
||||
int tag,
|
||||
struct opal_convertor_t *convertor,
|
||||
mca_pml_base_send_mode_t mode)
|
||||
{
|
||||
const bool OFI_CQ_DATA = false;
|
||||
const bool OFI_SCEP_EPS = false;
|
||||
|
||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||
convertor, mode,
|
||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||
}
|
||||
|
||||
__opal_attribute_always_inline__ static inline int
|
||||
ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
|
||||
struct ompi_communicator_t *comm,
|
||||
int dest,
|
||||
int tag,
|
||||
struct opal_convertor_t *convertor,
|
||||
mca_pml_base_send_mode_t mode)
|
||||
{
|
||||
const bool OFI_CQ_DATA = false;
|
||||
const bool OFI_SCEP_EPS = true;
|
||||
|
||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||
convertor, mode,
|
||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||
}
|
||||
|
||||
__opal_attribute_always_inline__ static inline int
|
||||
ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
|
||||
struct ompi_communicator_t *comm,
|
||||
int dest,
|
||||
int tag,
|
||||
struct opal_convertor_t *convertor,
|
||||
mca_pml_base_send_mode_t mode)
|
||||
{
|
||||
const bool OFI_CQ_DATA = true;
|
||||
const bool OFI_SCEP_EPS = false;
|
||||
|
||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||
convertor, mode,
|
||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||
}
|
||||
|
||||
__opal_attribute_always_inline__ static inline int
|
||||
ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
|
||||
struct ompi_communicator_t *comm,
|
||||
int dest,
|
||||
int tag,
|
||||
struct opal_convertor_t *convertor,
|
||||
mca_pml_base_send_mode_t mode)
|
||||
{
|
||||
const bool OFI_CQ_DATA = true;
|
||||
const bool OFI_SCEP_EPS = true;
|
||||
|
||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||
convertor, mode,
|
||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||
}
|
||||
|
||||
void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
|
||||
{
|
||||
|
||||
sym_table->ompi_mtl_ofi_send[false][false]
|
||||
= ompi_mtl_ofi_send_false_false;
|
||||
|
||||
|
||||
sym_table->ompi_mtl_ofi_send[false][true]
|
||||
= ompi_mtl_ofi_send_false_true;
|
||||
|
||||
|
||||
sym_table->ompi_mtl_ofi_send[true][false]
|
||||
= ompi_mtl_ofi_send_true_false;
|
||||
|
||||
|
||||
sym_table->ompi_mtl_ofi_send[true][true]
|
||||
= ompi_mtl_ofi_send_true_true;
|
||||
|
||||
}
|
||||
###
|
368
ompi/mca/mtl/ofi/README.md
Обычный файл
368
ompi/mca/mtl/ofi/README.md
Обычный файл
@ -0,0 +1,368 @@
|
||||
# Open MPI OFI MTL
|
||||
|
||||
The OFI MTL supports Libfabric (a.k.a., [Open Fabrics Interfaces
|
||||
OFI](https://ofiwg.github.io/libfabric/)) tagged APIs
|
||||
(`fi_tagged(3)`). At initialization time, the MTL queries libfabric
|
||||
for providers supporting tag matching (`fi_getinfo(3)`). Libfabric
|
||||
will return a list of providers that satisfy the requested
|
||||
capabilities, having the most performant one at the top of the list.
|
||||
The user may modify the OFI provider selection with mca parameters
|
||||
`mtl_ofi_provider_include` or `mtl_ofi_provider_exclude`.
|
||||
|
||||
## PROGRESS
|
||||
|
||||
The MTL registers a progress function to `opal_progress`. There is
|
||||
currently no support for asynchronous progress. The progress function
|
||||
reads multiple events from the OFI provider Completion Queue (CQ) per
|
||||
iteration (defaults to 100, can be modified with the mca
|
||||
`mtl_ofi_progress_event_cnt`) and iterates until the completion queue is
|
||||
drained.
|
||||
|
||||
## COMPLETIONS
|
||||
|
||||
Each operation uses a request type `ompi_mtl_ofi_request_t` which
|
||||
includes a reference to an operation specific completion callback, an
|
||||
MPI request, and a context. The context (`fi_context`) is used to map
|
||||
completion events with `MPI_requests` when reading the CQ.
|
||||
|
||||
## OFI TAG
|
||||
|
||||
MPI needs to send 96 bits of information per message (32 bits
|
||||
communicator id, 32 bits source rank, 32 bits MPI tag) but OFI only
|
||||
offers 64 bits tags. In addition, the OFI MTL uses 2 bits of the OFI
|
||||
tag for the synchronous send protocol. Therefore, there are only 62
|
||||
bits available in the OFI tag for message usage. The OFI MTL offers
|
||||
the `mtl_ofi_tag_mode` mca parameter with 4 modes to address this:
|
||||
|
||||
* `auto` (Default):
|
||||
After the OFI provider is selected, a runtime check is performed to
|
||||
assess `FI_REMOTE_CQ_DATA` and `FI_DIRECTED_RECV` support (see
|
||||
`fi_tagged(3)`, `fi_msg(2)` and `fi_getinfo(3)`). If supported,
|
||||
`ofi_tag_full` is used. If not supported, fall back to `ofi_tag_1`.
|
||||
|
||||
* `ofi_tag_1`:
|
||||
For providers that do not support `FI_REMOTE_CQ_DATA`, the OFI MTL
|
||||
will trim the fields (Communicator ID, Source Rank, MPI tag) to make
|
||||
them fit the 62 bits available bit in the OFI tag. There are two
|
||||
options available with different number of bits for the Communicator
|
||||
ID and MPI tag fields. This tag distribution offers: 12 bits for
|
||||
Communicator ID (max Communicator ID 4,095) subject to provider
|
||||
reserved bits (see `mem_tag_format` below), 18 bits for Source Rank
|
||||
(max Source Rank 262,143), 32 bits for MPI tag (max MPI tag is
|
||||
`INT_MAX`).
|
||||
|
||||
* `ofi_tag_2`:
|
||||
Same as 2 `ofi_tag_1` but offering a different OFI tag distribution
|
||||
for applications that may require a greater number of supported
|
||||
Communicators at the expense of fewer MPI tag bits. This tag
|
||||
distribution offers: 24 bits for Communicator ID (max Communicator
|
||||
ED 16,777,215. See mem_tag_format below), 18 bits for Source Rank
|
||||
(max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
|
||||
524,287).
|
||||
|
||||
* `ofi_tag_full`:
|
||||
For executions that cannot accept trimming source rank or MPI tag,
|
||||
this mode sends source rank for each message in the CQ DATA. The
|
||||
Source Rank is made available at the remote process CQ
|
||||
(`FI_CQ_FORMAT_TAGGED` is used, see `fi_cq(3)`) at the completion of
|
||||
the matching receive operation. Since the minimum size for
|
||||
`FI_REMOTE_CQ_DATA` is 32 bits, the Source Rank fits with no
|
||||
limitations. The OFI tag is used for the Communicator id (28 bits,
|
||||
max Communicator ID 268,435,455. See `mem_tag_format` below), and
|
||||
the MPI tag (max MPI tag is `INT_MAX`). If this mode is selected by
|
||||
the user and `FI_REMOTE_CQ_DATA` or `FI_DIRECTED_RECV` are not
|
||||
supported, the execution will abort.
|
||||
|
||||
* `mem_tag_format` (`fi_endpoint(3)`)
|
||||
Some providers can reserve the higher order bits from the OFI tag
|
||||
for internal purposes. This is signaled in `mem_tag_format` (see
|
||||
`fi_endpoint(3)`) by setting higher order bits to zero. In such
|
||||
cases, the OFI MTL will reduce the number of communicator ids
|
||||
supported by reducing the bits available for the communicator ID
|
||||
field in the OFI tag.
|
||||
|
||||
## SCALABLE ENDPOINTS
|
||||
|
||||
OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to
|
||||
improve multi-threaded application throughput and message
|
||||
rate. Currently the feature is designed to utilize multiple TX/RX
|
||||
contexts exposed by the OFI provider in conjunction with a
|
||||
multi-communicator MPI application model. Therefore, new OFI contexts
|
||||
are created as and when communicators are duplicated in a lazy fashion
|
||||
instead of creating them all at once during init time and this
|
||||
approach also favours only creating as many contexts as needed.
|
||||
|
||||
1. Multi-communicator model:
|
||||
With this approach, the MPI application is requried to first duplicate
|
||||
the communicators it wants to use with MPI operations (ideally creating
|
||||
as many communicators as the number of threads it wants to use to call
|
||||
into MPI). The duplicated communicators are then used by the
|
||||
corresponding threads to perform MPI operations. A possible usage
|
||||
scenario could be in an MPI + OMP application as follows
|
||||
(example limited to 2 ranks):
|
||||
|
||||
```c
|
||||
MPI_Comm dup_comm[n];
|
||||
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
|
||||
for (i = 0; i < n; i++) {
|
||||
MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
|
||||
}
|
||||
if (rank == 0) {
|
||||
#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
|
||||
for (i = 0; i < n ; i++) {
|
||||
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
|
||||
1, MSG_TAG, dup_comm[i]);
|
||||
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
|
||||
1, MSG_TAG, dup_comm[i], &status);
|
||||
}
|
||||
} else if (rank == 1) {
|
||||
#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
|
||||
for (i = 0; i < n ; i++) {
|
||||
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
|
||||
0, MSG_TAG, dup_comm[i], &status);
|
||||
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
|
||||
0, MSG_TAG, dup_comm[i]);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. MCA variables:
|
||||
To utilize the feature, the following MCA variables need to be set:
|
||||
|
||||
* `mtl_ofi_enable_sep`:
|
||||
This MCA variable needs to be set to enable the use of Scalable
|
||||
Endpoints (SEP) feature in the OFI MTL. The underlying provider
|
||||
is also checked to ensure the feature is supported. If the
|
||||
provider chosen does not support it, user needs to either set
|
||||
this variable to 0 or select a different provider which supports
|
||||
the feature. For single-threaded applications one OFI context is
|
||||
sufficient, so OFI SEPs may not add benefit. Note that
|
||||
`mtl_ofi_thread_grouping` (see below) needs to be enabled to use
|
||||
the different OFI SEP contexts. Otherwise, only one context (ctxt
|
||||
0) will be used.
|
||||
|
||||
Default: 0
|
||||
|
||||
Command-line syntax: `--mca mtl_ofi_enable_sep 1`
|
||||
|
||||
* `mtl_ofi_thread_grouping`:
|
||||
Turn Thread Grouping feature on. This is needed to use the
|
||||
Multi-communicator model explained above. This means that the OFI
|
||||
MTL will use the communicator ID to decide the SEP contexts to be
|
||||
used by the thread. In this way, each thread will have direct
|
||||
access to different OFI resources. If disabled, only context 0
|
||||
will be used. Requires `mtl_ofi_enable_sep` to be set to 1.
|
||||
|
||||
Default: 0
|
||||
|
||||
It is not recommended to set the MCA variable for:
|
||||
|
||||
* Multi-threaded MPI applications not following multi-communicator
|
||||
approach.
|
||||
* Applications that have multiple threads using a single
|
||||
communicator as it may degrade performance.
|
||||
|
||||
Command-line syntax: `--mca mtl_ofi_thread_grouping 1`
|
||||
|
||||
* `mtl_ofi_num_ctxts`:
|
||||
This MCA variable allows user to set the number of OFI SEP
|
||||
contexts the application expects to use. For multi-threaded
|
||||
applications using Thread Grouping feature, this number should be
|
||||
set to the number of user threads that will call into MPI. This
|
||||
variable will only have effect if `mtl_ofi_enable_sep` is set to 1.
|
||||
|
||||
Default: 1
|
||||
|
||||
Command-line syntax: `--mca mtl_ofi_num_ctxts N` (`N`: number of OFI contexts required by application)
|
||||
|
||||
3. Notes on performance:
|
||||
* OFI MTL will create as many TX/RX contexts as set by MCA
|
||||
mtl_ofi_num_ctxts. The number of contexts that can be created is
|
||||
also limited by the underlying provider as each provider may have
|
||||
different thresholds. Once the threshold is exceeded, contexts are
|
||||
used in a round-robin fashion which leads to resource sharing
|
||||
among threads. Therefore locks are required to guard against race
|
||||
conditions. For performance, it is recommended to have
|
||||
|
||||
Number of threads = Number of communicators = Number of contexts
|
||||
|
||||
For example, when using PSM2 provider, the number of contexts is
|
||||
dictated by the Intel Omni-Path HFI1 driver module.
|
||||
|
||||
* OPAL layer allows for multiple threads to enter progress
|
||||
simultaneously. To enable this feature, user needs to set MCA
|
||||
variable `max_thread_in_progress`. When using Thread Grouping
|
||||
feature, it is recommended to set this MCA parameter to the number
|
||||
of threads expected to call into MPI as it provides performance
|
||||
benefits.
|
||||
|
||||
Default: 1
|
||||
|
||||
Command-line syntax: `--mca opal_max_thread_in_progress N` (`N`: number of threads expected to make MPI calls )
|
||||
|
||||
* For applications using a single thread with multiple communicators
|
||||
and MCA variable `mtl_ofi_thread_grouping` set to 1, the MTL will
|
||||
use multiple contexts, but the benefits may be negligible as only
|
||||
one thread is driving progress.
|
||||
|
||||
## SPECIALIZED FUNCTIONS
|
||||
|
||||
To improve performance when calling message passing APIs in the OFI
|
||||
mtl specialized functions are generated at compile time that eliminate
|
||||
all the if conditionals that can be determined at init and don't need
|
||||
to be queried again during the critical path. These functions are
|
||||
generated by perl scripts during make which generate functions and
|
||||
symbols for every combination of flags for each function.
|
||||
|
||||
1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
|
||||
To add a new flag to an existing specialized function for handling
|
||||
cases where different OFI providers may or may not support a
|
||||
particular feature, then you must follow these steps:
|
||||
|
||||
1. Update the `_generic` function in `mtl_ofi.h` with the new flag
|
||||
and the if conditionals to read the new value.
|
||||
1. Update the `*.pm` file corresponding to the function with the
|
||||
new flag in: `gen_funcs()`, `gen_*_function()`, &
|
||||
`gen_*_sym_init()`
|
||||
1. Update `mtl_ofi_opt.h` with:
|
||||
* The new flag as `#define NEW_FLAG_TYPES #NUMBER_OF_STATES`.
|
||||
Example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
|
||||
* Update the function's types with:
|
||||
`#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]`
|
||||
|
||||
1. ADDING A NEW FUNCTION FOR SPECIALIZATION:
|
||||
To add a new function to be specialized you must
|
||||
follow these steps:
|
||||
1. Create a new `mtl_ofi_<function_name>_opt.pm` based off
|
||||
`opt_common/mtl_ofi_opt.pm.template`
|
||||
1. Add new `.pm` file to `generated_source_modules` in `Makefile.am`
|
||||
1. Add `.c` file to `generated_sources` in `Makefile.am` named the
|
||||
same as the corresponding `.pm` file
|
||||
1. Update existing or create function in `mtl_ofi.h` to `_generic`
|
||||
with new flags.
|
||||
1. Update `mtl_ofi_opt.h` with:
|
||||
1. New function types: `#define OMPI_MTL_OFI_FUNCTION_TYPES` `[FLAG_TYPES]`
|
||||
1. Add new function to the `struct ompi_mtl_ofi_symtable`:
|
||||
```c
|
||||
struct ompi_mtl_ofi_symtable {
|
||||
...
|
||||
int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
|
||||
}
|
||||
```
|
||||
1. Add new symbol table init function definition:
|
||||
```c
|
||||
void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
|
||||
```
|
||||
1. Add calls to init the new function in the symbol table and
|
||||
assign the function pointer to be used based off the flags in
|
||||
`mtl_ofi_component.c`:
|
||||
* `ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);`
|
||||
* `ompi_mtl_ofi.base.mtl_FUNCTION = ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];`
|
||||
|
||||
## EXAMPLE SPECIALIZED FILE
|
||||
|
||||
The code below is an example of what is generated by the
|
||||
specialization scripts for use in the OFI mtl. This code specializes
|
||||
the blocking send functionality based on `FI_REMOTE_CQ_DATA` & OFI
|
||||
Scalable Endpoint support provided by an OFI Provider. Only one
|
||||
function and symbol is used during runtime based on if
|
||||
`FI_REMOTE_CQ_DATA` is supported and/or if OFI Scalable Endpoint support
|
||||
is enabled.
|
||||
|
||||
```c
|
||||
/*
|
||||
* Copyright (c) 2013-2018 Intel, Inc. All rights reserved
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "mtl_ofi.h"
|
||||
|
||||
__opal_attribute_always_inline__ static inline int
|
||||
ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
|
||||
struct ompi_communicator_t *comm,
|
||||
int dest,
|
||||
int tag,
|
||||
struct opal_convertor_t *convertor,
|
||||
mca_pml_base_send_mode_t mode)
|
||||
{
|
||||
const bool OFI_CQ_DATA = false;
|
||||
const bool OFI_SCEP_EPS = false;
|
||||
|
||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||
convertor, mode,
|
||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||
}
|
||||
|
||||
__opal_attribute_always_inline__ static inline int
|
||||
ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
|
||||
struct ompi_communicator_t *comm,
|
||||
int dest,
|
||||
int tag,
|
||||
struct opal_convertor_t *convertor,
|
||||
mca_pml_base_send_mode_t mode)
|
||||
{
|
||||
const bool OFI_CQ_DATA = false;
|
||||
const bool OFI_SCEP_EPS = true;
|
||||
|
||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||
convertor, mode,
|
||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||
}
|
||||
|
||||
__opal_attribute_always_inline__ static inline int
|
||||
ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
|
||||
struct ompi_communicator_t *comm,
|
||||
int dest,
|
||||
int tag,
|
||||
struct opal_convertor_t *convertor,
|
||||
mca_pml_base_send_mode_t mode)
|
||||
{
|
||||
const bool OFI_CQ_DATA = true;
|
||||
const bool OFI_SCEP_EPS = false;
|
||||
|
||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||
convertor, mode,
|
||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||
}
|
||||
|
||||
__opal_attribute_always_inline__ static inline int
|
||||
ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
|
||||
struct ompi_communicator_t *comm,
|
||||
int dest,
|
||||
int tag,
|
||||
struct opal_convertor_t *convertor,
|
||||
mca_pml_base_send_mode_t mode)
|
||||
{
|
||||
const bool OFI_CQ_DATA = true;
|
||||
const bool OFI_SCEP_EPS = true;
|
||||
|
||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||
convertor, mode,
|
||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||
}
|
||||
|
||||
void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
|
||||
{
|
||||
|
||||
sym_table->ompi_mtl_ofi_send[false][false]
|
||||
= ompi_mtl_ofi_send_false_false;
|
||||
|
||||
|
||||
sym_table->ompi_mtl_ofi_send[false][true]
|
||||
= ompi_mtl_ofi_send_false_true;
|
||||
|
||||
|
||||
sym_table->ompi_mtl_ofi_send[true][false]
|
||||
= ompi_mtl_ofi_send_true_false;
|
||||
|
||||
|
||||
sym_table->ompi_mtl_ofi_send[true][true]
|
||||
= ompi_mtl_ofi_send_true_true;
|
||||
|
||||
}
|
||||
```
|
@ -1,5 +1,3 @@
|
||||
Copyright 2009 Cisco Systems, Inc. All rights reserved.
|
||||
|
||||
This is a simple example op component meant to be a template /
|
||||
springboard for people to write their own op components. There are
|
||||
many different ways to write components and modules; this is but one
|
||||
@ -13,28 +11,26 @@ same end effect. Feel free to customize / simplify / strip out what
|
||||
you don't need from this example.
|
||||
|
||||
This example component supports a ficticious set of hardware that
|
||||
provides acceleation for the MPI_MAX and MPI_BXOR MPI_Ops. The
|
||||
provides acceleation for the `MPI_MAX` and `MPI_BXOR` `MPI_Ops`. The
|
||||
ficticious hardware has multiple versions, too: some versions only
|
||||
support single precision floating point types for MAX and single
|
||||
precision integer types for BXOR, whereas later versions support both
|
||||
single and double precision floating point types for MAX and both
|
||||
single and double precision integer types for BXOR. Hence, this
|
||||
example walks through setting up particular MPI_Op function pointers
|
||||
based on:
|
||||
support single precision floating point types for `MAX` and single
|
||||
precision integer types for `BXOR`, whereas later versions support
|
||||
both single and double precision floating point types for `MAX` and
|
||||
both single and double precision integer types for `BXOR`. Hence,
|
||||
this example walks through setting up particular `MPI_Op` function
|
||||
pointers based on:
|
||||
|
||||
a) hardware availability (e.g., does the node where this MPI process
|
||||
1. hardware availability (e.g., does the node where this MPI process
|
||||
is running have the relevant hardware/resources?)
|
||||
|
||||
b) MPI_Op (e.g., in this example, only MPI_MAX and MPI_BXOR are
|
||||
1. `MPI_Op` (e.g., in this example, only `MPI_MAX` and `MPI_BXOR` are
|
||||
supported)
|
||||
|
||||
c) datatype (e.g., single/double precision floating point for MAX and
|
||||
single/double precision integer for BXOR)
|
||||
1. datatype (e.g., single/double precision floating point for `MAX`
|
||||
and single/double precision integer for `BXOR`)
|
||||
|
||||
Additionally, there are other considerations that should be factored
|
||||
in at run time. Hardware accelerators are great, but they do induce
|
||||
overhead -- for example, some accelerator hardware require registered
|
||||
memory. So even if a particular MPI_Op and datatype are supported, it
|
||||
memory. So even if a particular `MPI_Op` and datatype are supported, it
|
||||
may not be worthwhile to use the hardware unless the amount of data to
|
||||
be processed is "big enough" (meaning that the cost of the
|
||||
registration and/or copy-in/copy-out is ameliorated) or the memory to
|
||||
@ -47,57 +43,65 @@ failover strategy is well-supported by the op framework; during the
|
||||
query process, a component can "stack" itself similar to how POSIX
|
||||
signal handlers can be stacked. Specifically, op components can cache
|
||||
other implementations of operation functions for use in the case of
|
||||
failover. The MAX and BXOR module implementations show one way of
|
||||
failover. The `MAX` and `BXOR` module implementations show one way of
|
||||
using this method.
|
||||
|
||||
Here's a listing of the files in the example component and what they
|
||||
do:
|
||||
|
||||
- configure.m4: Tests that get slurped into OMPI's top-level configure
|
||||
script to determine whether this component will be built or not.
|
||||
- Makefile.am: Automake makefile that builds this component.
|
||||
- op_example_component.c: The main "component" source file.
|
||||
- op_example_module.c: The main "module" source file.
|
||||
- op_example.h: information that is shared between the .c files.
|
||||
- .ompi_ignore: the presence of this file causes OMPI's autogen.pl to
|
||||
skip this component in the configure/build/install process (see
|
||||
- `configure.m4`: Tests that get slurped into OMPI's top-level
|
||||
`configure` script to determine whether this component will be built
|
||||
or not.
|
||||
- `Makefile.am`: Automake makefile that builds this component.
|
||||
- `op_example_component.c`: The main "component" source file.
|
||||
- `op_example_module.c`: The main "module" source file.
|
||||
- `op_example.h`: information that is shared between the `.c` files.
|
||||
- `.ompi_ignore`: the presence of this file causes OMPI's `autogen.pl`
|
||||
to skip this component in the configure/build/install process (see
|
||||
below).
|
||||
|
||||
To use this example as a template for your component (assume your new
|
||||
component is named "foo"):
|
||||
component is named `foo`):
|
||||
|
||||
```
|
||||
shell$ cd (top_ompi_dir)/ompi/mca/op
|
||||
shell$ cp -r example foo
|
||||
shell$ cd foo
|
||||
```
|
||||
|
||||
Remove the .ompi_ignore file (which makes the component "visible" to
|
||||
all developers) *OR* add an .ompi_unignore file with one username per
|
||||
line (as reported by `whoami`). OMPI's autogen.pl will skip any
|
||||
component with a .ompi_ignore file *unless* there is also an
|
||||
Remove the `.ompi_ignore` file (which makes the component "visible" to
|
||||
all developers) *OR* add an `.ompi_unignore` file with one username per
|
||||
line (as reported by `whoami`). OMPI's `autogen.pl` will skip any
|
||||
component with a `.ompi_ignore` file *unless* there is also an
|
||||
.ompi_unignore file containing your user ID in it. This is a handy
|
||||
mechanism to have a component in the tree but have it not built / used
|
||||
by most other developers:
|
||||
|
||||
```
|
||||
shell$ rm .ompi_ignore
|
||||
*OR*
|
||||
shell$ whoami > .ompi_unignore
|
||||
```
|
||||
|
||||
Now rename any file that contains "example" in the filename to have
|
||||
"foo", instead. For example:
|
||||
Now rename any file that contains `example` in the filename to have
|
||||
`foo`, instead. For example:
|
||||
|
||||
```
|
||||
shell$ mv op_example_component.c op_foo_component.c
|
||||
#...etc.
|
||||
```
|
||||
|
||||
Now edit all the files and s/example/foo/gi. Specifically, replace
|
||||
all instances of "example" with "foo" in all function names, type
|
||||
names, header #defines, strings, and global variables.
|
||||
Now edit all the files and `s/example/foo/gi`. Specifically, replace
|
||||
all instances of `example` with `foo` in all function names, type
|
||||
names, header `#defines`, strings, and global variables.
|
||||
|
||||
Now your component should be fully functional (although entirely
|
||||
renamed as "foo" instead of "example"). You can go to the top-level
|
||||
OMPI directory and run "autogen.pl" (which will find your component
|
||||
and att it to the configure/build process) and then "configure ..."
|
||||
and "make ..." as normal.
|
||||
renamed as `foo` instead of `example`). You can go to the top-level
|
||||
OMPI directory and run `autogen.pl` (which will find your component
|
||||
and att it to the configure/build process) and then `configure ...`
|
||||
and `make ...` as normal.
|
||||
|
||||
```
|
||||
shell$ cd (top_ompi_dir)
|
||||
shell$ ./autogen.pl
|
||||
# ...lots of output...
|
||||
@ -107,19 +111,21 @@ shell$ make -j 4 all
|
||||
# ...lots of output...
|
||||
shell$ make install
|
||||
# ...lots of output...
|
||||
```
|
||||
|
||||
After you have installed Open MPI, running "ompi_info" should show
|
||||
your "foo" component in the output.
|
||||
After you have installed Open MPI, running `ompi_info` should show
|
||||
your `foo` component in the output.
|
||||
|
||||
```
|
||||
shell$ ompi_info | grep op:
|
||||
MCA op: example (MCA v2.0, API v1.0, Component v1.4)
|
||||
MCA op: foo (MCA v2.0, API v1.0, Component v1.4)
|
||||
shell$
|
||||
```
|
||||
|
||||
If you do not see your foo component, check the above steps, and check
|
||||
the output of autogen.pl, configure, and make to ensure that "foo" was
|
||||
found, configured, and built successfully.
|
||||
|
||||
Once ompi_info sees your component, start editing the "foo" component
|
||||
files in a meaningful way.
|
||||
If you do not see your `foo` component, check the above steps, and
|
||||
check the output of `autogen.pl`, `configure`, and `make` to ensure
|
||||
that `foo` was found, configured, and built successfully.
|
||||
|
||||
Once `ompi_info` sees your component, start editing the `foo`
|
||||
component files in a meaningful way.
|
@ -10,3 +10,5 @@
|
||||
#
|
||||
|
||||
SUBDIRS = java c
|
||||
|
||||
EXTRA_DIST = README.md
|
||||
|
@ -1,26 +1,27 @@
|
||||
***************************************************************************
|
||||
# Open MPI Java bindings
|
||||
|
||||
Note about the Open MPI Java bindings
|
||||
|
||||
The Java bindings in this directory are not part of the MPI specification,
|
||||
as noted in the README.JAVA.txt file in the root directory. That file also
|
||||
contains some information regarding the installation and use of the Java
|
||||
bindings. Further details can be found in the paper [1].
|
||||
The Java bindings in this directory are not part of the MPI
|
||||
specification, as noted in the README.JAVA.md file in the root
|
||||
directory. That file also contains some information regarding the
|
||||
installation and use of the Java bindings. Further details can be
|
||||
found in the paper [1].
|
||||
|
||||
We originally took the code from the mpiJava project [2] as starting point
|
||||
for our developments, but we have pretty much rewritten 100% of it. The
|
||||
original copyrights and license terms of mpiJava are listed below.
|
||||
|
||||
[1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
|
||||
implementation of Java bindings in Open MPI". Parallel Comput.
|
||||
59: 1-20 (2016).
|
||||
1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
|
||||
implementation of Java bindings in Open MPI". Parallel Comput.
|
||||
59: 1-20 (2016).
|
||||
1. M. Baker et al. "mpiJava: An object-oriented Java interface to
|
||||
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
|
||||
pp. 748-762, Springer (1999).
|
||||
|
||||
[2] M. Baker et al. "mpiJava: An object-oriented Java interface to
|
||||
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
|
||||
pp. 748-762, Springer (1999).
|
||||
|
||||
***************************************************************************
|
||||
## Original citation
|
||||
|
||||
```
|
||||
mpiJava - A Java Interface to MPI
|
||||
---------------------------------
|
||||
Copyright 2003
|
||||
@ -39,6 +40,7 @@ original copyrights and license terms of mpiJava are listed below.
|
||||
(Bugfixes/Additions, CMake based configure/build)
|
||||
Blasius Czink
|
||||
HLRS, University of Stuttgart
|
||||
```
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
@ -1,4 +1,5 @@
|
||||
Symbol conventions for Open MPI extensions
|
||||
# Symbol conventions for Open MPI extensions
|
||||
|
||||
Last updated: January 2015
|
||||
|
||||
This README provides some rule-of-thumb guidance for how to name
|
||||
@ -15,26 +16,22 @@ Generally speaking, there are usually three kinds of extensions:
|
||||
3. Functionality that is strongly expected to be in an upcoming
|
||||
version of the MPI specification.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
## Case 1
|
||||
|
||||
Case 1
|
||||
|
||||
The OMPI_Paffinity_str() extension is a good example of this type: it
|
||||
is solely intended to be for Open MPI. It will likely never be pushed
|
||||
to other MPI implementations, and it will likely never be pushed to
|
||||
the MPI Forum.
|
||||
The `OMPI_Paffinity_str()` extension is a good example of this type:
|
||||
it is solely intended to be for Open MPI. It will likely never be
|
||||
pushed to other MPI implementations, and it will likely never be
|
||||
pushed to the MPI Forum.
|
||||
|
||||
It's Open MPI-specific functionality, through and through.
|
||||
|
||||
Public symbols of this type of functionality should be named with an
|
||||
"OMPI_" prefix to emphasize its Open MPI-specific nature. To be
|
||||
clear: the "OMPI_" prefix clearly identifies parts of user code that
|
||||
`OMPI_` prefix to emphasize its Open MPI-specific nature. To be
|
||||
clear: the `OMPI_` prefix clearly identifies parts of user code that
|
||||
are relying on Open MPI (and likely need to be surrounded with #if
|
||||
OPEN_MPI blocks, etc.).
|
||||
`OPEN_MPI` blocks, etc.).
|
||||
|
||||
----------------------------------------------------------------------
|
||||
|
||||
Case 2
|
||||
## Case 2
|
||||
|
||||
The MPI extensions mechanism in Open MPI was designed to help MPI
|
||||
Forum members prototype new functionality that is intended for the
|
||||
@ -43,23 +40,21 @@ functionality is not only to be included in the MPI spec, but possibly
|
||||
also be included in another MPI implementation.
|
||||
|
||||
As such, it seems reasonable to prefix public symbols in this type of
|
||||
functionality with "MPIX_". This commonly-used prefix allows the same
|
||||
functionality with `MPIX_`. This commonly-used prefix allows the same
|
||||
symbols to be available in multiple MPI implementations, and therefore
|
||||
allows user code to easily check for it. E.g., user apps can check
|
||||
for the presence of MPIX_Foo to know if both Open MPI and Other MPI
|
||||
support the proposed MPIX_Foo functionality.
|
||||
for the presence of `MPIX_Foo` to know if both Open MPI and Other MPI
|
||||
support the proposed `MPIX_Foo` functionality.
|
||||
|
||||
Of course, when using the MPIX_ namespace, there is the possibility of
|
||||
symbol name collisions. E.g., what if Open MPI has an MPIX_Foo and
|
||||
Other MPI has a *different* MPIX_Foo?
|
||||
Of course, when using the `MPIX_` namespace, there is the possibility of
|
||||
symbol name collisions. E.g., what if Open MPI has an `MPIX_Foo` and
|
||||
Other MPI has a *different* `MPIX_Foo`?
|
||||
|
||||
While we technically can't prevent such collisions from happening, we
|
||||
encourage extension authors to avoid such symbol clashes whenever
|
||||
possible.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
|
||||
Case 3
|
||||
## Case 3
|
||||
|
||||
It is well-known that the MPI specification (intentionally) takes a
|
||||
long time to publish. MPI implementers can typically know, with a
|
||||
@ -72,13 +67,13 @@ functionality early (i.e., before the actual publication of the
|
||||
corresponding MPI specification document).
|
||||
|
||||
Case in point: the non-blocking collective operations that were
|
||||
included in MPI-3.0 (e.g., MPI_Ibarrier). It was known for a year or
|
||||
two before MPI-3.0 was published that these functions would be
|
||||
included in MPI-3.0 (e.g., `MPI_Ibarrier()`). It was known for a year
|
||||
or two before MPI-3.0 was published that these functions would be
|
||||
included in MPI-3.0.
|
||||
|
||||
There is a continual debate among the developer community: when
|
||||
implementing such functionality, should the symbols be in the MPIX_
|
||||
namespace or in the MPI_ namespace? On one hand, the symbols are not
|
||||
namespace or in the `MPI_` namespace? On one hand, the symbols are not
|
||||
yet officially standardized -- *they could change* before publication.
|
||||
On the other hand, developers who participate in the Forum typically
|
||||
have a good sense for whether symbols are going to change before
|
||||
@ -89,35 +84,31 @@ before the MPI specification is published. ...and so on.
|
||||
After much debate: for functionality that has a high degree of
|
||||
confidence that it will be included in an upcoming spec (e.g., it has
|
||||
passed at least one vote in the MPI Forum), our conclusion is that it
|
||||
is OK to use the MPI_ namespace.
|
||||
is OK to use the `MPI_` namespace.
|
||||
|
||||
Case in point: Open MPI released non-blocking collectives with the
|
||||
MPI_ prefix (not the MPIX_ prefix) before the MPI-3.0 specification
|
||||
officially standardized these functions.
|
||||
`MPI_` prefix (not the `MPIX_` prefix) before the MPI-3.0
|
||||
specification officially standardized these functions.
|
||||
|
||||
The rationale was threefold:
|
||||
|
||||
1. Let users use the functionality as soon as possible.
|
||||
|
||||
2. If OMPI initially creates MPIX_Foo, but eventually renames it to
|
||||
MPI_Foo when the MPI specification is published, then users will
|
||||
1. If OMPI initially creates `MPIX_Foo`, but eventually renames it to
|
||||
`MPI_Foo` when the MPI specification is published, then users will
|
||||
have to modify their codes to match. This is an artificial change
|
||||
inserted just to be "pure" to the MPI spec (i.e., it's a "lawyer's
|
||||
answer"). But since the MPIX_Foo -> MPI_Foo change is inevitable,
|
||||
it just ends up annoying users.
|
||||
|
||||
3. Once OMPI introduces MPIX_ symbols, if we want to *not* annoy
|
||||
answer"). But since the `MPIX_Foo` -> `MPI_Foo` change is
|
||||
inevitable, it just ends up annoying users.
|
||||
1. Once OMPI introduces `MPIX_` symbols, if we want to *not* annoy
|
||||
users, we'll likely have weak symbols / aliased versions of both
|
||||
MPIX_Foo and MPI_Foo once the Foo functionality is included in a
|
||||
published MPI specification. However, when can we delete the
|
||||
MPIX_Foo symbol? It becomes a continuing annoyance of backwards
|
||||
`MPIX_Foo` and `MPI_Foo` once the Foo functionality is included in
|
||||
a published MPI specification. However, when can we delete the
|
||||
`MPIX_Foo` symbol? It becomes a continuing annoyance of backwards
|
||||
compatibility that we have to keep carrying forward.
|
||||
|
||||
For all these reasons, we believe that it's better to put
|
||||
expected-upcoming official MPI functionality in the MPI_ namespace,
|
||||
not the MPIX_ namespace.
|
||||
|
||||
----------------------------------------------------------------------
|
||||
expected-upcoming official MPI functionality in the `MPI_` namespace,
|
||||
not the `MPIX_` namespace.
|
||||
|
||||
All that being said, these are rules of thumb. They are not an
|
||||
official mandate. There may well be cases where there are reasons to
|
@ -2,7 +2,7 @@
|
||||
# Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
|
||||
# Copyright (c) 2010-2020 Cisco Systems, Inc. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
@ -20,4 +20,4 @@
|
||||
|
||||
SUBDIRS = c
|
||||
|
||||
EXTRA_DIST = README.txt
|
||||
EXTRA_DIST = README.md
|
||||
|
30
ompi/mpiext/affinity/README.md
Обычный файл
30
ompi/mpiext/affinity/README.md
Обычный файл
@ -0,0 +1,30 @@
|
||||
# Open MPI extension: Affinity
|
||||
|
||||
## Copyrights
|
||||
|
||||
```
|
||||
Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
|
||||
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
|
||||
```
|
||||
|
||||
## Authors
|
||||
|
||||
* Jeff Squyres, 19 April 2010, and 16 April 2012
|
||||
* Terry Dontje, 18 November 2010
|
||||
|
||||
## Description
|
||||
|
||||
This extension provides a single new function, `OMPI_Affinity_str()`,
|
||||
that takes a format value and then provides 3 prettyprint strings as
|
||||
output:
|
||||
|
||||
* `fmt_type`: is an enum that tells `OMPI_Affinity_str()` whether to
|
||||
use a resource description string or layout string format for
|
||||
`ompi_bound` and `currently_bound` output strings.
|
||||
* `ompi_bound`: describes what sockets/cores Open MPI bound this process
|
||||
to (or indicates that Open MPI did not bind this process).
|
||||
* `currently_bound`: describes what sockets/cores this process is
|
||||
currently bound to (or indicates that it is unbound).
|
||||
* `exists`: describes what processors are available in the current host.
|
||||
|
||||
See `OMPI_Affinity_str(3)` for more details.
|
@ -1,29 +0,0 @@
|
||||
# Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
|
||||
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
|
||||
|
||||
$COPYRIGHT$
|
||||
|
||||
Jeff Squyres
|
||||
19 April 2010, and
|
||||
16 April 2012
|
||||
|
||||
Terry Dontje
|
||||
18 November 2010
|
||||
|
||||
This extension provides a single new function, OMPI_Affinity_str(),
|
||||
that takes a format value and then provides 3 prettyprint strings as
|
||||
output:
|
||||
|
||||
fmt_type: is an enum that tells OMPI_Affinity_str() whether to use a
|
||||
resource description string or layout string format for ompi_bound and
|
||||
currently_bound output strings.
|
||||
|
||||
ompi_bound: describes what sockets/cores Open MPI bound this process
|
||||
to (or indicates that Open MPI did not bind this process).
|
||||
|
||||
currently_bound: describes what sockets/cores this process is
|
||||
currently bound to (or indicates that it is unbound).
|
||||
|
||||
exists: describes what processors are available in the current host.
|
||||
|
||||
See OMPI_Affinity_str(3) for more details.
|
@ -21,4 +21,4 @@
|
||||
|
||||
SUBDIRS = c
|
||||
|
||||
EXTRA_DIST = README.txt
|
||||
EXTRA_DIST = README.md
|
||||
|
11
ompi/mpiext/cuda/README.md
Обычный файл
11
ompi/mpiext/cuda/README.md
Обычный файл
@ -0,0 +1,11 @@
|
||||
# Open MPI extension: Cuda
|
||||
|
||||
Copyright (c) 2015 NVIDIA, Inc. All rights reserved.
|
||||
|
||||
Author: Rolf vandeVaart
|
||||
|
||||
This extension provides a macro for compile time check of CUDA aware
|
||||
support. It also provides a function for runtime check of CUDA aware
|
||||
support.
|
||||
|
||||
See `MPIX_Query_cuda_support(3)` for more details.
|
@ -1,11 +0,0 @@
|
||||
# Copyright (c) 2015 NVIDIA, Inc. All rights reserved.
|
||||
|
||||
$COPYRIGHT$
|
||||
|
||||
Rolf vandeVaart
|
||||
|
||||
|
||||
This extension provides a macro for compile time check of CUDA aware support.
|
||||
It also provides a function for runtime check of CUDA aware support.
|
||||
|
||||
See MPIX_Query_cuda_support(3) for more details.
|
@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2012 Cisco Systems, Inc. All rights reserved.
|
||||
# Copyright (c) 2020 Cisco Systems, Inc. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
@ -17,4 +17,4 @@
|
||||
|
||||
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
||||
|
||||
EXTRA_DIST = README.txt
|
||||
EXTRA_DIST = README.md
|
||||
|
148
ompi/mpiext/example/README.md
Обычный файл
148
ompi/mpiext/example/README.md
Обычный файл
@ -0,0 +1,148 @@
|
||||
# Open MPI extension: Example
|
||||
|
||||
## Overview
|
||||
|
||||
This example MPI extension shows how to make an MPI extension for Open
|
||||
MPI.
|
||||
|
||||
An MPI extension provides new top-level APIs in Open MPI that are
|
||||
available to user-level applications (vs. adding new code/APIs that is
|
||||
wholly internal to Open MPI). MPI extensions are generally used to
|
||||
prototype new MPI APIs, or provide Open MPI-specific APIs to
|
||||
applications. This example MPI extension provides a new top-level MPI
|
||||
API named `OMPI_Progress` that is callable in both C and Fortran.
|
||||
|
||||
MPI extensions are similar to Open MPI components, but due to
|
||||
complex ordering requirements for the Fortran-based MPI bindings,
|
||||
their build order is a little different.
|
||||
|
||||
Note that MPI has 4 different sets of bindings (C, Fortran `mpif.h`,
|
||||
the Fortran `mpi` module, and the Fortran `mpi_f08` module), and Open
|
||||
MPI extensions allow adding API calls to all 4 of them. Prototypes
|
||||
for the user-accessible functions/subroutines/constants are included
|
||||
in the following publicly-available mechanisms:
|
||||
|
||||
* C: `mpi-ext.h`
|
||||
* Fortran mpif.h: `mpif-ext.h`
|
||||
* Fortran "use mpi": `use mpi_ext`
|
||||
* Fortran "use mpi_f08": `use mpi_f08_ext`
|
||||
|
||||
This example extension defines a new top-level API named
|
||||
`OMPI_Progress()` in all four binding types, and provides test programs
|
||||
to call this API in each of the four binding types. Code (and
|
||||
comments) is worth 1,000 words -- see the code in this example
|
||||
extension to understand how it works and how the build system builds
|
||||
and inserts each piece into the publicly-available mechansisms (e.g.,
|
||||
`mpi-ext.h` and the `mpi_f08_ext` module).
|
||||
|
||||
## Comparison to General Open MPI MCA Components
|
||||
|
||||
Here's the ways that MPI extensions are similar to Open MPI
|
||||
components:
|
||||
|
||||
1. Extensions have a top-level `configure.m4` with a well-known m4 macro
|
||||
that is run during Open MPI's configure that determines whether the
|
||||
component wants to build or not.
|
||||
|
||||
Note, however, that unlike components, extensions *must* have a
|
||||
`configure.m4`. No other method of configuration is supported.
|
||||
|
||||
1. Extensions must adhere to normal Automake-based targets. We
|
||||
strongly suggest that you use `Makefile.am`'s and have the
|
||||
extension's `configure.m4` `AC_CONFIG_FILE` each `Makefile.am` in
|
||||
the extension. Using other build systems may work, but are
|
||||
untested and unsupported.
|
||||
|
||||
1. Extensions create specifically-named libtool convenience archives
|
||||
(i.e., `*.la` files) that the build system slurps into higher-level
|
||||
libraries.
|
||||
|
||||
Unlike components, however, extensions:
|
||||
|
||||
1. Have a bit more rigid directory and file naming scheme.
|
||||
1. Have up to four different, specifically-named subdirectories (one
|
||||
for each MPI binding type).
|
||||
1. Also install some specifically-named header files (for C and the
|
||||
Fortran `mpif.h` bindings).
|
||||
|
||||
Similar to components, an MPI extension's name is determined by its
|
||||
directory name: `ompi/mpiext/EXTENSION_NAME`
|
||||
|
||||
## Extension requirements
|
||||
|
||||
### Required: C API
|
||||
|
||||
Under this top-level directory, the extension *must* have a directory
|
||||
named `c` (for the C bindings) that:
|
||||
|
||||
1. contains a file named `mpiext_EXTENSION_NAME_c.h`
|
||||
1. installs `mpiext_EXTENSION_NAME_c.h` to
|
||||
`$includedir/openmpi/mpiext/EXTENSION_NAME/c`
|
||||
1. builds a Libtool convenience library named
|
||||
`libmpiext_EXTENSION_NAME_c.la`
|
||||
|
||||
### Optional: `mpif.h` bindings
|
||||
|
||||
Optionally, the extension may have a director named `mpif-h` (for the
|
||||
Fortran `mpif.h` bindings) that:
|
||||
|
||||
1. contains a file named `mpiext_EXTENSION_NAME_mpifh.h`
|
||||
1. installs `mpiext_EXTENSION_NAME_mpih.h` to
|
||||
`$includedir/openmpi/mpiext/EXTENSION_NAME/mpif-h`
|
||||
1. builds a Libtool convenience library named
|
||||
`libmpiext_EXTENSION_NAME_mpifh.la`
|
||||
|
||||
### Optional: `mpi` module bindings
|
||||
|
||||
Optionally, the extension may have a directory named `use-mpi` (for the
|
||||
Fortran `mpi` module) that:
|
||||
|
||||
1. contains a file named `mpiext_EXTENSION_NAME_usempi.h`
|
||||
|
||||
***NOTE:*** The MPI extension system does NOT support building an
|
||||
additional library in the `use-mpi` extension directory. It is
|
||||
assumed that the `use-mpi` bindings will use the same back-end symbols
|
||||
as the `mpif.h` bindings, and that the only output product of the
|
||||
`use-mpi` directory is a file to be included in the `mpi-ext` module
|
||||
(i.e., strong Fortran prototypes for the functions/global variables in
|
||||
this extension).
|
||||
|
||||
### Optional: `mpi_f08` module bindings
|
||||
|
||||
Optionally, the extension may have a directory named `use-mpi-f08` (for
|
||||
the Fortran `mpi_f08` module) that:
|
||||
|
||||
1. contains a file named `mpiext_EXTENSION_NAME_usempif08.h`
|
||||
1. builds a Libtool convenience library named
|
||||
`libmpiext_EXTENSION_NAME_usempif08.la`
|
||||
|
||||
See the comments in all the header and source files in this tree to
|
||||
see what each file is for and what should be in each.
|
||||
|
||||
## Notes
|
||||
|
||||
Note that the build order of MPI extensions is a bit strange. The
|
||||
directories in a MPI extensions are NOT traversed top-down in
|
||||
sequential order. Instead, due to ordering requirements when building
|
||||
the Fortran module-based interfaces, each subdirectory in extensions
|
||||
are traversed individually at different times in the overall Open MPI
|
||||
build.
|
||||
|
||||
As such, `ompi/mpiext/EXTENSION_NAME/Makefile.am` is not traversed
|
||||
during a normal top-level `make all` target. This `Makefile.am`
|
||||
exists for two reasons, however:
|
||||
|
||||
1. For the conveneince of the developer, so that you can issue normal
|
||||
`make` commands at the top of your extension tree (e.g., `make all`
|
||||
will still build all bindings in an extension).
|
||||
1. During a top-level `make dist`, extension directories *are*
|
||||
traversed top-down in sequence order. Having a top-level
|
||||
`Makefile.am` in an extension allows `EXTRA_DIST`ing of files, such
|
||||
as this `README.md` file.
|
||||
|
||||
This are reasons for this strange ordering, but suffice it to say that
|
||||
`make dist` doesn't have the same ordering requiements as `make all`,
|
||||
and is therefore easier to have a "normal" Automake-usual top-down
|
||||
sequential directory traversal.
|
||||
|
||||
Enjoy!
|
@ -1,138 +0,0 @@
|
||||
Copyright (C) 2012 Cisco Systems, Inc. All rights reserved.
|
||||
|
||||
$COPYRIGHT$
|
||||
|
||||
This example MPI extension shows how to make an MPI extension for Open
|
||||
MPI.
|
||||
|
||||
An MPI extension provides new top-level APIs in Open MPI that are
|
||||
available to user-level applications (vs. adding new code/APIs that is
|
||||
wholly internal to Open MPI). MPI extensions are generally used to
|
||||
prototype new MPI APIs, or provide Open MPI-specific APIs to
|
||||
applications. This example MPI extension provides a new top-level MPI
|
||||
API named "OMPI_Progress" that is callable in both C and Fortran.
|
||||
|
||||
MPI extensions are similar to Open MPI components, but due to
|
||||
complex ordering requirements for the Fortran-based MPI bindings,
|
||||
their build order is a little different.
|
||||
|
||||
Note that MPI has 4 different sets of bindings (C, Fortran mpif.h,
|
||||
Fortran "use mpi", and Fortran "use mpi_f08"), and Open MPI extensions
|
||||
allow adding API calls to all 4 of them. Prototypes for the
|
||||
user-accessible functions/subroutines/constants are included in the
|
||||
following publicly-available mechanisms:
|
||||
|
||||
- C: mpi-ext.h
|
||||
- Fortran mpif.h: mpif-ext.h
|
||||
- Fortran "use mpi": use mpi_ext
|
||||
- Fortran "use mpi_f08": use mpi_f08_ext
|
||||
|
||||
This example extension defines a new top-level API named
|
||||
"OMPI_Progress" in all four binding types, and provides test programs
|
||||
to call this API in each of the four binding types. Code (and
|
||||
comments) is worth 1,000 words -- see the code in this example
|
||||
extension to understand how it works and how the build system builds
|
||||
and inserts each piece into the publicly-available mechansisms (e.g.,
|
||||
mpi-ext.h and the mpi_f08_ext module).
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Here's the ways that MPI extensions are similar to Open MPI
|
||||
components:
|
||||
|
||||
- Extensions have a top-level configure.m4 with a well-known m4 macro
|
||||
that is run during Open MPI's configure that determines whether the
|
||||
component wants to build or not.
|
||||
|
||||
Note, however, that unlike components, extensions *must* have a
|
||||
configure.m4. No other method of configuration is supported.
|
||||
|
||||
- Extensions must adhere to normal Automake-based targets. We
|
||||
strongly suggest that you use Makefile.am's and have the extension's
|
||||
configure.m4 AC_CONFIG_FILE each Makefile.am in the extension.
|
||||
Using other build systems may work, but are untested and
|
||||
unsupported.
|
||||
|
||||
- Extensions create specifically-named libtool convenience archives
|
||||
(i.e., *.la files) that the build system slurps into higher-level
|
||||
libraries.
|
||||
|
||||
Unlike components, however, extensions:
|
||||
|
||||
- Have a bit more rigid directory and file naming scheme.
|
||||
|
||||
- Have up to four different, specifically-named subdirectories (one
|
||||
for each MPI binding type).
|
||||
|
||||
- Also install some specifically-named header files (for C and the
|
||||
Fortran mpif.h bindings).
|
||||
|
||||
Similar to components, an MPI extension's name is determined by its
|
||||
directory name: ompi/mpiext/<extension name>
|
||||
|
||||
Under this top-level directory, the extension *must* have a directory
|
||||
named "c" (for the C bindings) that:
|
||||
|
||||
- contains a file named mpiext_<ext_name>_c.h
|
||||
- installs mpiext_<ext_name>_c.h to
|
||||
$includedir/openmpi/mpiext/<ext_name>/c
|
||||
- builds a Libtool convenience library named libmpiext_<ext_name>_c.la
|
||||
|
||||
Optionally, the extension may have a director named "mpif-h" (for the
|
||||
Fortran mpif.h bindings) that:
|
||||
|
||||
- contains a file named mpiext_<ext_name>_mpifh.h
|
||||
- installs mpiext_<ext_name>_mpih.h to
|
||||
$includedir/openmpi/mpiext/<ext_name>/mpif-h
|
||||
- builds a Libtool convenience library named libmpiext_<ext_name>_mpifh.la
|
||||
|
||||
Optionally, the extension may have a director named "use-mpi" (for the
|
||||
Fortran "use mpi" bindings) that:
|
||||
|
||||
- contains a file named mpiext_<ext_name>_usempi.h
|
||||
|
||||
NOTE: The MPI extension system does NOT support building an additional
|
||||
library in the use-mpi extension directory. It is assumed that the
|
||||
use-mpi bindings will use the same back-end symbols as the mpif.h
|
||||
bindings, and that the only output product of the use-mpi directory is
|
||||
a file to be included in the mpi-ext module (i.e., strong Fortran
|
||||
prototypes for the functions/global variables in this extension).
|
||||
|
||||
Optionally, the extension may have a director named "use-mpi-f08" (for
|
||||
the Fortran mpi_f08 bindings) that:
|
||||
|
||||
- contains a file named mpiext_<ext_name>_usempif08.h
|
||||
- builds a Libtool convenience library named
|
||||
libmpiext_<ext_name>_usempif08.la
|
||||
|
||||
See the comments in all the header and source files in this tree to
|
||||
see what each file is for and what should be in each.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Note that the build order of MPI extensions is a bit strange. The
|
||||
directories in a MPI extensions are NOT traversed top-down in
|
||||
sequential order. Instead, due to ordering requirements when building
|
||||
the Fortran module-based interfaces, each subdirectory in extensions
|
||||
are traversed individually at different times in the overall Open MPI
|
||||
build.
|
||||
|
||||
As such, ompi/mpiext/<ext_name>/Makefile.am is not traversed during a
|
||||
normal top-level "make all" target. This Makefile.am exists for two
|
||||
reasons, however:
|
||||
|
||||
1. For the conveneince of the developer, so that you can issue normal
|
||||
"make" commands at the top of your extension tree (e.g., "make all"
|
||||
will still build all bindings in an extension).
|
||||
|
||||
2. During a top-level "make dist", extension directories *are*
|
||||
traversed top-down in sequence order. Having a top-level Makefile.am
|
||||
in an extension allows EXTRA_DISTing of files, such as this README
|
||||
file.
|
||||
|
||||
This are reasons for this strange ordering, but suffice it to say that
|
||||
"make dist" doesn't have the same ordering requiements as "make all",
|
||||
and is therefore easier to have a "normal" Automake-usual top-down
|
||||
sequential directory traversal.
|
||||
|
||||
Enjoy!
|
@ -8,3 +8,5 @@
|
||||
#
|
||||
|
||||
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
||||
|
||||
EXTRA_DIST = README.md
|
||||
|
14
ompi/mpiext/pcollreq/README.md
Обычный файл
14
ompi/mpiext/pcollreq/README.md
Обычный файл
@ -0,0 +1,14 @@
|
||||
# Open MPI extension: pcollreq
|
||||
|
||||
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
|
||||
|
||||
This extension provides the feature of persistent collective
|
||||
communication operations and persistent neighborhood collective
|
||||
communication operations, which is planned to be included in the next
|
||||
MPI Standard after MPI-3.1 as of Nov. 2018.
|
||||
|
||||
See `MPIX_Barrier_init(3)` for more details.
|
||||
|
||||
The code will be moved to the `ompi/mpi` directory and the `MPIX_`
|
||||
prefix will be switch to the `MPI_` prefix once the MPI Standard which
|
||||
includes this feature is published.
|
@ -1,14 +0,0 @@
|
||||
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
|
||||
|
||||
$COPYRIGHT$
|
||||
|
||||
This extension provides the feature of persistent collective communication
|
||||
operations and persistent neighborhood collective communication operations,
|
||||
which is planned to be included in the next MPI Standard after MPI-3.1 as
|
||||
of Nov. 2018.
|
||||
|
||||
See MPIX_Barrier_init(3) for more details.
|
||||
|
||||
The code will be moved to the ompi/mpi directory and the MPIX_ prefix will
|
||||
be switch to the MPI_ prefix once the MPI Standard which includes this
|
||||
feature is published.
|
@ -8,3 +8,5 @@
|
||||
#
|
||||
|
||||
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
||||
|
||||
EXTRA_DIST = README.md
|
||||
|
35
ompi/mpiext/shortfloat/README.md
Обычный файл
35
ompi/mpiext/shortfloat/README.md
Обычный файл
@ -0,0 +1,35 @@
|
||||
# Open MPI extension: shortfloat
|
||||
|
||||
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
|
||||
|
||||
This extension provides additional MPI datatypes `MPIX_SHORT_FLOAT`,
|
||||
`MPIX_C_SHORT_FLOAT_COMPLEX`, and `MPIX_CXX_SHORT_FLOAT_COMPLEX`,
|
||||
which are proposed with the `MPI_` prefix in June 2017 for proposal in
|
||||
the MPI 4.0 standard. As of February 2019, it is not accepted yet.
|
||||
|
||||
See https://github.com/mpi-forum/mpi-issues/issues/65 for moe details
|
||||
|
||||
Each MPI datatype corresponds to the C/C++ type `short float`, the C
|
||||
type `short float _Complex`, and the C++ type `std::complex<short
|
||||
float>`, respectively.
|
||||
|
||||
In addition, this extension provides a datatype `MPIX_C_FLOAT16` for
|
||||
the C type `_Float16`, which is defined in ISO/IEC JTC 1/SC 22/WG 14
|
||||
N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
|
||||
that of MPICH. See https://github.com/pmodels/mpich/pull/3455.
|
||||
|
||||
This extension is enabled only if the C compiler supports `short float`
|
||||
or `_Float16`, or the `--enable-alt-short-float=TYPE` option is passed
|
||||
to the Open MPI `configure` script.
|
||||
|
||||
NOTE: The Clang 6.0.x and 7.0.x compilers support the `_Float16` type
|
||||
(via software emulation), but require an additional linker flag to
|
||||
function properly. If you wish to enable Clang 6.0.x or 7.0.x's
|
||||
software emulation of `_Float16`, use the following CLI options to Open
|
||||
MPI configure script:
|
||||
|
||||
```
|
||||
./configure \
|
||||
LDFLAGS=--rtlib=compiler-rt \
|
||||
--with-wrapper-ldflags=--rtlib=compiler-rt ...
|
||||
```
|
@ -1,35 +0,0 @@
|
||||
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
|
||||
|
||||
$COPYRIGHT$
|
||||
|
||||
This extension provides additional MPI datatypes MPIX_SHORT_FLOAT,
|
||||
MPIX_C_SHORT_FLOAT_COMPLEX, and MPIX_CXX_SHORT_FLOAT_COMPLEX, which
|
||||
are proposed with the MPI_ prefix in June 2017 for proposal in the
|
||||
MPI 4.0 standard. As of February 2019, it is not accepted yet.
|
||||
|
||||
https://github.com/mpi-forum/mpi-issues/issues/65
|
||||
|
||||
Each MPI datatype corresponds to the C/C++ type 'short float', the C type
|
||||
'short float _Complex', and the C++ type 'std::complex<short float>',
|
||||
respectively.
|
||||
|
||||
In addition, this extension provides a datatype MPIX_C_FLOAT16 for
|
||||
the C type _Float16, which is defined in ISO/IEC JTC 1/SC 22/WG 14
|
||||
N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
|
||||
that of MPICH.
|
||||
|
||||
https://github.com/pmodels/mpich/pull/3455
|
||||
|
||||
This extension is enabled only if the C compiler supports 'short float'
|
||||
or '_Float16', or the '--enable-alt-short-float=TYPE' option is passed
|
||||
to the configure script.
|
||||
|
||||
NOTE: The Clang 6.0.x and 7.0.x compilers support the "_Float16" type
|
||||
(via software emulation), but require an additional linker flag to
|
||||
function properly. If you wish to enable Clang 6.0.x or 7.0.x's
|
||||
software emulation of _Float16, use the following CLI options to Open
|
||||
MPI configure script:
|
||||
|
||||
./configure \
|
||||
LDFLAGS=--rtlib=compiler-rt \
|
||||
--with-wrapper-ldflags=--rtlib=compiler-rt ...
|
@ -1,110 +0,0 @@
|
||||
========================================
|
||||
Design notes on BTL/OFI
|
||||
========================================
|
||||
|
||||
This is the RDMA only btl based on OFI Libfabric. The goal is to enable RDMA
|
||||
with multiple vendor hardware through one interface. Most of the operations are
|
||||
managed by upper layer (osc/rdma). This BTL is mostly doing the low level work.
|
||||
|
||||
Tested providers: sockets,psm2,ugni
|
||||
|
||||
========================================
|
||||
|
||||
Component
|
||||
|
||||
This BTL is requesting libfabric version 1.5 API and will not support older versions.
|
||||
|
||||
The required capabilities of this BTL is FI_ATOMIC and FI_RMA with the endpoint type
|
||||
of FI_EP_RDM only. This BTL does NOT support libfabric provider that requires local
|
||||
memory registration (FI_MR_LOCAL).
|
||||
|
||||
BTL/OFI will initialize a module with ONLY the first compatible info returned from OFI.
|
||||
This means it will rely on OFI provider to do load balancing. The support for multiple
|
||||
device might be added later.
|
||||
|
||||
The BTL creates only one endpoint and one CQ.
|
||||
|
||||
========================================
|
||||
|
||||
Memory Registration
|
||||
|
||||
Open MPI has a system in place to exchange remote address and always use the remote
|
||||
virtual address to refer to a piece of memory. However, some libfabric providers might
|
||||
not support the use of virtual address and instead will use zero-based offset addressing.
|
||||
|
||||
FI_MR_VIRT_ADDR is the flag that determine this behavior. mca_btl_ofi_reg_mem() handles
|
||||
this by storing the base address in registration handle in case of the provider does not
|
||||
support FI_MR_VIRT_ADDR. This base address will be used to calculate the offset later in
|
||||
RDMA/Atomic operations.
|
||||
|
||||
The BTL will try to use the address of registration handle as the key. However, if the
|
||||
provider supports FI_MR_PROV_KEY, it will use provider provided key. Simply does not care.
|
||||
|
||||
The BTL does not register local operand or compare. This is why this BTL does not support
|
||||
FI_MR_LOCAL and will allocate every buffer before registering. This means FI_MR_ALLOCATED
|
||||
is supported. So to be explicit.
|
||||
|
||||
Supported MR mode bits (will work with or without):
|
||||
enum:
|
||||
- FI_MR_BASIC
|
||||
- FI_MR_SCALABLE
|
||||
|
||||
mode bits:
|
||||
- FI_MR_VIRT_ADDR
|
||||
- FI_MR_ALLOCATED
|
||||
- FI_MR_PROV_KEY
|
||||
|
||||
The BTL does NOT support (will not work with):
|
||||
- FI_MR_LOCAL
|
||||
- FI_MR_MMU_NOTIFY
|
||||
- FI_MR_RMA_EVENT
|
||||
- FI_MR_ENDPOINT
|
||||
|
||||
Just a reminder, in libfabric API 1.5...
|
||||
FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)
|
||||
|
||||
========================================
|
||||
|
||||
Completions
|
||||
|
||||
Every operation in this BTL is asynchronous. The completion handling will occur in
|
||||
mca_btl_ofi_component_progress() where we read the CQ with the completion context and
|
||||
execute the callback functions. The completions are local. No remote completion event is
|
||||
generated as local completion already guarantee global completion.
|
||||
|
||||
The BTL keep tracks of number of outstanding operations and provide flush interface.
|
||||
|
||||
========================================
|
||||
|
||||
Sockets Provider
|
||||
|
||||
Sockets provider is the proof of concept provider for libfabric. It is supposed to support
|
||||
all the OFI API with emulations. This provider is considered very slow and bound to raise
|
||||
problems that we might not see from other faster providers.
|
||||
|
||||
Known Problems:
|
||||
- sockets provider uses progress thread and can cause segfault in finalize as we free
|
||||
the resources while progress thread is still using it. sleep(1) was put in
|
||||
mca_btl_ofi_componenet_close() for this reason.
|
||||
- sockets provider deadlock in two-sided mode. Might be something about buffered recv.
|
||||
(August 2018).
|
||||
|
||||
========================================
|
||||
|
||||
Scalable Endpoint
|
||||
|
||||
This BTL will try to use scalable endpoint to create communication context. This will increase
|
||||
multithreaded performance for some application. The default number of context created is 1 and
|
||||
can be tuned VIA MCA parameter "btl_ofi_num_contexts_per_module". It is advised that the number
|
||||
of context should be equal to number of physical core for optimal performance.
|
||||
|
||||
User can disable scalable endpoint by MCA parameter "btl_ofi_disable_sep".
|
||||
With scalable endpoint disbled, the BTL will alias OFI endpoint to both tx and rx context.
|
||||
|
||||
========================================
|
||||
|
||||
Two sided communication
|
||||
|
||||
Two sided communication is added later on to BTL OFI to enable non tag-matching provider
|
||||
to be able to use in Open MPI with this BTL. However, the support is only for "functional"
|
||||
and has not been optimized for performance at this point. (August 2018)
|
113
opal/mca/btl/ofi/README.md
Обычный файл
113
opal/mca/btl/ofi/README.md
Обычный файл
@ -0,0 +1,113 @@
|
||||
# Design notes on BTL/OFI
|
||||
|
||||
This is the RDMA only btl based on OFI Libfabric. The goal is to
|
||||
enable RDMA with multiple vendor hardware through one interface. Most
|
||||
of the operations are managed by upper layer (osc/rdma). This BTL is
|
||||
mostly doing the low level work.
|
||||
|
||||
Tested providers: sockets,psm2,ugni
|
||||
|
||||
## Component
|
||||
|
||||
This BTL is requesting libfabric version 1.5 API and will not support
|
||||
older versions.
|
||||
|
||||
The required capabilities of this BTL is `FI_ATOMIC` and `FI_RMA` with
|
||||
the endpoint type of `FI_EP_RDM` only. This BTL does NOT support
|
||||
libfabric provider that requires local memory registration
|
||||
(`FI_MR_LOCAL`).
|
||||
|
||||
BTL/OFI will initialize a module with ONLY the first compatible info
|
||||
returned from OFI. This means it will rely on OFI provider to do load
|
||||
balancing. The support for multiple device might be added later.
|
||||
|
||||
The BTL creates only one endpoint and one CQ.
|
||||
|
||||
## Memory Registration
|
||||
|
||||
Open MPI has a system in place to exchange remote address and always
|
||||
use the remote virtual address to refer to a piece of memory. However,
|
||||
some libfabric providers might not support the use of virtual address
|
||||
and instead will use zero-based offset addressing.
|
||||
|
||||
`FI_MR_VIRT_ADDR` is the flag that determine this
|
||||
behavior. `mca_btl_ofi_reg_mem()` handles this by storing the base
|
||||
address in registration handle in case of the provider does not
|
||||
support `FI_MR_VIRT_ADDR`. This base address will be used to calculate
|
||||
the offset later in RDMA/Atomic operations.
|
||||
|
||||
The BTL will try to use the address of registration handle as the
|
||||
key. However, if the provider supports `FI_MR_PROV_KEY`, it will use
|
||||
provider provided key. Simply does not care.
|
||||
|
||||
The BTL does not register local operand or compare. This is why this
|
||||
BTL does not support `FI_MR_LOCAL` and will allocate every buffer
|
||||
before registering. This means `FI_MR_ALLOCATED` is supported. So to
|
||||
be explicit.
|
||||
|
||||
Supported MR mode bits (will work with or without):
|
||||
|
||||
* enum:
|
||||
* `FI_MR_BASIC`
|
||||
* `FI_MR_SCALABLE`
|
||||
* mode bits:
|
||||
* `FI_MR_VIRT_ADDR`
|
||||
* `FI_MR_ALLOCATED`
|
||||
* `FI_MR_PROV_KEY`
|
||||
|
||||
The BTL does NOT support (will not work with):
|
||||
|
||||
* `FI_MR_LOCAL`
|
||||
* `FI_MR_MMU_NOTIFY`
|
||||
* `FI_MR_RMA_EVENT`
|
||||
* `FI_MR_ENDPOINT`
|
||||
|
||||
Just a reminder, in libfabric API 1.5...
|
||||
`FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)`
|
||||
|
||||
## Completions
|
||||
|
||||
Every operation in this BTL is asynchronous. The completion handling
|
||||
will occur in `mca_btl_ofi_component_progress()` where we read the CQ
|
||||
with the completion context and execute the callback functions. The
|
||||
completions are local. No remote completion event is generated as
|
||||
local completion already guarantee global completion.
|
||||
|
||||
The BTL keep tracks of number of outstanding operations and provide
|
||||
flush interface.
|
||||
|
||||
## Sockets Provider
|
||||
|
||||
Sockets provider is the proof of concept provider for libfabric. It is
|
||||
supposed to support all the OFI API with emulations. This provider is
|
||||
considered very slow and bound to raise problems that we might not see
|
||||
from other faster providers.
|
||||
|
||||
Known Problems:
|
||||
|
||||
* sockets provider uses progress thread and can cause segfault in
|
||||
finalize as we free the resources while progress thread is still
|
||||
using it. `sleep(1)` was put in `mca_btl_ofi_component_close()` for
|
||||
this reason.
|
||||
* sockets provider deadlock in two-sided mode. Might be something
|
||||
about buffered recv. (August 2018).
|
||||
|
||||
## Scalable Endpoint
|
||||
|
||||
This BTL will try to use scalable endpoint to create communication
|
||||
context. This will increase multithreaded performance for some
|
||||
application. The default number of context created is 1 and can be
|
||||
tuned VIA MCA parameter `btl_ofi_num_contexts_per_module`. It is
|
||||
advised that the number of context should be equal to number of
|
||||
physical core for optimal performance.
|
||||
|
||||
User can disable scalable endpoint by MCA parameter
|
||||
`btl_ofi_disable_sep`. With scalable endpoint disbled, the BTL will
|
||||
alias OFI endpoint to both tx and rx context.
|
||||
|
||||
## Two sided communication
|
||||
|
||||
Two sided communication is added later on to BTL OFI to enable non
|
||||
tag-matching provider to be able to use in Open MPI with this
|
||||
BTL. However, the support is only for "functional" and has not been
|
||||
optimized for performance at this point. (August 2018)
|
@ -1,113 +0,0 @@
|
||||
Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
|
||||
August 21, 2013
|
||||
|
||||
SMCUDA DESIGN DOCUMENT
|
||||
This document describes the design and use of the smcuda BTL.
|
||||
|
||||
BACKGROUND
|
||||
The smcuda btl is a copy of the sm btl but with some additional features.
|
||||
The main extra feature is the ability to make use of the CUDA IPC APIs to
|
||||
quickly move GPU buffers from one GPU to another. Without this support,
|
||||
the GPU buffers would all be moved into and then out of host memory.
|
||||
|
||||
GENERAL DESIGN
|
||||
|
||||
The general design makes use of the large message RDMA RGET support in the
|
||||
OB1 PML. However, there are some interesting choices to make use of it.
|
||||
First, we disable any large message RDMA support in the BTL for host
|
||||
messages. This is done because we need to use the mca_btl_smcuda_get() for
|
||||
the GPU buffers. This is also done because the upper layers expect there
|
||||
to be a single mpool but we need one for the GPU memory and one for the
|
||||
host memory. Since the advantages of using RDMA with host memory is
|
||||
unclear, we disabled it. This means no KNEM or CMA support built in to the
|
||||
smcuda BTL.
|
||||
|
||||
Also note that we give the smcuda BTL a higher rank than the sm BTL. This
|
||||
means it will always be selected even if we are doing host only data
|
||||
transfers. The smcuda BTL is not built if it is not requested via the
|
||||
--with-cuda flag to the configure line.
|
||||
|
||||
Secondly, the smcuda does not make use of the traditional method of
|
||||
enabling RDMA operations. The traditional method checks for the existence
|
||||
of an RDMA btl hanging off the endpoint. The smcuda works in conjunction
|
||||
with the OB1 PML and uses flags that it sends in the BML layer.
|
||||
|
||||
OTHER CONSIDERATIONS
|
||||
CUDA IPC is not necessarily supported by all GPUs on a node. In NUMA
|
||||
nodes, CUDA IPC may only work between GPUs that are not connected
|
||||
over the IOH. In addition, we want to check for CUDA IPC support lazily,
|
||||
when the first GPU access occurs, rather than during MPI_Init() time.
|
||||
This complicates the design.
|
||||
|
||||
INITIALIZATION
|
||||
When the smcuda BTL initializes, it starts with no support for CUDA IPC.
|
||||
Upon the first access of a GPU buffer, the smcuda checks which GPU device
|
||||
it has and sends that to the remote side using a smcuda specific control
|
||||
message. The other rank receives the message, and checks to see if there
|
||||
is CUDA IPC support between the two GPUs via a call to
|
||||
cuDeviceCanAccessPeer(). If it is true, then the smcuda BTL piggy backs on
|
||||
the PML error handler callback to make a call into the PML and let it know
|
||||
to enable CUDA IPC. We created a new flag so that the error handler does
|
||||
the right thing. Large message RDMA is enabled by setting a flag in the
|
||||
bml->btl_flags field. Control returns to the smcuda BTL where a reply
|
||||
message is sent so the sending side can set its flag.
|
||||
|
||||
At that point, the PML layer starts using the large message RDMA support
|
||||
in the smcuda BTL. This is done in some special CUDA code in the PML layer.
|
||||
|
||||
ESTABLISHING CUDA IPC SUPPORT
|
||||
A check has been added into both the send and sendi path in the smcuda btl
|
||||
that checks to see if it should send a request for CUDA IPC setup message.
|
||||
|
||||
/* Initiate setting up CUDA IPC support. */
|
||||
if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
|
||||
mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
|
||||
}
|
||||
|
||||
The first check is to see if the CUDA environment has been initialized. If
|
||||
not, then presumably we are not sending any GPU buffers yet and there is
|
||||
nothing to be done. If we are initialized, then check the status of the
|
||||
CUDA IPC endpoint. If it is in the IPC_INIT stage, then call the function
|
||||
to send of a control message to the endpoint.
|
||||
|
||||
On the receiving side, we first check to see if we are initialized. If
|
||||
not, then send a message back to the sender saying we are not initialized.
|
||||
This will cause the sender to reset its state to IPC_INIT so it can try
|
||||
again on the next send.
|
||||
|
||||
I considered putting the receiving side into a new state like IPC_NOTREADY,
|
||||
and then when it switches to ready, to then sending the ACK to the sender.
|
||||
The problem with this is that we would need to do these checks during the
|
||||
progress loop which adds some extra overhead as we would have to check all
|
||||
endpoints to see if they were ready.
|
||||
|
||||
Note that any rank can initiate the setup of CUDA IPC. It is triggered by
|
||||
whichever side does a send or sendi call of a GPU buffer.
|
||||
|
||||
I have the sender attempt 5 times to set up the connection. After that, we
|
||||
give up. Note that I do not expect many scenarios where the sender has to
|
||||
resend. It could happen in a race condition where one rank has initialized
|
||||
its CUDA environment but the other side has not.
|
||||
|
||||
There are several states the connections can go through.
|
||||
|
||||
IPC_INIT - nothing has happened
|
||||
IPC_SENT - message has been sent to other side
|
||||
IPC_ACKING - Received request and figuring out what to send back
|
||||
IPC_ACKED - IPC ACK sent
|
||||
IPC_OK - IPC ACK received back
|
||||
IPC_BAD - Something went wrong, so marking as no IPC support
|
||||
|
||||
NOTE ABOUT CUDA IPC AND MEMORY POOLS
|
||||
The CUDA IPC support works in the following way. A sender makes a call to
|
||||
cuIpcGetMemHandle() and gets a memory handle for its local memory. The
|
||||
sender then sends that handle to receiving side. The receiver calls
|
||||
cuIpcOpenMemHandle() using that handle and gets back an address to the
|
||||
remote memory. The receiver then calls cuMemcpyAsync() to initiate a
|
||||
remote read of the GPU data.
|
||||
|
||||
The receiver maintains a cache of remote memory that it has handles open on.
|
||||
This is because a call to cuIpcOpenMemHandle() can be very expensive (90usec) so
|
||||
we want to avoid it when we can. The cache of remote memory is kept in a memory
|
||||
pool that is associated with each endpoint. Note that we do not cache the local
|
||||
memory handles because getting them is very cheap and there is no need.
|
126
opal/mca/btl/smcuda/README.md
Обычный файл
126
opal/mca/btl/smcuda/README.md
Обычный файл
@ -0,0 +1,126 @@
|
||||
# Open MPI SMCUDA design document
|
||||
|
||||
Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
|
||||
August 21, 2013
|
||||
|
||||
This document describes the design and use of the `smcuda` BTL.
|
||||
|
||||
## BACKGROUND
|
||||
|
||||
The `smcuda` btl is a copy of the `sm` btl but with some additional
|
||||
features. The main extra feature is the ability to make use of the
|
||||
CUDA IPC APIs to quickly move GPU buffers from one GPU to another.
|
||||
Without this support, the GPU buffers would all be moved into and then
|
||||
out of host memory.
|
||||
|
||||
## GENERAL DESIGN
|
||||
|
||||
The general design makes use of the large message RDMA RGET support in
|
||||
the OB1 PML. However, there are some interesting choices to make use
|
||||
of it. First, we disable any large message RDMA support in the BTL
|
||||
for host messages. This is done because we need to use the
|
||||
`mca_btl_smcuda_get()` for the GPU buffers. This is also done because
|
||||
the upper layers expect there to be a single mpool but we need one for
|
||||
the GPU memory and one for the host memory. Since the advantages of
|
||||
using RDMA with host memory is unclear, we disabled it. This means no
|
||||
KNEM or CMA support built in to the `smcuda` BTL.
|
||||
|
||||
Also note that we give the `smcuda` BTL a higher rank than the `sm`
|
||||
BTL. This means it will always be selected even if we are doing host
|
||||
only data transfers. The `smcuda` BTL is not built if it is not
|
||||
requested via the `--with-cuda` flag to the configure line.
|
||||
|
||||
Secondly, the `smcuda` does not make use of the traditional method of
|
||||
enabling RDMA operations. The traditional method checks for the existence
|
||||
of an RDMA btl hanging off the endpoint. The `smcuda` works in conjunction
|
||||
with the OB1 PML and uses flags that it sends in the BML layer.
|
||||
|
||||
## OTHER CONSIDERATIONS
|
||||
|
||||
CUDA IPC is not necessarily supported by all GPUs on a node. In NUMA
|
||||
nodes, CUDA IPC may only work between GPUs that are not connected
|
||||
over the IOH. In addition, we want to check for CUDA IPC support lazily,
|
||||
when the first GPU access occurs, rather than during `MPI_Init()` time.
|
||||
This complicates the design.
|
||||
|
||||
## INITIALIZATION
|
||||
|
||||
When the `smcuda` BTL initializes, it starts with no support for CUDA IPC.
|
||||
Upon the first access of a GPU buffer, the `smcuda` checks which GPU device
|
||||
it has and sends that to the remote side using a `smcuda` specific control
|
||||
message. The other rank receives the message, and checks to see if there
|
||||
is CUDA IPC support between the two GPUs via a call to
|
||||
`cuDeviceCanAccessPeer()`. If it is true, then the `smcuda` BTL piggy backs on
|
||||
the PML error handler callback to make a call into the PML and let it know
|
||||
to enable CUDA IPC. We created a new flag so that the error handler does
|
||||
the right thing. Large message RDMA is enabled by setting a flag in the
|
||||
`bml->btl_flags` field. Control returns to the `smcuda` BTL where a reply
|
||||
message is sent so the sending side can set its flag.
|
||||
|
||||
At that point, the PML layer starts using the large message RDMA
|
||||
support in the `smcuda` BTL. This is done in some special CUDA code
|
||||
in the PML layer.
|
||||
|
||||
## ESTABLISHING CUDA IPC SUPPORT
|
||||
|
||||
A check has been added into both the `send` and `sendi` path in the
|
||||
`smcuda` btl that checks to see if it should send a request for CUDA
|
||||
IPC setup message.
|
||||
|
||||
```c
|
||||
/* Initiate setting up CUDA IPC support. */
|
||||
if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
|
||||
mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
|
||||
}
|
||||
```
|
||||
|
||||
The first check is to see if the CUDA environment has been
|
||||
initialized. If not, then presumably we are not sending any GPU
|
||||
buffers yet and there is nothing to be done. If we are initialized,
|
||||
then check the status of the CUDA IPC endpoint. If it is in the
|
||||
IPC_INIT stage, then call the function to send of a control message to
|
||||
the endpoint.
|
||||
|
||||
On the receiving side, we first check to see if we are initialized.
|
||||
If not, then send a message back to the sender saying we are not
|
||||
initialized. This will cause the sender to reset its state to
|
||||
IPC_INIT so it can try again on the next send.
|
||||
|
||||
I considered putting the receiving side into a new state like
|
||||
IPC_NOTREADY, and then when it switches to ready, to then sending the
|
||||
ACK to the sender. The problem with this is that we would need to do
|
||||
these checks during the progress loop which adds some extra overhead
|
||||
as we would have to check all endpoints to see if they were ready.
|
||||
|
||||
Note that any rank can initiate the setup of CUDA IPC. It is
|
||||
triggered by whichever side does a send or sendi call of a GPU buffer.
|
||||
|
||||
I have the sender attempt 5 times to set up the connection. After
|
||||
that, we give up. Note that I do not expect many scenarios where the
|
||||
sender has to resend. It could happen in a race condition where one
|
||||
rank has initialized its CUDA environment but the other side has not.
|
||||
|
||||
There are several states the connections can go through.
|
||||
|
||||
1. IPC_INIT - nothing has happened
|
||||
1. IPC_SENT - message has been sent to other side
|
||||
1. IPC_ACKING - Received request and figuring out what to send back
|
||||
1. IPC_ACKED - IPC ACK sent
|
||||
1. IPC_OK - IPC ACK received back
|
||||
1. IPC_BAD - Something went wrong, so marking as no IPC support
|
||||
|
||||
## NOTE ABOUT CUDA IPC AND MEMORY POOLS
|
||||
|
||||
The CUDA IPC support works in the following way. A sender makes a
|
||||
call to `cuIpcGetMemHandle()` and gets a memory handle for its local
|
||||
memory. The sender then sends that handle to receiving side. The
|
||||
receiver calls `cuIpcOpenMemHandle()` using that handle and gets back
|
||||
an address to the remote memory. The receiver then calls
|
||||
`cuMemcpyAsync()` to initiate a remote read of the GPU data.
|
||||
|
||||
The receiver maintains a cache of remote memory that it has handles
|
||||
open on. This is because a call to `cuIpcOpenMemHandle()` can be very
|
||||
expensive (90usec) so we want to avoid it when we can. The cache of
|
||||
remote memory is kept in a memory pool that is associated with each
|
||||
endpoint. Note that we do not cache the local memory handles because
|
||||
getting them is very cheap and there is no need.
|
@ -27,7 +27,7 @@
|
||||
|
||||
AM_CPPFLAGS = $(opal_ofi_CPPFLAGS) -DOMPI_LIBMPI_NAME=\"$(OMPI_LIBMPI_NAME)\"
|
||||
|
||||
EXTRA_DIST = README.txt README.test
|
||||
EXTRA_DIST = README.md README.test
|
||||
|
||||
dist_opaldata_DATA = \
|
||||
help-mpi-btl-usnic.txt
|
||||
|
330
opal/mca/btl/usnic/README.md
Обычный файл
330
opal/mca/btl/usnic/README.md
Обычный файл
@ -0,0 +1,330 @@
|
||||
# Design notes on usnic BTL
|
||||
|
||||
## nomenclature
|
||||
|
||||
* fragment - something the PML asks us to send or put, any size
|
||||
* segment - something we can put on the wire in a single packet
|
||||
* chunk - a piece of a fragment that fits into one segment
|
||||
|
||||
a segment can contain either an entire fragment or a chunk of a fragment
|
||||
|
||||
each segment and fragment has associated descriptor.
|
||||
|
||||
Each segment data structure has a block of registered memory associated with
|
||||
it which matches MTU for that segment
|
||||
|
||||
* ACK - acks get special small segments with only enough memory for an ACK
|
||||
* non-ACK segments always have a parent fragment
|
||||
|
||||
* fragments are either large (> MTU) or small (<= MTU)
|
||||
* a small fragment has a segment descriptor embedded within it since it
|
||||
always needs exactly one.
|
||||
* a large fragment has no permanently associated segments, but allocates them
|
||||
as needed.
|
||||
|
||||
## channels
|
||||
|
||||
A channel is a queue pair with an associated completion queue
|
||||
each channel has its own MTU and r/w queue entry counts
|
||||
|
||||
There are 2 channels, command and data:
|
||||
* command queue is generally for higher priority fragments
|
||||
* data queue is for standard data traffic
|
||||
* command queue should possibly be called "priority" queue
|
||||
|
||||
command queue is shorter and has a smaller MTU that the data queue.
|
||||
this makes the command queue a lot faster than the data queue, so we
|
||||
hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
|
||||
|
||||
command queue is used for ACKs and tiny fragments.
|
||||
data queue is used for everything else.
|
||||
|
||||
PML fragments marked priority should perhaps use command queue
|
||||
|
||||
## sending
|
||||
|
||||
Normally, all send requests are simply enqueued and then actually posted
|
||||
to the NIC by the routine `opal_btl_usnic_module_progress_sends()`.
|
||||
"fastpath" tiny sends are the exception.
|
||||
|
||||
Each module maintains a queue of endpoints that are ready to send.
|
||||
An endpoint is ready to send if all of the following are met:
|
||||
1. the endpoint has fragments to send
|
||||
1. the endpoint has send credits
|
||||
1. the endpoint's send window is "open" (not full of un-ACKed segments)
|
||||
|
||||
Each module also maintains a list of segments that need to be retransmitted.
|
||||
Note that the list of pending retrans is per-module, not per-endpoint.
|
||||
|
||||
Send progression first posts any pending retransmissions, always using
|
||||
the data channel. (reason is that if we start getting heavy
|
||||
congestion and there are lots of retransmits, it becomes more
|
||||
important than ever to prioritize ACKs, clogging command channel with
|
||||
retrans data makes things worse, not better)
|
||||
|
||||
Next, progression loops sending segments to the endpoint at the top of
|
||||
the `endpoints_with_sends` queue. When an endpoint exhausts its send
|
||||
credits or fills its send window or runs out of segments to send, it
|
||||
removes itself from the `endpoint_with_sends` list. Any pending ACKs
|
||||
will be picked up and piggy-backed on these sends.
|
||||
|
||||
Finally, any endpoints that still need ACKs whose timer has expired will
|
||||
be sent explicit ACK packets.
|
||||
|
||||
## fragment sending
|
||||
|
||||
The middle part of the progression loop handles both small
|
||||
(single-segment) and large (multi-segment) sends.
|
||||
|
||||
For small fragments, the verbs descriptor within the embedded segment
|
||||
is updated with length, BTL header is updated, then we call
|
||||
`opal_btl_usnic_endpoint_send_segment()` to send the segment. After
|
||||
posting, we make a PML callback if needed.
|
||||
|
||||
For large fragments, a little more is needed. segments froma large
|
||||
fragment have a slightly larger BTL header which contains a fragment
|
||||
ID, and offset, and a size. The fragment ID is allocated when the
|
||||
first chunk the fragment is sent. A segment gets allocated, next blob
|
||||
of data is copied into this segment, segment is posted. If last chunk
|
||||
of fragment sent, perform callback if needed, then remove fragment
|
||||
from endpoint send queue.
|
||||
|
||||
## `opal_btl_usnic_endpoint_send_segment()`
|
||||
|
||||
This is common posting code for large or small segments. It assigns a
|
||||
sequence number to a segment, checks for an ACK to piggy-back,
|
||||
posts the segment to the NIC, and then starts the retransmit timer
|
||||
by checking the segment into hotel. Send credits are consumed here.
|
||||
|
||||
|
||||
## send dataflow
|
||||
|
||||
PML control messages with no user data are sent via:
|
||||
* `desc = usnic_alloc(size)`
|
||||
* `usnic_send(desc)`
|
||||
|
||||
user messages less than eager limit and 1st part of larger
|
||||
|
||||
messages are sent via:
|
||||
* `desc = usnic_prepare_src(convertor, size)`
|
||||
* `usnic_send(desc)`
|
||||
|
||||
larger msgs:
|
||||
* `desc = usnic_prepare_src(convertor, size)`
|
||||
* `usnic_put(desc)`
|
||||
|
||||
|
||||
`usnic_alloc()` currently asserts the length is "small", allocates and
|
||||
fills in a small fragment. src pointer will point to start of
|
||||
associated registered mem + sizeof BTL header, and PML will put its
|
||||
data there.
|
||||
|
||||
`usnic_prepare_src()` allocated either a large or small fragment based
|
||||
on size The fragment descriptor is filled in to have 2 SG entries, 1st
|
||||
pointing to place where PML should construct its header. If the data
|
||||
convertor says data is contiguous, 2nd SG entry points to user buffer,
|
||||
else it is null and sf_convertor is filled in with address of
|
||||
convertor.
|
||||
|
||||
### `usnic_send()`
|
||||
|
||||
If the fragment being sent is small enough, has contiguous data, and
|
||||
"very few" command queue send WQEs have been consumed, `usnic_send()`
|
||||
does a fastpath send. This means it posts the segment immediately to
|
||||
the NIC with INLINE flag set.
|
||||
|
||||
If all of the conditions for fastpath send are not met, and this is a
|
||||
small fragment, the user data is copied into the associated registered
|
||||
memory at this time and the SG list in the descriptor is collapsed to
|
||||
one entry.
|
||||
|
||||
After the checks above are done, the fragment is enqueued to be sent
|
||||
via `opal_btl_usnic_endpoint_enqueue_frag()`
|
||||
|
||||
### `usnic_put()`
|
||||
|
||||
Do a fast version of what happens in `prepare_src()` (can take shortcuts
|
||||
because we know it will always be a contiguous buffer / no convertor
|
||||
needed). PML gives us the destination address, which we save on the
|
||||
fragment (which is the sentinel value that the underlying engine uses
|
||||
to know that this is a PUT and not a SEND), and the fragment is
|
||||
enqueued for processing.
|
||||
|
||||
### `opal_btl_usnic_endpoint_enqueue_frag()`
|
||||
|
||||
This appends the fragment to the "to be sent" list of the endpoint and
|
||||
conditionally adds the endpoint to the list of endpoints with data to
|
||||
send via `opal_btl_usnic_check_rts()`
|
||||
|
||||
## receive dataflow
|
||||
|
||||
BTL packets has one of 3 types in header: frag, chunk, or ack.
|
||||
|
||||
* A frag packet is a full PML fragment.
|
||||
* A chunk packet is a piece of a fragment that needs to be reassembled.
|
||||
* An ack packet is header only with a sequence number being ACKed.
|
||||
|
||||
* Both frag and chunk packets go through some of the same processing.
|
||||
* Both may carry piggy-backed ACKs which may need to be processed.
|
||||
* Both have sequence numbers which must be processed and may result in
|
||||
dropping the packet and/or queueing an ACK to the sender.
|
||||
|
||||
frag packets may be either regular PML fragments or PUT segments. If
|
||||
the "put_addr" field of the BTL header is set, this is a PUT and the
|
||||
data is copied directly to the user buffer. If this field is NULL,
|
||||
the segment is passed up to the PML. The PML is expected to do
|
||||
everything it needs with this packet in the callback, including
|
||||
copying data out if needed. Once the callback is complete, the
|
||||
receive buffer is recycled.
|
||||
|
||||
chunk packets are parts of a larger fragment. If an active fragment
|
||||
receive for the matching fragment ID cannot be found, and new fragment
|
||||
info descriptor is allocated. If this is not a PUT (`put_addr == NULL`),
|
||||
we `malloc()` data to reassemble the fragment into. Each
|
||||
subsequent chunk is copied either into this reassembly buffer or
|
||||
directly into user memory. When the last chunk of a fragment arrives,
|
||||
a PML callback is made for non-PUTs, then the fragment info descriptor
|
||||
is released.
|
||||
|
||||
## fast receive optimization
|
||||
|
||||
In order to optimize latency of small packets, the component progress
|
||||
routine implements a fast path for receives. If the first completion
|
||||
is a receive on the priority queue, then it is handled by a routine
|
||||
called `opal_btl_usnic_recv_fast()` which does nothing but validates
|
||||
that the packet is OK to be received (sequence number OK and not a
|
||||
DUP) and then delivers it to the PML. This packet is recorded in the
|
||||
channel structure, and all bookeeping for the packet is deferred until
|
||||
the next time `component_progress` is called again.
|
||||
|
||||
This fast path cannot be taken every time we pass through
|
||||
`component_progress` because there will be other completions that need
|
||||
processing, and the receive bookeeping for one fast receive must be
|
||||
complete before allowing another fast receive to occur, as only one
|
||||
recv segment can be saved for deferred processing at a time. This is
|
||||
handled by maintaining a variable in `opal_btl_usnic_recv_fast()`
|
||||
called fastpath_ok which is set to false every time the fastpath is
|
||||
taken. A call into the regular progress routine will set this flag
|
||||
back to true.
|
||||
|
||||
## reliability:
|
||||
|
||||
* every packet has sequence #
|
||||
* each endpoint has a "send window" , currently 4096 entries.
|
||||
* once a segment is sent, it is saved in window array until ACK is received
|
||||
* ACKs acknowledge all packets <= specified sequence #
|
||||
* rcvr only ACKs a sequence # when all packets up to that sequence have arrived
|
||||
|
||||
* each pkt has dflt retrans timer of 100ms
|
||||
* packet will be scheduled for retrans if timer expires
|
||||
|
||||
Once a segment is sent, it always has its retransmit timer started.
|
||||
This is accomplished by `opal_hotel_checkin()`.
|
||||
Any time a segment is posted to the NIC for retransmit, it is checked out
|
||||
of the hotel (timer stopped).
|
||||
So, a send segment is always in one of 4 states:
|
||||
* on free list, unallocated
|
||||
* on endpoint to-send list in the case of segment associated with small fragment
|
||||
* posted to NIC and in hotel awaiting ACK
|
||||
* on module re-send list awaiting retransmission
|
||||
|
||||
rcvr:
|
||||
* if a pkt with seq >= expected seq is received, schedule ack of largest
|
||||
in-order sequence received if not already scheduled. dflt time is 50us
|
||||
* if a packet with seq < expected seq arrives, we send an ACK immediately,
|
||||
as this indicates a lost ACK
|
||||
|
||||
sender:
|
||||
* duplicate ACK triggers immediate retrans if one is not pending for
|
||||
that segment
|
||||
|
||||
## Reordering induced by two queues and piggy-backing:
|
||||
|
||||
ACKs can be reordered-
|
||||
* not an issue at all, old ACKs are simply ignored
|
||||
|
||||
Sends can be reordered-
|
||||
* (small send can jump far ahead of large sends)
|
||||
* large send followed by lots of small sends could trigger many
|
||||
retrans of the large sends. smalls would have to be paced pretty
|
||||
precisely to keep command queue empty enough and also beat out the
|
||||
large sends. send credits limit how many larges can be queued on
|
||||
the sender, but there could be many on the receiver
|
||||
|
||||
|
||||
## RDMA emulation
|
||||
|
||||
We emulate the RDMA PUT because it's more efficient than regular send:
|
||||
it allows the receive to copy directly to the target buffer
|
||||
(vs. making an intermediate copy out of the bounce buffer).
|
||||
|
||||
It would actually be better to morph this PUT into a GET -- GET would
|
||||
be slightly more efficient. In short, when the target requests the
|
||||
actual RDMA data, with PUT, the request has to go up to the PML, which
|
||||
will then invoke PUT on the source's BTL module. With GET, the target
|
||||
issues the GET, and the source BTL module can reply without needing to
|
||||
go up the stack to the PML.
|
||||
|
||||
Once we start supporting RDMA in hardware:
|
||||
|
||||
* we need to provide `module.btl_register_mem` and
|
||||
`module.btl_deregister_mem` functions (see openib for an example)
|
||||
* we need to put something meaningful in
|
||||
`btl_usnic_frag.h:mca_btl_base_registration_handle_t`.
|
||||
* we need to set `module.btl_registration_handle_size` to `sizeof(struct
|
||||
mca_btl_base_registration_handle_t`).
|
||||
* `module.btl_put` / `module.btl_get` will receive the
|
||||
`mca_btl_base_registration_handle_t` from the peer as a cookie.
|
||||
|
||||
Also, `module.btl_put` / `module.btl_get` do not need to make
|
||||
descriptors (this was an optimization added in BTL 3.0). They are now
|
||||
called with enough information to do whatever they need to do.
|
||||
module.btl_put still makes a descriptor and submits it to the usnic
|
||||
sending engine so as to utilize a common infrastructure for send and
|
||||
put.
|
||||
|
||||
But it doesn't necessarily have to be that way -- we could optimize
|
||||
out the use of the descriptors. Have not investigated how easy/hard
|
||||
that would be.
|
||||
|
||||
## libfabric abstractions:
|
||||
|
||||
* `fi_fabric`: corresponds to a VIC PF
|
||||
* `fi_domain`: corresponds to a VIC VF
|
||||
* `fi_endpoint`: resources inside the VIC VF (basically a QP)
|
||||
|
||||
## `MPI_THREAD_MULTIPLE` support
|
||||
|
||||
In order to make usnic btl thread-safe, the mutex locks are issued to
|
||||
protect the critical path. ie; libfabric routines, book keeping, etc.
|
||||
|
||||
The said lock is `btl_usnic_lock`. It is a RECURSIVE lock, meaning
|
||||
that the same thread can take the lock again even if it already has
|
||||
the lock to allow the callback function to post another segment right
|
||||
away if we know that the current segment is completed inline. (So we
|
||||
can call send in send without deadlocking)
|
||||
|
||||
These two functions taking care of hotel checkin/checkout and we have
|
||||
to protect that part. So we take the mutex lock before we enter the
|
||||
function.
|
||||
|
||||
* `opal_btl_usnic_check_rts()`
|
||||
* `opal_btl_usnic_handle_ack()`
|
||||
|
||||
We also have to protect the call to libfabric routines
|
||||
|
||||
* `opal_btl_usnic_endpoint_send_segment()` (`fi_send`)
|
||||
* `opal_btl_usnic_recv_call()` (`fi_recvmsg`)
|
||||
|
||||
have to be protected as well.
|
||||
|
||||
Also cclient connection checking (`opal_btl_usnic_connectivity_ping`)
|
||||
has to be protected. This happens only in the beginning but cclient
|
||||
communicate with cagent through `opal_fd_read/write()` and if two or
|
||||
more clients do `opal_fd_write()` at the same time, the data might be
|
||||
corrupt.
|
||||
|
||||
With this concept, many functions in btl/usnic that make calls to the
|
||||
listed functions are protected by `OPAL_THREAD_LOCK` macro which will
|
||||
only be active if the user specify `MPI_Init_thread()` with
|
||||
`MPI_THREAD_MULTIPLE` support.
|
@ -1,383 +0,0 @@
|
||||
Design notes on usnic BTL
|
||||
|
||||
======================================
|
||||
nomenclature
|
||||
|
||||
fragment - something the PML asks us to send or put, any size
|
||||
segment - something we can put on the wire in a single packet
|
||||
chunk - a piece of a fragment that fits into one segment
|
||||
|
||||
a segment can contain either an entire fragment or a chunk of a fragment
|
||||
|
||||
each segment and fragment has associated descriptor.
|
||||
|
||||
Each segment data structure has a block of registered memory associated with
|
||||
it which matches MTU for that segment
|
||||
ACK - acks get special small segments with only enough memory for an ACK
|
||||
non-ACK segments always have a parent fragment
|
||||
|
||||
fragments are either large (> MTU) or small (<= MTU)
|
||||
a small fragment has a segment descriptor embedded within it since it
|
||||
always needs exactly one.
|
||||
|
||||
a large fragment has no permanently associated segments, but allocates them
|
||||
as needed.
|
||||
|
||||
======================================
|
||||
channels
|
||||
|
||||
a channel is a queue pair with an associated completion queue
|
||||
each channel has its own MTU and r/w queue entry counts
|
||||
|
||||
There are 2 channels, command and data
|
||||
command queue is generally for higher priority fragments
|
||||
data queue is for standard data traffic
|
||||
command queue should possibly be called "priority" queue
|
||||
|
||||
command queue is shorter and has a smaller MTU that the data queue
|
||||
this makes the command queue a lot faster than the data queue, so we
|
||||
hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
|
||||
|
||||
command queue is used for ACKs and tiny fragments
|
||||
data queue is used for everything else
|
||||
|
||||
PML fragments marked priority should perhaps use command queue
|
||||
|
||||
======================================
|
||||
sending
|
||||
|
||||
Normally, all send requests are simply enqueued and then actually posted
|
||||
to the NIC by the routine opal_btl_usnic_module_progress_sends().
|
||||
"fastpath" tiny sends are the exception.
|
||||
|
||||
Each module maintains a queue of endpoints that are ready to send.
|
||||
An endpoint is ready to send if all of the following are met:
|
||||
- the endpoint has fragments to send
|
||||
- the endpoint has send credits
|
||||
- the endpoint's send window is "open" (not full of un-ACKed segments)
|
||||
|
||||
Each module also maintains a list of segments that need to be retransmitted.
|
||||
Note that the list of pending retrans is per-module, not per-endpoint.
|
||||
|
||||
send progression first posts any pending retransmissions, always using the
|
||||
data channel. (reason is that if we start getting heavy congestion and
|
||||
there are lots of retransmits, it becomes more important than ever to
|
||||
prioritize ACKs, clogging command channel with retrans data makes things worse,
|
||||
not better)
|
||||
|
||||
Next, progression loops sending segments to the endpoint at the top of
|
||||
the "endpoints_with_sends" queue. When an endpoint exhausts its send
|
||||
credits or fills its send window or runs out of segments to send, it removes
|
||||
itself from the endpoint_with_sends list. Any pending ACKs will be
|
||||
picked up and piggy-backed on these sends.
|
||||
|
||||
Finally, any endpoints that still need ACKs whose timer has expired will
|
||||
be sent explicit ACK packets.
|
||||
|
||||
[double-click fragment sending]
|
||||
The middle part of the progression loop handles both small (single-segment)
|
||||
and large (multi-segment) sends.
|
||||
|
||||
For small fragments, the verbs descriptor within the embedded segment is
|
||||
updated with length, BTL header is updated, then we call
|
||||
opal_btl_usnic_endpoint_send_segment() to send the segment.
|
||||
After posting, we make a PML callback if needed.
|
||||
|
||||
For large fragments, a little more is needed. segments froma large
|
||||
fragment have a slightly larger BTL header which contains a fragment ID,
|
||||
and offset, and a size. The fragment ID is allocated when the first chunk
|
||||
the fragment is sent. A segment gets allocated, next blob of data is
|
||||
copied into this segment, segment is posted. If last chunk of fragment
|
||||
sent, perform callback if needed, then remove fragment from endpoint
|
||||
send queue.
|
||||
|
||||
[double-click opal_btl_usnic_endpoint_send_segment()]
|
||||
|
||||
This is common posting code for large or small segments. It assigns a
|
||||
sequence number to a segment, checks for an ACK to piggy-back,
|
||||
posts the segment to the NIC, and then starts the retransmit timer
|
||||
by checking the segment into hotel. Send credits are consumed here.
|
||||
|
||||
|
||||
======================================
|
||||
send dataflow
|
||||
|
||||
PML control messages with no user data are sent via:
|
||||
desc = usnic_alloc(size)
|
||||
usnic_send(desc)
|
||||
|
||||
user messages less than eager limit and 1st part of larger
|
||||
messages are sent via:
|
||||
desc = usnic_prepare_src(convertor, size)
|
||||
usnic_send(desc)
|
||||
|
||||
larger msgs
|
||||
desc = usnic_prepare_src(convertor, size)
|
||||
usnic_put(desc)
|
||||
|
||||
|
||||
usnic_alloc() currently asserts the length is "small", allocates and
|
||||
fills in a small fragment. src pointer will point to start of
|
||||
associated registered mem + sizeof BTL header, and PML will put its
|
||||
data there.
|
||||
|
||||
usnic_prepare_src() allocated either a large or small fragment based on size
|
||||
The fragment descriptor is filled in to have 2 SG entries, 1st pointing to
|
||||
place where PML should construct its header. If the data convertor says
|
||||
data is contiguous, 2nd SG entry points to user buffer, else it is null and
|
||||
sf_convertor is filled in with address of convertor.
|
||||
|
||||
usnic_send()
|
||||
If the fragment being sent is small enough, has contiguous data, and
|
||||
"very few" command queue send WQEs have been consumed, usnic_send() does
|
||||
a fastpath send. This means it posts the segment immediately to the NIC
|
||||
with INLINE flag set.
|
||||
|
||||
If all of the conditions for fastpath send are not met, and this is a small
|
||||
fragment, the user data is copied into the associated registered memory at this
|
||||
time and the SG list in the descriptor is collapsed to one entry.
|
||||
|
||||
After the checks above are done, the fragment is enqueued to be sent
|
||||
via opal_btl_usnic_endpoint_enqueue_frag()
|
||||
|
||||
usnic_put()
|
||||
Do a fast version of what happens in prepare_src() (can take shortcuts
|
||||
because we know it will always be a contiguous buffer / no convertor
|
||||
needed). PML gives us the destination address, which we save on the
|
||||
fragment (which is the sentinel value that the underlying engine uses
|
||||
to know that this is a PUT and not a SEND), and the fragment is
|
||||
enqueued for processing.
|
||||
|
||||
opal_btl_usnic_endpoint_enqueue_frag()
|
||||
This appends the fragment to the "to be sent" list of the endpoint and
|
||||
conditionally adds the endpoint to the list of endpoints with data to send
|
||||
via opal_btl_usnic_check_rts()
|
||||
|
||||
======================================
|
||||
receive dataflow
|
||||
|
||||
BTL packets has one of 3 types in header: frag, chunk, or ack.
|
||||
|
||||
A frag packet is a full PML fragment.
|
||||
A chunk packet is a piece of a fragment that needs to be reassembled.
|
||||
An ack packet is header only with a sequence number being ACKed.
|
||||
|
||||
Both frag and chunk packets go through some of the same processing.
|
||||
Both may carry piggy-backed ACKs which may need to be processed.
|
||||
Both have sequence numbers which must be processed and may result in
|
||||
dropping the packet and/or queueing an ACK to the sender.
|
||||
|
||||
frag packets may be either regular PML fragments or PUT segments.
|
||||
If the "put_addr" field of the BTL header is set, this is a PUT and
|
||||
the data is copied directly to the user buffer. If this field is NULL,
|
||||
the segment is passed up to the PML. The PML is expected to do everything
|
||||
it needs with this packet in the callback, including copying data out if
|
||||
needed. Once the callback is complete, the receive buffer is recycled.
|
||||
|
||||
chunk packets are parts of a larger fragment. If an active fragment receive
|
||||
for the matching fragment ID cannot be found, and new fragment info
|
||||
descriptor is allocated. If this is not a PUT (put_addr == NULL), we
|
||||
malloc() data to reassemble the fragment into. Each subsequent chunk
|
||||
is copied either into this reassembly buffer or directly into user memory.
|
||||
When the last chunk of a fragment arrives, a PML callback is made for non-PUTs,
|
||||
then the fragment info descriptor is released.
|
||||
|
||||
======================================
|
||||
fast receive optimization
|
||||
|
||||
In order to optimize latency of small packets, the component progress routine
|
||||
implements a fast path for receives. If the first completion is a receive on
|
||||
the priority queue, then it is handled by a routine called
|
||||
opal_btl_usnic_recv_fast() which does nothing but validates that the packet
|
||||
is OK to be received (sequence number OK and not a DUP) and then delivers it
|
||||
to the PML. This packet is recorded in the channel structure, and all
|
||||
bookeeping for the packet is deferred until the next time component_progress
|
||||
is called again.
|
||||
|
||||
This fast path cannot be taken every time we pass through component_progress
|
||||
because there will be other completions that need processing, and the receive
|
||||
bookeeping for one fast receive must be complete before allowing another fast
|
||||
receive to occur, as only one recv segment can be saved for deferred
|
||||
processing at a time. This is handled by maintaining a variable in
|
||||
opal_btl_usnic_recv_fast() called fastpath_ok which is set to false every time
|
||||
the fastpath is taken. A call into the regular progress routine will set this
|
||||
flag back to true.
|
||||
|
||||
======================================
|
||||
reliability:
|
||||
|
||||
every packet has sequence #
|
||||
each endpoint has a "send window" , currently 4096 entries.
|
||||
once a segment is sent, it is saved in window array until ACK is received
|
||||
ACKs acknowledge all packets <= specified sequence #
|
||||
rcvr only ACKs a sequence # when all packets up to that sequence have arrived
|
||||
|
||||
each pkt has dflt retrans timer of 100ms
|
||||
packet will be scheduled for retrans if timer expires
|
||||
|
||||
Once a segment is sent, it always has its retransmit timer started.
|
||||
This is accomplished by opal_hotel_checkin()
|
||||
Any time a segment is posted to the NIC for retransmit, it is checked out
|
||||
of the hotel (timer stopped).
|
||||
So, a send segment is always in one of 4 states:
|
||||
- on free list, unallocated
|
||||
- on endpoint to-send list in the case of segment associated with small fragment
|
||||
- posted to NIC and in hotel awaiting ACK
|
||||
- on module re-send list awaiting retransmission
|
||||
|
||||
rcvr:
|
||||
- if a pkt with seq >= expected seq is received, schedule ack of largest
|
||||
in-order sequence received if not already scheduled. dflt time is 50us
|
||||
- if a packet with seq < expected seq arrives, we send an ACK immediately,
|
||||
as this indicates a lost ACK
|
||||
|
||||
sender:
|
||||
duplicate ACK triggers immediate retrans if one is not pending for that segment
|
||||
|
||||
======================================
|
||||
Reordering induced by two queues and piggy-backing:
|
||||
|
||||
ACKs can be reordered-
|
||||
not an issue at all, old ACKs are simply ignored
|
||||
|
||||
Sends can be reordered-
|
||||
(small send can jump far ahead of large sends)
|
||||
large send followed by lots of small sends could trigger many retrans
|
||||
of the large sends. smalls would have to be paced pretty precisely to
|
||||
keep command queue empty enough and also beat out the large sends.
|
||||
send credits limit how many larges can be queued on the sender, but there
|
||||
could be many on the receiver
|
||||
|
||||
|
||||
======================================
|
||||
RDMA emulation
|
||||
|
||||
We emulate the RDMA PUT because it's more efficient than regular send:
|
||||
it allows the receive to copy directly to the target buffer
|
||||
(vs. making an intermediate copy out of the bounce buffer).
|
||||
|
||||
It would actually be better to morph this PUT into a GET -- GET would
|
||||
be slightly more efficient. In short, when the target requests the
|
||||
actual RDMA data, with PUT, the request has to go up to the PML, which
|
||||
will then invoke PUT on the source's BTL module. With GET, the target
|
||||
issues the GET, and the source BTL module can reply without needing to
|
||||
go up the stack to the PML.
|
||||
|
||||
Once we start supporting RDMA in hardware:
|
||||
|
||||
- we need to provide module.btl_register_mem and
|
||||
module.btl_deregister_mem functions (see openib for an example)
|
||||
- we need to put something meaningful in
|
||||
btl_usnic_frag.h:mca_btl_base_registration_handle_t.
|
||||
- we need to set module.btl_registration_handle_size to sizeof(struct
|
||||
mca_btl_base_registration_handle_t).
|
||||
- module.btl_put / module.btl_get will receive the
|
||||
mca_btl_base_registration_handle_t from the peer as a cookie.
|
||||
|
||||
Also, module.btl_put / module.btl_get do not need to make descriptors
|
||||
(this was an optimization added in BTL 3.0). They are now called with
|
||||
enough information to do whatever they need to do. module.btl_put
|
||||
still makes a descriptor and submits it to the usnic sending engine so
|
||||
as to utilize a common infrastructure for send and put.
|
||||
|
||||
But it doesn't necessarily have to be that way -- we could optimize
|
||||
out the use of the descriptors. Have not investigated how easy/hard
|
||||
that would be.
|
||||
|
||||
======================================
|
||||
|
||||
November 2014 / SC 2014
|
||||
Update February 2015
|
||||
|
||||
The usnic BTL code has been unified across master and the v1.8
|
||||
branches.
|
||||
|
||||
NOTE: As of May 2018, this is no longer true. This was generally
|
||||
only necessary back when the BTLs were moved from the OMPI layer to
|
||||
the OPAL layer. Now that the BTLs have been down in OPAL for
|
||||
several years, this tomfoolery is no longer necessary. This note
|
||||
is kept for historical purposes, just in case someone needs to go
|
||||
back and look at the v1.8 series.
|
||||
|
||||
That is, you can copy the code from v1.8:ompi/mca/btl/usnic/* to
|
||||
master:opal/mca/btl/usnic*, and then only have to make 3 changes in
|
||||
the resulting code in master:
|
||||
|
||||
1. Edit Makefile.am: s/ompi/opal/gi
|
||||
2. Edit configure.m4: s/ompi/opal/gi
|
||||
--> EXCEPT for:
|
||||
- opal_common_libfabric_* (which will eventually be removed,
|
||||
when the embedded libfabric goes away)
|
||||
- OPAL_BTL_USNIC_FI_EXT_USNIC_H (which will eventually be
|
||||
removed, when the embedded libfabric goes away)
|
||||
- OPAL_VAR_SCOPE_*
|
||||
3. Edit Makefile.am: change -DBTL_IN_OPAL=0 to -DBTL_IN_OPAL=1
|
||||
|
||||
*** Note: the BTL_IN_OPAL preprocessor macro is set in Makefile.am
|
||||
rather that in btl_usnic_compat.h to avoid all kinds of include
|
||||
file dependency issues (i.e., btl_usnic_compat.h would need to be
|
||||
included first, but it requires some data structures to be
|
||||
defined, which means it either can't be first or we have to
|
||||
declare various structs first... just put BTL_IN_OPAL in
|
||||
Makefile.am and be happy).
|
||||
|
||||
*** Note 2: CARE MUST BE TAKEN WHEN COPYING THE OTHER DIRECTION! It
|
||||
is *not* as simple as simple s/opal/ompi/gi in configure.m4 and
|
||||
Makefile.am. It certainly can be done, but there's a few strings
|
||||
that need to stay "opal" or "OPAL" (e.g., OPAL_HAVE_FOO).
|
||||
Hence, the string replace will likely need to be done via manual
|
||||
inspection.
|
||||
|
||||
Things still to do:
|
||||
|
||||
- VF/PF sanity checks in component.c:check_usnic_config() uses
|
||||
usnic-specific fi_provider info. The exact mechanism might change
|
||||
as provider-specific info is still being discussed upstream.
|
||||
|
||||
- component.c:usnic_handle_cq_error is using a USD_* constant from
|
||||
usnic_direct. Need to get that value through libfabric somehow.
|
||||
|
||||
======================================
|
||||
|
||||
libfabric abstractions:
|
||||
|
||||
fi_fabric: corresponds to a VIC PF
|
||||
fi_domain: corresponds to a VIC VF
|
||||
fi_endpoint: resources inside the VIC VF (basically a QP)
|
||||
|
||||
======================================
|
||||
|
||||
MPI_THREAD_MULTIPLE support
|
||||
|
||||
In order to make usnic btl thread-safe, the mutex locks are issued
|
||||
to protect the critical path. ie; libfabric routines, book keeping, etc.
|
||||
|
||||
The said lock is btl_usnic_lock. It is a RECURSIVE lock, meaning that
|
||||
the same thread can take the lock again even if it already has the lock to
|
||||
allow the callback function to post another segment right away if we know
|
||||
that the current segment is completed inline. (So we can call send in send
|
||||
without deadlocking)
|
||||
|
||||
These two functions taking care of hotel checkin/checkout and we
|
||||
have to protect that part. So we take the mutex lock before we enter the
|
||||
function.
|
||||
|
||||
- opal_btl_usnic_check_rts()
|
||||
- opal_btl_usnic_handle_ack()
|
||||
|
||||
We also have to protect the call to libfabric routines
|
||||
|
||||
- opal_btl_usnic_endpoint_send_segment() (fi_send)
|
||||
- opal_btl_usnic_recv_call() (fi_recvmsg)
|
||||
|
||||
have to be protected as well.
|
||||
|
||||
Also cclient connection checking (opal_btl_usnic_connectivity_ping) has to be
|
||||
protected. This happens only in the beginning but cclient communicate with cagent
|
||||
through opal_fd_read/write() and if two or more clients do opal_fd_write() at the
|
||||
same time, the data might be corrupt.
|
||||
|
||||
With this concept, many functions in btl/usnic that make calls to the
|
||||
listed functions are protected by OPAL_THREAD_LOCK macro which will only
|
||||
be active if the user specify MPI_Init_thread() with MPI_THREAD_MULTIPLE
|
||||
support.
|
@ -1,50 +0,0 @@
|
||||
# Copyright (c) 2013 Mellanox Technologies, Inc.
|
||||
# All rights reserved
|
||||
# $COPYRIGHT$
|
||||
MEMHEAP Infrustructure documentation
|
||||
------------------------------------
|
||||
|
||||
MEMHEAP Infrustructure is responsible for managing the symmetric heap.
|
||||
The framework currently has following components: buddy and ptmalloc. buddy which uses a buddy allocator in order to manage the Memory allocations on the symmetric heap. Ptmalloc is an adaptation of ptmalloc3.
|
||||
|
||||
Additional components may be added easily to the framework by defining the component's and the module's base and extended structures, and their funtionalities.
|
||||
|
||||
The buddy allocator has the following data structures:
|
||||
1. Base component - of type struct mca_memheap_base_component_2_0_0_t
|
||||
2. Base module - of type struct mca_memheap_base_module_t
|
||||
3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
|
||||
4. Buddy module - of type struct mca_memheap_buddy_module_t extending the base module (struct mca_memheap_base_module_t)
|
||||
|
||||
Each data structure includes the following fields:
|
||||
1. Base component - memheap_version, memheap_data and memheap_init
|
||||
2. Base module - Holds pointers to the base component and to the functions: alloc, free and finalize
|
||||
3. Buddy component - is a base component.
|
||||
4. Buddy module - Extends the base module and holds additional data on the components's priority, buddy allocator,
|
||||
maximal order of the symmetric heap, symmetric heap, pointer to the symmetric heap and hashtable maintaining the size of each allocated address.
|
||||
|
||||
In the case that the user decides to implement additional components, the Memheap infrastructure chooses a component with the maximal priority.
|
||||
Handling the component opening is done under the base directory, in three stages:
|
||||
1. Open all available components. Implemented by memheap_base_open.c and called from shmem_init.
|
||||
2. Select the maximal priority component. This procedure involves the initialization of all components and then their
|
||||
finalization except to the chosen component. It is implemented by memheap_base_select.c and called from shmem_init.
|
||||
3. Close the max priority active cmponent. Implemented by memheap_base_close.c and called from shmem finalize.
|
||||
|
||||
|
||||
Buddy Component/Module
|
||||
----------------------
|
||||
|
||||
Responsible for handling the entire activities of the symmetric heap.
|
||||
The supported activities are:
|
||||
- buddy_init (Initialization)
|
||||
- buddy_alloc (Allocates a variable on the symmetric heap)
|
||||
- buddy_free (frees a variable previously allocated on the symetric heap)
|
||||
- buddy_finalize (Finalization).
|
||||
|
||||
Data members of buddy module: - priority. The module's priority.
|
||||
- buddy allocator: bits, num_free, lock and the maximal order (log2 of the maximal size)
|
||||
of a variable on the symmetric heap. Buddy Allocator gives the offset in the symmetric heap
|
||||
where a variable should be allocated.
|
||||
- symmetric_heap: a range of reserved addresses (equal in all executing PE's) dedicated to "shared memory" allocation.
|
||||
- symmetric_heap_hashtable (holding the size of an allocated variable on the symmetric heap.
|
||||
used to free an allocated variable on the symmetric heap)
|
||||
|
71
oshmem/mca/memheap/README.md
Обычный файл
71
oshmem/mca/memheap/README.md
Обычный файл
@ -0,0 +1,71 @@
|
||||
# MEMHEAP infrastructure documentation
|
||||
|
||||
Copyright (c) 2013 Mellanox Technologies, Inc.
|
||||
All rights reserved
|
||||
|
||||
MEMHEAP Infrustructure is responsible for managing the symmetric heap.
|
||||
The framework currently has following components: buddy and
|
||||
ptmalloc. buddy which uses a buddy allocator in order to manage the
|
||||
Memory allocations on the symmetric heap. Ptmalloc is an adaptation of
|
||||
ptmalloc3.
|
||||
|
||||
Additional components may be added easily to the framework by defining
|
||||
the component's and the module's base and extended structures, and
|
||||
their funtionalities.
|
||||
|
||||
The buddy allocator has the following data structures:
|
||||
|
||||
1. Base component - of type struct mca_memheap_base_component_2_0_0_t
|
||||
2. Base module - of type struct mca_memheap_base_module_t
|
||||
3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
|
||||
4. Buddy module - of type struct mca_memheap_buddy_module_t extending
|
||||
the base module (struct mca_memheap_base_module_t)
|
||||
|
||||
Each data structure includes the following fields:
|
||||
|
||||
1. Base component - memheap_version, memheap_data and memheap_init
|
||||
2. Base module - Holds pointers to the base component and to the
|
||||
functions: alloc, free and finalize
|
||||
3. Buddy component - is a base component.
|
||||
4. Buddy module - Extends the base module and holds additional data on
|
||||
the components's priority, buddy allocator,
|
||||
maximal order of the symmetric heap, symmetric heap, pointer to the
|
||||
symmetric heap and hashtable maintaining the size of each allocated
|
||||
address.
|
||||
|
||||
In the case that the user decides to implement additional components,
|
||||
the Memheap infrastructure chooses a component with the maximal
|
||||
priority. Handling the component opening is done under the base
|
||||
directory, in three stages:
|
||||
1. Open all available components. Implemented by memheap_base_open.c
|
||||
and called from shmem_init.
|
||||
2. Select the maximal priority component. This procedure involves the
|
||||
initialization of all components and then their finalization except
|
||||
to the chosen component. It is implemented by memheap_base_select.c
|
||||
and called from shmem_init.
|
||||
3. Close the max priority active cmponent. Implemented by
|
||||
memheap_base_close.c and called from shmem finalize.
|
||||
|
||||
|
||||
## Buddy Component/Module
|
||||
|
||||
Responsible for handling the entire activities of the symmetric heap.
|
||||
The supported activities are:
|
||||
|
||||
1. buddy_init (Initialization)
|
||||
1. buddy_alloc (Allocates a variable on the symmetric heap)
|
||||
1. buddy_free (frees a variable previously allocated on the symetric heap)
|
||||
1. buddy_finalize (Finalization).
|
||||
|
||||
Data members of buddy module:
|
||||
|
||||
1. priority. The module's priority.
|
||||
1. buddy allocator: bits, num_free, lock and the maximal order (log2
|
||||
of the maximal size) of a variable on the symmetric heap. Buddy
|
||||
Allocator gives the offset in the symmetric heap where a variable
|
||||
should be allocated.
|
||||
1. symmetric_heap: a range of reserved addresses (equal in all
|
||||
executing PE's) dedicated to "shared memory" allocation.
|
||||
1. symmetric_heap_hashtable (holding the size of an allocated variable
|
||||
on the symmetric heap. used to free an allocated variable on the
|
||||
symmetric heap)
|
@ -1,7 +0,0 @@
|
||||
The functions in this directory are all intended to test registry operations against a persistent seed. Thus, they perform a system init/finalize. The functions in the directory above this one should be used to test basic registry operations within the replica - they will isolate the replica so as to avoid the communications issues and the init/finalize problems in other subsystems that may cause problems here.
|
||||
|
||||
To run these tests, you need to first start a persistent daemon. This can be done using the command:
|
||||
|
||||
orted --seed --scope public --persistent
|
||||
|
||||
The daemon will "daemonize" itself and establish the registry (as well as other central services) replica, and then return a system prompt. You can then run any of these functions. If desired, you can utilize gdb and/or debug options on the persistent orted to watch/debug replica operations as well.
|
20
test/runtime/README.md
Обычный файл
20
test/runtime/README.md
Обычный файл
@ -0,0 +1,20 @@
|
||||
The functions in this directory are all intended to test registry
|
||||
operations against a persistent seed. Thus, they perform a system
|
||||
init/finalize. The functions in the directory above this one should be
|
||||
used to test basic registry operations within the replica - they will
|
||||
isolate the replica so as to avoid the communications issues and the
|
||||
init/finalize problems in other subsystems that may cause problems
|
||||
here.
|
||||
|
||||
To run these tests, you need to first start a persistent daemon. This
|
||||
can be done using the command:
|
||||
|
||||
```
|
||||
orted --seed --scope public --persistent
|
||||
```
|
||||
|
||||
The daemon will "daemonize" itself and establish the registry (as well
|
||||
as other central services) replica, and then return a system
|
||||
prompt. You can then run any of these functions. If desired, you can
|
||||
utilize gdb and/or debug options on the persistent orted to
|
||||
watch/debug replica operations as well.
|
Загрузка…
x
Ссылка в новой задаче
Block a user