diff --git a/HACKING b/HACKING deleted file mode 100644 index 35710f95a0..0000000000 --- a/HACKING +++ /dev/null @@ -1,272 +0,0 @@ -Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana - University Research and Technology - Corporation. All rights reserved. -Copyright (c) 2004-2005 The University of Tennessee and The University - of Tennessee Research Foundation. All rights - reserved. -Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, - University of Stuttgart. All rights reserved. -Copyright (c) 2004-2005 The Regents of the University of California. - All rights reserved. -Copyright (c) 2008-2020 Cisco Systems, Inc. All rights reserved. -Copyright (c) 2013 Intel, Inc. All rights reserved. -$COPYRIGHT$ - -Additional copyrights may follow - -$HEADER$ - -Overview -======== - -This file is here for those who are building/exploring OMPI in its -source code form, most likely through a developer's tree (i.e., a -Git clone). - - -Developer Builds: Compiler Pickyness by Default -=============================================== - -If you are building Open MPI from a Git clone (i.e., there is a ".git" -directory in your build tree), the default build includes extra -compiler pickyness, which will result in more compiler warnings than -in non-developer builds. Getting these extra compiler warnings is -helpful to Open MPI developers in making the code base as clean as -possible. - -Developers can disable this picky-by-default behavior by using the ---disable-picky configure option. Also note that extra-picky compiles -do *not* happen automatically when you do a VPATH build (e.g., if -".git" is in your source tree, but not in your build tree). - -Prior versions of Open MPI would automatically activate a lot of -(performance-reducing) debugging code by default if ".git" was found -in your build tree. This is no longer true. You can manually enable -these (performance-reducing) debugging features in the Open MPI code -base with these configure options: - - --enable-debug - --enable-mem-debug - --enable-mem-profile - -NOTE: These options are really only relevant to those who are -developing Open MPI itself. They are not generally helpful for -debugging general MPI applications. - - -Use of GNU Autoconf, Automake, and Libtool (and m4) -=================================================== - -You need to read/care about this section *ONLY* if you are building -from a developer's tree (i.e., a Git clone of the Open MPI source -tree). If you have an Open MPI distribution tarball, the contents of -this section are optional -- you can (and probably should) skip -reading this section. - -If you are building Open MPI from a developer's tree, you must first -install fairly recent versions of the GNU tools Autoconf, Automake, -and Libtool (and possibly GNU m4, because recent versions of Autoconf -have specific GNU m4 version requirements). The specific versions -required depend on if you are using the Git master branch or a release -branch (and which release branch you are using). The specific -versions can be found here: - - https://www.open-mpi.org/source/building.php - -You can check what versions of the autotools you have installed with -the following: - -shell$ m4 --version -shell$ autoconf --version -shell$ automake --version -shell$ libtoolize --version - -Required version levels for all the OMPI releases can be found here: - -https://www.open-mpi.org/source/building.php - -To strengthen the above point: the core Open MPI developers typically -use very, very recent versions of the GNU tools. There are known bugs -in older versions of the GNU tools that Open MPI no longer compensates -for (it seemed senseless to indefinitely support patches for ancient -versions of Autoconf, for example). You *WILL* have problems if you -do not use recent versions of the GNU tools. - -If you need newer versions, you are *strongly* encouraged to heed the -following advice: - -NOTE: On MacOS/X, the default "libtool" program is different than the - GNU libtool. You must download and install the GNU version - (e.g., via MacPorts, Homebrew, or some other mechanism). - -1. Unless your OS distribution has easy-to-use binary installations, - the sources can be can be downloaded from: - - ftp://ftp.gnu.org/gnu/autoconf/ - ftp://ftp.gnu.org/gnu/automake/ - ftp://ftp.gnu.org/gnu/libtool/ - and if you need it: - ftp://ftp.gnu.org/gnu/m4/ - - NOTE: It is certainly easiest to download/build/install all four of - these tools together. But note that Open MPI has no specific m4 - requirements; it is only listed here because Autoconf requires - minimum versions of GNU m4. Hence, you may or may not *need* to - actually install a new version of GNU m4. That being said, if you - are confused or don't know, just install the latest GNU m4 with the - rest of the GNU Autotools and everything will work out fine. - -2. Build and install the tools in the following order: - - 2a. m4 - 2b. Autoconf - 2c. Automake - 2d. Libtool - -3. You MUST install the last three tools (Autoconf, Automake, Libtool) - into the same prefix directory. These three tools are somewhat - inter-related, and if they're going to be used together, they MUST - share a common installation prefix. - - You can install m4 anywhere as long as it can be found in the path; - it may be convenient to install it in the same prefix as the other - three. Or you can use any recent-enough m4 that is in your path. - - 3a. It is *strongly* encouraged that you do not install your new - versions over the OS-installed versions. This could cause - other things on your system to break. Instead, install into - $HOME/local, or /usr/local, or wherever else you tend to - install "local" kinds of software. - 3b. In doing so, be sure to prefix your $path with the directory - where they are installed. For example, if you install into - $HOME/local, you may want to edit your shell startup file - (.bashrc, .cshrc, .tcshrc, etc.) to have something like: - - # For bash/sh: - export PATH=$HOME/local/bin:$PATH - # For csh/tcsh: - set path = ($HOME/local/bin $path) - - 3c. Ensure to set your $path *BEFORE* you configure/build/install - the four packages. - -4. All four packages require two simple commands to build and - install (where PREFIX is the prefix discussed in 3, above). - - shell$ cd - shell$ ./configure --prefix=PREFIX - shell$ make; make install - - --> If you are using the csh or tcsh shells, be sure to run the - "rehash" command after you install each package. - - shell$ cd - shell$ ./configure --prefix=PREFIX - shell$ make; make install - - --> If you are using the csh or tcsh shells, be sure to run the - "rehash" command after you install each package. - - shell$ cd - shell$ ./configure --prefix=PREFIX - shell$ make; make install - - --> If you are using the csh or tcsh shells, be sure to run the - "rehash" command after you install each package. - - shell$ cd - shell$ ./configure --prefix=PREFIX - shell$ make; make install - - --> If you are using the csh or tcsh shells, be sure to run the - "rehash" command after you install each package. - - m4, Autoconf and Automake build and install very quickly; Libtool will - take a minute or two. - -5. You can now run OMPI's top-level "autogen.pl" script. This script - will invoke the GNU Autoconf, Automake, and Libtool commands in the - proper order and setup to run OMPI's top-level "configure" script. - - Running autogen.pl may take a few minutes, depending on your - system. It's not very exciting to watch. :-) - - If you have a multi-processor system, enabling the multi-threaded - behavior in Automake 1.11 (or newer) can result in autogen.pl - running faster. Do this by setting the AUTOMAKE_JOBS environment - variable to the number of processors (threads) that you want it to - use before invoking autogen.pl. For example (you can again put - this in your shell startup files): - - # For bash/sh: - export AUTOMAKE_JOBS=4 - # For csh/tcsh: - set AUTOMAKE_JOBS 4 - - 5a. You generally need to run autogen.pl whenever the top-level - file "configure.ac" changes, or any files in the config/ or - /config/ directories change (these directories are - where a lot of "include" files for OMPI's configure script - live). - - 5b. You do *NOT* need to re-run autogen.pl if you modify a - Makefile.am. - -Use of Flex -=========== - -Flex is used during the compilation of a developer's checkout (it is -not used to build official distribution tarballs). Other flavors of -lex are *not* supported: given the choice of making parsing code -portable between all flavors of lex and doing more interesting work on -Open MPI, we greatly prefer the latter. - -Note that no testing has been performed to see what the minimum -version of Flex is required by Open MPI. We suggest that you use -v2.5.35 at the earliest. - -*** NOTE: Windows developer builds of Open MPI *require* Flex version -2.5.35. Specifically, we know that v2.5.35 works and 2.5.4a does not. -We have not tested to figure out exactly what the minimum required -flex version is on Windows; we suggest that you use 2.5.35 at the -earliest. It is for this reason that the -contrib/dist/make_dist_tarball script checks for a Windows-friendly -version of flex before continuing. - -For now, Open MPI will allow developer builds with Flex 2.5.4. This -is primarily motivated by the fact that RedHat/Centos 5 ships with -Flex 2.5.4. It is likely that someday Open MPI developer builds will -require Flex version >=2.5.35. - -Note that the flex-generated code generates some compiler warnings on -some platforms, but the warnings do not seem to be consistent or -uniform on all platforms, compilers, and flex versions. As such, we -have done little to try to remove those warnings. - -If you do not have Flex installed, it can be downloaded from the -following URL: - - https://github.com/westes/flex - -Use of Pandoc -============= - -Similar to prior sections, you need to read/care about this section -*ONLY* if you are building from a developer's tree (i.e., a Git clone -of the Open MPI source tree). If you have an Open MPI distribution -tarball, the contents of this section are optional -- you can (and -probably should) skip reading this section. - -The Pandoc tool is used to generate Open MPI's man pages. -Specifically: Open MPI's man pages are written in Markdown; Pandoc is -the tool that converts that Markdown to nroff (i.e., the format of man -pages). - -You must have Pandoc >=v1.12 when building Open MPI from a developer's -tree. If configure cannot find Pandoc >=v1.12, it will abort. - -If you need to install Pandoc, check your operating system-provided -packages (to include MacOS Homebrew and MacPorts). The Pandoc project -itself also offers binaries for their releases: - - https://pandoc.org/ diff --git a/HACKING.md b/HACKING.md new file mode 100644 index 0000000000..fe045ecf0e --- /dev/null +++ b/HACKING.md @@ -0,0 +1,258 @@ +# Open MPI Hacking / Developer's Guide + +## Overview + +This file is here for those who are building/exploring OMPI in its +source code form, most likely through a developer's tree (i.e., a +Git clone). + + +## Developer Builds: Compiler Pickyness by Default + +If you are building Open MPI from a Git clone (i.e., there is a `.git` +directory in your build tree), the default build includes extra +compiler pickyness, which will result in more compiler warnings than +in non-developer builds. Getting these extra compiler warnings is +helpful to Open MPI developers in making the code base as clean as +possible. + +Developers can disable this picky-by-default behavior by using the +`--disable-picky` configure option. Also note that extra-picky compiles +do *not* happen automatically when you do a VPATH build (e.g., if +`.git` is in your source tree, but not in your build tree). + +Prior versions of Open MPI would automatically activate a lot of +(performance-reducing) debugging code by default if `.git` was found +in your build tree. This is no longer true. You can manually enable +these (performance-reducing) debugging features in the Open MPI code +base with these configure options: + +* `--enable-debug` +* `--enable-mem-debug` +* `--enable-mem-profile` + +***NOTE:*** These options are really only relevant to those who are +developing Open MPI itself. They are not generally helpful for +debugging general MPI applications. + + +## Use of GNU Autoconf, Automake, and Libtool (and m4) + +You need to read/care about this section *ONLY* if you are building +from a developer's tree (i.e., a Git clone of the Open MPI source +tree). If you have an Open MPI distribution tarball, the contents of +this section are optional -- you can (and probably should) skip +reading this section. + +If you are building Open MPI from a developer's tree, you must first +install fairly recent versions of the GNU tools Autoconf, Automake, +and Libtool (and possibly GNU m4, because recent versions of Autoconf +have specific GNU m4 version requirements). The specific versions +required depend on if you are using the Git master branch or a release +branch (and which release branch you are using). [The specific +versions can be found +here](https://www.open-mpi.org/source/building.php). + +You can check what versions of the autotools you have installed with +the following: + +``` +shell$ m4 --version +shell$ autoconf --version +shell$ automake --version +shell$ libtoolize --version +``` + +[Required version levels for all the OMPI releases can be found +here](https://www.open-mpi.org/source/building.php). + +To strengthen the above point: the core Open MPI developers typically +use very, very recent versions of the GNU tools. There are known bugs +in older versions of the GNU tools that Open MPI no longer compensates +for (it seemed senseless to indefinitely support patches for ancient +versions of Autoconf, for example). You *WILL* have problems if you +do not use recent versions of the GNU tools. + +***NOTE:*** On MacOS/X, the default `libtool` program is different +than the GNU libtool. You must download and install the GNU version +(e.g., via MacPorts, Homebrew, or some other mechanism). + +If you need newer versions, you are *strongly* encouraged to heed the +following advice: + +1. Unless your OS distribution has easy-to-use binary installations, + the sources can be can be downloaded from: + * https://ftp.gnu.org/gnu/autoconf/ + * https://ftp.gnu.org/gnu/automake/ + * https://ftp.gnu.org/gnu/libtool/ + * And if you need it: https://ftp.gnu.org/gnu/m4/ + + ***NOTE:*** It is certainly easiest to download/build/install all + four of these tools together. But note that Open MPI has no + specific m4 requirements; it is only listed here because Autoconf + requires minimum versions of GNU m4. Hence, you may or may not + *need* to actually install a new version of GNU m4. That being + said, if you are confused or don't know, just install the latest + GNU m4 with the rest of the GNU Autotools and everything will work + out fine. + +1. Build and install the tools in the following order: + 1. m4 + 1. Autoconf + 1. Automake + 1. Libtool + +1. You MUST install the last three tools (Autoconf, Automake, Libtool) + into the same prefix directory. These three tools are somewhat + inter-related, and if they're going to be used together, they MUST + share a common installation prefix. + + You can install m4 anywhere as long as it can be found in the path; + it may be convenient to install it in the same prefix as the other + three. Or you can use any recent-enough m4 that is in your path. + + 1. It is *strongly* encouraged that you do not install your new + versions over the OS-installed versions. This could cause + other things on your system to break. Instead, install into + `$HOME/local`, or `/usr/local`, or wherever else you tend to + install "local" kinds of software. + 1. In doing so, be sure to prefix your $path with the directory + where they are installed. For example, if you install into + `$HOME/local`, you may want to edit your shell startup file + (`.bashrc`, `.cshrc`, `.tcshrc`, etc.) to have something like: + + ```sh + # For bash/sh: + export PATH=$HOME/local/bin:$PATH + # For csh/tcsh: + set path = ($HOME/local/bin $path) + ``` + + 1. Ensure to set your `$PATH` *BEFORE* you configure/build/install + the four packages. + +1. All four packages require two simple commands to build and + install (where PREFIX is the prefix discussed in 3, above). + + ``` + shell$ cd + shell$ ./configure --prefix=PREFIX + shell$ make; make install + ``` + + ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to + run the `rehash` command after you install each package. + + ``` + shell$ cd + shell$ ./configure --prefix=PREFIX + shell$ make; make install + ``` + + ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to + run the `rehash` command after you install each package. + + ``` + shell$ cd + shell$ ./configure --prefix=PREFIX + shell$ make; make install + ``` + + ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to + run the `rehash` command after you install each package. + + ``` + shell$ cd + shell$ ./configure --prefix=PREFIX + shell$ make; make install + ``` + + ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to + run the `rehash` command after you install each package. + + m4, Autoconf and Automake build and install very quickly; Libtool + will take a minute or two. + +1. You can now run OMPI's top-level `autogen.pl` script. This script + will invoke the GNU Autoconf, Automake, and Libtool commands in the + proper order and setup to run OMPI's top-level `configure` script. + + Running `autogen.pl` may take a few minutes, depending on your + system. It's not very exciting to watch. :smile: + + If you have a multi-processor system, enabling the multi-threaded + behavior in Automake 1.11 (or newer) can result in `autogen.pl` + running faster. Do this by setting the `AUTOMAKE_JOBS` environment + variable to the number of processors (threads) that you want it to + use before invoking `autogen`.pl. For example (you can again put + this in your shell startup files): + + ```sh + # For bash/sh: + export AUTOMAKE_JOBS=4 + # For csh/tcsh: + set AUTOMAKE_JOBS 4 + ``` + + 1. You generally need to run autogen.pl whenever the top-level file + `configure.ac` changes, or any files in the `config/` or + `/config/` directories change (these directories are + where a lot of "include" files for Open MPI's `configure` script + live). + + 1. You do *NOT* need to re-run `autogen.pl` if you modify a + `Makefile.am`. + +## Use of Flex + +Flex is used during the compilation of a developer's checkout (it is +not used to build official distribution tarballs). Other flavors of +lex are *not* supported: given the choice of making parsing code +portable between all flavors of lex and doing more interesting work on +Open MPI, we greatly prefer the latter. + +Note that no testing has been performed to see what the minimum +version of Flex is required by Open MPI. We suggest that you use +v2.5.35 at the earliest. + +***NOTE:*** Windows developer builds of Open MPI *require* Flex version +2.5.35. Specifically, we know that v2.5.35 works and 2.5.4a does not. +We have not tested to figure out exactly what the minimum required +flex version is on Windows; we suggest that you use 2.5.35 at the +earliest. It is for this reason that the +`contrib/dist/make_dist_tarball` script checks for a Windows-friendly +version of Flex before continuing. + +For now, Open MPI will allow developer builds with Flex 2.5.4. This +is primarily motivated by the fact that RedHat/Centos 5 ships with +Flex 2.5.4. It is likely that someday Open MPI developer builds will +require Flex version >=2.5.35. + +Note that the `flex`-generated code generates some compiler warnings +on some platforms, but the warnings do not seem to be consistent or +uniform on all platforms, compilers, and flex versions. As such, we +have done little to try to remove those warnings. + +If you do not have Flex installed, see [the Flex Github +repository](https://github.com/westes/flex). + +## Use of Pandoc + +Similar to prior sections, you need to read/care about this section +*ONLY* if you are building from a developer's tree (i.e., a Git clone +of the Open MPI source tree). If you have an Open MPI distribution +tarball, the contents of this section are optional -- you can (and +probably should) skip reading this section. + +The Pandoc tool is used to generate Open MPI's man pages. +Specifically: Open MPI's man pages are written in Markdown; Pandoc is +the tool that converts that Markdown to nroff (i.e., the format of man +pages). + +You must have Pandoc >=v1.12 when building Open MPI from a developer's +tree. If configure cannot find Pandoc >=v1.12, it will abort. + +If you need to install Pandoc, check your operating system-provided +packages (to include MacOS Homebrew and MacPorts). [The Pandoc +project web site](https://pandoc.org/) itself also offers binaries for +their releases. diff --git a/LICENSE b/LICENSE index 6dacb6877e..906630dcc6 100644 --- a/LICENSE +++ b/LICENSE @@ -15,9 +15,9 @@ Copyright (c) 2004-2010 High Performance Computing Center Stuttgart, University of Stuttgart. All rights reserved. Copyright (c) 2004-2008 The Regents of the University of California. All rights reserved. -Copyright (c) 2006-2017 Los Alamos National Security, LLC. All rights +Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights reserved. -Copyright (c) 2006-2017 Cisco Systems, Inc. All rights reserved. +Copyright (c) 2006-2020 Cisco Systems, Inc. All rights reserved. Copyright (c) 2006-2010 Voltaire, Inc. All rights reserved. Copyright (c) 2006-2017 Sandia National Laboratories. All rights reserved. Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved. @@ -25,7 +25,7 @@ Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved. Copyright (c) 2006-2017 The University of Houston. All rights reserved. Copyright (c) 2006-2009 Myricom, Inc. All rights reserved. Copyright (c) 2007-2017 UT-Battelle, LLC. All rights reserved. -Copyright (c) 2007-2017 IBM Corporation. All rights reserved. +Copyright (c) 2007-2020 IBM Corporation. All rights reserved. Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich Supercomputing Centre, Federal Republic of Germany Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany @@ -45,7 +45,7 @@ Copyright (c) 2016 ARM, Inc. All rights reserved. Copyright (c) 2010-2011 Alex Brick . All rights reserved. Copyright (c) 2012 The University of Wisconsin-La Crosse. All rights reserved. -Copyright (c) 2013-2016 Intel, Inc. All rights reserved. +Copyright (c) 2013-2020 Intel, Inc. All rights reserved. Copyright (c) 2011-2017 NVIDIA Corporation. All rights reserved. Copyright (c) 2016 Broadcom Limited. All rights reserved. Copyright (c) 2011-2017 Fujitsu Limited. All rights reserved. @@ -56,7 +56,8 @@ Copyright (c) 2013-2017 Research Organization for Information Science (RIST). Copyright (c) 2017-2020 Amazon.com, Inc. or its affiliates. All Rights reserved. Copyright (c) 2018 DataDirect Networks. All rights reserved. -Copyright (c) 2018-2019 Triad National Security, LLC. All rights reserved. +Copyright (c) 2018-2020 Triad National Security, LLC. All rights reserved. +Copyright (c) 2020 Google, LLC. All rights reserved. $COPYRIGHT$ diff --git a/Makefile.am b/Makefile.am index 3062d5adc1..63613685ff 100644 --- a/Makefile.am +++ b/Makefile.am @@ -24,7 +24,7 @@ SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_SUBDIRS) test DIST_SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_DIST_SUBDIRS) test -EXTRA_DIST = README INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.txt AUTHORS +EXTRA_DIST = README.md INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.md AUTHORS include examples/Makefile.include diff --git a/README b/README deleted file mode 100644 index 00369190bd..0000000000 --- a/README +++ /dev/null @@ -1,2243 +0,0 @@ -Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana - University Research and Technology - Corporation. All rights reserved. -Copyright (c) 2004-2007 The University of Tennessee and The University - of Tennessee Research Foundation. All rights - reserved. -Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, - University of Stuttgart. All rights reserved. -Copyright (c) 2004-2007 The Regents of the University of California. - All rights reserved. -Copyright (c) 2006-2020 Cisco Systems, Inc. All rights reserved. -Copyright (c) 2006-2011 Mellanox Technologies. All rights reserved. -Copyright (c) 2006-2012 Oracle and/or its affiliates. All rights reserved. -Copyright (c) 2007 Myricom, Inc. All rights reserved. -Copyright (c) 2008-2018 IBM Corporation. All rights reserved. -Copyright (c) 2010 Oak Ridge National Labs. All rights reserved. -Copyright (c) 2011 University of Houston. All rights reserved. -Copyright (c) 2013-2020 Intel, Inc. All rights reserved. -Copyright (c) 2015 NVIDIA Corporation. All rights reserved. -Copyright (c) 2017-2018 Los Alamos National Security, LLC. All rights - reserved. -Copyright (c) 2017 Research Organization for Information Science - and Technology (RIST). All rights reserved. -Copyright (c) 2020 Google, LLC. All rights reserved. -Copyright (c) 2019-2020 Triad National Security, LLC. All rights - reserved. - -$COPYRIGHT$ - -Additional copyrights may follow - -$HEADER$ - -=========================================================================== - -When submitting questions and problems, be sure to include as much -extra information as possible. This web page details all the -information that we request in order to provide assistance: - - https://www.open-mpi.org/community/help/ - -The best way to report bugs, send comments, or ask questions is to -sign up on the user's and/or developer's mailing list (for user-level -and developer-level questions; when in doubt, send to the user's -list): - - users@lists.open-mpi.org - devel@lists.open-mpi.org - -Because of spam, only subscribers are allowed to post to these lists -(ensure that you subscribe with and post from exactly the same e-mail -address -- joe@example.com is considered different than -joe@mycomputer.example.com!). Visit these pages to subscribe to the -lists: - - https://lists.open-mpi.org/mailman/listinfo/users - https://lists.open-mpi.org/mailman/listinfo/devel - -Thanks for your time. - -=========================================================================== - -Much, much more information is also available in the Open MPI FAQ: - - https://www.open-mpi.org/faq/ - -=========================================================================== - -Quick start ------------ - -In many cases, Open MPI can be built and installed by simply -indicating the installation directory on the command line: - -$ tar xf openmpi-.tar.bz2 -$ cd openmpi- -$ ./configure --prefix= |& tee config.out -...lots of output... -$ make -j 8 |& tee make.out -...lots of output... -$ make install |& tee install.out -...lots of output... - -Note that there are many, many configuration options to the -"./configure" step. Some of them may be needed for your particular -environmnet; see below for desciptions of the options available. - -If your installation prefix path is not writable by a regular user, -you may need to use sudo or su to run the "make install" step. For -example: - -$ sudo make install |& tee install.out -[sudo] password for jsquyres: -...lots of output... - -Finally, note that VPATH builds are fully supported. For example: - -$ tar xf openmpi-.tar.bz2 -$ cd openmpi- -$ mkdir build -$ cd build -$ ../configure --prefix= |& tee config.out -...etc. - -The rest of this README file contains: - -- General release notes about Open MPI, including information about - platform, compiler, and run-time support, MPI and OpenSHMEM - functionality, network support, and MPI extensions. -- Detailed information on building and installing Open MPI. -- Open MPI version and library numbering policies, including how those - are related to backwards compatibility guarantees. -- Information on how to both query and validate your Open MPI - installation. -- Description of Open MPI extensions. -- Examples showing how to compile and run Open MPI applications. -- Summary information on the various plugin frameworks inside Open - MPI and OpenSHMEM. - -=========================================================================== - -The following abbreviated list of release notes applies to this code -base as of this writing (April 2020): - -General notes -------------- - -- Open MPI now includes two public software layers: MPI and OpenSHMEM. - Throughout this document, references to Open MPI implicitly include - both of these layers. When distinction between these two layers is - necessary, we will reference them as the "MPI" and "OpenSHMEM" - layers respectively. - -- OpenSHMEM is a collaborative effort between academia, industry, and - the U.S. Government to create a specification for a standardized API - for parallel programming in the Partitioned Global Address Space - (PGAS). For more information about the OpenSHMEM project, including - access to the current OpenSHMEM specification, please visit: - - http://openshmem.org/ - - This OpenSHMEM implementation will only work in Linux environments - with a restricted set of supported networks. - -- Open MPI includes support for a wide variety of supplemental - hardware and software package. When configuring Open MPI, you may - need to supply additional flags to the "configure" script in order - to tell Open MPI where the header files, libraries, and any other - required files are located. As such, running "configure" by itself - may not include support for all the devices (etc.) that you expect, - especially if their support headers / libraries are installed in - non-standard locations. Network interconnects are an easy example - to discuss -- Libfabric and OpenFabrics networks, for example, both - have supplemental headers and libraries that must be found before - Open MPI can build support for them. You must specify where these - files are with the appropriate options to configure. See the - listing of configure command-line switches, below, for more details. - -- The majority of Open MPI's documentation is here in this file, the - included man pages, and on the web site FAQ - (https://www.open-mpi.org/). - -- Note that Open MPI documentation uses the word "component" - frequently; the word "plugin" is probably more familiar to most - users. As such, end users can probably completely substitute the - word "plugin" wherever you see "component" in our documentation. - For what it's worth, we use the word "component" for historical - reasons, mainly because it is part of our acronyms and internal API - function calls. - -- The run-time systems that are currently supported are: - - rsh / ssh - - PBS Pro, Torque - - Platform LSF (tested with v9.1.1 and later) - - SLURM - - Cray XE, XC, and XK - - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine - -- Systems that have been tested are: - - Linux (various flavors/distros), 64 bit (x86, ppc, aarch64), - with gcc (>=4.8.x+), clang (>=3.6.0), Absoft (fortran), Intel, - and Portland (*) - - macOS (10.14-10.15), 64 bit (x86_64) with XCode compilers - - (*) Be sure to read the Compiler Notes, below. - -- Other systems have been lightly (but not fully) tested: - - Linux (various flavors/distros), 32 bit, with gcc - - Cygwin 32 & 64 bit with gcc - - ARMv6, ARMv7 - - Other 64 bit platforms. - - OpenBSD. Requires configure options --enable-mca-no-build=patcher - and --disable-dlopen with this release. - - Problems have been reported when building Open MPI on FreeBSD 11.1 - using the clang-4.0 system compiler. A workaround is to build - Open MPI using the GNU compiler. - -- Open MPI has taken some steps towards Reproducible Builds - (https://reproducible-builds.org/). Specifically, Open MPI's - "configure" and "make" process, by default, records the build date - and some system-specific information such as the hostname where Open - MPI was built and the username who built it. If you desire a - Reproducible Build, set the $SOURCE_DATE_EPOCH, $USER and $HOSTNAME - environment variables before invoking "configure" and "make", and - Open MPI will use those values instead of invoking "whoami" and/or - "hostname", respectively. See - https://reproducible-builds.org/docs/source-date-epoch/ for - information on the expected format and content of the - $SOURCE_DATE_EPOCH variable. - -Platform Notes --------------- - -- N/A - -Compiler Notes --------------- - -- Open MPI requires a C99-capable compiler to build. - -- On platforms other than x86-64, ARM, and PPC, Open MPI requires a - compiler that either supports C11 atomics or the GCC "__atomic" - atomics (e.g., GCC >= v4.7.2). - -- Mixing compilers from different vendors when building Open MPI - (e.g., using the C/C++ compiler from one vendor and the Fortran - compiler from a different vendor) has been successfully employed by - some Open MPI users (discussed on the Open MPI user's mailing list), - but such configurations are not tested and not documented. For - example, such configurations may require additional compiler / - linker flags to make Open MPI build properly. - - A not-uncommon case for this is when building on MacOS with the - system-default GCC compiler (i.e., /usr/bin/gcc), but a 3rd party - gfortran (e.g., provided by Homebrew, in /usr/local/bin/gfortran). - Since these compilers are provided by different organizations, they - have different default search paths. For example, if Homebrew has - also installed a local copy of Libevent (a 3rd party package that - Open MPI requires), the MacOS-default gcc linker will find it - without any additional command line flags, but the Homebrew-provided - gfortran linker will not. In this case, it may be necessary to - provide the following on the configure command line: - - $ ./configure FCFLAGS=-L/usr/local/lib ... - - This -L flag will then be passed to the Fortran linker when creating - Open MPI's Fortran libraries, and it will therefore be able to find - the installed Libevent. - -- In general, the latest versions of compilers of a given vendor's - series have the least bugs. We have seen cases where Vendor XYZ's - compiler version A.B fails to compile Open MPI, but version A.C - (where C>B) works just fine. If you run into a compile failure, you - might want to double check that you have the latest bug fixes and - patches for your compiler. - -- Users have reported issues with older versions of the Fortran PGI - compiler suite when using Open MPI's (non-default) --enable-debug - configure option. Per the above advice of using the most recent - version of a compiler series, the Open MPI team recommends using the - latest version of the PGI suite, and/or not using the --enable-debug - configure option. If it helps, here's what we have found with some - (not comprehensive) testing of various versions of the PGI compiler - suite: - - pgi-8 : NO known good version with --enable-debug - pgi-9 : 9.0-4 known GOOD - pgi-10: 10.0-0 known GOOD - pgi-11: NO known good version with --enable-debug - pgi-12: 12.10 known BAD with -m32, but known GOOD without -m32 - (and 12.8 and 12.9 both known BAD with --enable-debug) - pgi-13: 13.9 known BAD with -m32, 13.10 known GOOD without -m32 - pgi-15: 15.10 known BAD with -m32 - -- Similarly, there is a known Fortran PGI compiler issue with long - source directory path names that was resolved in 9.0-4 (9.0-3 is - known to be broken in this regard). - -- Open MPI does not support the PGI compiler suite on OS X or MacOS. - See issues below for more details: - https://github.com/open-mpi/ompi/issues/2604 - https://github.com/open-mpi/ompi/issues/2605 - -- OpenSHMEM Fortran bindings do not support the `no underscore` Fortran - symbol convention. IBM's xlf compilers build in that mode by default. - As such, IBM's xlf compilers cannot build/link the OpenSHMEM Fortran - bindings by default. A workaround is to pass FC="xlf -qextname" at - configure time to force a trailing underscore. See the issue below - for more details: - https://github.com/open-mpi/ompi/issues/3612 - -- MPI applications that use the mpi_f08 module on PowerPC platforms - (tested ppc64le) will likely experience runtime failures if: - - they are using a GNU linker (ld) version after v2.25.1 and before v2.28, - -and- - - they compiled with PGI (tested 17.5) or XL (tested v15.1.5) compilers. - This was noticed on Ubuntu 16.04 which uses the 2.26.1 version of ld by - default. However, this issue impacts any OS using a version of ld noted - above. This GNU linker regression will be fixed in version 2.28. - Below is a link to the GNU bug on this issue: - https://sourceware.org/bugzilla/show_bug.cgi?id=21306 - The XL compiler will include a fix for this issue in a future release. - -- On NetBSD-6 (at least AMD64 and i386), and possibly on OpenBSD, - libtool misidentifies properties of f95/g95, leading to obscure - compile-time failures if used to build Open MPI. You can work - around this issue by ensuring that libtool will not use f95/g95 - (e.g., by specifying FC=, or otherwise ensuring - a different Fortran compiler will be found earlier in the path than - f95/g95), or by disabling the Fortran MPI bindings with - --disable-mpi-fortran. - -- On OpenBSD/i386, if you configure with - --enable-mca-no-build=patcher, you will also need to add - --disable-dlopen. Otherwise, odd crashes can occur - nondeterministically. - -- Absoft 11.5.2 plus a service pack from September 2012 (which Absoft - says is available upon request), or a version later than 11.5.2 - (e.g., 11.5.3), is required to compile the Fortran mpi_f08 - module. - -- Open MPI does not support the Sparc v8 CPU target. However, - as of Solaris Studio 12.1, and later compilers, one should not - specify -xarch=v8plus or -xarch=v9. The use of the options - -m32 and -m64 for producing 32 and 64 bit targets, respectively, - are now preferred by the Solaris Studio compilers. GCC may - require either "-m32" or "-mcpu=v9 -m32", depending on GCC version. - -- If one tries to build OMPI on Ubuntu with Solaris Studio using the C++ - compiler and the -m32 option, you might see a warning: - - CC: Warning: failed to detect system linker version, falling back to - custom linker usage - - And the build will fail. One can overcome this error by either - setting LD_LIBRARY_PATH to the location of the 32 bit libraries (most - likely /lib32), or giving LDFLAGS="-L/lib32 -R/lib32" to the configure - command. Officially, Solaris Studio is not supported on Ubuntu Linux - distributions, so additional problems might be incurred. - -- Open MPI does not support the gccfss compiler (GCC For SPARC - Systems; a now-defunct compiler project from Sun). - -- At least some versions of the Intel 8.1 compiler seg fault while - compiling certain Open MPI source code files. As such, it is not - supported. - -- It has been reported that the Intel 9.1 and 10.0 compilers fail to - compile Open MPI on IA64 platforms. As of 12 Sep 2012, there is - very little (if any) testing performed on IA64 platforms (with any - compiler). Support is "best effort" for these platforms, but it is - doubtful that any effort will be expended to fix the Intel 9.1 / - 10.0 compiler issuers on this platform. - -- Early versions of the Intel 12.1 Linux compiler suite on x86_64 seem - to have a bug that prevents Open MPI from working. Symptoms - including immediate segv of the wrapper compilers (e.g., mpicc) and - MPI applications. As of 1 Feb 2012, if you upgrade to the latest - version of the Intel 12.1 Linux compiler suite, the problem will go - away. - -- The Portland Group compilers prior to version 7.0 require the - "-Msignextend" compiler flag to extend the sign bit when converting - from a shorter to longer integer. This is is different than other - compilers (such as GNU). When compiling Open MPI with the Portland - compiler suite, the following flags should be passed to Open MPI's - configure script: - - shell$ ./configure CFLAGS=-Msignextend CXXFLAGS=-Msignextend \ - --with-wrapper-cflags=-Msignextend \ - --with-wrapper-cxxflags=-Msignextend ... - - This will both compile Open MPI with the proper compile flags and - also automatically add "-Msignextend" when the C and C++ MPI wrapper - compilers are used to compile user MPI applications. - -- It has been reported that Pathscale 5.0.5 and 6.0.527 compilers - give an internal compiler error when trying to Open MPI. - -- As of July 2017, the Pathscale compiler suite apparently has no - further commercial support, and it does not look like there will be - further releases. Any issues discovered regarding building / - running Open MPI with the Pathscale compiler suite therefore may not - be able to be resolved. - -- Using the Absoft compiler to build the MPI Fortran bindings on Suse - 9.3 is known to fail due to a Libtool compatibility issue. - -- MPI Fortran API support has been completely overhauled since the - Open MPI v1.5/v1.6 series. - - ******************************************************************** - ******************************************************************** - *** There is now only a single Fortran MPI wrapper compiler and a - *** single Fortran OpenSHMEM wrapper compiler: mpifort and oshfort, - *** respectively. mpif77 and mpif90 still exist, but they are - *** symbolic links to mpifort. - ******************************************************************** - *** Similarly, Open MPI's configure script only recognizes the FC - *** and FCFLAGS environment variables (to specify the Fortran - *** compiler and compiler flags, respectively). The F77 and FFLAGS - *** environment variables are IGNORED. - ******************************************************************** - ******************************************************************** - - As a direct result, it is STRONGLY recommended that you specify a - Fortran compiler that uses file suffixes to determine Fortran code - layout (e.g., free form vs. fixed). For example, with some versions - of the IBM XLF compiler, it is preferable to use FC=xlf instead of - FC=xlf90, because xlf will automatically determine the difference - between free form and fixed Fortran source code. - - However, many Fortran compilers allow specifying additional - command-line arguments to indicate which Fortran dialect to use. - For example, if FC=xlf90, you may need to use "mpifort --qfixed ..." - to compile fixed format Fortran source files. - - You can use either ompi_info or oshmem_info to see with which Fortran - compiler Open MPI was configured and compiled. - - There are up to three sets of Fortran MPI bindings that may be - provided (depending on your Fortran compiler): - - - mpif.h: This is the first MPI Fortran interface that was defined - in MPI-1. It is a file that is included in Fortran source code. - Open MPI's mpif.h does not declare any MPI subroutines; they are - all implicit. - - - mpi module: The mpi module file was added in MPI-2. It provides - strong compile-time parameter type checking for MPI subroutines. - - - mpi_f08 module: The mpi_f08 module was added in MPI-3. It - provides many advantages over the mpif.h file and mpi module. For - example, MPI handles have distinct types (vs. all being integers). - See the MPI-3 document for more details. - - *** The mpi_f08 module is STRONGLY recommended for all new MPI - Fortran subroutines and applications. Note that the mpi_f08 - module can be used in conjunction with the other two Fortran - MPI bindings in the same application (only one binding can be - used per subroutine/function, however). Full interoperability - between mpif.h/mpi module and mpi_f08 module MPI handle types - is provided, allowing mpi_f08 to be used in new subroutines in - legacy MPI applications. - - Per the OpenSHMEM specification, there is only one Fortran OpenSHMEM - binding provided: - - - shmem.fh: All Fortran OpenSHMEM programs **should** include - 'shmem.fh', and Fortran OpenSHMEM programs that use constants - defined by OpenSHMEM **MUST** include 'shmem.fh'. - - The following notes apply to the above-listed Fortran bindings: - - - All Fortran compilers support the mpif.h/shmem.fh-based bindings, - with one exception: the MPI_SIZEOF interfaces will only be present - when Open MPI is built with a Fortran compiler that supports the - INTERFACE keyword and ISO_FORTRAN_ENV. Most notably, this - excludes the GNU Fortran compiler suite before version 4.9. - - - The level of support provided by the mpi module is based on your - Fortran compiler. - - If Open MPI is built with a non-GNU Fortran compiler, or if Open - MPI is built with the GNU Fortran compiler >= v4.9, all MPI - subroutines will be prototyped in the mpi module. All calls to - MPI subroutines will therefore have their parameter types checked - at compile time. - - If Open MPI is built with an old gfortran (i.e., < v4.9), a - limited "mpi" module will be built. Due to the limitations of - these compilers, and per guidance from the MPI-3 specification, - all MPI subroutines with "choice" buffers are specifically *not* - included in the "mpi" module, and their parameters will not be - checked at compile time. Specifically, all MPI subroutines with - no "choice" buffers are prototyped and will receive strong - parameter type checking at run-time (e.g., MPI_INIT, - MPI_COMM_RANK, etc.). - - Similar to the mpif.h interface, MPI_SIZEOF is only supported on - Fortran compilers that support INTERFACE and ISO_FORTRAN_ENV. - - - The mpi_f08 module has been tested with the Intel Fortran compiler - and gfortran >= 4.9. Other modern Fortran compilers likely also - work. - - Many older Fortran compilers do not provide enough modern Fortran - features to support the mpi_f08 module. For example, gfortran < - v4.9 does provide enough support for the mpi_f08 module. - - You can examine the output of the following command to see all - the Fortran features that are/are not enabled in your Open MPI - installation: - - shell$ ompi_info | grep -i fort - - -General Run-Time Support Notes ------------------------------- - -- The Open MPI installation must be in your PATH on all nodes (and - potentially LD_LIBRARY_PATH or DYLD_LIBRARY_PATH, if libmpi/libshmem - is a shared library), unless using the --prefix or - --enable-mpirun-prefix-by-default functionality (see below). - -- Open MPI's run-time behavior can be customized via MPI Component - Architecture (MCA) parameters (see below for more information on how - to get/set MCA parameter values). Some MCA parameters can be set in - a way that renders Open MPI inoperable (see notes about MCA - parameters later in this file). In particular, some parameters have - required options that must be included. - - - If specified, the "btl" parameter must include the "self" - component, or Open MPI will not be able to deliver messages to the - same rank as the sender. For example: "mpirun --mca btl tcp,self - ..." - - If specified, the "btl_tcp_if_exclude" parameter must include the - loopback device ("lo" on many Linux platforms), or Open MPI will - not be able to route MPI messages using the TCP BTL. For example: - "mpirun --mca btl_tcp_if_exclude lo,eth1 ..." - -- Running on nodes with different endian and/or different datatype - sizes within a single parallel job is supported in this release. - However, Open MPI does not resize data when datatypes differ in size - (for example, sending a 4 byte MPI_DOUBLE and receiving an 8 byte - MPI_DOUBLE will fail). - - -MPI Functionality and Features ------------------------------- - -- All MPI-3 functionality is supported. - -- Note that starting with Open MPI v4.0.0, prototypes for several - legacy MPI-1 symbols that were deleted in the MPI-3.0 specification - (which was published in 2012) are no longer available by default in - mpi.h. Specifically, several MPI-1 symbols were deprecated in the - 1996 publishing of the MPI-2.0 specification. These deprecated - symbols were eventually removed from the MPI-3.0 specification in - 2012. - - The symbols that now no longer appear by default in Open MPI's mpi.h - are: - - - MPI_Address (replaced by MPI_Get_address) - - MPI_Errhandler_create (replaced by MPI_Comm_create_errhandler) - - MPI_Errhandler_get (replaced by MPI_Comm_get_errhandler) - - MPI_Errhandler_set (replaced by MPI_Comm_set_errhandler) - - MPI_Type_extent (replaced by MPI_Type_get_extent) - - MPI_Type_hindexed (replaced by MPI_Type_create_hindexed) - - MPI_Type_hvector (replaced by MPI_Type_create_hvector) - - MPI_Type_lb (replaced by MPI_Type_get_extent) - - MPI_Type_struct (replaced by MPI_Type_create_struct) - - MPI_Type_ub (replaced by MPI_Type_get_extent) - - MPI_LB (replaced by MPI_Type_create_resized) - - MPI_UB (replaced by MPI_Type_create_resized) - - MPI_COMBINER_HINDEXED_INTEGER - - MPI_COMBINER_HVECTOR_INTEGER - - MPI_COMBINER_STRUCT_INTEGER - - MPI_Handler_function (replaced by MPI_Comm_errhandler_function) - - Although these symbols are no longer prototyped in mpi.h, they - are still present in the MPI library in Open MPI v4.0.x. This - enables legacy MPI applications to link and run successfully with - Open MPI v4.0.x, even though they will fail to compile. - - *** Future releases of Open MPI beyond the v4.0.x series may - remove these symbols altogether. - - *** The Open MPI team STRONGLY encourages all MPI application - developers to stop using these constructs that were first - deprecated over 20 years ago, and finally removed from the MPI - specification in MPI-3.0 (in 2012). - - *** The Open MPI FAQ (https://www.open-mpi.org/faq/) contains - examples of how to update legacy MPI applications using these - deleted symbols to use the "new" symbols. - - All that being said, if you are unable to immediately update your - application to stop using these legacy MPI-1 symbols, you can - re-enable them in mpi.h by configuring Open MPI with the - --enable-mpi1-compatibility flag. - -- Rank reordering support is available using the TreeMatch library. It - is activated for the graph and dist_graph communicator topologies. - -- When using MPI deprecated functions, some compilers will emit - warnings. For example: - - shell$ cat deprecated_example.c - #include - void foo(void) { - MPI_Datatype type; - MPI_Type_struct(1, NULL, NULL, NULL, &type); - } - shell$ mpicc -c deprecated_example.c - deprecated_example.c: In function 'foo': - deprecated_example.c:4: warning: 'MPI_Type_struct' is deprecated (declared at /opt/openmpi/include/mpi.h:1522) - shell$ - -- MPI_THREAD_MULTIPLE is supported with some exceptions. - - The following PMLs support MPI_THREAD_MULTIPLE: - - cm (see list (1) of supported MTLs, below) - - ob1 (see list (2) of supported BTLs, below) - - ucx - - (1) The cm PML and the following MTLs support MPI_THREAD_MULTIPLE: - - ofi (Libfabric) - - portals4 - - (2) The ob1 PML and the following BTLs support MPI_THREAD_MULTIPLE: - - self - - sm - - smcuda - - tcp - - ugni - - usnic - - Currently, MPI File operations are not thread safe even if MPI is - initialized for MPI_THREAD_MULTIPLE support. - -- MPI_REAL16 and MPI_COMPLEX32 are only supported on platforms where a - portable C datatype can be found that matches the Fortran type - REAL*16, both in size and bit representation. - -- The "libompitrace" library is bundled in Open MPI and is installed - by default (it can be disabled via the --disable-libompitrace - flag). This library provides a simplistic tracing of select MPI - function calls via the MPI profiling interface. Linking it in to - your application via (e.g., via -lompitrace) will automatically - output to stderr when some MPI functions are invoked: - - shell$ cd examples/ - shell$ mpicc hello_c.c -o hello_c -lompitrace - shell$ mpirun -np 1 hello_c - MPI_INIT: argc 1 - Hello, world, I am 0 of 1 - MPI_BARRIER[0]: comm MPI_COMM_WORLD - MPI_FINALIZE[0] - shell$ - - Keep in mind that the output from the trace library is going to - stderr, so it may output in a slightly different order than the - stdout from your application. - - This library is being offered as a "proof of concept" / convenience - from Open MPI. If there is interest, it is trivially easy to extend - it to printf for other MPI functions. Pull requests on github.com - would be greatly appreciated. - -OpenSHMEM Functionality and Features ------------------------------------- - -- All OpenSHMEM-1.3 functionality is supported. - - -MPI Collectives ---------------- - -- The "cuda" coll component provides CUDA-aware support for the - reduction type collectives with GPU buffers. This component is only - compiled into the library when the library has been configured with - CUDA-aware support. It intercepts calls to the reduction - collectives, copies the data to staging buffers if GPU buffers, then - calls underlying collectives to do the work. - -OpenSHMEM Collectives ---------------------- - -- The "fca" scoll component: the Mellanox Fabric Collective - Accelerator (FCA) is a solution for offloading collective operations - from the MPI process onto Mellanox QDR InfiniBand switch CPUs and - HCAs. - -- The "basic" scoll component: Reference implementation of all - OpenSHMEM collective operations. - - -Network Support ---------------- - -- There are several main MPI network models available: "ob1", "cm", - and "ucx". "ob1" uses BTL ("Byte Transfer Layer") - components for each supported network. "cm" uses MTL ("Matching - Transport Layer") components for each supported network. "ucx" uses - the OpenUCX transport. - - - "ob1" supports a variety of networks that can be used in - combination with each other: - - - OpenFabrics: InfiniBand, iWARP, and RoCE - - Loopback (send-to-self) - - Shared memory - - TCP - - SMCUDA - - Cisco usNIC - - uGNI (Cray Gemini, Aries) - - shared memory (XPMEM, Linux CMA, Linux KNEM, and - copy-in/copy-out shared memory) - - - "cm" supports a smaller number of networks (and they cannot be - used together), but may provide better overall MPI performance: - - - Intel Omni-Path PSM2 (version 11.2.173 or later) - - Intel True Scale PSM (QLogic InfiniPath) - - OpenFabrics Interfaces ("libfabric" tag matching) - - Portals 4 - - - UCX is the Unified Communication X (UCX) communication library - (https://www.openucx.org/). This is an open-source project - developed in collaboration between industry, laboratories, and - academia to create an open-source production grade communication - framework for data centric and high-performance applications. The - UCX library can be downloaded from repositories (e.g., - Fedora/RedHat yum repositories). The UCX library is also part of - Mellanox OFED and Mellanox HPC-X binary distributions. - - UCX currently supports: - - - OpenFabrics Verbs (including InfiniBand and RoCE) - - Cray's uGNI - - TCP - - Shared memory - - NVIDIA CUDA drivers - - While users can manually select any of the above transports at run - time, Open MPI will select a default transport as follows: - - 1. If InfiniBand devices are available, use the UCX PML. - - 2. If PSM, PSM2, or other tag-matching-supporting Libfabric - transport devices are available (e.g., Cray uGNI), use the "cm" - PML and a single appropriate corresponding "mtl" module. - - 3. Otherwise, use the ob1 PML and one or more appropriate "btl" - modules. - - Users can override Open MPI's default selection algorithms and force - the use of a specific transport if desired by setting the "pml" MCA - parameter (and potentially the "btl" and/or "mtl" MCA parameters) at - run-time: - - shell$ mpirun --mca pml ob1 --mca btl [comma-delimted-BTLs] ... - or - shell$ mpirun --mca pml cm --mca mtl [MTL] ... - or - shell$ mpirun --mca pml ucx ... - - There is a known issue when using UCX with very old Mellanox Infiniband - HCAs, in particular HCAs preceding the introduction of the ConnectX - product line, which can result in Open MPI crashing in MPI_Finalize. - This issue will be addressed by UCX release 1.9.0 and newer. - -- The main OpenSHMEM network model is "ucx"; it interfaces directly - with UCX. - -- In prior versions of Open MPI, InfiniBand and RoCE support was - provided through the openib BTL and ob1 PML plugins. Starting with - Open MPI 4.0.0, InfiniBand support through the openib plugin is both - deprecated and superseded by the ucx PML component. The openib BTL - was removed in Open MPI v5.0.0. - - While the openib BTL depended on libibverbs, the UCX PML depends on - the UCX library. - - Once installed, Open MPI can be built with UCX support by adding - --with-ucx to the Open MPI configure command. Once Open MPI is - configured to use UCX, the runtime will automatically select the UCX - PML if one of the supported networks is detected (e.g., InfiniBand). - It's possible to force using UCX in the mpirun or oshrun command - lines by specifying any or all of the following mca parameters: - "--mca pml ucx" for MPI point-to-point operations, "--mca spml ucx" - for OpenSHMEM support, and "--mca osc ucx" for MPI RMA (one-sided) - operations. - -- The usnic BTL is support for Cisco's usNIC device ("userspace NIC") - on Cisco UCS servers with the Virtualized Interface Card (VIC). - Although the usNIC is accessed via the OpenFabrics Libfabric API - stack, this BTL is specific to Cisco usNIC devices. - -- uGNI is a Cray library for communicating over the Gemini and Aries - interconnects. - -- The OpenFabrics Enterprise Distribution (OFED) software package v1.0 - will not work properly with Open MPI v1.2 (and later) due to how its - Mellanox InfiniBand plugin driver is created. The problem is fixed - with OFED v1.1 (and later). - -- Better memory management support is available for OFED-based - transports using the "ummunotify" Linux kernel module. OFED memory - managers are necessary for better bandwidth when re-using the same - buffers for large messages (e.g., benchmarks and some applications). - - Unfortunately, the ummunotify module was not accepted by the Linux - kernel community (and is still not distributed by OFED). But it - still remains the best memory management solution for MPI - applications that used the OFED network transports. If Open MPI is - able to find the header file, it will build - support for ummunotify and include it by default. If MPI processes - then find the ummunotify kernel module loaded and active, then their - memory managers (which have been shown to be problematic in some - cases) will be disabled and ummunotify will be used. Otherwise, the - same memory managers from prior versions of Open MPI will be used. - The ummunotify Linux kernel module can be downloaded from: - - https://lwn.net/Articles/343351/ - -- The use of fork() with OpenFabrics-based networks (i.e., the UCX - PML) is only partially supported, and only on Linux kernels >= - v2.6.15 with libibverbs v1.1 or later (first released as part of - OFED v1.2), per restrictions imposed by the OFED network stack. - -- Linux "knem" support is used when the "sm" (shared memory) BTL is - compiled with knem support (see the --with-knem configure option) - and the knem Linux module is loaded in the running kernel. If the - knem Linux kernel module is not loaded, the knem support is (by - default) silently deactivated during Open MPI jobs. - - See https://knem.gforge.inria.fr/ for details on Knem. - -- Linux Cross-Memory Attach (CMA) or XPMEM is used by the "sm" shared - memory BTL when the CMA/XPMEM libraries are installedm, - respectively. Linux CMA and XPMEM are similar (but different) - mechanisms for Open MPI to utilize single-copy semantics for shared - memory. - -Open MPI Extensions -------------------- - -- An MPI "extensions" framework is included in Open MPI, but is not - enabled by default. See the "Open MPI API Extensions" section below - for more information on compiling and using MPI extensions. - -- The following extensions are included in this version of Open MPI: - - - pcollreq: Provides routines for persistent collective communication - operations and persistent neighborhood collective communication - operations, which are planned to be included in the next MPI - Standard after MPI-3.1 as of Nov. 2018. The function names are - prefixed with MPIX_ instead of MPI_, like MPIX_Barrier_init, - because they are not standardized yet. Future versions of Open MPI - will switch to the MPI_ prefix once the MPI Standard which includes - this feature is published. See their man page for more details. - - shortfloat: Provides MPI datatypes MPIX_C_FLOAT16, MPIX_SHORT_FLOAT, - MPIX_SHORT_FLOAT, and MPIX_CXX_SHORT_FLOAT_COMPLEX if corresponding - language types are available. See ompi/mpiext/shortfloat/README.txt - for details. - - affinity: Provides the OMPI_Affinity_str() a string indicating the - resources which a process is bound. For more details, see its man - page. - - cuda: When the library is compiled with CUDA-aware support, it - provides two things. First, a macro - MPIX_CUDA_AWARE_SUPPORT. Secondly, the function - MPIX_Query_cuda_support that can be used to query for support. - - example: A non-functional extension; its only purpose is to - provide an example for how to create other extensions. - -=========================================================================== - -Building Open MPI ------------------ - -If you have checked out a DEVELOPER'S COPY of Open MPI (i.e., you -cloned from Git), you really need to read the HACKING file before -attempting to build Open MPI. Really. - -If you have downloaded a tarball, then things are much simpler. -Open MPI uses a traditional configure script paired with "make" to -build. Typical installs can be of the pattern: - -shell$ ./configure [...options...] -shell$ make [-j N] all install - (use an integer value of N for parallel builds) - -There are many available configure options (see "./configure --help" -for a full list); a summary of the more commonly used ones is included -below. - -NOTE: if you are building Open MPI on a network filesystem, the - machine you on which you are building *must* be - time-synchronized with the file server. Specifically: Open - MPI's build system *requires* accurate filesystem timestamps. - If your "make" output includes warning about timestamps in the - future or runs GNU Automake, Autoconf, and/or Libtool, this is - *not normal*, and you may have an invalid build. Ensure that - the time on your build machine is synchronized with the time on - your file server, or build on a local filesystem. Then remove - the Open MPI source directory and start over (e.g., by - re-extracting the Open MPI tarball). - -Note that for many of Open MPI's --with- options, Open MPI will, -by default, search for header files and/or libraries for . If -the relevant files are found, Open MPI will built support for ; -if they are not found, Open MPI will skip building support for . -However, if you specify --with- on the configure command line and -Open MPI is unable to find relevant support for , configure will -assume that it was unable to provide a feature that was specifically -requested and will abort so that a human can resolve out the issue. - -Additionally, if a search directory is specified in the form ---with-=, Open MPI will: - -1. Search for 's header files in /include. -2. Search for 's library files: - 2a. If --with--libdir= was specified, search in - . - 2b. Otherwise, search in /lib, and if they are not found - there, search again in /lib64. -3. If both the relevant header files and libraries are found: - 3a. Open MPI will build support for . - 3b. If the root path where the libraries are found is neither - "/usr" nor "/usr/local", Open MPI will compile itself with - RPATH flags pointing to the directory where 's libraries - are located. Open MPI does not RPATH /usr/lib[64] and - /usr/local/lib[64] because many systems already search these - directories for run-time libraries by default; adding RPATH for - them could have unintended consequences for the search path - ordering. - -INSTALLATION OPTIONS - ---prefix= - Install Open MPI into the base directory named . Hence, - Open MPI will place its executables in /bin, its header - files in /include, its libraries in /lib, etc. - ---disable-shared - By default, Open MPI and OpenSHMEM build shared libraries, and all - components are built as dynamic shared objects (DSOs). This switch - disables this default; it is really only useful when used with - --enable-static. Specifically, this option does *not* imply - --enable-static; enabling static libraries and disabling shared - libraries are two independent options. - ---enable-static - Build MPI and OpenSHMEM as static libraries, and statically link in - all components. Note that this option does *not* imply - --disable-shared; enabling static libraries and disabling shared - libraries are two independent options. - - Be sure to read the description of --without-memory-manager, below; - it may have some effect on --enable-static. - ---disable-wrapper-rpath - By default, the wrapper compilers (e.g., mpicc) will enable "rpath" - support in generated executables on systems that support it. That - is, they will include a file reference to the location of Open MPI's - libraries in the application executable itself. This means that - the user does not have to set LD_LIBRARY_PATH to find Open MPI's - libraries (e.g., if they are installed in a location that the - run-time linker does not search by default). - - On systems that utilize the GNU ld linker, recent enough versions - will actually utilize "runpath" functionality, not "rpath". There - is an important difference between the two: - - "rpath": the location of the Open MPI libraries is hard-coded into - the MPI/OpenSHMEM application and cannot be overridden at - run-time. - "runpath": the location of the Open MPI libraries is hard-coded into - the MPI/OpenSHMEM application, but can be overridden at run-time - by setting the LD_LIBRARY_PATH environment variable. - - For example, consider that you install Open MPI vA.B.0 and - compile/link your MPI/OpenSHMEM application against it. Later, you - install Open MPI vA.B.1 to a different installation prefix (e.g., - /opt/openmpi/A.B.1 vs. /opt/openmpi/A.B.0), and you leave the old - installation intact. - - In the rpath case, your MPI application will always use the - libraries from your A.B.0 installation. In the runpath case, you - can set the LD_LIBRARY_PATH environment variable to point to the - A.B.1 installation, and then your MPI application will use those - libraries. - - Note that in both cases, however, if you remove the original A.B.0 - installation and set LD_LIBRARY_PATH to point to the A.B.1 - installation, your application will use the A.B.1 libraries. - - This rpath/runpath behavior can be disabled via - --disable-wrapper-rpath. - - If you would like to keep the rpath option, but not enable runpath - a different configure option is avalabile - --disable-wrapper-runpath. - ---enable-dlopen - Build all of Open MPI's components as standalone Dynamic Shared - Objects (DSO's) that are loaded at run-time (this is the default). - The opposite of this option, --disable-dlopen, causes two things: - - 1. All of Open MPI's components will be built as part of Open MPI's - normal libraries (e.g., libmpi). - 2. Open MPI will not attempt to open any DSO's at run-time. - - Note that this option does *not* imply that OMPI's libraries will be - built as static objects (e.g., libmpi.a). It only specifies the - location of OMPI's components: standalone DSOs or folded into the - Open MPI libraries. You can control whether Open MPI's libraries - are build as static or dynamic via --enable|disable-static and - --enable|disable-shared. - ---disable-show-load-errors-by-default - Set the default value of the mca_base_component_show_load_errors MCA - variable: the --enable form of this option sets the MCA variable to - true, the --disable form sets the MCA variable to false. The MCA - mca_base_component_show_load_errors variable can still be overridden - at run time via the usual MCA-variable-setting mechanisms; this - configure option simply sets the default value. - - The --disable form of this option is intended for Open MPI packagers - who tend to enable support for many different types of networks and - systems in their packages. For example, consider a packager who - includes support for both the FOO and BAR networks in their Open MPI - package, both of which require support libraries (libFOO.so and - libBAR.so). If an end user only has BAR hardware, they likely only - have libBAR.so available on their systems -- not libFOO.so. - Disabling load errors by default will prevent the user from seeing - potentially confusing warnings about the FOO components failing to - load because libFOO.so is not available on their systems. - - Conversely, system administrators tend to build an Open MPI that is - targeted at their specific environment, and contains few (if any) - components that are not needed. In such cases, they might want - their users to be warned that the FOO network components failed to - load (e.g., if libFOO.so was mistakenly unavailable), because Open - MPI may otherwise silently failover to a slower network path for MPI - traffic. - ---with-platform=FILE - Load configure options for the build from FILE. Options on the - command line that are not in FILE are also used. Options on the - command line and in FILE are replaced by what is in FILE. - ---with-libmpi-name=STRING - Replace libmpi.* and libmpi_FOO.* (where FOO is one of the fortran - supporting libraries installed in lib) with libSTRING.* and - libSTRING_FOO.*. This is provided as a convenience mechanism for - third-party packagers of Open MPI that might want to rename these - libraries for their own purposes. This option is *not* intended for - typical users of Open MPI. - ---enable-mca-no-build=LIST - Comma-separated list of - pairs that will not be - built. For example, "--enable-mca-no-build=btl-portals,oob-ud" will - disable building the portals BTL and the ud OOB component. - -NETWORKING SUPPORT / OPTIONS - ---with-fca= - Specify the directory where the Mellanox FCA library and - header files are located. - - FCA is the support library for Mellanox switches and HCAs. - ---with-hcoll= - Specify the directory where the Mellanox hcoll library and header - files are located. This option is generally only necessary if the - hcoll headers and libraries are not in default compiler/linker - search paths. - - hcoll is the support library for MPI collective operation offload on - Mellanox ConnectX-3 HCAs (and later). - ---with-knem= - Specify the directory where the knem libraries and header files are - located. This option is generally only necessary if the knem headers - and libraries are not in default compiler/linker search paths. - - knem is a Linux kernel module that allows direct process-to-process - memory copies (optionally using hardware offload), potentially - increasing bandwidth for large messages sent between messages on the - same server. See https://knem.gforge.inria.fr/ for - details. - ---with-libfabric= - Specify the directory where the OpenFabrics Interfaces libfabric - library and header files are located. This option is generally only - necessary if the libfabric headers and libraries are not in default - compiler/linker search paths. - - Libfabric is the support library for OpenFabrics Interfaces-based - network adapters, such as Cisco usNIC, Intel True Scale PSM, Cray - uGNI, etc. - ---with-libfabric-libdir= - Look in directory for the libfabric libraries. By default, Open MPI - will look in /lib and /lib64, which covers most cases. This option is only - needed for special configurations. - ---with-portals4= - Specify the directory where the Portals4 libraries and header files - are located. This option is generally only necessary if the Portals4 - headers and libraries are not in default compiler/linker search - paths. - - Portals is a low-level network API for high-performance networking - on high-performance computing systems developed by Sandia National - Laboratories, Intel Corporation, and the University of New Mexico. - The Portals 4 Reference Implementation is a complete implementation - of Portals 4, with transport over InfiniBand verbs and UDP. - ---with-portals4-libdir= - Location of libraries to link with for Portals4 support. - ---with-portals4-max-md-size=SIZE ---with-portals4-max-va-size=SIZE - Set configuration values for Portals 4 - ---with-psm= - Specify the directory where the QLogic InfiniPath / Intel True Scale - PSM library and header files are located. This option is generally - only necessary if the PSM headers and libraries are not in default - compiler/linker search paths. - - PSM is the support library for QLogic InfiniPath and Intel TrueScale - network adapters. - ---with-psm-libdir= - Look in directory for the PSM libraries. By default, Open MPI will - look in /lib and /lib64, which covers - most cases. This option is only needed for special configurations. - ---with-psm2= - Specify the directory where the Intel Omni-Path PSM2 library and - header files are located. This option is generally only necessary - if the PSM2 headers and libraries are not in default compiler/linker - search paths. - - PSM is the support library for Intel Omni-Path network adapters. - ---with-psm2-libdir= - Look in directory for the PSM2 libraries. By default, Open MPI will - look in /lib and /lib64, which - covers most cases. This option is only needed for special - configurations. - ---with-ucx= - Specify the directory where the UCX libraries and header files are - located. This option is generally only necessary if the UCX headers - and libraries are not in default compiler/linker search paths. - ---with-ucx-libdir= - Look in directory for the UCX libraries. By default, Open MPI will - look in /lib and /lib64, which covers - most cases. This option is only needed for special configurations. - ---with-usnic - Abort configure if Cisco usNIC support cannot be built. - - -RUN-TIME SYSTEM SUPPORT - ---enable-mpirun-prefix-by-default - This option forces the "mpirun" command to always behave as if - "--prefix $prefix" was present on the command line (where $prefix is - the value given to the --prefix option to configure). This prevents - most rsh/ssh-based users from needing to modify their shell startup - files to set the PATH and/or LD_LIBRARY_PATH for Open MPI on remote - nodes. Note, however, that such users may still desire to set PATH - -- perhaps even in their shell startup files -- so that executables - such as mpicc and mpirun can be found without needing to type long - path names. --enable-orterun-prefix-by-default is a synonym for - this option. - ---enable-orte-static-ports - Enable orte static ports for tcp oob (default: enabled). - ---with-alps - Force the building of for the Cray Alps run-time environment. If - Alps support cannot be found, configure will abort. - ---with-lsf= - Specify the directory where the LSF libraries and header files are - located. This option is generally only necessary if the LSF headers - and libraries are not in default compiler/linker search paths. - - LSF is a resource manager system, frequently used as a batch - scheduler in HPC systems. - ---with-lsf-libdir= - Look in directory for the LSF libraries. By default, Open MPI will - look in /lib and /lib64, which covers - most cases. This option is only needed for special configurations. - ---with-pmi - Build PMI support (by default on non-Cray XE/XC systems, it is not - built). On Cray XE/XC systems, the location of pmi is detected - automatically as part of the configure process. For non-Cray - systems, if the pmi2.h header is found in addition to pmi.h, then - support for PMI2 will be built. - ---with-slurm - Force the building of SLURM scheduler support. - ---with-sge - Specify to build support for the Oracle Grid Engine (OGE) resource - manager and/or the Open Grid Engine. OGE support is disabled by - default; this option must be specified to build OMPI's OGE support. - - The Oracle Grid Engine (OGE) and open Grid Engine packages are - resource manager systems, frequently used as a batch scheduler in - HPC systems. - ---with-tm= - Specify the directory where the TM libraries and header files are - located. This option is generally only necessary if the TM headers - and libraries are not in default compiler/linker search paths. - - TM is the support library for the Torque and PBS Pro resource - manager systems, both of which are frequently used as a batch - scheduler in HPC systems. - -MISCELLANEOUS SUPPORT LIBRARIES - ---with-libevent(=value) - This option specifies where to find the libevent support headers and - library. The following VALUEs are permitted: - - internal: Use Open MPI's internal copy of libevent. - external: Use an external libevent installation (rely on default - compiler and linker paths to find it) - : Same as "internal". - : Specify the location of a specific libevent - installation to use - - By default (or if --with-libevent is specified with no VALUE), Open - MPI will build and use the copy of libevent that it has in its - source tree. However, if the VALUE is "external", Open MPI will - look for the relevant libevent header file and library in default - compiler / linker locations. Or, VALUE can be a directory tree - where the libevent header file and library can be found. This - option allows operating systems to include Open MPI and use their - default libevent installation instead of Open MPI's bundled libevent. - - libevent is a support library that provides event-based processing, - timers, and signal handlers. Open MPI requires libevent to build; - passing --without-libevent will cause configure to abort. - ---with-libevent-libdir= - Look in directory for the libevent libraries. This option is only - usable when building Open MPI against an external libevent - installation. Just like other --with-FOO-libdir configure options, - this option is only needed for special configurations. - ---with-hwloc(=value) - hwloc is a support library that provides processor and memory - affinity information for NUMA platforms. It is required by Open - MPI. Therefore, specifying --with-hwloc=no (or --without-hwloc) is - disallowed. - - By default (i.e., if --with-hwloc is not specified, or if - --with-hwloc is specified without a value), Open MPI will first try - to find/use an hwloc installation on the current system. If Open - MPI cannot find one, it will fall back to build and use the internal - copy of hwloc included in the Open MPI source tree. - - Alternatively, the --with-hwloc option can be used to specify where - to find the hwloc support headers and library. The following values - are permitted: - - internal: Only use Open MPI's internal copy of hwloc. - external: Only use an external hwloc installation (rely on - default compiler and linker paths to find it). - : Only use the specific hwloc installation found in - the specified directory. - ---with-hwloc-libdir= - Look in directory for the hwloc libraries. This option is only - usable when building Open MPI against an external hwloc - installation. Just like other --with-FOO-libdir configure options, - this option is only needed for special configurations. - ---disable-hwloc-pci - Disable building hwloc's PCI device-sensing capabilities. On some - platforms (e.g., SusE 10 SP1, x86-64), the libpci support library is - broken. Open MPI's configure script should usually detect when - libpci is not usable due to such brokenness and turn off PCI - support, but there may be cases when configure mistakenly enables - PCI support in the presence of a broken libpci. These cases may - result in "make" failing with warnings about relocation symbols in - libpci. The --disable-hwloc-pci switch can be used to force Open - MPI to not build hwloc's PCI device-sensing capabilities in these - cases. - - Similarly, if Open MPI incorrectly decides that libpci is broken, - you can force Open MPI to build hwloc's PCI device-sensing - capabilities by using --enable-hwloc-pci. - - hwloc can discover PCI devices and locality, which can be useful for - Open MPI in assigning message passing resources to MPI processes. - ---with-libltdl= - Specify the directory where the GNU Libtool libltdl libraries and - header files are located. This option is generally only necessary - if the libltdl headers and libraries are not in default - compiler/linker search paths. - - Note that this option is ignored if --disable-dlopen is specified. - ---disable-libompitrace - Disable building the simple "libompitrace" library (see note above - about libompitrace) - ---with-valgrind(=) - Directory where the valgrind software is installed. If Open MPI - finds Valgrind's header files, it will include additional support - for Valgrind's memory-checking debugger. - - Specifically, it will eliminate a lot of false positives from - running Valgrind on MPI applications. There is a minor performance - penalty for enabling this option. - -MPI FUNCTIONALITY - ---with-mpi-param-check(=value) - Whether or not to check MPI function parameters for errors at - runtime. The following values are permitted: - - always: MPI function parameters are always checked for errors - never: MPI function parameters are never checked for errors - runtime: Whether MPI function parameters are checked depends on - the value of the MCA parameter mpi_param_check (default: - yes). - yes: Synonym for "always" (same as --with-mpi-param-check). - no: Synonym for "never" (same as --without-mpi-param-check). - - If --with-mpi-param is not specified, "runtime" is the default. - ---disable-mpi-thread-multiple - Disable the MPI thread level MPI_THREAD_MULTIPLE (it is enabled by - default). - ---enable-mpi-java - Enable building of an EXPERIMENTAL Java MPI interface (disabled by - default). You may also need to specify --with-jdk-dir, - --with-jdk-bindir, and/or --with-jdk-headers. See README.JAVA.txt - for details. - - Note that this Java interface is INCOMPLETE (meaning: it does not - support all MPI functionality) and LIKELY TO CHANGE. The Open MPI - developers would very much like to hear your feedback about this - interface. See README.JAVA.txt for more details. - ---enable-mpi-fortran(=value) - By default, Open MPI will attempt to build all 3 Fortran bindings: - mpif.h, the "mpi" module, and the "mpi_f08" module. The following - values are permitted: - - all: Synonym for "yes". - yes: Attempt to build all 3 Fortran bindings; skip - any binding that cannot be built (same as - --enable-mpi-fortran). - mpifh: Build mpif.h support. - usempi: Build mpif.h and "mpi" module support. - usempif08: Build mpif.h, "mpi" module, and "mpi_f08" - module support. - none: Synonym for "no". - no: Do not build any MPI Fortran support (same as - --disable-mpi-fortran). This is mutually exclusive - with building the OpenSHMEM Fortran interface. - ---enable-mpi-ext(=) - Enable Open MPI's non-portable API extensions. If no is - specified, all of the extensions are enabled. - - See "Open MPI API Extensions", below, for more details. - ---disable-mpi-io - Disable built-in support for MPI-2 I/O, likely because an - externally-provided MPI I/O package will be used. Default is to use - the internal framework system that uses the ompio component and a - specially modified version of ROMIO that fits inside the romio - component - ---disable-io-romio - Disable the ROMIO MPI-IO component - ---with-io-romio-flags=flags - Pass flags to the ROMIO distribution configuration script. This - option is usually only necessary to pass - parallel-filesystem-specific preprocessor/compiler/linker flags back - to the ROMIO system. - ---disable-io-ompio - Disable the ompio MPI-IO component - ---enable-sparse-groups - Enable the usage of sparse groups. This would save memory - significantly especially if you are creating large - communicators. (Disabled by default) - -OPENSHMEM FUNCTIONALITY - ---disable-oshmem - Disable building the OpenSHMEM implementation (by default, it is - enabled). - ---disable-oshmem-fortran - Disable building only the Fortran OpenSHMEM bindings. Please see - the "Compiler Notes" section herein which contains further - details on known issues with various Fortran compilers. - -MISCELLANEOUS FUNCTIONALITY - ---without-memory-manager - Disable building Open MPI's memory manager. Open MPI's memory - manager is usually built on Linux based platforms, and is generally - only used for optimizations with some OpenFabrics-based networks (it - is not *necessary* for OpenFabrics networks, but some performance - loss may be observed without it). - - However, it may be necessary to disable the memory manager in order - to build Open MPI statically. - ---with-ft=TYPE - Specify the type of fault tolerance to enable. Options: LAM - (LAM/MPI-like), cr (Checkpoint/Restart). Fault tolerance support is - disabled unless this option is specified. - ---enable-peruse - Enable the PERUSE MPI data analysis interface. - ---enable-heterogeneous - Enable support for running on heterogeneous clusters (e.g., machines - with different endian representations). Heterogeneous support is - disabled by default because it imposes a minor performance penalty. - - *** THIS FUNCTIONALITY IS CURRENTLY BROKEN - DO NOT USE *** - ---with-wrapper-cflags= ---with-wrapper-cxxflags= ---with-wrapper-fflags= ---with-wrapper-fcflags= ---with-wrapper-ldflags= ---with-wrapper-libs= - Add the specified flags to the default flags that are used in Open - MPI's "wrapper" compilers (e.g., mpicc -- see below for more - information about Open MPI's wrapper compilers). By default, Open - MPI's wrapper compilers use the same compilers used to build Open - MPI and specify a minimum set of additional flags that are necessary - to compile/link MPI applications. These configure options give - system administrators the ability to embed additional flags in - OMPI's wrapper compilers (which is a local policy decision). The - meanings of the different flags are: - - : Flags passed by the mpicc wrapper to the C compiler - : Flags passed by the mpic++ wrapper to the C++ compiler - : Flags passed by the mpifort wrapper to the Fortran compiler - : Flags passed by all the wrappers to the linker - : Flags passed by all the wrappers to the linker - - There are other ways to configure Open MPI's wrapper compiler - behavior; see the Open MPI FAQ for more information. - -There are many other options available -- see "./configure --help". - -Changing the compilers that Open MPI uses to build itself uses the -standard Autoconf mechanism of setting special environment variables -either before invoking configure or on the configure command line. -The following environment variables are recognized by configure: - -CC - C compiler to use -CFLAGS - Compile flags to pass to the C compiler -CPPFLAGS - Preprocessor flags to pass to the C compiler - -CXX - C++ compiler to use -CXXFLAGS - Compile flags to pass to the C++ compiler -CXXCPPFLAGS - Preprocessor flags to pass to the C++ compiler - -FC - Fortran compiler to use -FCFLAGS - Compile flags to pass to the Fortran compiler - -LDFLAGS - Linker flags to pass to all compilers -LIBS - Libraries to pass to all compilers (it is rarely - necessary for users to need to specify additional LIBS) - -PKG_CONFIG - Path to the pkg-config utility - -For example: - - shell$ ./configure CC=mycc CXX=myc++ FC=myfortran ... - -*** NOTE: We generally suggest using the above command line form for - setting different compilers (vs. setting environment variables and - then invoking "./configure"). The above form will save all - variables and values in the config.log file, which makes - post-mortem analysis easier if problems occur. - -Note that if you intend to compile Open MPI with a "make" other than -the default one in your PATH, then you must either set the $MAKE -environment variable before invoking Open MPI's configure script, or -pass "MAKE=your_make_prog" to configure. For example: - - shell$ ./configure MAKE=/path/to/my/make ... - -This could be the case, for instance, if you have a shell alias for -"make", or you always type "gmake" out of habit. Failure to tell -configure which non-default "make" you will use to compile Open MPI -can result in undefined behavior (meaning: don't do that). - -Note that you may also want to ensure that the value of -LD_LIBRARY_PATH is set appropriately (or not at all) for your build -(or whatever environment variable is relevant for your operating -system). For example, some users have been tripped up by setting to -use a non-default Fortran compiler via FC, but then failing to set -LD_LIBRARY_PATH to include the directory containing that non-default -Fortran compiler's support libraries. This causes Open MPI's -configure script to fail when it tries to compile / link / run simple -Fortran programs. - -It is required that the compilers specified be compile and link -compatible, meaning that object files created by one compiler must be -able to be linked with object files from the other compilers and -produce correctly functioning executables. - -Open MPI supports all the "make" targets that are provided by GNU -Automake, such as: - -all - build the entire Open MPI package -install - install Open MPI -uninstall - remove all traces of Open MPI from the $prefix -clean - clean out the build tree - -Once Open MPI has been built and installed, it is safe to run "make -clean" and/or remove the entire build tree. - -VPATH and parallel builds are fully supported. - -Generally speaking, the only thing that users need to do to use Open -MPI is ensure that /bin is in their PATH and /lib is -in their LD_LIBRARY_PATH. Users may need to ensure to set the PATH -and LD_LIBRARY_PATH in their shell setup files (e.g., .bashrc, .cshrc) -so that non-interactive rsh/ssh-based logins will be able to find the -Open MPI executables. - -=========================================================================== - -Open MPI Version Numbers and Binary Compatibility -------------------------------------------------- - -Open MPI has two sets of version numbers that are likely of interest -to end users / system administrator: - - * Software version number - * Shared library version numbers - -Both are predicated on Open MPI's definition of "backwards -compatibility." - -NOTE: The version numbering conventions were changed with the release - of v1.10.0. Most notably, Open MPI no longer uses an "odd/even" - release schedule to indicate feature development vs. stable - releases. See the README in releases prior to v1.10.0 for more - information (e.g., - https://github.com/open-mpi/ompi/blob/v1.8/README#L1392-L1475). - -Backwards Compatibility ------------------------ - -Open MPI version Y is backwards compatible with Open MPI version X -(where Y>X) if users can: - - * Compile an MPI/OpenSHMEM application with version X, mpirun/oshrun - it with version Y, and get the same user-observable behavior. - * Invoke ompi_info with the same CLI options in versions X and Y and - get the same user-observable behavior. - -Note that this definition encompasses several things: - - * Application Binary Interface (ABI) - * MPI / OpenSHMEM run time system - * mpirun / oshrun command line options - * MCA parameter names / values / meanings - -However, this definition only applies when the same version of Open -MPI is used with all instances of the runtime and MPI / OpenSHMEM -processes in a single MPI job. If the versions are not exactly the -same everywhere, Open MPI is not guaranteed to work properly in any -scenario. - -Backwards compatibility tends to work best when user applications are -dynamically linked to one version of the Open MPI / OSHMEM libraries, -and can be updated at run time to link to a new version of the Open -MPI / OSHMEM libraries. - -For example, if an MPI / OSHMEM application links statically against -the libraries from Open MPI vX, then attempting to launch that -application with mpirun / oshrun from Open MPI vY is not guaranteed to -work (because it is mixing vX and vY of Open MPI in a single job). - -Similarly, if using a container technology that internally bundles all -the libraries from Open MPI vX, attempting to launch that container -with mpirun / oshrun from Open MPI vY is not guaranteed to work. - -Software Version Number ------------------------ - -Official Open MPI releases use the common "A.B.C" version identifier -format. Each of the three numbers has a specific meaning: - - * Major: The major number is the first integer in the version string - Changes in the major number typically indicate a significant - change in the code base and/or end-user functionality, and also - indicate a break from backwards compatibility. Specifically: Open - MPI releases with different major version numbers are not - backwards compatibale with each other. - - CAVEAT: This rule does not extend to versions prior to v1.10.0. - Specifically: v1.10.x is not guaranteed to be backwards - compatible with other v1.x releases. - - * Minor: The minor number is the second integer in the version - string. Changes in the minor number indicate a user-observable - change in the code base and/or end-user functionality. Backwards - compatibility will still be preserved with prior releases that - have the same major version number (e.g., v2.5.3 is backwards - compatible with v2.3.1). - - * Release: The release number is the third integer in the version - string. Changes in the release number typically indicate a bug - fix in the code base and/or end-user functionality. For example, - if there is a release that only contains bug fixes and no other - user-observable changes or new features, only the third integer - will be increased (e.g., from v4.3.0 to v4.3.1). - -The "A.B.C" version number may optionally be followed by a Quantifier: - - * Quantifier: Open MPI version numbers sometimes have an arbitrary - string affixed to the end of the version number. Common strings - include: - - o aX: Indicates an alpha release. X is an integer indicating the - number of the alpha release (e.g., v1.10.3a5 indicates the 5th - alpha release of version 1.10.3). - o bX: Indicates a beta release. X is an integer indicating the - number of the beta release (e.g., v1.10.3b3 indicates the 3rd - beta release of version 1.10.3). - o rcX: Indicates a release candidate. X is an integer indicating - the number of the release candidate (e.g., v1.10.3rc4 indicates - the 4th release candidate of version 1.10.3). - -Nightly development snapshot tarballs use a different version number -scheme; they contain three distinct values: - - * The git branch name from which the tarball was created. - * The date/timestamp, in YYYYMMDDHHMM format. - * The hash of the git commit from which the tarball was created. - -For example, a snapshot tarball filename of -"openmpi-v2.x-201703070235-e4798fb.tar.gz" indicates that this tarball -was created from the v2.x branch, on March 7, 2017, at 2:35am GMT, -from git hash e4798fb. - -Shared Library Version Number ------------------------------ - -The GNU Libtool official documentation details how the versioning -scheme works. The quick version is that the shared library versions -are a triple of integers: (current,revision,age), or "c:r:a". This -triple is not related to the Open MPI software version number. There -are six simple rules for updating the values (taken almost verbatim -from the Libtool docs): - - 1. Start with version information of "0:0:0" for each shared library. - - 2. Update the version information only immediately before a public - release of your software. More frequent updates are unnecessary, - and only guarantee that the current interface number gets larger - faster. - - 3. If the library source code has changed at all since the last - update, then increment revision ("c:r:a" becomes "c:r+1:a"). - - 4. If any interfaces have been added, removed, or changed since the - last update, increment current, and set revision to 0. - - 5. If any interfaces have been added since the last public release, - then increment age. - - 6. If any interfaces have been removed since the last public release, - then set age to 0. - -Here's how we apply those rules specifically to Open MPI: - - 1. The above rules do not apply to MCA components (a.k.a. "plugins"); - MCA component .so versions stay unspecified. - - 2. The above rules apply exactly as written to the following - libraries starting with Open MPI version v1.5 (prior to v1.5, - libopen-pal and libopen-rte were still at 0:0:0 for reasons - discussed in bug ticket #2092 - https://svn.open-mpi.org/trac/ompi/ticket/2092): - - * libopen-rte - * libopen-pal - * libmca_common_* - - 3. The following libraries use a slightly modified version of the - above rules: rules 4, 5, and 6 only apply to the official MPI and - OpenSHMEM interfaces (functions, global variables). The rationale - for this decision is that the vast majority of our users only care - about the official/public MPI/OpenSHMEM interfaces; we therefore - want the .so version number to reflect only changes to the - official MPI/OpenSHMEM APIs. Put simply: non-MPI/OpenSHMEM API / - internal changes to the MPI-application-facing libraries are - irrelevant to pure MPI/OpenSHMEM applications. - - * libmpi - * libmpi_mpifh - * libmpi_usempi_tkr - * libmpi_usempi_ignore_tkr - * libmpi_usempif08 - * libmpi_cxx - * libmpi_java - * liboshmem - -=========================================================================== - -Checking Your Open MPI Installation ------------------------------------ - -The "ompi_info" command can be used to check the status of your Open -MPI installation (located in /bin/ompi_info). Running it with -no arguments provides a summary of information about your Open MPI -installation. - -Note that the ompi_info command is extremely helpful in determining -which components are installed as well as listing all the run-time -settable parameters that are available in each component (as well as -their default values). - -The following options may be helpful: - ---all Show a *lot* of information about your Open MPI - installation. ---parsable Display all the information in an easily - grep/cut/awk/sed-able format. ---param - A of "all" and a of "all" will - show all parameters to all components. Otherwise, the - parameters of all the components in a specific framework, - or just the parameters of a specific component can be - displayed by using an appropriate and/or - name. ---level - By default, ompi_info only shows "Level 1" MCA parameters - -- parameters that can affect whether MPI processes can - run successfully or not (e.g., determining which network - interfaces to use). The --level option will display all - MCA parameters from level 1 to (the max - value is 9). Use "ompi_info --param - --level 9" to see *all* MCA parameters for a - given component. See "The Modular Component Architecture - (MCA)" section, below, for a fuller explanation. - -Changing the values of these parameters is explained in the "The -Modular Component Architecture (MCA)" section, below. - -When verifying a new Open MPI installation, we recommend running six -tests: - -1. Use "mpirun" to launch a non-MPI program (e.g., hostname or uptime) - across multiple nodes. - -2. Use "mpirun" to launch a trivial MPI program that does no MPI - communication (e.g., the hello_c program in the examples/ directory - in the Open MPI distribution). - -3. Use "mpirun" to launch a trivial MPI program that sends and - receives a few MPI messages (e.g., the ring_c program in the - examples/ directory in the Open MPI distribution). - -4. Use "oshrun" to launch a non-OpenSHMEM program across multiple - nodes. - -5. Use "oshrun" to launch a trivial MPI program that does no OpenSHMEM - communication (e.g., hello_shmem.c program in the examples/ - directory in the Open MPI distribution.) - -6. Use "oshrun" to launch a trivial OpenSHMEM program that puts and - gets a few messages. (e.g., the ring_shmem.c in the examples/ - directory in the Open MPI distribution.) - -If you can run all six of these tests successfully, that is a good -indication that Open MPI built and installed properly. - -=========================================================================== - -Open MPI API Extensions ------------------------ - -Open MPI contains a framework for extending the MPI API that is -available to applications. Each extension is usually a standalone set -of functionality that is distinct from other extensions (similar to -how Open MPI's plugins are usually unrelated to each other). These -extensions provide new functions and/or constants that are available -to MPI applications. - -WARNING: These extensions are neither standard nor portable to other -MPI implementations! - -Compiling the extensions ------------------------- - -Open MPI extensions are all enabled by default; they can be disabled -via the --disable-mpi-ext command line switch. - -Since extensions are meant to be used by advanced users only, this -file does not document which extensions are available or what they -do. Look in the ompi/mpiext/ directory to see the extensions; each -subdirectory of that directory contains an extension. Each has a -README file that describes what it does. - -Using the extensions --------------------- - -To reinforce the fact that these extensions are non-standard, you must -include a separate header file after to obtain the function -prototypes, constant declarations, etc. For example: - ------ -#include -#if defined(OPEN_MPI) && OPEN_MPI -#include -#endif - -int main() { - MPI_Init(NULL, NULL); - -#if defined(OPEN_MPI) && OPEN_MPI - { - char ompi_bound[OMPI_AFFINITY_STRING_MAX]; - char current_binding[OMPI_AFFINITY_STRING_MAX]; - char exists[OMPI_AFFINITY_STRING_MAX]; - OMPI_Affinity_str(OMPI_AFFINITY_LAYOUT_FMT, ompi_bound, - current_bindings, exists); - } -#endif - MPI_Finalize(); - return 0; -} ------ - -Notice that the Open MPI-specific code is surrounded by the #if -statement to ensure that it is only ever compiled by Open MPI. - -The Open MPI wrapper compilers (mpicc and friends) should -automatically insert all relevant compiler and linker flags necessary -to use the extensions. No special flags or steps should be necessary -compared to "normal" MPI applications. - -=========================================================================== - -Compiling Open MPI Applications -------------------------------- - -Open MPI provides "wrapper" compilers that should be used for -compiling MPI and OpenSHMEM applications: - -C: mpicc, oshcc -C++: mpiCC, oshCC (or mpic++ if your filesystem is case-insensitive) -Fortran: mpifort, oshfort - -For example: - - shell$ mpicc hello_world_mpi.c -o hello_world_mpi -g - shell$ - -For OpenSHMEM applications: - - shell$ oshcc hello_shmem.c -o hello_shmem -g - shell$ - -All the wrapper compilers do is add a variety of compiler and linker -flags to the command line and then invoke a back-end compiler. To be -specific: the wrapper compilers do not parse source code at all; they -are solely command-line manipulators, and have nothing to do with the -actual compilation or linking of programs. The end result is an MPI -executable that is properly linked to all the relevant libraries. - -Customizing the behavior of the wrapper compilers is possible (e.g., -changing the compiler [not recommended] or specifying additional -compiler/linker flags); see the Open MPI FAQ for more information. - -Alternatively, Open MPI also installs pkg-config(1) configuration -files under $libdir/pkgconfig. If pkg-config is configured to find -these files, then compiling / linking Open MPI programs can be -performed like this: - - shell$ gcc hello_world_mpi.c -o hello_world_mpi -g \ - `pkg-config ompi-c --cflags --libs` - shell$ - -Open MPI supplies multiple pkg-config(1) configuration files; one for -each different wrapper compiler (language): - ------------------------------------------------------------------------- -ompi Synonym for "ompi-c"; Open MPI applications using the C - MPI bindings -ompi-c Open MPI applications using the C MPI bindings -ompi-cxx Open MPI applications using the C MPI bindings -ompi-fort Open MPI applications using the Fortran MPI bindings ------------------------------------------------------------------------- - -The following pkg-config(1) configuration files *may* be installed, -depending on which command line options were specified to Open MPI's -configure script. They are not necessary for MPI applications, but -may be used by applications that use Open MPI's lower layer support -libraries. - -orte: Open MPI Run-Time Environment applications -opal: Open Portable Access Layer applications - -=========================================================================== - -Running Open MPI Applications ------------------------------ - -Open MPI supports both mpirun and mpiexec (they are exactly -equivalent) to launch MPI applications. For example: - - shell$ mpirun -np 2 hello_world_mpi - or - shell$ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi - -are equivalent. - -The rsh launcher (which defaults to using ssh) accepts a --hostfile -parameter (the option "--machinefile" is equivalent); you can specify a ---hostfile parameter indicating a standard mpirun-style hostfile (one -hostname per line): - - shell$ mpirun --hostfile my_hostfile -np 2 hello_world_mpi - -If you intend to run more than one process on a node, the hostfile can -use the "slots" attribute. If "slots" is not specified, a count of 1 -is assumed. For example, using the following hostfile: - ---------------------------------------------------------------------------- -node1.example.com -node2.example.com -node3.example.com slots=2 -node4.example.com slots=4 ---------------------------------------------------------------------------- - - shell$ mpirun --hostfile my_hostfile -np 8 hello_world_mpi - -will launch MPI_COMM_WORLD rank 0 on node1, rank 1 on node2, ranks 2 -and 3 on node3, and ranks 4 through 7 on node4. - -Other starters, such as the resource manager / batch scheduling -environments, do not require hostfiles (and will ignore the hostfile -if it is supplied). They will also launch as many processes as slots -have been allocated by the scheduler if no "-np" argument has been -provided. For example, running a SLURM job with 8 processors: - - shell$ salloc -n 8 mpirun a.out - -The above command will reserve 8 processors and run 1 copy of mpirun, -which will, in turn, launch 8 copies of a.out in a single -MPI_COMM_WORLD on the processors that were allocated by SLURM. - -Note that the values of component parameters can be changed on the -mpirun / mpiexec command line. This is explained in the section -below, "The Modular Component Architecture (MCA)". - -Open MPI supports oshrun to launch OpenSHMEM applications. For -example: - - shell$ oshrun -np 2 hello_world_oshmem - -OpenSHMEM applications may also be launched directly by resource -managers such as SLURM. For example, when OMPI is configured ---with-pmi and --with-slurm, one may launch OpenSHMEM applications via -srun: - - shell$ srun -N 2 hello_world_oshmem - -=========================================================================== - -The Modular Component Architecture (MCA) - -The MCA is the backbone of Open MPI -- most services and functionality -are implemented through MCA components. Here is a list of all the -component frameworks in Open MPI: - ---------------------------------------------------------------------------- - -MPI component frameworks: -------------------------- - -bml - BTL management layer -coll - MPI collective algorithms -fbtl - file byte transfer layer: abstraction for individual - read/write operations for OMPIO -fcoll - collective read and write operations for MPI I/O -fs - file system functions for MPI I/O -io - MPI I/O -mtl - Matching transport layer, used for MPI point-to-point - messages on some types of networks -op - Back end computations for intrinsic MPI_Op operators -osc - MPI one-sided communications -pml - MPI point-to-point management layer -rte - Run-time environment operations -sharedfp - shared file pointer operations for MPI I/O -topo - MPI topology routines -vprotocol - Protocols for the "v" PML - -OpenSHMEM component frameworks: -------------------------- - -atomic - OpenSHMEM atomic operations -memheap - OpenSHMEM memory allocators that support the - PGAS memory model -scoll - OpenSHMEM collective operations -spml - OpenSHMEM "pml-like" layer: supports one-sided, - point-to-point operations -sshmem - OpenSHMEM shared memory backing facility - - -Back-end run-time environment (RTE) component frameworks: ---------------------------------------------------------- - -dfs - Distributed file system -errmgr - RTE error manager -ess - RTE environment-specific services -filem - Remote file management -grpcomm - RTE group communications -iof - I/O forwarding -notifier - System-level notification support -odls - OpenRTE daemon local launch subsystem -oob - Out of band messaging -plm - Process lifecycle management -ras - Resource allocation system -rmaps - Resource mapping system -rml - RTE message layer -routed - Routing table for the RML -rtc - Run-time control framework -schizo - OpenRTE personality framework -state - RTE state machine - -Miscellaneous frameworks: -------------------------- - -allocator - Memory allocator -backtrace - Debugging call stack backtrace support -btl - Point-to-point Byte Transfer Layer -dl - Dynamic loading library interface -event - Event library (libevent) versioning support -hwloc - Hardware locality (hwloc) versioning support -if - OS IP interface support -installdirs - Installation directory relocation services -memchecker - Run-time memory checking -memcpy - Memory copy support -memory - Memory management hooks -mpool - Memory pooling -patcher - Symbol patcher hooks -pmix - Process management interface (exascale) -pstat - Process status -rcache - Memory registration cache -sec - Security framework -shmem - Shared memory support (NOT related to OpenSHMEM) -timer - High-resolution timers - ---------------------------------------------------------------------------- - -Each framework typically has one or more components that are used at -run-time. For example, the btl framework is used by the MPI layer to -send bytes across different types underlying networks. The tcp btl, -for example, sends messages across TCP-based networks; the UCX PML -sends messages across OpenFabrics-based networks. - -Each component typically has some tunable parameters that can be -changed at run-time. Use the ompi_info command to check a component -to see what its tunable parameters are. For example: - - shell$ ompi_info --param btl tcp - -shows some of the parameters (and default values) for the tcp btl -component (use --level to show *all* the parameters; see below). - -Note that ompi_info only shows a small number a component's MCA -parameters by default. Each MCA parameter has a "level" value from 1 -to 9, corresponding to the MPI-3 MPI_T tool interface levels. In Open -MPI, we have interpreted these nine levels as three groups of three: - - 1. End user / basic - 2. End user / detailed - 3. End user / all - - 4. Application tuner / basic - 5. Application tuner / detailed - 6. Application tuner / all - - 7. MPI/OpenSHMEM developer / basic - 8. MPI/OpenSHMEM developer / detailed - 9. MPI/OpenSHMEM developer / all - -Here's how the three sub-groups are defined: - - 1. End user: Generally, these are parameters that are required for - correctness, meaning that someone may need to set these just to - get their MPI/OpenSHMEM application to run correctly. - 2. Application tuner: Generally, these are parameters that can be - used to tweak MPI application performance. - 3. MPI/OpenSHMEM developer: Parameters that either don't fit in the - other two, or are specifically intended for debugging / - development of Open MPI itself. - -Each sub-group is broken down into three classifications: - - 1. Basic: For parameters that everyone in this category will want to - see. - 2. Detailed: Parameters that are useful, but you probably won't need - to change them often. - 3. All: All other parameters -- probably including some fairly - esoteric parameters. - -To see *all* available parameters for a given component, specify that -ompi_info should use level 9: - - shell$ ompi_info --param btl tcp --level 9 - -These values can be overridden at run-time in several ways. At -run-time, the following locations are examined (in order) for new -values of parameters: - -1. /etc/openmpi-mca-params.conf - - This file is intended to set any system-wide default MCA parameter - values -- it will apply, by default, to all users who use this Open - MPI installation. The default file that is installed contains many - comments explaining its format. - -2. $HOME/.openmpi/mca-params.conf - - If this file exists, it should be in the same format as - /etc/openmpi-mca-params.conf. It is intended to provide - per-user default parameter values. - -3. environment variables of the form OMPI_MCA_ set equal to a - - - Where is the name of the parameter. For example, set the - variable named OMPI_MCA_btl_tcp_frag_size to the value 65536 - (Bourne-style shells): - - shell$ OMPI_MCA_btl_tcp_frag_size=65536 - shell$ export OMPI_MCA_btl_tcp_frag_size - -4. the mpirun/oshrun command line: --mca - - Where is the name of the parameter. For example: - - shell$ mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi - -These locations are checked in order. For example, a parameter value -passed on the mpirun command line will override an environment -variable; an environment variable will override the system-wide -defaults. - -Each component typically activates itself when relevant. For example, -the usNIC component will detect that usNIC devices are present and -will automatically be used for MPI communications. The SLURM -component will automatically detect when running inside a SLURM job -and activate itself. And so on. - -Components can be manually activated or deactivated if necessary, of -course. The most common components that are manually activated, -deactivated, or tuned are the "BTL" components -- components that are -used for MPI point-to-point communications on many types common -networks. - -For example, to *only* activate the TCP and "self" (process loopback) -components are used for MPI communications, specify them in a -comma-delimited list to the "btl" MCA parameter: - - shell$ mpirun --mca btl tcp,self hello_world_mpi - -To add shared memory support, add "sm" into the command-delimited list -(list order does not matter): - - shell$ mpirun --mca btl tcp,sm,self hello_world_mpi - -(there used to be a "vader" BTL for shared memory support; it was -renamed to "sm" in Open MPI v5.0.0, but the alias "vader" still works -as well) - -To specifically deactivate a specific component, the comma-delimited -list can be prepended with a "^" to negate it: - - shell$ mpirun --mca btl ^tcp hello_mpi_world - -The above command will use any other BTL component other than the tcp -component. - -=========================================================================== - -Common Questions ----------------- - -Many common questions about building and using Open MPI are answered -on the FAQ: - - https://www.open-mpi.org/faq/ - -=========================================================================== - -Got more questions? -------------------- - -Found a bug? Got a question? Want to make a suggestion? Want to -contribute to Open MPI? Please let us know! - -When submitting questions and problems, be sure to include as much -extra information as possible. This web page details all the -information that we request in order to provide assistance: - - https://www.open-mpi.org/community/help/ - -User-level questions and comments should generally be sent to the -user's mailing list (users@lists.open-mpi.org). Because of spam, only -subscribers are allowed to post to this list (ensure that you -subscribe with and post from *exactly* the same e-mail address -- -joe@example.com is considered different than -joe@mycomputer.example.com!). Visit this page to subscribe to the -user's list: - - https://lists.open-mpi.org/mailman/listinfo/users - -Developer-level bug reports, questions, and comments should generally -be sent to the developer's mailing list (devel@lists.open-mpi.org). -Please do not post the same question to both lists. As with the -user's list, only subscribers are allowed to post to the developer's -list. Visit the following web page to subscribe: - - https://lists.open-mpi.org/mailman/listinfo/devel - -Make today an Open MPI day! diff --git a/README.JAVA.md b/README.JAVA.md new file mode 100644 index 0000000000..234c7a6a1c --- /dev/null +++ b/README.JAVA.md @@ -0,0 +1,281 @@ +# Open MPI Java Bindings + +## Important node + +JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE +NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF +JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF +THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND +CONTINUED DEVELOPER SUPPORT. + +## Overview + +This version of Open MPI provides support for Java-based +MPI applications. + +The rest of this document provides step-by-step instructions on +building OMPI with Java bindings, and compiling and running Java-based +MPI applications. Also, part of the functionality is explained with +examples. Further details about the design, implementation and usage +of Java bindings in Open MPI can be found in [1]. The bindings follow +a JNI approach, that is, we do not provide a pure Java implementation +of MPI primitives, but a thin layer on top of the C +implementation. This is the same approach as in mpiJava [2]; in fact, +mpiJava was taken as a starting point for Open MPI Java bindings, but +they were later totally rewritten. + +1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and + implementation of Java bindings in Open MPI". Parallel Comput. + 59: 1-20 (2016). +2. M. Baker et al. "mpiJava: An object-oriented Java interface to + MPI". In Parallel and Distributed Processing, LNCS vol. 1586, + pp. 748-762, Springer (1999). + +## Building Java Bindings + +If this software was obtained as a developer-level checkout as opposed +to a tarball, you will need to start your build by running +`./autogen.pl`. This will also require that you have a fairly recent +version of GNU Autotools on your system - see the HACKING.md file for +details. + +Java support requires that Open MPI be built at least with shared libraries +(i.e., `--enable-shared`) - any additional options are fine and will not +conflict. Note that this is the default for Open MPI, so you don't +have to explicitly add the option. The Java bindings will build only +if `--enable-mpi-java` is specified, and a JDK is found in a typical +system default location. + +If the JDK is not in a place where we automatically find it, you can +specify the location. For example, this is required on the Mac +platform as the JDK headers are located in a non-typical location. Two +options are available for this purpose: + +1. `--with-jdk-bindir=`: the location of `javac` and `javah` +1. `--with-jdk-headers=`: the directory containing `jni.h` + +For simplicity, typical configurations are provided in platform files +under `contrib/platform/hadoop`. These will meet the needs of most +users, or at least provide a starting point for your own custom +configuration. + +In summary, therefore, you can configure the system using the +following Java-related options: + +``` +$ ./configure --with-platform=contrib/platform/hadoop/ ... + +```` + +or + +``` +$ ./configure --enable-mpi-java --with-jdk-bindir= --with-jdk-headers= ... +``` + +or simply + +``` +$ ./configure --enable-mpi-java ... +``` + +if JDK is in a "standard" place that we automatically find. + +## Running Java Applications + +For convenience, the `mpijavac` wrapper compiler has been provided for +compiling Java-based MPI applications. It ensures that all required MPI +libraries and class paths are defined. You can see the actual command +line using the `--showme` option, if you are interested. + +Once your application has been compiled, you can run it with the +standard `mpirun` command line: + +``` +$ mpirun java +``` + +For convenience, `mpirun` has been updated to detect the `java` command +and ensure that the required MPI libraries and class paths are defined +to support execution. You therefore do _NOT_ need to specify the Java +library path to the MPI installation, nor the MPI classpath. Any class +path definitions required for your application should be specified +either on the command line or via the `CLASSPATH` environment +variable. Note that the local directory will be added to the class +path if nothing is specified. + +As always, the `java` executable, all required libraries, and your +application classes must be available on all nodes. + +## Basic usage of Java bindings + +There is an MPI package that contains all classes of the MPI Java +bindings: `Comm`, `Datatype`, `Request`, etc. These classes have a +direct correspondence with classes defined by the MPI standard. MPI +primitives are just methods included in these classes. The convention +used for naming Java methods and classes is the usual camel-case +convention, e.g., the equivalent of `MPI_File_set_info(fh,info)` is +`fh.setInfo(info)`, where `fh` is an object of the class `File`. + +Apart from classes, the MPI package contains predefined public +attributes under a convenience class `MPI`. Examples are the +predefined communicator `MPI.COMM_WORLD` or predefined datatypes such +as `MPI.DOUBLE`. Also, MPI initialization and finalization are methods +of the `MPI` class and must be invoked by all MPI Java +applications. The following example illustrates these concepts: + +```java +import mpi.*; + +class ComputePi { + + public static void main(String args[]) throws MPIException { + + MPI.Init(args); + + int rank = MPI.COMM_WORLD.getRank(), + size = MPI.COMM_WORLD.getSize(), + nint = 100; // Intervals. + double h = 1.0/(double)nint, sum = 0.0; + + for(int i=rank+1; i<=nint; i+=size) { + double x = h * ((double)i - 0.5); + sum += (4.0 / (1.0 + x * x)); + } + + double sBuf[] = { h * sum }, + rBuf[] = new double[1]; + + MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0); + + if(rank == 0) System.out.println("PI: " + rBuf[0]); + MPI.Finalize(); + } +} +``` + +## Exception handling + +Java bindings in Open MPI support exception handling. By default, errors +are fatal, but this behavior can be changed. The Java API will throw +exceptions if the MPI.ERRORS_RETURN error handler is set: + +```java +MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN); +``` + +If you add this statement to your program, it will show the line +where it breaks, instead of just crashing in case of an error. +Error-handling code can be separated from main application code by +means of try-catch blocks, for instance: + +```java +try +{ + File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY); +} +catch(MPIException ex) +{ + System.err.println("Error Message: "+ ex.getMessage()); + System.err.println(" Error Class: "+ ex.getErrorClass()); + ex.printStackTrace(); + System.exit(-1); +} +``` + +## How to specify buffers + +In MPI primitives that require a buffer (either send or receive) the +Java API admits a Java array. Since Java arrays can be relocated by +the Java runtime environment, the MPI Java bindings need to make a +copy of the contents of the array to a temporary buffer, then pass the +pointer to this buffer to the underlying C implementation. From the +practical point of view, this implies an overhead associated to all +buffers that are represented by Java arrays. The overhead is small +for small buffers but increases for large arrays. + +There is a pool of temporary buffers with a default capacity of 64K. +If a temporary buffer of 64K or less is needed, then the buffer will +be obtained from the pool. But if the buffer is larger, then it will +be necessary to allocate the buffer and free it later. + +The default capacity of pool buffers can be modified with an Open MPI +MCA parameter: + +``` +shell$ mpirun --mca mpi_java_eager size ... +``` + +Where `size` is the number of bytes, or kilobytes if it ends with 'k', +or megabytes if it ends with 'm'. + +An alternative is to use "direct buffers" provided by standard classes +available in the Java SDK such as `ByteBuffer`. For convenience we +provide a few static methods `new[Type]Buffer` in the `MPI` class to +create direct buffers for a number of basic datatypes. Elements of the +direct buffer can be accessed with methods `put()` and `get()`, and +the number of elements in the buffer can be obtained with the method +`capacity()`. This example illustrates its use: + +```java +int myself = MPI.COMM_WORLD.getRank(); +int tasks = MPI.COMM_WORLD.getSize(); + +IntBuffer in = MPI.newIntBuffer(MAXLEN * tasks), + out = MPI.newIntBuffer(MAXLEN); + +for(int i = 0; i < MAXLEN; i++) + out.put(i, myself); // fill the buffer with the rank + +Request request = MPI.COMM_WORLD.iAllGather( + out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT); +request.waitFor(); +request.free(); + +for(int i = 0; i < tasks; i++) +{ + for(int k = 0; k < MAXLEN; k++) + { + if(in.get(k + i * MAXLEN) != i) + throw new AssertionError("Unexpected value"); + } +} +``` + +Direct buffers are available for: `BYTE`, `CHAR`, `SHORT`, `INT`, +`LONG`, `FLOAT`, and `DOUBLE`. There is no direct buffer for booleans. + +Direct buffers are not a replacement for arrays, because they have +higher allocation and deallocation costs than arrays. In some +cases arrays will be a better choice. You can easily convert a +buffer into an array and vice versa. + +All non-blocking methods must use direct buffers and only +blocking methods can choose between arrays and direct buffers. + +The above example also illustrates that it is necessary to call +the `free()` method on objects whose class implements the `Freeable` +interface. Otherwise a memory leak is produced. + +## Specifying offsets in buffers + +In a C program, it is common to specify an offset in a array with +`&array[i]` or `array+i`, for instance to send data starting from +a given position in the array. The equivalent form in the Java bindings +is to `slice()` the buffer to start at an offset. Making a `slice()` +on a buffer is only necessary, when the offset is not zero. Slices +work for both arrays and direct buffers. + +```java +import static mpi.MPI.slice; +// ... +int numbers[] = new int[SIZE]; +// ... +MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0); +``` + +## Questions? Problems? + +If you have any problems, or find any bugs, please feel free to report +them to [Open MPI user's mailing +list](https://www.open-mpi.org/community/lists/ompi.php). diff --git a/README.JAVA.txt b/README.JAVA.txt deleted file mode 100644 index 312601ab8e..0000000000 --- a/README.JAVA.txt +++ /dev/null @@ -1,275 +0,0 @@ -*************************************************************************** -IMPORTANT NOTE - -JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE -NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF -JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF -THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND -CONTINUED DEVELOPER SUPPORT. - -*************************************************************************** - -This version of Open MPI provides support for Java-based -MPI applications. - -The rest of this document provides step-by-step instructions on -building OMPI with Java bindings, and compiling and running -Java-based MPI applications. Also, part of the functionality is -explained with examples. Further details about the design, -implementation and usage of Java bindings in Open MPI can be found -in [1]. The bindings follow a JNI approach, that is, we do not -provide a pure Java implementation of MPI primitives, but a thin -layer on top of the C implementation. This is the same approach -as in mpiJava [2]; in fact, mpiJava was taken as a starting point -for Open MPI Java bindings, but they were later totally rewritten. - - [1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and - implementation of Java bindings in Open MPI". Parallel Comput. - 59: 1-20 (2016). - - [2] M. Baker et al. "mpiJava: An object-oriented Java interface to - MPI". In Parallel and Distributed Processing, LNCS vol. 1586, - pp. 748-762, Springer (1999). - -============================================================================ - -Building Java Bindings - -If this software was obtained as a developer-level -checkout as opposed to a tarball, you will need to start your build by -running ./autogen.pl. This will also require that you have a fairly -recent version of autotools on your system - see the HACKING file for -details. - -Java support requires that Open MPI be built at least with shared libraries -(i.e., --enable-shared) - any additional options are fine and will not -conflict. Note that this is the default for Open MPI, so you don't -have to explicitly add the option. The Java bindings will build only -if --enable-mpi-java is specified, and a JDK is found in a typical -system default location. - -If the JDK is not in a place where we automatically find it, you can -specify the location. For example, this is required on the Mac -platform as the JDK headers are located in a non-typical location. Two -options are available for this purpose: - ---with-jdk-bindir= - the location of javac and javah ---with-jdk-headers= - the directory containing jni.h - -For simplicity, typical configurations are provided in platform files -under contrib/platform/hadoop. These will meet the needs of most -users, or at least provide a starting point for your own custom -configuration. - -In summary, therefore, you can configure the system using the -following Java-related options: - -$ ./configure --with-platform=contrib/platform/hadoop/ -... - -or - -$ ./configure --enable-mpi-java --with-jdk-bindir= - --with-jdk-headers= ... - -or simply - -$ ./configure --enable-mpi-java ... - -if JDK is in a "standard" place that we automatically find. - ----------------------------------------------------------------------------- - -Running Java Applications - -For convenience, the "mpijavac" wrapper compiler has been provided for -compiling Java-based MPI applications. It ensures that all required MPI -libraries and class paths are defined. You can see the actual command -line using the --showme option, if you are interested. - -Once your application has been compiled, you can run it with the -standard "mpirun" command line: - -$ mpirun java - -For convenience, mpirun has been updated to detect the "java" command -and ensure that the required MPI libraries and class paths are defined -to support execution. You therefore do NOT need to specify the Java -library path to the MPI installation, nor the MPI classpath. Any class -path definitions required for your application should be specified -either on the command line or via the CLASSPATH environmental -variable. Note that the local directory will be added to the class -path if nothing is specified. - -As always, the "java" executable, all required libraries, and your -application classes must be available on all nodes. - ----------------------------------------------------------------------------- - -Basic usage of Java bindings - -There is an MPI package that contains all classes of the MPI Java -bindings: Comm, Datatype, Request, etc. These classes have a direct -correspondence with classes defined by the MPI standard. MPI primitives -are just methods included in these classes. The convention used for -naming Java methods and classes is the usual camel-case convention, -e.g., the equivalent of MPI_File_set_info(fh,info) is fh.setInfo(info), -where fh is an object of the class File. - -Apart from classes, the MPI package contains predefined public attributes -under a convenience class MPI. Examples are the predefined communicator -MPI.COMM_WORLD or predefined datatypes such as MPI.DOUBLE. Also, MPI -initialization and finalization are methods of the MPI class and must -be invoked by all MPI Java applications. The following example illustrates -these concepts: - -import mpi.*; - -class ComputePi { - - public static void main(String args[]) throws MPIException { - - MPI.Init(args); - - int rank = MPI.COMM_WORLD.getRank(), - size = MPI.COMM_WORLD.getSize(), - nint = 100; // Intervals. - double h = 1.0/(double)nint, sum = 0.0; - - for(int i=rank+1; i<=nint; i+=size) { - double x = h * ((double)i - 0.5); - sum += (4.0 / (1.0 + x * x)); - } - - double sBuf[] = { h * sum }, - rBuf[] = new double[1]; - - MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0); - - if(rank == 0) System.out.println("PI: " + rBuf[0]); - MPI.Finalize(); - } -} - ----------------------------------------------------------------------------- - -Exception handling - -Java bindings in Open MPI support exception handling. By default, errors -are fatal, but this behavior can be changed. The Java API will throw -exceptions if the MPI.ERRORS_RETURN error handler is set: - - MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN); - -If you add this statement to your program, it will show the line -where it breaks, instead of just crashing in case of an error. -Error-handling code can be separated from main application code by -means of try-catch blocks, for instance: - - try - { - File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY); - } - catch(MPIException ex) - { - System.err.println("Error Message: "+ ex.getMessage()); - System.err.println(" Error Class: "+ ex.getErrorClass()); - ex.printStackTrace(); - System.exit(-1); - } - - ----------------------------------------------------------------------------- - -How to specify buffers - -In MPI primitives that require a buffer (either send or receive) the -Java API admits a Java array. Since Java arrays can be relocated by -the Java runtime environment, the MPI Java bindings need to make a -copy of the contents of the array to a temporary buffer, then pass the -pointer to this buffer to the underlying C implementation. From the -practical point of view, this implies an overhead associated to all -buffers that are represented by Java arrays. The overhead is small -for small buffers but increases for large arrays. - -There is a pool of temporary buffers with a default capacity of 64K. -If a temporary buffer of 64K or less is needed, then the buffer will -be obtained from the pool. But if the buffer is larger, then it will -be necessary to allocate the buffer and free it later. - -The default capacity of pool buffers can be modified with an 'mca' -parameter: - - mpirun --mca mpi_java_eager size ... - -Where 'size' is the number of bytes, or kilobytes if it ends with 'k', -or megabytes if it ends with 'm'. - -An alternative is to use "direct buffers" provided by standard -classes available in the Java SDK such as ByteBuffer. For convenience -we provide a few static methods "new[Type]Buffer" in the MPI class -to create direct buffers for a number of basic datatypes. Elements -of the direct buffer can be accessed with methods put() and get(), -and the number of elements in the buffer can be obtained with the -method capacity(). This example illustrates its use: - - int myself = MPI.COMM_WORLD.getRank(); - int tasks = MPI.COMM_WORLD.getSize(); - - IntBuffer in = MPI.newIntBuffer(MAXLEN * tasks), - out = MPI.newIntBuffer(MAXLEN); - - for(int i = 0; i < MAXLEN; i++) - out.put(i, myself); // fill the buffer with the rank - - Request request = MPI.COMM_WORLD.iAllGather( - out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT); - request.waitFor(); - request.free(); - - for(int i = 0; i < tasks; i++) - { - for(int k = 0; k < MAXLEN; k++) - { - if(in.get(k + i * MAXLEN) != i) - throw new AssertionError("Unexpected value"); - } - } - -Direct buffers are available for: BYTE, CHAR, SHORT, INT, LONG, -FLOAT, and DOUBLE. There is no direct buffer for booleans. - -Direct buffers are not a replacement for arrays, because they have -higher allocation and deallocation costs than arrays. In some -cases arrays will be a better choice. You can easily convert a -buffer into an array and vice versa. - -All non-blocking methods must use direct buffers and only -blocking methods can choose between arrays and direct buffers. - -The above example also illustrates that it is necessary to call -the free() method on objects whose class implements the Freeable -interface. Otherwise a memory leak is produced. - ----------------------------------------------------------------------------- - -Specifying offsets in buffers - -In a C program, it is common to specify an offset in a array with -"&array[i]" or "array+i", for instance to send data starting from -a given position in the array. The equivalent form in the Java bindings -is to "slice()" the buffer to start at an offset. Making a "slice()" -on a buffer is only necessary, when the offset is not zero. Slices -work for both arrays and direct buffers. - - import static mpi.MPI.slice; - ... - int numbers[] = new int[SIZE]; - ... - MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0); - ----------------------------------------------------------------------------- - -If you have any problems, or find any bugs, please feel free to report -them to Open MPI user's mailing list (see -https://www.open-mpi.org/community/lists/ompi.php). diff --git a/README.md b/README.md new file mode 100644 index 0000000000..572ee880d1 --- /dev/null +++ b/README.md @@ -0,0 +1,2191 @@ +# Open MPI + +The Open MPI Project is an open source Message Passing Interface (MPI) +implementation that is developed and maintained by a consortium of +academic, research, and industry partners. Open MPI is therefore able +to combine the expertise, technologies, and resources from all across +the High Performance Computing community in order to build the best +MPI library available. Open MPI offers advantages for system and +software vendors, application developers and computer science +researchers. + +See [the MPI Forum web site](https://mpi-forum.org/) for information +about the MPI API specification. + +## Quick start + +In many cases, Open MPI can be built and installed by simply +indicating the installation directory on the command line: + +``` +$ tar xf openmpi-.tar.bz2 +$ cd openmpi- +$ ./configure --prefix= |& tee config.out +...lots of output... +$ make -j 8 |& tee make.out +...lots of output... +$ make install |& tee install.out +...lots of output... +``` + +Note that there are many, many configuration options to the +`./configure` step. Some of them may be needed for your particular +environmnet; see below for desciptions of the options available. + +If your installation prefix path is not writable by a regular user, +you may need to use sudo or su to run the `make install` step. For +example: + +``` +$ sudo make install |& tee install.out +[sudo] password for jsquyres: +...lots of output... +``` + +Finally, note that VPATH builds are fully supported. For example: + +``` +$ tar xf openmpi-.tar.bz2 +$ cd openmpi- +$ mkdir build +$ cd build +$ ../configure --prefix= |& tee config.out +...etc. +``` + +## Table of contents + +The rest of this file contains: + +* [General release notes about Open MPI](#general-notes) + * [Platform-specific notes](#platform-notes) + * [Compiler-specific notes](#compiler-notes) + * [Run-time support notes](#general-run-time-support-notes) + * [MPI functionality and features](#mpi-functionality-and-features) + * [OpenSHMEM functionality and + features](#openshmem-functionality-and-features) + * [MPI collectives](#mpi-collectives) + * [OpenSHMEM collectives](#openshmem-collectives) + * [Network support](#network-support) + * [Open MPI extensions](#open-mpi-extensions) +* [Detailed information on building Open MPI](#building-open-mpi) + * [Installation options](#installation-options) + * [Networking support and options](#networking-support--options) + * [Run-time system support and options](#run-time-system-support) + * [Miscellaneous support + libraries](#miscellaneous-support-libraries) + * [MPI functionality options](#mpi-functionality) + * [OpenSHMEM functionality options](#openshmem-functionality) + * [Miscellaneous functionality + options](#miscellaneous-functionality) +* [Open MPI version and library numbering + policies](#open-mpi-version-numbers-and-binary-compatibility) + * [Backwards compatibility polices](#backwards-compatibility) + * [Software version numbering](#software-version-number) + * [Shared library version numbering](#shared-library-version-number) +* [Information on how to both query and validate your Open MPI + installation](#checking-your-open-mpi-installation) +* [Description of Open MPI extensions](#open-mpi-api-extensions) + * [Compiling the extensions](#compiling-the-extensions) + * [Using the extensions](#using-the-extensions) +* [Examples showing how to compile Open MPI applications](#compiling-open-mpi-applications) +* [Examples showing how to run Open MPI applications](#running-open-mpi-applications) +* [Summary information on the various plugin + frameworks](#the-modular-component-architecture-mca) + * [MPI layer frameworks](#mpi-layer-frameworks) + * [OpenSHMEM component frameworks](#openshmem-component-frameworks) + * [Run-time environment + frameworks](#back-end-run-time-environment-rte-component-frameworks) + * [Miscellaneous frameworks](#miscellaneous-frameworks) + * [Other notes about frameworks](#framework-notes) +* [How to get more help](#questions--problems) + +Also, note that much, much more information is also available [in the +Open MPI FAQ](https://www.open-mpi.org/faq/). + + +## General notes + +The following abbreviated list of release notes applies to this code +base as of this writing (April 2020): + +* Open MPI now includes two public software layers: MPI and OpenSHMEM. + Throughout this document, references to Open MPI implicitly include + both of these layers. When distinction between these two layers is + necessary, we will reference them as the "MPI" and "OpenSHMEM" + layers respectively. + +* OpenSHMEM is a collaborative effort between academia, industry, and + the U.S. Government to create a specification for a standardized API + for parallel programming in the Partitioned Global Address Space + (PGAS). For more information about the OpenSHMEM project, including + access to the current OpenSHMEM specification, please visit + http://openshmem.org/. + + This OpenSHMEM implementation will only work in Linux environments + with a restricted set of supported networks. + +* Open MPI includes support for a wide variety of supplemental + hardware and software package. When configuring Open MPI, you may + need to supply additional flags to the `configure` script in order + to tell Open MPI where the header files, libraries, and any other + required files are located. As such, running `configure` by itself + may not include support for all the devices (etc.) that you expect, + especially if their support headers / libraries are installed in + non-standard locations. Network interconnects are an easy example + to discuss -- Libfabric and OpenFabrics networks, for example, both + have supplemental headers and libraries that must be found before + Open MPI can build support for them. You must specify where these + files are with the appropriate options to configure. See the + listing of configure command-line switches, below, for more details. + +* The majority of Open MPI's documentation is here in this file, the + included man pages, and on [the web site + FAQ](https://www.open-mpi.org/). + +* Note that Open MPI documentation uses the word "component" + frequently; the word "plugin" is probably more familiar to most + users. As such, end users can probably completely substitute the + word "plugin" wherever you see "component" in our documentation. + For what it's worth, we use the word "component" for historical + reasons, mainly because it is part of our acronyms and internal API + function calls. + +* The run-time systems that are currently supported are: + * rsh / ssh + * PBS Pro, Torque + * Platform LSF (tested with v9.1.1 and later) + * SLURM + * Cray XE, XC, and XK + * Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine + +* Systems that have been tested are: + * Linux (various flavors/distros), 64 bit (x86, ppc, aarch64), + with gcc (>=4.8.x+), clang (>=3.6.0), Absoft (fortran), Intel, + and Portland (*) + * macOS (10.14-10.15), 64 bit (x86_64) with XCode compilers + + (*) Be sure to read the Compiler Notes, below. + +* Other systems have been lightly (but not fully) tested: + * Linux (various flavors/distros), 32 bit, with gcc + * Cygwin 32 & 64 bit with gcc + * ARMv6, ARMv7 + * Other 64 bit platforms. + * OpenBSD. Requires configure options `--enable-mca-no-build=patcher` + and `--disable-dlopen` with this release. + * Problems have been reported when building Open MPI on FreeBSD 11.1 + using the clang-4.0 system compiler. A workaround is to build + Open MPI using the GNU compiler. + +* Open MPI has taken some steps towards [Reproducible + Builds](https://reproducible-builds.org/). Specifically, Open MPI's + `configure` and `make` process, by default, records the build date + and some system-specific information such as the hostname where Open + MPI was built and the username who built it. If you desire a + Reproducible Build, set the `$SOURCE_DATE_EPOCH`, `$USER` and + `$HOSTNAME` environment variables before invoking `configure` and + `make`, and Open MPI will use those values instead of invoking + `whoami` and/or `hostname`, respectively. See + https://reproducible-builds.org/docs/source-date-epoch/ for + information on the expected format and content of the + `$SOURCE_DATE_EPOCH` variable. + + +### Platform Notes + +- N/A + + +### Compiler Notes + +* Open MPI requires a C99-capable compiler to build. + +* On platforms other than x86-64, ARM, and PPC, Open MPI requires a + compiler that either supports C11 atomics or the GCC `__atomic` + atomics (e.g., GCC >= v4.7.2). + +* Mixing compilers from different vendors when building Open MPI + (e.g., using the C/C++ compiler from one vendor and the Fortran + compiler from a different vendor) has been successfully employed by + some Open MPI users (discussed on the Open MPI user's mailing list), + but such configurations are not tested and not documented. For + example, such configurations may require additional compiler / + linker flags to make Open MPI build properly. + + A not-uncommon case for this is when building on MacOS with the + system-default GCC compiler (i.e., `/usr/bin/gcc`), but a 3rd party + gfortran (e.g., provided by Homebrew, in `/usr/local/bin/gfortran`). + Since these compilers are provided by different organizations, they + have different default search paths. For example, if Homebrew has + also installed a local copy of Libevent (a 3rd party package that + Open MPI requires), the MacOS-default `gcc` linker will find it + without any additional command line flags, but the Homebrew-provided + gfortran linker will not. In this case, it may be necessary to + provide the following on the configure command line: + + ``` + $ ./configure FCFLAGS=-L/usr/local/lib ... + ``` + + This `-L` flag will then be passed to the Fortran linker when + creating Open MPI's Fortran libraries, and it will therefore be able + to find the installed Libevent. + +* In general, the latest versions of compilers of a given vendor's + series have the least bugs. We have seen cases where Vendor XYZ's + compiler version A.B fails to compile Open MPI, but version A.C + (where C>B) works just fine. If you run into a compile failure, you + might want to double check that you have the latest bug fixes and + patches for your compiler. + +* Users have reported issues with older versions of the Fortran PGI + compiler suite when using Open MPI's (non-default) `--enable-debug` + configure option. Per the above advice of using the most recent + version of a compiler series, the Open MPI team recommends using the + latest version of the PGI suite, and/or not using the `--enable-debug` + configure option. If it helps, here's what we have found with some + (not comprehensive) testing of various versions of the PGI compiler + suite: + + * pgi-8 : NO known good version with `--enable-debug` + * pgi-9 : 9.0-4 known GOOD + * pgi-10: 10.0-0 known GOOD + * pgi-11: NO known good version with `--enable-debug` + * pgi-12: 12.10 known BAD with `-m32`, but known GOOD without `-m32` + (and 12.8 and 12.9 both known BAD with `--enable-debug`) + * pgi-13: 13.9 known BAD with `-m32`, 13.10 known GOOD without `-m32` + * pgi-15: 15.10 known BAD with `-m32` + +* Similarly, there is a known Fortran PGI compiler issue with long + source directory path names that was resolved in 9.0-4 (9.0-3 is + known to be broken in this regard). + +* Open MPI does not support the PGI compiler suite on OS X or MacOS. + See issues below for more details: + * https://github.com/open-mpi/ompi/issues/2604 + * https://github.com/open-mpi/ompi/issues/2605 + +* OpenSHMEM Fortran bindings do not support the "no underscore" + Fortran symbol convention. IBM's `xlf` compilers build in that mode + by default. As such, IBM's `xlf` compilers cannot build/link the + OpenSHMEM Fortran bindings by default. A workaround is to pass + `FC="xlf -qextname"` at configure time to force a trailing + underscore. See [this + issue](https://github.com/open-mpi/ompi/issues/3612) for more + details. + +* MPI applications that use the mpi_f08 module on PowerPC platforms + (tested ppc64le) will likely experience runtime failures if: + * they are using a GNU linker (ld) version after v2.25.1 and before v2.28, + *and* + * they compiled with PGI (tested 17.5) or XL (tested v15.1.5) compilers. + This was noticed on Ubuntu 16.04 which uses the 2.26.1 version of + `ld` by default. However, this issue impacts any OS using a version + of `ld` noted above. This GNU linker regression will be fixed in + version 2.28. [Here is a link to the GNU bug on this + issue](https://sourceware.org/bugzilla/show_bug.cgi?id=21306). The + XL compiler will include a fix for this issue in a future release. + +* On NetBSD-6 (at least AMD64 and i386), and possibly on OpenBSD, + Libtool misidentifies properties of f95/g95, leading to obscure + compile-time failures if used to build Open MPI. You can work + around this issue by ensuring that libtool will not use f95/g95 + (e.g., by specifying `FC=`, or otherwise ensuring + a different Fortran compiler will be found earlier in the path than + `f95`/`g95`), or by disabling the Fortran MPI bindings with + `--disable-mpi-fortran`. + +* On OpenBSD/i386, if you configure with + `--enable-mca-no-build=patcher`, you will also need to add + `--disable-dlopen`. Otherwise, odd crashes can occur + nondeterministically. + +* Absoft 11.5.2 plus a service pack from September 2012 (which Absoft + says is available upon request), or a version later than 11.5.2 + (e.g., 11.5.3), is required to compile the Fortran `mpi_f08` + module. + +* Open MPI does not support the Sparc v8 CPU target. However, + as of Solaris Studio 12.1, and later compilers, one should not + specify `-xarch=v8plus` or `-xarch=v9`. The use of the options + `-m32` and `-m64` for producing 32 and 64 bit targets, respectively, + are now preferred by the Solaris Studio compilers. GCC may + require either `-m32` or `-mcpu=v9 -m32`, depending on GCC version. + +* If one tries to build OMPI on Ubuntu with Solaris Studio using the C++ + compiler and the `-m32` option, you might see a warning: + + ``` + CC: Warning: failed to detect system linker version, falling back to custom linker usage + ``` + + And the build will fail. One can overcome this error by either + setting `LD_LIBRARY_PATH` to the location of the 32 bit libraries + (most likely /lib32), or giving `LDFLAGS="-L/lib32 -R/lib32"` to the + `configure` command. Officially, Solaris Studio is not supported on + Ubuntu Linux distributions, so additional problems might be + incurred. + +* Open MPI does not support the `gccfss` compiler (GCC For SPARC + Systems; a now-defunct compiler project from Sun). + +* At least some versions of the Intel 8.1 compiler seg fault while + compiling certain Open MPI source code files. As such, it is not + supported. + +* It has been reported that the Intel 9.1 and 10.0 compilers fail to + compile Open MPI on IA64 platforms. As of 12 Sep 2012, there is + very little (if any) testing performed on IA64 platforms (with any + compiler). Support is "best effort" for these platforms, but it is + doubtful that any effort will be expended to fix the Intel 9.1 / + 10.0 compiler issuers on this platform. + +* Early versions of the Intel 12.1 Linux compiler suite on x86_64 seem + to have a bug that prevents Open MPI from working. Symptoms + including immediate segv of the wrapper compilers (e.g., `mpicc`) and + MPI applications. As of 1 Feb 2012, if you upgrade to the latest + version of the Intel 12.1 Linux compiler suite, the problem will go + away. + +* The Portland Group compilers prior to version 7.0 require the + `-Msignextend` compiler flag to extend the sign bit when converting + from a shorter to longer integer. This is is different than other + compilers (such as GNU). When compiling Open MPI with the Portland + compiler suite, the following flags should be passed to Open MPI's + `configure` script: + + ``` + shell$ ./configure CFLAGS=-Msignextend CXXFLAGS=-Msignextend \ + --with-wrapper-cflags=-Msignextend \ + --with-wrapper-cxxflags=-Msignextend ... + ``` + + This will both compile Open MPI with the proper compile flags and + also automatically add "-Msignextend" when the C and C++ MPI wrapper + compilers are used to compile user MPI applications. + +* It has been reported that Pathscale 5.0.5 and 6.0.527 compilers + give an internal compiler error when trying to build Open MPI. + +* As of July 2017, the Pathscale compiler suite apparently has no + further commercial support, and it does not look like there will be + further releases. Any issues discovered regarding building / + running Open MPI with the Pathscale compiler suite therefore may not + be able to be resolved. + +* Using the Absoft compiler to build the MPI Fortran bindings on Suse + 9.3 is known to fail due to a Libtool compatibility issue. + +* MPI Fortran API support has been completely overhauled since the + Open MPI v1.5/v1.6 series. + + There is now only a single Fortran MPI wrapper compiler and a + single Fortran OpenSHMEM wrapper compiler: `mpifort` and `oshfort`, + respectively. `mpif77` and `mpif90` still exist, but they are + symbolic links to `mpifort`. + + Similarly, Open MPI's `configure` script only recognizes the `FC` + and `FCFLAGS` environment variables (to specify the Fortran + compiler and compiler flags, respectively). The `F77` and `FFLAGS` + environment variables are ***IGNORED***. + + As a direct result, it is ***STRONGLY*** recommended that you + specify a Fortran compiler that uses file suffixes to determine + Fortran code layout (e.g., free form vs. fixed). For example, with + some versions of the IBM XLF compiler, it is preferable to use + `FC=xlf` instead of `FC=xlf90`, because `xlf` will automatically + determine the difference between free form and fixed Fortran source + code. + + However, many Fortran compilers allow specifying additional + command-line arguments to indicate which Fortran dialect to use. + For example, if `FC=xlf90`, you may need to use `mpifort --qfixed ...` + to compile fixed format Fortran source files. + + You can use either `ompi_info` or `oshmem_info` to see with which + Fortran compiler Open MPI was configured and compiled. + + There are up to three sets of Fortran MPI bindings that may be + provided (depending on your Fortran compiler): + + 1. `mpif.h`: This is the first MPI Fortran interface that was + defined in MPI-1. It is a file that is included in Fortran + source code. Open MPI's `mpif.h` does not declare any MPI + subroutines; they are all implicit. + + 1. `mpi` module: The `mpi` module file was added in MPI-2. It + provides strong compile-time parameter type checking for MPI + subroutines. + + 1. `mpi_f08` module: The `mpi_f08` module was added in MPI-3. It + provides many advantages over the `mpif.h` file and `mpi` module. + For example, MPI handles have distinct types (vs. all being + integers). See the MPI-3 document for more details. + + ***NOTE:*** The `mpi_f08` module is ***STRONGLY*** recommended for + all new MPI Fortran subroutines and applications. Note that the + `mpi_f08` module can be used in conjunction with the other two + Fortran MPI bindings in the same application (only one binding can + be used per subroutine/function, however). Full interoperability + between `mpif.h`/`mpi` module and `mpi_f08` module MPI handle types + is provided, allowing `mpi_f08` to be used in new subroutines in + legacy MPI applications. + + Per the OpenSHMEM specification, there is only one Fortran OpenSHMEM + binding provided: + + * `shmem.fh`: All Fortran OpenSHMEM programs should include + `shmem.f`, and Fortran OpenSHMEM programs that use constants + defined by OpenSHMEM ***MUST*** include `shmem.fh`. + + The following notes apply to the above-listed Fortran bindings: + + * All Fortran compilers support the `mpif.h`/`shmem.fh`-based + bindings, with one exception: the `MPI_SIZEOF` interfaces will + only be present when Open MPI is built with a Fortran compiler + that supports the `INTERFACE` keyword and `ISO_FORTRAN_ENV`. Most + notably, this excludes the GNU Fortran compiler suite before + version 4.9. + + * The level of support provided by the `mpi` module is based on your + Fortran compiler. + + If Open MPI is built with a non-GNU Fortran compiler, or if Open + MPI is built with the GNU Fortran compiler >= v4.9, all MPI + subroutines will be prototyped in the `mpi` module. All calls to + MPI subroutines will therefore have their parameter types checked + at compile time. + + If Open MPI is built with an old `gfortran` (i.e., < v4.9), a + limited `mpi` module will be built. Due to the limitations of + these compilers, and per guidance from the MPI-3 specification, + all MPI subroutines with "choice" buffers are specifically *not* + included in the `mpi` module, and their parameters will not be + checked at compile time. Specifically, all MPI subroutines with + no "choice" buffers are prototyped and will receive strong + parameter type checking at run-time (e.g., `MPI_INIT`, + `MPI_COMM_RANK`, etc.). + + Similar to the `mpif.h` interface, `MPI_SIZEOF` is only supported + on Fortran compilers that support `INTERFACE` and + `ISO_FORTRAN_ENV`. + + * The `mpi_f08` module has been tested with the Intel Fortran + compiler and gfortran >= 4.9. Other modern Fortran compilers + likely also work. + + Many older Fortran compilers do not provide enough modern Fortran + features to support the `mpi_f08` module. For example, `gfortran` + < v4.9 does provide enough support for the `mpi_f08` module. + + You can examine the output of the following command to see all + the Fortran features that are/are not enabled in your Open MPI + installation: + + ``` + shell$ ompi_info | grep -i fort + ``` + + +### General Run-Time Support Notes + +* The Open MPI installation must be in your `PATH` on all nodes (and + potentially `LD_LIBRARY_PATH` or `DYLD_LIBRARY_PATH`, if + `libmpi`/`libshmem` is a shared library), unless using the + `--prefix` or `--enable-mpirun-prefix-by-default` functionality (see + below). + +* Open MPI's run-time behavior can be customized via Modular Component + Architecture (MCA) parameters (see below for more information on how + to get/set MCA parameter values). Some MCA parameters can be set in + a way that renders Open MPI inoperable (see notes about MCA + parameters later in this file). In particular, some parameters have + required options that must be included. + + * If specified, the `btl` parameter must include the `self` + component, or Open MPI will not be able to deliver messages to the + same rank as the sender. For example: `mpirun --mca btl tcp,self + ...` + * If specified, the `btl_tcp_if_exclude` parameter must include the + loopback device (`lo` on many Linux platforms), or Open MPI will + not be able to route MPI messages using the TCP BTL. For example: + `mpirun --mca btl_tcp_if_exclude lo,eth1 ...` + +* Running on nodes with different endian and/or different datatype + sizes within a single parallel job is supported in this release. + However, Open MPI does not resize data when datatypes differ in size + (for example, sending a 4 byte `MPI_DOUBLE` and receiving an 8 byte + `MPI_DOUBLE` will fail). + + +### MPI Functionality and Features + +* All MPI-3.1 functionality is supported. + +* Note that starting with Open MPI v4.0.0, prototypes for several + legacy MPI-1 symbols that were deleted in the MPI-3.0 specification + (which was published in 2012) are no longer available by default in + `mpi.h`. Specifically, several MPI-1 symbols were deprecated in the + 1996 publishing of the MPI-2.0 specification. These deprecated + symbols were eventually removed from the MPI-3.0 specification in + 2012. + + The symbols that now no longer appear by default in Open MPI's + `mpi.h` are: + + * `MPI_Address` (replaced by `MPI_Get_address`) + * `MPI_Errhandler_create` (replaced by `MPI_Comm_create_errhandler`) + * `MPI_Errhandler_get` (replaced by `MPI_Comm_get_errhandler`) + * `MPI_Errhandler_set` (replaced by `MPI_Comm_set_errhandler`) + * `MPI_Type_extent` (replaced by `MPI_Type_get_extent`) + * `MPI_Type_hindexed` (replaced by `MPI_Type_create_hindexed`) + * `MPI_Type_hvector` (replaced by `MPI_Type_create_hvector`) + * `MPI_Type_lb` (replaced by `MPI_Type_get_extent`) + * `MPI_Type_struct` (replaced by `MPI_Type_create_struct`) + * `MPI_Type_ub` (replaced by `MPI_Type_get_extent`) + * `MPI_LB` (replaced by `MPI_Type_create_resized`) + * `MPI_UB` (replaced by `MPI_Type_create_resized`) + * `MPI_COMBINER_HINDEXED_INTEGER` + * `MPI_COMBINER_HVECTOR_INTEGER` + * `MPI_COMBINER_STRUCT_INTEGER` + * `MPI_Handler_function` (replaced by `MPI_Comm_errhandler_function`) + + Although these symbols are no longer prototyped in `mpi.h`, they + are still present in the MPI library in Open MPI v4.0.x. This + enables legacy MPI applications to link and run successfully with + Open MPI v4.0.x, even though they will fail to compile. + + ***WARNING:*** Future releases of Open MPI beyond the v4.0.x series + may remove these symbols altogether. + + ***WARNING:*** The Open MPI team ***STRONGLY*** encourages all MPI + application developers to stop using these constructs that were + first deprecated over 20 years ago, and finally removed from the MPI + specification in MPI-3.0 (in 2012). + + ***WARNING:*** [The Open MPI + FAQ](https://www.open-mpi.org/faq/?category=mpi-removed) contains + examples of how to update legacy MPI applications using these + deleted symbols to use the "new" symbols. + + All that being said, if you are unable to immediately update your + application to stop using these legacy MPI-1 symbols, you can + re-enable them in `mpi.h` by configuring Open MPI with the + `--enable-mpi1-compatibility` flag. + +* Rank reordering support is available using the TreeMatch library. It + is activated for the graph and `dist_graph` communicator topologies. + +* When using MPI deprecated functions, some compilers will emit + warnings. For example: + + ``` + shell$ cat deprecated_example.c + #include + void foo(void) { + MPI_Datatype type; + MPI_Type_struct(1, NULL, NULL, NULL, &type); + } + shell$ mpicc -c deprecated_example.c + deprecated_example.c: In function 'foo': + deprecated_example.c:4: warning: 'MPI_Type_struct' is deprecated (declared at /opt/openmpi/include/mpi.h:1522) + shell$ + ``` + +* `MPI_THREAD_MULTIPLE` is supported with some exceptions. + + The following PMLs support `MPI_THREAD_MULTIPLE`: + 1. `cm` (see list (1) of supported MTLs, below) + 1. `ob1` (see list (2) of supported BTLs, below) + 1. `ucx` + + (1) The `cm` PML and the following MTLs support `MPI_THREAD_MULTIPLE`: + 1. `ofi` (Libfabric) + 1. `portals4` + + (2) The `ob1` PML and the following BTLs support `MPI_THREAD_MULTIPLE`: + 1. `self` + 1. `sm` + 1. `smcuda` + 1. `tcp` + 1. `ugni` + 1. `usnic` + + Currently, MPI File operations are not thread safe even if MPI is + initialized for `MPI_THREAD_MULTIPLE` support. + +* `MPI_REAL16` and `MPI_COMPLEX32` are only supported on platforms + where a portable C datatype can be found that matches the Fortran + type `REAL*16`, both in size and bit representation. + +* The "libompitrace" library is bundled in Open MPI and is installed + by default (it can be disabled via the `--disable-libompitrace` + flag). This library provides a simplistic tracing of select MPI + function calls via the MPI profiling interface. Linking it in to + your application via (e.g., via `-lompitrace`) will automatically + output to stderr when some MPI functions are invoked: + + ``` + shell$ cd examples/ + shell$ mpicc hello_c.c -o hello_c -lompitrace + shell$ mpirun -np 1 hello_c + MPI_INIT: argc 1 + Hello, world, I am 0 of 1 + MPI_BARRIER[0]: comm MPI_COMM_WORLD + MPI_FINALIZE[0] + shell$ + ``` + + Keep in mind that the output from the trace library is going to + `stderr`, so it may output in a slightly different order than the + `stdout` from your application. + + This library is being offered as a "proof of concept" / convenience + from Open MPI. If there is interest, it is trivially easy to extend + it to printf for other MPI functions. Pull requests on github.com + would be greatly appreciated. + + +### OpenSHMEM Functionality and Features + +All OpenSHMEM-1.3 functionality is supported. + + +### MPI Collectives + +* The `cuda` coll component provides CUDA-aware support for the + reduction type collectives with GPU buffers. This component is only + compiled into the library when the library has been configured with + CUDA-aware support. It intercepts calls to the reduction + collectives, copies the data to staging buffers if GPU buffers, then + calls underlying collectives to do the work. + + +### OpenSHMEM Collectives + +* The `fca` scoll component: the Mellanox Fabric Collective + Accelerator (FCA) is a solution for offloading collective operations + from the MPI process onto Mellanox QDR InfiniBand switch CPUs and + HCAs. + +* The `basic` scoll component: Reference implementation of all + OpenSHMEM collective operations. + + +### Network Support + +* There are several main MPI network models available: `ob1`, `cm`, + and `ucx`. `ob1` uses BTL ("Byte Transfer Layer") + components for each supported network. `cm` uses MTL ("Matching + Transport Layer") components for each supported network. `ucx` uses + the OpenUCX transport. + + * `ob1` supports a variety of networks that can be used in + combination with each other: + * OpenFabrics: InfiniBand, iWARP, and RoCE + * Loopback (send-to-self) + * Shared memory + * TCP + * SMCUDA + * Cisco usNIC + * uGNI (Cray Gemini, Aries) + * shared memory (XPMEM, Linux CMA, Linux KNEM, and + copy-in/copy-out shared memory) + + * `cm` supports a smaller number of networks (and they cannot be + used together), but may provide better overall MPI performance: + * Intel Omni-Path PSM2 (version 11.2.173 or later) + * Intel True Scale PSM (QLogic InfiniPath) + * OpenFabrics Interfaces ("libfabric" tag matching) + * Portals 4 + + * UCX is the [Unified Communication X (UCX) communication + library](https://www.openucx.org/). This is an open-source + project developed in collaboration between industry, laboratories, + and academia to create an open-source production grade + communication framework for data centric and high-performance + applications. The UCX library can be downloaded from repositories + (e.g., Fedora/RedHat yum repositories). The UCX library is also + part of Mellanox OFED and Mellanox HPC-X binary distributions. + + UCX currently supports: + + * OpenFabrics Verbs (including InfiniBand and RoCE) + * Cray's uGNI + * TCP + * Shared memory + * NVIDIA CUDA drivers + + While users can manually select any of the above transports at run + time, Open MPI will select a default transport as follows: + + 1. If InfiniBand devices are available, use the UCX PML. + 1. If PSM, PSM2, or other tag-matching-supporting Libfabric + transport devices are available (e.g., Cray uGNI), use the `cm` + PML and a single appropriate corresponding `mtl` module. + 1. Otherwise, use the `ob1` PML and one or more appropriate `btl` + modules. + + Users can override Open MPI's default selection algorithms and force + the use of a specific transport if desired by setting the `pml` MCA + parameter (and potentially the `btl` and/or `mtl` MCA parameters) at + run-time: + + ``` + shell$ mpirun --mca pml ob1 --mca btl [comma-delimted-BTLs] ... + or + shell$ mpirun --mca pml cm --mca mtl [MTL] ... + or + shell$ mpirun --mca pml ucx ... + ``` + + There is a known issue when using UCX with very old Mellanox + Infiniband HCAs, in particular HCAs preceding the introduction of + the ConnectX product line, which can result in Open MPI crashing in + MPI_Finalize. This issue is addressed by UCX release 1.9.0 and + newer. + +* The main OpenSHMEM network model is `ucx`; it interfaces directly + with UCX. + +* In prior versions of Open MPI, InfiniBand and RoCE support was + provided through the `openib` BTL and `ob1` PML plugins. Starting + with Open MPI 4.0.0, InfiniBand support through the `openib` plugin + is both deprecated and superseded by the `ucx` PML component. The + `openib` BTL was removed in Open MPI v5.0.0. + + While the `openib` BTL depended on `libibverbs`, the UCX PML depends + on the UCX library. + + Once installed, Open MPI can be built with UCX support by adding + `--with-ucx` to the Open MPI configure command. Once Open MPI is + configured to use UCX, the runtime will automatically select the + `ucx` PML if one of the supported networks is detected (e.g., + InfiniBand). It's possible to force using UCX in the `mpirun` or + `oshrun` command lines by specifying any or all of the following mca + parameters: `--mca pml ucx` for MPI point-to-point operations, + `--mca spml ucx` for OpenSHMEM support, and `--mca osc ucx` for MPI + RMA (one-sided) operations. + +* The `usnic` BTL is support for Cisco's usNIC device ("userspace NIC") + on Cisco UCS servers with the Virtualized Interface Card (VIC). + Although the usNIC is accessed via the OpenFabrics Libfabric API + stack, this BTL is specific to Cisco usNIC devices. + +* uGNI is a Cray library for communicating over the Gemini and Aries + interconnects. + +* The OpenFabrics Enterprise Distribution (OFED) software package v1.0 + will not work properly with Open MPI v1.2 (and later) due to how its + Mellanox InfiniBand plugin driver is created. The problem is fixed + with OFED v1.1 (and later). + +* The use of `fork()` with Libiverbs-based networks (i.e., the UCX + PML) is only partially supported, and only on Linux kernels >= + v2.6.15 with `libibverbs` v1.1 or later (first released as part of + OFED v1.2), per restrictions imposed by the OFED network stack. + +* Linux `knem` support is used when the `sm` (shared memory) BTL is + compiled with knem support (see the `--with-knem` configure option) + and the `knem` Linux module is loaded in the running kernel. If the + `knem` Linux kernel module is not loaded, the `knem` support is (by + default) silently deactivated during Open MPI jobs. + + See https://knem.gforge.inria.fr/ for details on Knem. + +* Linux Cross-Memory Attach (CMA) or XPMEM is used by the `sm` shared + memory BTL when the CMA/XPMEM libraries are installed, + respectively. Linux CMA and XPMEM are similar (but different) + mechanisms for Open MPI to utilize single-copy semantics for shared + memory. + + +### Open MPI Extensions + +An MPI "extensions" framework is included in Open MPI, but is not +enabled by default. See the "Open MPI API Extensions" section below +for more information on compiling and using MPI extensions. + +The following extensions are included in this version of Open MPI: + +1. `pcollreq`: Provides routines for persistent collective + communication operations and persistent neighborhood collective + communication operations, which are planned to be included in + MPI-4.0. The function names are prefixed with `MPIX_` instead of + `MPI_`, like `MPIX_Barrier_init`, because they are not + standardized yet. Future versions of Open MPI will switch to the + `MPI_` prefix once the MPI Standard which includes this feature is + published. See their man page for more details. +1. `shortfloat`: Provides MPI datatypes `MPIX_C_FLOAT16`, + `MPIX_SHORT_FLOAT`, `MPIX_SHORT_FLOAT`, and + `MPIX_CXX_SHORT_FLOAT_COMPLEX` if corresponding language types are + available. See `ompi/mpiext/shortfloat/README.txt` for details. +1. `affinity`: Provides the `OMPI_Affinity_str()` API, which returns + a string indicating the resources which a process is bound. For + more details, see its man page. +1. `cuda`: When the library is compiled with CUDA-aware support, it + provides two things. First, a macro + `MPIX_CUDA_AWARE_SUPPORT`. Secondly, the function + `MPIX_Query_cuda_support()` that can be used to query for support. +1. `example`: A non-functional extension; its only purpose is to + provide an example for how to create other extensions. + + +## Building Open MPI + +If you have checked out a ***developer's copy*** of Open MPI (i.e., +you cloned from Git), you really need to read the `HACKING` file +before attempting to build Open MPI. Really. + +If you have downloaded a tarball, then things are much simpler. +Open MPI uses a traditional `configure` script paired with `make` to +build. Typical installs can be of the pattern: + +``` +shell$ ./configure [...options...] +shell$ make [-j N] all install + (use an integer value of N for parallel builds) +``` + +There are many available `configure` options (see `./configure --help` +for a full list); a summary of the more commonly used ones is included +below. + +***NOTE:*** if you are building Open MPI on a network filesystem, the +machine you on which you are building *must* be time-synchronized with +the file server. Specifically: Open MPI's build system *requires* +accurate filesystem timestamps. If your `make` output includes +warning about timestamps in the future or runs GNU Automake, Autoconf, +and/or Libtool, this is *not normal*, and you may have an invalid +build. Ensure that the time on your build machine is synchronized +with the time on your file server, or build on a local filesystem. +Then remove the Open MPI source directory and start over (e.g., by +re-extracting the Open MPI tarball). + +Note that for many of Open MPI's `--with-FOO` options, Open MPI will, +by default, search for header files and/or libraries for `FOO`. If +the relevant files are found, Open MPI will built support for `FOO`; +if they are not found, Open MPI will skip building support for `FOO`. +However, if you specify `--with-FOO` on the configure command line and +Open MPI is unable to find relevant support for `FOO`, configure will +assume that it was unable to provide a feature that was specifically +requested and will abort so that a human can resolve out the issue. + +Additionally, if a search directory is specified in the form +`--with-FOO=DIR`, Open MPI will: + +1. Search for `FOO`'s header files in `DIR/include`. +2. Search for `FOO`'s library files: + 1. If `--with-FOO-libdir=` was specified, search in + ``. + 1. Otherwise, search in `DIR/lib`, and if they are not found + there, search again in `DIR/lib64`. +3. If both the relevant header files and libraries are found: + 1. Open MPI will build support for `FOO`. + 1. If the root path where the FOO libraries are found is neither + `/usr` nor `/usr/local`, Open MPI will compile itself with + RPATH flags pointing to the directory where FOO's libraries + are located. Open MPI does not RPATH `/usr/lib[64]` and + `/usr/local/lib[64]` because many systems already search these + directories for run-time libraries by default; adding RPATH for + them could have unintended consequences for the search path + ordering. + + +### Installation Options + +* `--prefix=DIR`: + Install Open MPI into the base directory named `DIR`. Hence, Open + MPI will place its executables in `DIR/bin`, its header files in + `DIR/include`, its libraries in `DIR/lib`, etc. + +* `--disable-shared`: + By default, Open MPI and OpenSHMEM build shared libraries, and all + components are built as dynamic shared objects (DSOs). This switch + disables this default; it is really only useful when used with + `--enable-static`. Specifically, this option does *not* imply + `--enable-static`; enabling static libraries and disabling shared + libraries are two independent options. + +* `--enable-static`: + Build MPI and OpenSHMEM as static libraries, and statically link in + all components. Note that this option does *not* imply + `--disable-shared`; enabling static libraries and disabling shared + libraries are two independent options. + + Be sure to read the description of `--without-memory-manager`, + below; it may have some effect on `--enable-static`. + +* `--disable-wrapper-rpath`: + By default, the wrapper compilers (e.g., `mpicc`) will enable + "rpath" support in generated executables on systems that support it. + That is, they will include a file reference to the location of Open + MPI's libraries in the application executable itself. This means + that the user does not have to set `LD_LIBRARY_PATH` to find Open + MPI's libraries (e.g., if they are installed in a location that the + run-time linker does not search by default). + + On systems that utilize the GNU `ld` linker, recent enough versions + will actually utilize "runpath" functionality, not "rpath". There + is an important difference between the two: + + 1. "rpath": the location of the Open MPI libraries is hard-coded into + the MPI/OpenSHMEM application and cannot be overridden at + run-time. + 1. "runpath": the location of the Open MPI libraries is hard-coded into + the MPI/OpenSHMEM application, but can be overridden at run-time + by setting the `LD_LIBRARY_PATH` environment variable. + + For example, consider that you install Open MPI vA.B.0 and + compile/link your MPI/OpenSHMEM application against it. Later, you + install Open MPI vA.B.1 to a different installation prefix (e.g., + `/opt/openmpi/A.B.1` vs. `/opt/openmpi/A.B.0`), and you leave the old + installation intact. + + In the rpath case, your MPI application will always use the + libraries from your A.B.0 installation. In the runpath case, you + can set the `LD_LIBRARY_PATH` environment variable to point to the + A.B.1 installation, and then your MPI application will use those + libraries. + + Note that in both cases, however, if you remove the original A.B.0 + installation and set `LD_LIBRARY_PATH` to point to the A.B.1 + installation, your application will use the A.B.1 libraries. + + This rpath/runpath behavior can be disabled via + `--disable-wrapper-rpath`. + + If you would like to keep the rpath option, but not enable runpath + a different configure option is avalabile + `--disable-wrapper-runpath`. + +* `--enable-dlopen`: + Build all of Open MPI's components as standalone Dynamic Shared + Objects (DSO's) that are loaded at run-time (this is the default). + The opposite of this option, `--disable-dlopen`, causes two things: + + 1. All of Open MPI's components will be built as part of Open MPI's + normal libraries (e.g., `libmpi`). + 1. Open MPI will not attempt to open any DSO's at run-time. + + Note that this option does *not* imply that OMPI's libraries will be + built as static objects (e.g., `libmpi.a`). It only specifies the + location of OMPI's components: standalone DSOs or folded into the + Open MPI libraries. You can control whether Open MPI's libraries + are build as static or dynamic via `--enable|disable-static` and + `--enable|disable-shared`. + +* `--disable-show-load-errors-by-default`: + Set the default value of the `mca_base_component_show_load_errors` + MCA variable: the `--enable` form of this option sets the MCA + variable to true, the `--disable` form sets the MCA variable to + false. The MCA `mca_base_component_show_load_errors` variable can + still be overridden at run time via the usual MCA-variable-setting + mechanisms; this configure option simply sets the default value. + + The `--disable` form of this option is intended for Open MPI + packagers who tend to enable support for many different types of + networks and systems in their packages. For example, consider a + packager who includes support for both the FOO and BAR networks in + their Open MPI package, both of which require support libraries + (`libFOO.so` and `libBAR.so`). If an end user only has BAR + hardware, they likely only have `libBAR.so` available on their + systems -- not `libFOO.so`. Disabling load errors by default will + prevent the user from seeing potentially confusing warnings about + the FOO components failing to load because `libFOO.so` is not + available on their systems. + + Conversely, system administrators tend to build an Open MPI that is + targeted at their specific environment, and contains few (if any) + components that are not needed. In such cases, they might want + their users to be warned that the FOO network components failed to + load (e.g., if `libFOO.so` was mistakenly unavailable), because Open + MPI may otherwise silently failover to a slower network path for MPI + traffic. + +* `--with-platform=FILE`: + Load configure options for the build from `FILE`. Options on the + command line that are not in `FILE` are also used. Options on the + command line and in `FILE` are replaced by what is in `FILE`. + +* `--with-libmpi-name=STRING`: + Replace `libmpi.*` and `libmpi_FOO.*` (where `FOO` is one of the + fortran supporting libraries installed in lib) with `libSTRING.*` + and `libSTRING_FOO.*`. This is provided as a convenience mechanism + for third-party packagers of Open MPI that might want to rename + these libraries for their own purposes. This option is *not* + intended for typical users of Open MPI. + +* `--enable-mca-no-build=LIST`: + Comma-separated list of `-` pairs that will not be + built. For example, `--enable-mca-no-build=btl-portals,oob-ud` will + disable building the portals BTL and the ud OOB component. + + +### Networking support / options + +* `--with-fca=DIR`: + Specify the directory where the Mellanox FCA library and + header files are located. + + FCA is the support library for Mellanox switches and HCAs. + +* `--with-hcoll=DIR`: + Specify the directory where the Mellanox hcoll library and header + files are located. This option is generally only necessary if the + hcoll headers and libraries are not in default compiler/linker + search paths. + + hcoll is the support library for MPI collective operation offload on + Mellanox ConnectX-3 HCAs (and later). + +* `--with-knem=DIR`: + Specify the directory where the knem libraries and header files are + located. This option is generally only necessary if the knem headers + and libraries are not in default compiler/linker search paths. + + knem is a Linux kernel module that allows direct process-to-process + memory copies (optionally using hardware offload), potentially + increasing bandwidth for large messages sent between messages on the + same server. See [the Knem web site](https://knem.gforge.inria.fr/) + for details. + +* `--with-libfabric=DIR`: + Specify the directory where the OpenFabrics Interfaces `libfabric` + library and header files are located. This option is generally only + necessary if the libfabric headers and libraries are not in default + compiler/linker search paths. + + Libfabric is the support library for OpenFabrics Interfaces-based + network adapters, such as Cisco usNIC, Intel True Scale PSM, Cray + uGNI, etc. + +* `--with-libfabric-libdir=DIR`: + Look in directory for the libfabric libraries. By default, Open MPI + will look in `DIR/lib` and `DIR/lib64`, which covers most cases. + This option is only needed for special configurations. + +* `--with-portals4=DIR`: + Specify the directory where the Portals4 libraries and header files + are located. This option is generally only necessary if the Portals4 + headers and libraries are not in default compiler/linker search + paths. + + Portals is a low-level network API for high-performance networking + on high-performance computing systems developed by Sandia National + Laboratories, Intel Corporation, and the University of New Mexico. + The Portals 4 Reference Implementation is a complete implementation + of Portals 4, with transport over InfiniBand verbs and UDP. + +* `--with-portals4-libdir=DIR`: + Location of libraries to link with for Portals4 support. + +* `--with-portals4-max-md-size=SIZE` and + `--with-portals4-max-va-size=SIZE`: + Set configuration values for Portals 4 + +* `--with-psm=`: + Specify the directory where the QLogic InfiniPath / Intel True Scale + PSM library and header files are located. This option is generally + only necessary if the PSM headers and libraries are not in default + compiler/linker search paths. + + PSM is the support library for QLogic InfiniPath and Intel TrueScale + network adapters. + +* `--with-psm-libdir=DIR`: + Look in directory for the PSM libraries. By default, Open MPI will + look in `DIR/lib` and `DIR/lib64`, which covers most cases. This + option is only needed for special configurations. + +* `--with-psm2=DIR`: + Specify the directory where the Intel Omni-Path PSM2 library and + header files are located. This option is generally only necessary + if the PSM2 headers and libraries are not in default compiler/linker + search paths. + + PSM is the support library for Intel Omni-Path network adapters. + +* `--with-psm2-libdir=DIR`: + Look in directory for the PSM2 libraries. By default, Open MPI will + look in `DIR/lib` and `DIR/lib64`, which covers most cases. This + option is only needed for special configurations. + +* `--with-ucx=DIR`: + Specify the directory where the UCX libraries and header files are + located. This option is generally only necessary if the UCX headers + and libraries are not in default compiler/linker search paths. + +* `--with-ucx-libdir=DIR`: + Look in directory for the UCX libraries. By default, Open MPI will + look in `DIR/lib` and `DIR/lib64`, which covers most cases. This + option is only needed for special configurations. + +* `--with-usnic`: + Abort configure if Cisco usNIC support cannot be built. + + +### Run-time system support + +* `--enable-mpirun-prefix-by-default`: + This option forces the `mpirun` command to always behave as if + `--prefix $prefix` was present on the command line (where `$prefix` + is the value given to the `--prefix` option to configure). This + prevents most `rsh`/`ssh`-based users from needing to modify their + shell startup files to set the `PATH` and/or `LD_LIBRARY_PATH` for + Open MPI on remote nodes. Note, however, that such users may still + desire to set `PATH` -- perhaps even in their shell startup files -- + so that executables such as `mpicc` and `mpirun` can be found + without needing to type long path names. + +* `--enable-orte-static-ports`: + Enable ORTE static ports for TCP OOB (default: enabled). + +* `--with-alps`: + Force the building of for the Cray Alps run-time environment. If + Alps support cannot be found, configure will abort. + +* `--with-lsf=DIR`: + Specify the directory where the LSF libraries and header files are + located. This option is generally only necessary if the LSF headers + and libraries are not in default compiler/linker search paths. + + LSF is a resource manager system, frequently used as a batch + scheduler in HPC systems. + +* `--with-lsf-libdir=DIR`: + Look in directory for the LSF libraries. By default, Open MPI will + look in `DIR/lib` and `DIR/lib64`, which covers most cases. This + option is only needed for special configurations. + +* `--with-slurm`: + Force the building of SLURM scheduler support. + +* `--with-sge`: + Specify to build support for the Oracle Grid Engine (OGE) resource + manager and/or the Open Grid Engine. OGE support is disabled by + default; this option must be specified to build OMPI's OGE support. + + The Oracle Grid Engine (OGE) and open Grid Engine packages are + resource manager systems, frequently used as a batch scheduler in + HPC systems. It used to be called the "Sun Grid Engine", which is + why the option is still named `--with-sge`. + +* `--with-tm=DIR`: + Specify the directory where the TM libraries and header files are + located. This option is generally only necessary if the TM headers + and libraries are not in default compiler/linker search paths. + + TM is the support library for the Torque and PBS Pro resource + manager systems, both of which are frequently used as a batch + scheduler in HPC systems. + + +### Miscellaneous support libraries + +* `--with-libevent(=VALUE)` + This option specifies where to find the libevent support headers and + library. The following `VALUE`s are permitted: + + * `internal`: Use Open MPI's internal copy of libevent. + * `external`: Use an external Libevent installation (rely on default + compiler and linker paths to find it) + * ``: Same as `internal`. + * `DIR`: Specify the location of a specific libevent + installation to use + + By default (or if `--with-libevent` is specified with no `VALUE`), + Open MPI will build and use the copy of libevent that it has in its + source tree. However, if the `VALUE` is `external`, Open MPI will + look for the relevant libevent header file and library in default + compiler / linker locations. Or, `VALUE` can be a directory tree + where the libevent header file and library can be found. This + option allows operating systems to include Open MPI and use their + default libevent installation instead of Open MPI's bundled + libevent. + + libevent is a support library that provides event-based processing, + timers, and signal handlers. Open MPI requires libevent to build; + passing --without-libevent will cause configure to abort. + +* `--with-libevent-libdir=DIR`: + Look in directory for the libevent libraries. This option is only + usable when building Open MPI against an external libevent + installation. Just like other `--with-FOO-libdir` configure + options, this option is only needed for special configurations. + +* `--with-hwloc(=VALUE)`: + hwloc is a support library that provides processor and memory + affinity information for NUMA platforms. It is required by Open + MPI. Therefore, specifying `--with-hwloc=no` (or `--without-hwloc`) + is disallowed. + + By default (i.e., if `--with-hwloc` is not specified, or if + `--with-hwloc` is specified without a value), Open MPI will first try + to find/use an hwloc installation on the current system. If Open + MPI cannot find one, it will fall back to build and use the internal + copy of hwloc included in the Open MPI source tree. + + Alternatively, the `--with-hwloc` option can be used to specify + where to find the hwloc support headers and library. The following + `VALUE`s are permitted: + + * `internal`: Only use Open MPI's internal copy of hwloc. + * `external`: Only use an external hwloc installation (rely on + default compiler and linker paths to find it). + * `DIR`: Only use the specific hwloc installation found in + the specified directory. + +* `--with-hwloc-libdir=DIR`: + Look in directory for the hwloc libraries. This option is only + usable when building Open MPI against an external hwloc + installation. Just like other `--with-FOO-libdir` configure options, + this option is only needed for special configurations. + +* `--disable-hwloc-pci`: + Disable building hwloc's PCI device-sensing capabilities. On some + platforms (e.g., SusE 10 SP1, x86-64), the libpci support library is + broken. Open MPI's configure script should usually detect when + libpci is not usable due to such brokenness and turn off PCI + support, but there may be cases when configure mistakenly enables + PCI support in the presence of a broken libpci. These cases may + result in `make` failing with warnings about relocation symbols in + libpci. The `--disable-hwloc-pci` switch can be used to force Open + MPI to not build hwloc's PCI device-sensing capabilities in these + cases. + + Similarly, if Open MPI incorrectly decides that libpci is broken, + you can force Open MPI to build hwloc's PCI device-sensing + capabilities by using `--enable-hwloc-pci`. + + hwloc can discover PCI devices and locality, which can be useful for + Open MPI in assigning message passing resources to MPI processes. + +* `--with-libltdl=DIR`: + Specify the directory where the GNU Libtool libltdl libraries and + header files are located. This option is generally only necessary + if the libltdl headers and libraries are not in default + compiler/linker search paths. + + Note that this option is ignored if `--disable-dlopen` is specified. + +* `--disable-libompitrace`: + Disable building the simple `libompitrace` library (see note above + about libompitrace) + +* `--with-valgrind(=DIR)`: + Directory where the valgrind software is installed. If Open MPI + finds Valgrind's header files, it will include additional support + for Valgrind's memory-checking debugger. + + Specifically, it will eliminate a lot of false positives from + running Valgrind on MPI applications. There is a minor performance + penalty for enabling this option. + + +### MPI Functionality + +* `--with-mpi-param-check(=VALUE)`: + Whether or not to check MPI function parameters for errors at + runtime. The following `VALUE`s are permitted: + + * `always`: MPI function parameters are always checked for errors + * `never`: MPI function parameters are never checked for errors + * `runtime`: Whether MPI function parameters are checked depends on + the value of the MCA parameter `mpi_param_check` (default: yes). + * `yes`: Synonym for "always" (same as `--with-mpi-param-check`). + * `no`: Synonym for "never" (same as `--without-mpi-param-check`). + + If `--with-mpi-param` is not specified, `runtime` is the default. + +* `--disable-mpi-thread-multiple`: + Disable the MPI thread level `MPI_THREAD_MULTIPLE` (it is enabled by + default). + +* `--enable-mpi-java`: + Enable building of an ***EXPERIMENTAL*** Java MPI interface + (disabled by default). You may also need to specify + `--with-jdk-dir`, `--with-jdk-bindir`, and/or `--with-jdk-headers`. + See [README.JAVA.md](README.JAVA.md) for details. + + Note that this Java interface is ***INCOMPLETE*** (meaning: it does + not support all MPI functionality) and ***LIKELY TO CHANGE***. The + Open MPI developers would very much like to hear your feedback about + this interface. See [README.JAVA.md](README.JAVA.md) for more + details. + +* `--enable-mpi-fortran(=VALUE)`: + By default, Open MPI will attempt to build all 3 Fortran bindings: + `mpif.h`, the `mpi` module, and the `mpi_f08` module. The following + `VALUE`s are permitted: + + * `all`: Synonym for `yes`. + * `yes`: Attempt to build all 3 Fortran bindings; skip + any binding that cannot be built (same as + `--enable-mpi-fortran`). + * `mpifh`: Only build `mpif.h` support. + * `usempi`: Only build `mpif.h` and `mpi` module support. + * `usempif08`: Build `mpif.h`, `mpi` module, and `mpi_f08` + module support. + * `none`: Synonym for `no`. + * `no`: Do not build any MPI Fortran support (same as + `--disable-mpi-fortran`). This is mutually exclusive + with building the OpenSHMEM Fortran interface. + +* `--enable-mpi-ext(=LIST)`: + Enable Open MPI's non-portable API extensions. `LIST` is a + comma-delmited list of extensions. If no `LIST` is specified, all + of the extensions are enabled. + + See the "Open MPI API Extensions" section for more details. + +* `--disable-mpi-io`: + Disable built-in support for MPI-2 I/O, likely because an + externally-provided MPI I/O package will be used. Default is to use + the internal framework system that uses the ompio component and a + specially modified version of ROMIO that fits inside the romio + component + +* `--disable-io-romio`: + Disable the ROMIO MPI-IO component + +* `--with-io-romio-flags=FLAGS`: + Pass `FLAGS` to the ROMIO distribution configuration script. This + option is usually only necessary to pass + parallel-filesystem-specific preprocessor/compiler/linker flags back + to the ROMIO system. + +* `--disable-io-ompio`: + Disable the ompio MPI-IO component + +* `--enable-sparse-groups`: + Enable the usage of sparse groups. This would save memory + significantly especially if you are creating large + communicators. (Disabled by default) + + +### OpenSHMEM Functionality + +* `--disable-oshmem`: + Disable building the OpenSHMEM implementation (by default, it is + enabled). + +* `--disable-oshmem-fortran`: + Disable building only the Fortran OpenSHMEM bindings. Please see + the "Compiler Notes" section herein which contains further + details on known issues with various Fortran compilers. + + +### Miscellaneous Functionality + +* `--without-memory-manager`: + Disable building Open MPI's memory manager. Open MPI's memory + manager is usually built on Linux based platforms, and is generally + only used for optimizations with some OpenFabrics-based networks (it + is not *necessary* for OpenFabrics networks, but some performance + loss may be observed without it). + + However, it may be necessary to disable the memory manager in order + to build Open MPI statically. + +* `--with-ft=TYPE`: + Specify the type of fault tolerance to enable. Options: LAM + (LAM/MPI-like), cr (Checkpoint/Restart). Fault tolerance support is + disabled unless this option is specified. + +* `--enable-peruse`: + Enable the PERUSE MPI data analysis interface. + +* `--enable-heterogeneous`: + Enable support for running on heterogeneous clusters (e.g., machines + with different endian representations). Heterogeneous support is + disabled by default because it imposes a minor performance penalty. + + ***THIS FUNCTIONALITY IS CURRENTLY BROKEN - DO NOT USE*** + +* `--with-wrapper-cflags=CFLAGS` +* `--with-wrapper-cxxflags=CXXFLAGS` +* `--with-wrapper-fflags=FFLAGS` +* `--with-wrapper-fcflags=FCFLAGS` +* `--with-wrapper-ldflags=LDFLAGS` +* `--with-wrapper-libs=LIBS`: + Add the specified flags to the default flags that are used in Open + MPI's "wrapper" compilers (e.g., `mpicc` -- see below for more + information about Open MPI's wrapper compilers). By default, Open + MPI's wrapper compilers use the same compilers used to build Open + MPI and specify a minimum set of additional flags that are necessary + to compile/link MPI applications. These configure options give + system administrators the ability to embed additional flags in + OMPI's wrapper compilers (which is a local policy decision). The + meanings of the different flags are: + + `CFLAGS`: Flags passed by the `mpicc` wrapper to the C compiler + `CXXFLAGS`: Flags passed by the `mpic++` wrapper to the C++ compiler + `FCFLAGS`: Flags passed by the `mpifort` wrapper to the Fortran compiler + `LDFLAGS`: Flags passed by all the wrappers to the linker + `LIBS`: Flags passed by all the wrappers to the linker + + There are other ways to configure Open MPI's wrapper compiler + behavior; see [the Open MPI FAQ](https://www.open-mpi.org/faq/) for + more information. + +There are many other options available -- see `./configure --help`. + +Changing the compilers that Open MPI uses to build itself uses the +standard Autoconf mechanism of setting special environment variables +either before invoking configure or on the configure command line. +The following environment variables are recognized by configure: + +* `CC`: C compiler to use +* `CFLAGS`: Compile flags to pass to the C compiler +* `CPPFLAGS`: Preprocessor flags to pass to the C compiler +* `CXX`: C++ compiler to use +* `CXXFLAGS`: Compile flags to pass to the C++ compiler +* `CXXCPPFLAGS`: Preprocessor flags to pass to the C++ compiler +* `FC`: Fortran compiler to use +* `FCFLAGS`: Compile flags to pass to the Fortran compiler +* `LDFLAGS`: Linker flags to pass to all compilers +* `LIBS`: Libraries to pass to all compilers (it is rarely + necessary for users to need to specify additional `LIBS`) +* `PKG_CONFIG`: Path to the `pkg-config` utility + +For example: + +``` +shell$ ./configure CC=mycc CXX=myc++ FC=myfortran ... +``` + +***NOTE:*** We generally suggest using the above command line form for +setting different compilers (vs. setting environment variables and +then invoking `./configure`). The above form will save all variables +and values in the `config.log` file, which makes post-mortem analysis +easier if problems occur. + +Note that if you intend to compile Open MPI with a `make` other than +the default one in your `PATH`, then you must either set the `$MAKE` +environment variable before invoking Open MPI's `configure` script, or +pass `MAKE=your_make_prog` to configure. For example: + +``` +shell$ ./configure MAKE=/path/to/my/make ... +``` + +This could be the case, for instance, if you have a shell alias for +`make`, or you always type `gmake` out of habit. Failure to tell +`configure` which non-default `make` you will use to compile Open MPI +can result in undefined behavior (meaning: don't do that). + +Note that you may also want to ensure that the value of +`LD_LIBRARY_PATH` is set appropriately (or not at all) for your build +(or whatever environment variable is relevant for your operating +system). For example, some users have been tripped up by setting to +use a non-default Fortran compiler via the `FC` environment variable, +but then failing to set `LD_LIBRARY_PATH` to include the directory +containing that non-default Fortran compiler's support libraries. +This causes Open MPI's `configure` script to fail when it tries to +compile / link / run simple Fortran programs. + +It is required that the compilers specified be compile and link +compatible, meaning that object files created by one compiler must be +able to be linked with object files from the other compilers and +produce correctly functioning executables. + +Open MPI supports all the `make` targets that are provided by GNU +Automake, such as: + +* `all`: build the entire Open MPI package +* `install`: install Open MPI +* `uninstall`: remove all traces of Open MPI from the $prefix +* `clean`: clean out the build tree + +Once Open MPI has been built and installed, it is safe to run `make +clean` and/or remove the entire build tree. + +VPATH and parallel builds are fully supported. + +Generally speaking, the only thing that users need to do to use Open +MPI is ensure that `PREFIX/bin` is in their `PATH` and `PREFIX/lib` is +in their `LD_LIBRARY_PATH`. Users may need to ensure to set the +`PATH` and `LD_LIBRARY_PATH` in their shell setup files (e.g., +`.bashrc`, `.cshrc`) so that non-interactive `rsh`/`ssh`-based logins +will be able to find the Open MPI executables. + + +## Open MPI Version Numbers and Binary Compatibility + +Open MPI has two sets of version numbers that are likely of interest +to end users / system administrator: + +1. Software version number +1. Shared library version numbers + +Both are predicated on Open MPI's definition of "backwards +compatibility." + +***NOTE:*** The version numbering conventions were changed with the +release of v1.10.0. Most notably, Open MPI no longer uses an +"odd/even" release schedule to indicate feature development vs. stable +releases. See the README in releases prior to v1.10.0 for more +information (e.g., +https://github.com/open-mpi/ompi/blob/v1.8/README#L1392-L1475). + + +### Backwards Compatibility + +Open MPI version Y is backwards compatible with Open MPI version X +(where Y>X) if users can: + +* Compile an MPI/OpenSHMEM application with version X, + `mpirun`/`oshrun` it with version Y, and get the same + user-observable behavior. +* Invoke `ompi_info` with the same CLI options in versions X and Y and + get the same user-observable behavior. + +Note that this definition encompasses several things: + +* Application Binary Interface (ABI) +* MPI / OpenSHMEM run time system +* `mpirun` / `oshrun` command line options +* MCA parameter names / values / meanings + +However, this definition only applies when the same version of Open +MPI is used with all instances of the runtime and MPI / OpenSHMEM +processes in a single MPI job. If the versions are not exactly the +same everywhere, Open MPI is not guaranteed to work properly in any +scenario. + +Backwards compatibility tends to work best when user applications are +dynamically linked to one version of the Open MPI / OSHMEM libraries, +and can be updated at run time to link to a new version of the Open +MPI / OSHMEM libraries. + +For example, if an MPI / OSHMEM application links statically against +the libraries from Open MPI vX, then attempting to launch that +application with `mpirun` / `oshrun` from Open MPI vY is not guaranteed to +work (because it is mixing vX and vY of Open MPI in a single job). + +Similarly, if using a container technology that internally bundles all +the libraries from Open MPI vX, attempting to launch that container +with `mpirun` / `oshrun` from Open MPI vY is not guaranteed to work. + +### Software Version Number + +Official Open MPI releases use the common "A.B.C" version identifier +format. Each of the three numbers has a specific meaning: + +* Major: The major number is the first integer in the version string + Changes in the major number typically indicate a significant + change in the code base and/or end-user functionality, and also + indicate a break from backwards compatibility. Specifically: Open + MPI releases with different major version numbers are not + backwards compatibale with each other. + + ***CAVEAT:*** This rule does not extend to versions prior to v1.10.0. + Specifically: v1.10.x is not guaranteed to be backwards + compatible with other v1.x releases. + +* Minor: The minor number is the second integer in the version string. + Changes in the minor number indicate a user-observable change in the + code base and/or end-user functionality. Backwards compatibility + will still be preserved with prior releases that have the same major + version number (e.g., v2.5.3 is backwards compatible with v2.3.1). + +* Release: The release number is the third integer in the version + string. Changes in the release number typically indicate a bug fix + in the code base and/or end-user functionality. For example, if + there is a release that only contains bug fixes and no other + user-observable changes or new features, only the third integer will + be increased (e.g., from v4.3.0 to v4.3.1). + +The "A.B.C" version number may optionally be followed by a Quantifier: + +* Quantifier: Open MPI version numbers sometimes have an arbitrary + string affixed to the end of the version number. Common strings + include: + * aX: Indicates an alpha release. X is an integer indicating the + number of the alpha release (e.g., v1.10.3a5 indicates the 5th + alpha release of version 1.10.3). + * bX: Indicates a beta release. X is an integer indicating the + number of the beta release (e.g., v1.10.3b3 indicates the 3rd beta + release of version 1.10.3). + * rcX: Indicates a release candidate. X is an integer indicating the + number of the release candidate (e.g., v1.10.3rc4 indicates the + 4th release candidate of version 1.10.3). + +Nightly development snapshot tarballs use a different version number +scheme; they contain three distinct values: + +* The git branch name from which the tarball was created. +* The date/timestamp, in `YYYYMMDDHHMM` format. +* The hash of the git commit from which the tarball was created. + +For example, a snapshot tarball filename of +`openmpi-v2.x-201703070235-e4798fb.tar.gz` indicates that this tarball +was created from the v2.x branch, on March 7, 2017, at 2:35am GMT, +from git hash e4798fb. + +### Shared Library Version Number + +The GNU Libtool official documentation details how the versioning +scheme works. The quick version is that the shared library versions +are a triple of integers: (current,revision,age), or `c:r:a`. This +triple is not related to the Open MPI software version number. There +are six simple rules for updating the values (taken almost verbatim +from the Libtool docs): + +1. Start with version information of `0:0:0` for each shared library. +1. Update the version information only immediately before a public + release of your software. More frequent updates are unnecessary, + and only guarantee that the current interface number gets larger + faster. +1. If the library source code has changed at all since the last + update, then increment revision (`c:r:a` becomes `c:r+1:a`). +1. If any interfaces have been added, removed, or changed since the + last update, increment current, and set revision to 0. +1. If any interfaces have been added since the last public release, + then increment age. +1. If any interfaces have been removed since the last public release, + then set age to 0. + +Here's how we apply those rules specifically to Open MPI: + +1. The above rules do not apply to MCA components (a.k.a. "plugins"); + MCA component `.so` versions stay unspecified. +1. The above rules apply exactly as written to the following libraries + starting with Open MPI version v1.5 (prior to v1.5, `libopen-pal` + and `libopen-rte` were still at `0:0:0` for reasons discussed in bug + ticket #2092 https://svn.open-mpi.org/trac/ompi/ticket/2092): + * `libopen-rte` + * `libopen-pal` + * `libmca_common_*` +1. The following libraries use a slightly modified version of the + above rules: rules 4, 5, and 6 only apply to the official MPI and + OpenSHMEM interfaces (functions, global variables). The rationale + for this decision is that the vast majority of our users only care + about the official/public MPI/OpenSHMEM interfaces; we therefore + want the `.so` version number to reflect only changes to the + official MPI/OpenSHMEM APIs. Put simply: non-MPI/OpenSHMEM API / + internal changes to the MPI-application-facing libraries are + irrelevant to pure MPI/OpenSHMEM applications. + * `libmpi` + * `libmpi_mpifh` + * `libmpi_usempi_tkr` + * `libmpi_usempi_ignore_tkr` + * `libmpi_usempif08` + * `libmpi_cxx` + * `libmpi_java` + * `liboshmem` + + +## Checking Your Open MPI Installation + +The `ompi_info` command can be used to check the status of your Open +MPI installation (located in `PREFIX/bin/ompi_info`). Running it with +no arguments provides a summary of information about your Open MPI +installation. + +Note that the `ompi_info` command is extremely helpful in determining +which components are installed as well as listing all the run-time +settable parameters that are available in each component (as well as +their default values). + +The following options may be helpful: + +* `--all`: Show a *lot* of information about your Open MPI + installation. +* `--parsable`: Display all the information in an easily + `grep`/`cut`/`awk`/`sed`-able format. +* `--param FRAMEWORK COMPONENT`: + A `FRAMEWORK` value of `all` and a `COMPONENT` value of `all` will + show all parameters to all components. Otherwise, the parameters of + all the components in a specific framework, or just the parameters + of a specific component can be displayed by using an appropriate + FRAMEWORK and/or COMPONENT name. +* `--level LEVEL`: + By default, `ompi_info` only shows "Level 1" MCA parameters -- + parameters that can affect whether MPI processes can run + successfully or not (e.g., determining which network interfaces to + use). The `--level` option will display all MCA parameters from + level 1 to `LEVEL` (the max `LEVEL` value is 9). Use `ompi_info + --param FRAMEWORK COMPONENT --level 9` to see *all* MCA parameters + for a given component. See "The Modular Component Architecture + (MCA)" section, below, for a fuller explanation. + +Changing the values of these parameters is explained in the "The +Modular Component Architecture (MCA)" section, below. + +When verifying a new Open MPI installation, we recommend running six +tests: + +1. Use `mpirun` to launch a non-MPI program (e.g., `hostname` or + `uptime`) across multiple nodes. +1. Use `mpirun` to launch a trivial MPI program that does no MPI + communication (e.g., the `hello_c` program in the `examples/` + directory in the Open MPI distribution). +1. Use `mpirun` to launch a trivial MPI program that sends and + receives a few MPI messages (e.g., the `ring_c` program in the + `examples/` directory in the Open MPI distribution). +1. Use `oshrun` to launch a non-OpenSHMEM program across multiple + nodes. +1. Use `oshrun` to launch a trivial MPI program that does no OpenSHMEM + communication (e.g., `hello_shmem.c` program in the `examples/` + directory in the Open MPI distribution.) +1. Use `oshrun` to launch a trivial OpenSHMEM program that puts and + gets a few messages (e.g., the `ring_shmem.c` in the `examples/` + directory in the Open MPI distribution.) + +If you can run all six of these tests successfully, that is a good +indication that Open MPI built and installed properly. + + +## Open MPI API Extensions + +Open MPI contains a framework for extending the MPI API that is +available to applications. Each extension is usually a standalone set +of functionality that is distinct from other extensions (similar to +how Open MPI's plugins are usually unrelated to each other). These +extensions provide new functions and/or constants that are available +to MPI applications. + +WARNING: These extensions are neither standard nor portable to other +MPI implementations! + +### Compiling the extensions + +Open MPI extensions are all enabled by default; they can be disabled +via the `--disable-mpi-ext` command line switch. + +Since extensions are meant to be used by advanced users only, this +file does not document which extensions are available or what they +do. Look in the ompi/mpiext/ directory to see the extensions; each +subdirectory of that directory contains an extension. Each has a +README file that describes what it does. + +### Using the extensions + +To reinforce the fact that these extensions are non-standard, you must +include a separate header file after `` to obtain the function +prototypes, constant declarations, etc. For example: + +```c +#include +#if defined(OPEN_MPI) && OPEN_MPI +#include +#endif + +int main() { + MPI_Init(NULL, NULL); + +#if defined(OPEN_MPI) && OPEN_MPI + { + char ompi_bound[OMPI_AFFINITY_STRING_MAX]; + char current_binding[OMPI_AFFINITY_STRING_MAX]; + char exists[OMPI_AFFINITY_STRING_MAX]; + OMPI_Affinity_str(OMPI_AFFINITY_LAYOUT_FMT, ompi_bound, + current_bindings, exists); + } +#endif + MPI_Finalize(); + return 0; +} +``` + +Notice that the Open MPI-specific code is surrounded by the `#if` +statement to ensure that it is only ever compiled by Open MPI. + +The Open MPI wrapper compilers (`mpicc` and friends) should +automatically insert all relevant compiler and linker flags necessary +to use the extensions. No special flags or steps should be necessary +compared to "normal" MPI applications. + + +## Compiling Open MPI Applications + +Open MPI provides "wrapper" compilers that should be used for +compiling MPI and OpenSHMEM applications: + +* C: `mpicc`, `oshcc` +* C++: `mpiCC`, `oshCC` (or `mpic++` if your filesystem is case-insensitive) +* Fortran: `mpifort`, `oshfort` + +For example: + +``` +shell$ mpicc hello_world_mpi.c -o hello_world_mpi -g +shell$ +``` + +For OpenSHMEM applications: + +``` +shell$ oshcc hello_shmem.c -o hello_shmem -g +shell$ +``` + +All the wrapper compilers do is add a variety of compiler and linker +flags to the command line and then invoke a back-end compiler. To be +specific: the wrapper compilers do not parse source code at all; they +are solely command-line manipulators, and have nothing to do with the +actual compilation or linking of programs. The end result is an MPI +executable that is properly linked to all the relevant libraries. + +Customizing the behavior of the wrapper compilers is possible (e.g., +changing the compiler [not recommended] or specifying additional +compiler/linker flags); see the Open MPI FAQ for more information. + +Alternatively, Open MPI also installs `pkg-config(1)` configuration +files under `$libdir/pkgconfig`. If `pkg-config` is configured to find +these files, then compiling / linking Open MPI programs can be +performed like this: + +``` +shell$ gcc hello_world_mpi.c -o hello_world_mpi -g \ + `pkg-config ompi-c --cflags --libs` +shell$ +``` + +Open MPI supplies multiple `pkg-config(1)` configuration files; one +for each different wrapper compiler (language): + +* `ompi`: Synonym for `ompi-c`; Open MPI applications using the C + MPI bindings +* `ompi-c`: Open MPI applications using the C MPI bindings +* `ompi-cxx`: Open MPI applications using the C MPI bindings +* `ompi-fort`: Open MPI applications using the Fortran MPI bindings + +The following `pkg-config(1)` configuration files *may* be installed, +depending on which command line options were specified to Open MPI's +configure script. They are not necessary for MPI applications, but +may be used by applications that use Open MPI's lower layer support +libraries. + +* `opal`: Open Portable Access Layer applications + + +## Running Open MPI Applications + +Open MPI supports both `mpirun` and `mpiexec` (they are exactly +equivalent) to launch MPI applications. For example: + +``` +shell$ mpirun -np 2 hello_world_mpi +or +shell$ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi +``` + +are equivalent. + +The `rsh` launcher (which defaults to using `ssh`) accepts a +`--hostfile` parameter (the option `--machinefile` is equivalent); you +can specify a `--hostfile` parameter indicating a standard +`mpirun`-style hostfile (one hostname per line): + +``` +shell$ mpirun --hostfile my_hostfile -np 2 hello_world_mpi +``` + +If you intend to run more than one process on a node, the hostfile can +use the "slots" attribute. If "slots" is not specified, a count of 1 +is assumed. For example, using the following hostfile: + +``` +shell$ cat my_hostfile +node1.example.com +node2.example.com +node3.example.com slots=2 +node4.example.com slots=4 +``` + +``` +shell$ mpirun --hostfile my_hostfile -np 8 hello_world_mpi +``` + +will launch `MPI_COMM_WORLD` rank 0 on node1, rank 1 on node2, ranks 2 +and 3 on node3, and ranks 4 through 7 on node4. + +Other starters, such as the resource manager / batch scheduling +environments, do not require hostfiles (and will ignore the hostfile +if it is supplied). They will also launch as many processes as slots +have been allocated by the scheduler if no "-np" argument has been +provided. For example, running a SLURM job with 8 processors: + +``` +shell$ salloc -n 8 mpirun a.out +``` + +The above command will reserve 8 processors and run 1 copy of mpirun, +which will, in turn, launch 8 copies of a.out in a single +`MPI_COMM_WORLD` on the processors that were allocated by SLURM. + +Note that the values of component parameters can be changed on the +`mpirun` / `mpiexec` command line. This is explained in the section +below, "The Modular Component Architecture (MCA)". + +Open MPI supports `oshrun` to launch OpenSHMEM applications. For +example: + +``` +shell$ oshrun -np 2 hello_world_oshmem +``` + +OpenSHMEM applications may also be launched directly by resource +managers such as SLURM. For example, when OMPI is configured +`--with-pmix` and `--with-slurm`, one may launch OpenSHMEM applications +via `srun`: + +``` +shell$ srun -N 2 hello_world_oshmem +``` + +## The Modular Component Architecture (MCA) + +The MCA is the backbone of Open MPI -- most services and functionality +are implemented through MCA components. + +### MPI layer frameworks + +Here is a list of all the component frameworks in the MPI layer of +Open MPI: + +* `bml`: BTL management layer +* `coll`: MPI collective algorithms +* `fbtl`: file byte transfer layer: abstraction for individual + read: collective read and write operations for MPI I/O +* `fs`: file system functions for MPI I/O +* `io`: MPI I/O +* `mtl`: Matching transport layer, used for MPI point-to-point + messages on some types of networks +* `op`: Back end computations for intrinsic MPI_Op operators +* `osc`: MPI one-sided communications +* `pml`: MPI point-to-point management layer +* `rte`: Run-time environment operations +* `sharedfp`: shared file pointer operations for MPI I/O +* `topo`: MPI topology routines +* `vprotocol`: Protocols for the "v" PML + +### OpenSHMEM component frameworks + +* `atomic`: OpenSHMEM atomic operations +* `memheap`: OpenSHMEM memory allocators that support the + PGAS memory model +* `scoll`: OpenSHMEM collective operations +* `spml`: OpenSHMEM "pml-like" layer: supports one-sided, + point-to-point operations +* `sshmem`: OpenSHMEM shared memory backing facility + +### Back-end run-time environment (RTE) component frameworks: + +* `dfs`: Distributed file system +* `errmgr`: RTE error manager +* `ess`: RTE environment-specific services +* `filem`: Remote file management +* `grpcomm`: RTE group communications +* `iof`: I/O forwarding +* `notifier`: System-level notification support +* `odls`: OpenRTE daemon local launch subsystem +* `oob`: Out of band messaging +* `plm`: Process lifecycle management +* `ras`: Resource allocation system +* `rmaps`: Resource mapping system +* `rml`: RTE message layer +* `routed`: Routing table for the RML +* `rtc`: Run-time control framework +* `schizo`: OpenRTE personality framework +* `state`: RTE state machine + +### Miscellaneous frameworks: + +* `allocator`: Memory allocator +* `backtrace`: Debugging call stack backtrace support +* `btl`: Point-to-point Byte Transfer Layer +* `dl`: Dynamic loading library interface +* `event`: Event library (libevent) versioning support +* `hwloc`: Hardware locality (hwloc) versioning support +* `if`: OS IP interface support +* `installdirs`: Installation directory relocation services +* `memchecker`: Run-time memory checking +* `memcpy`: Memory copy support +* `memory`: Memory management hooks +* `mpool`: Memory pooling +* `patcher`: Symbol patcher hooks +* `pmix`: Process management interface (exascale) +* `pstat`: Process status +* `rcache`: Memory registration cache +* `sec`: Security framework +* `shmem`: Shared memory support (NOT related to OpenSHMEM) +* `timer`: High-resolution timers + +### Framework notes + +Each framework typically has one or more components that are used at +run-time. For example, the `btl` framework is used by the MPI layer +to send bytes across different types underlying networks. The `tcp` +`btl`, for example, sends messages across TCP-based networks; the +`ucx` `pml` sends messages across InfiniBand-based networks. + +Each component typically has some tunable parameters that can be +changed at run-time. Use the `ompi_info` command to check a component +to see what its tunable parameters are. For example: + +``` +shell$ ompi_info --param btl tcp +``` + +shows some of the parameters (and default values) for the `tcp` `btl` +component (use `--level` to show *all* the parameters; see below). + +Note that `ompi_info` only shows a small number a component's MCA +parameters by default. Each MCA parameter has a "level" value from 1 +to 9, corresponding to the MPI-3 MPI_T tool interface levels. In Open +MPI, we have interpreted these nine levels as three groups of three: + +1. End user / basic +1. End user / detailed +1. End user / all +1. Application tuner / basic +1. Application tuner / detailed +1. Application tuner / all +1. MPI/OpenSHMEM developer / basic +1. MPI/OpenSHMEM developer / detailed +1. MPI/OpenSHMEM developer / all + +Here's how the three sub-groups are defined: + +1. End user: Generally, these are parameters that are required for + correctness, meaning that someone may need to set these just to + get their MPI/OpenSHMEM application to run correctly. +1. Application tuner: Generally, these are parameters that can be + used to tweak MPI application performance. +1. MPI/OpenSHMEM developer: Parameters that either don't fit in the + other two, or are specifically intended for debugging / + development of Open MPI itself. + +Each sub-group is broken down into three classifications: + +1. Basic: For parameters that everyone in this category will want to + see. +1. Detailed: Parameters that are useful, but you probably won't need + to change them often. +1. All: All other parameters -- probably including some fairly + esoteric parameters. + +To see *all* available parameters for a given component, specify that +ompi_info should use level 9: + +``` +shell$ ompi_info --param btl tcp --level 9 +``` + +These values can be overridden at run-time in several ways. At +run-time, the following locations are examined (in order) for new +values of parameters: + +1. `PREFIX/etc/openmpi-mca-params.conf`: + This file is intended to set any system-wide default MCA parameter + values -- it will apply, by default, to all users who use this Open + MPI installation. The default file that is installed contains many + comments explaining its format. + +1. `$HOME/.openmpi/mca-params.conf`: + If this file exists, it should be in the same format as + `PREFIX/etc/openmpi-mca-params.conf`. It is intended to provide + per-user default parameter values. + +1. environment variables of the form `OMPI_MCA_` set equal to a + `VALUE`: + + Where `` is the name of the parameter. For example, set the + variable named `OMPI_MCA_btl_tcp_frag_size` to the value 65536 + (Bourne-style shells): + + ``` + shell$ OMPI_MCA_btl_tcp_frag_size=65536 + shell$ export OMPI_MCA_btl_tcp_frag_size + ``` + +4. the `mpirun`/`oshrun` command line: `--mca NAME VALUE` + + Where is the name of the parameter. For example: + + ``` + shell$ mpirun --mca btl_tcp_frag_size 65536 -np 2 hello_world_mpi + ``` + +These locations are checked in order. For example, a parameter value +passed on the `mpirun` command line will override an environment +variable; an environment variable will override the system-wide +defaults. + +Each component typically activates itself when relevant. For example, +the usNIC component will detect that usNIC devices are present and +will automatically be used for MPI communications. The SLURM +component will automatically detect when running inside a SLURM job +and activate itself. And so on. + +Components can be manually activated or deactivated if necessary, of +course. The most common components that are manually activated, +deactivated, or tuned are the `btl` components -- components that are +used for MPI point-to-point communications on many types common +networks. + +For example, to *only* activate the `tcp` and `self` (process loopback) +components are used for MPI communications, specify them in a +comma-delimited list to the `btl` MCA parameter: + +``` +shell$ mpirun --mca btl tcp,self hello_world_mpi +``` + +To add shared memory support, add `sm` into the command-delimited list +(list order does not matter): + +``` +shell$ mpirun --mca btl tcp,sm,self hello_world_mpi +``` + +(there used to be a `vader` BTL for shared memory support; it was +renamed to `sm` in Open MPI v5.0.0, but the alias `vader` still works +as well) + +To specifically deactivate a specific component, the comma-delimited +list can be prepended with a `^` to negate it: + +``` +shell$ mpirun --mca btl ^tcp hello_mpi_world +``` + +The above command will use any other `btl` component other than the +`tcp` component. + + +## Questions? Problems? + +Found a bug? Got a question? Want to make a suggestion? Want to +contribute to Open MPI? Please let us know! + +When submitting questions and problems, be sure to include as much +extra information as possible. [See the community help web +page](https://www.open-mpi.org/community/help/) for details on all the +information that we request in order to provide assistance: + +The best way to report bugs, send comments, or ask questions is to +sign up on the user's and/or developer's mailing list (for user-level +and developer-level questions; when in doubt, send to the user's +list): + +* users@lists.open-mpi.org +* devel@lists.open-mpi.org + +Because of spam, only subscribers are allowed to post to these lists +(ensure that you subscribe with and post from exactly the same e-mail +address -- joe@example.com is considered different than +joe@mycomputer.example.com!). Visit these pages to subscribe to the +lists: + +* [Subscribe to the users mailing + list](https://lists.open-mpi.org/mailman/listinfo/users) +* [Subscribe to the developers mailing + list](https://lists.open-mpi.org/mailman/listinfo/devel) + +Make today an Open MPI day! diff --git a/contrib/Makefile.am b/contrib/Makefile.am index 8783d6fb60..2f43ef065f 100644 --- a/contrib/Makefile.am +++ b/contrib/Makefile.am @@ -64,7 +64,7 @@ EXTRA_DIST = \ platform/lanl/cray_xc_cle5.2/optimized-common \ platform/lanl/cray_xc_cle5.2/optimized-lustre \ platform/lanl/cray_xc_cle5.2/optimized-lustre.conf \ - platform/lanl/toss/README \ + platform/lanl/toss/README.md \ platform/lanl/toss/common \ platform/lanl/toss/common-optimized \ platform/lanl/toss/cray-lustre-optimized \ diff --git a/contrib/build-mca-comps-outside-of-tree/README.txt b/contrib/build-mca-comps-outside-of-tree/README.md similarity index 52% rename from contrib/build-mca-comps-outside-of-tree/README.txt rename to contrib/build-mca-comps-outside-of-tree/README.md index b359239661..f25a2a36af 100644 --- a/contrib/build-mca-comps-outside-of-tree/README.txt +++ b/contrib/build-mca-comps-outside-of-tree/README.md @@ -1,121 +1,108 @@ +# Description + 2 Feb 2011 -Description -=========== - -This sample "tcp2" BTL component is a simple example of how to build +This sample `tcp2` BTL component is a simple example of how to build an Open MPI MCA component from outside of the Open MPI source tree. This is a valuable technique for 3rd parties who want to provide their own components for Open MPI, but do not want to be in the mainstream distribution (i.e., their code is not part of the main Open MPI code base). -NOTE: We do recommend that 3rd party developers investigate using a - DVCS such as Mercurial or Git to keep up with Open MPI - development. Using a DVCS allows you to host your component in - your own copy of the Open MPI source tree, and yet still keep up - with development changes, stable releases, etc. - Previous colloquial knowledge held that building a component from outside of the Open MPI source tree required configuring Open MPI ---with-devel-headers, and then building and installing it. This -configure switch installs all of OMPI's internal .h files under -$prefix/include/openmpi, and therefore allows 3rd party code to be +`--with-devel-headers`, and then building and installing it. This +configure switch installs all of OMPI's internal `.h` files under +`$prefix/include/openmpi`, and therefore allows 3rd party code to be compiled outside of the Open MPI tree. This method definitely works, but is annoying: - * You have to ask users to use this special configure switch. - * Not all users install from source; many get binary packages (e.g., - RPMs). +* You have to ask users to use this special configure switch. +* Not all users install from source; many get binary packages (e.g., + RPMs). This example package shows two ways to build an Open MPI MCA component from outside the Open MPI source tree: - 1. Using the above --with-devel-headers technique - 2. Compiling against the Open MPI source tree itself (vs. the - installation tree) +1. Using the above `--with-devel-headers` technique +2. Compiling against the Open MPI source tree itself (vs. the + installation tree) The user still has to have a source tree, but at least they don't have -to be required to use --with-devel-headers (which most users don't) -- +to be required to use `--with-devel-headers` (which most users don't) -- they can likely build off the source tree that they already used. -Example project contents -======================== +# Example project contents -The "tcp2" component is a direct copy of the TCP BTL as of January +The `tcp2` component is a direct copy of the TCP BTL as of January 2011 -- it has just been renamed so that it can be built separately and installed alongside the real TCP BTL component. Most of the mojo for both methods is handled in the example -components' configure.ac, but the same techniques are applicable +components' `configure.ac`, but the same techniques are applicable outside of the GNU Auto toolchain. -This sample "tcp2" component has an autogen.sh script that requires +This sample `tcp2` component has an `autogen.sh` script that requires the normal Autoconf, Automake, and Libtool. It also adds the following two configure switches: - --with-openmpi-install=DIR +1. `--with-openmpi-install=DIR`: + If provided, `DIR` is an Open MPI installation tree that was + installed `--with-devel-headers`. - If provided, DIR is an Open MPI installation tree that was - installed --with-devel-headers. - - This switch uses the installed mpicc --showme: functionality - to extract the relevant CPPFLAGS, LDFLAGS, and LIBS. - - --with-openmpi-source=DIR - - If provided, DIR is the source of a configured and built Open MPI + This switch uses the installed `mpicc --showme:` functionality + to extract the relevant `CPPFLAGS`, `LDFLAGS`, and `LIBS`. +1. `--with-openmpi-source=DIR`: + If provided, `DIR` is the source of a configured and built Open MPI source tree (corresponding to the version expected by the example component). The source tree is not required to have been - configured --with-devel-headers. + configured `--with-devel-headers`. - This switch uses the source tree's config.status script to extract - the relevant CPPFLAGS and CFLAGS. + This switch uses the source tree's `config.status` script to + extract the relevant `CPPFLAGS` and `CFLAGS`. Either one of these two switches must be provided, or appropriate -CPPFLAGS, CFLAGS, LDFLAGS, and/or LIBS must be provided such that -valid Open MPI header and library files can be found and compiled / -linked against, respectively. +`CPPFLAGS`, `CFLAGS`, `LDFLAGS`, and/or `LIBS` must be provided such +that valid Open MPI header and library files can be found and compiled +/ linked against, respectively. -Example use -=========== +# Example use First, download, build, and install Open MPI: ------ +``` $ cd $HOME -$ wget \ - https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2 - [lots of output] +$ wget https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2 +[...lots of output...] $ tar jxf openmpi-X.Y.Z.tar.bz2 $ cd openmpi-X.Y.Z $ ./configure --prefix=/opt/openmpi ... - [lots of output] +[...lots of output...] $ make -j 4 install - [lots of output] +[...lots of output...] $ /opt/openmpi/bin/ompi_info | grep btl MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z) MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z) MCA btl: tcp (MCA vA.B, API vM.N, Component vX.Y.Z) [where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI] $ ------ +``` -Notice the installed BTLs from ompi_info. +Notice the installed BTLs from `ompi_info`. -Now cd into this example project and build it, pointing it to the +Now `cd` into this example project and build it, pointing it to the source directory of the Open MPI that you just built. Note that we -use the same --prefix as when installing Open MPI (so that the built +use the same `--prefix` as when installing Open MPI (so that the built component will be installed into the Right place): ------ +``` $ cd /path/to/this/sample $ ./autogen.sh $ ./configure --prefix=/opt/openmpi --with-openmpi-source=$HOME/openmpi-X.Y.Z - [lots of output] +[...lots of output...] $ make -j 4 install - [lots of output] +[...lots of output...] $ /opt/openmpi/bin/ompi_info | grep btl MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z) MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z) @@ -123,12 +110,11 @@ $ /opt/openmpi/bin/ompi_info | grep btl MCA btl: tcp2 (MCA vA.B, API vM.N, Component vX.Y.Z) [where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI] $ ------ +``` -Notice that the "tcp2" BTL is now installed. +Notice that the `tcp2` BTL is now installed. -Random notes -============ +# Random notes The component in this project is just an example; I whipped it up in the span of several hours. Your component may be a bit more complex @@ -139,17 +125,15 @@ what you need. Changes required to the component to make it build in a standalone mode: -1. Write your own configure script. This component is just a sample. - You basically need to build against an OMPI install that was - installed --with-devel-headers or a built OMPI source tree. See - ./configure --help for details. - -2. I also provided a bogus btl_tcp2_config.h (generated by configure). - This file is not included anywhere, but it does provide protection - against re-defined PACKAGE_* macros when running configure, which - is quite annoying. - -3. Modify Makefile.am to only build DSOs. I.e., you can optionally +1. Write your own `configure` script. This component is just a + sample. You basically need to build against an OMPI install that + was installed `--with-devel-headers` or a built OMPI source tree. + See `./configure --help` for details. +1. I also provided a bogus `btl_tcp2_config.h` (generated by + `configure`). This file is not included anywhere, but it does + provide protection against re-defined `PACKAGE_*` macros when + running `configure`, which is quite annoying. +1. Modify `Makefile.am` to only build DSOs. I.e., you can optionally take the static option out since the component can *only* build in DSO mode when building standalone. That being said, it doesn't hurt to leave the static builds in -- this would (hypothetically) diff --git a/contrib/dist/linux/README b/contrib/dist/linux/README deleted file mode 100644 index f9a3aa8841..0000000000 --- a/contrib/dist/linux/README +++ /dev/null @@ -1,105 +0,0 @@ -Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana - University Research and Technology - Corporation. All rights reserved. -Copyright (c) 2004-2006 The University of Tennessee and The University - of Tennessee Research Foundation. All rights - reserved. -Copyright (c) 2004-2006 High Performance Computing Center Stuttgart, - University of Stuttgart. All rights reserved. -Copyright (c) 2004-2006 The Regents of the University of California. - All rights reserved. -Copyright (c) 2006-2016 Cisco Systems, Inc. All rights reserved. -$COPYRIGHT$ - -Additional copyrights may follow - -$HEADER$ - -=========================================================================== - -Note that you probably want to download the latest release of the SRPM -for any given Open MPI version. The SRPM release number is the -version after the dash in the SRPM filename. For example, -"openmpi-1.6.3-2.src.rpm" is the 2nd release of the SRPM for Open MPI -v1.6.3. Subsequent releases of SRPMs typically contain bug fixes for -the RPM packaging, but not Open MPI itself. - -The buildrpm.sh script takes a single mandatory argument -- a filename -pointing to an Open MPI tarball (may be either .gz or .bz2). It will -create one or more RPMs from this tarball: - -1. Source RPM -2. "All in one" RPM, where all of Open MPI is put into a single RPM. -3. "Multiple" RPM, where Open MPI is split into several sub-package - RPMs: - - openmpi-runtime - - openmpi-devel - - openmpi-docs - -The folowing arguments could be used to affect script behaviour. -Please, do NOT set the same settings with parameters and config vars. - --b - If you specify this option, only the all-in-one binary RPM will - be built. By default, only the source RPM (SRPM) is built. Other - parameters that affect the all-in-one binary RPM will be ignored - unless this option is specified. - --n name - This option will change the name of the produced RPM to the "name". - It is useful to use with "-o" and "-m" options if you want to have - multiple Open MPI versions installed simultaneously in the same - enviroment. Requires use of option "-b". - --o - With this option the install path of the binary RPM will be changed - to /opt/_NAME_/_VERSION_. Requires use of option "-b". - --m - This option causes the RPM to also install modulefiles - to the location specified in the specfile. Requires use of option "-b". - --i - Also build a debuginfo RPM. By default, the debuginfo RPM is not built. - Requires use of option "-b". - --f lf_location - Include support for Libfabric. "lf_location" is Libfabric install - path. Requires use of option "-b". - --t tm_location - Include support for Torque/PBS Pro. "tm_location" is path of the - Torque/PBS Pro header files. Requires use of option "-b". - --d - Build with debugging support. By default, - the RPM is built without debugging support. - --c parameter - Add custom configure parameter. - --r parameter - Add custom RPM build parameter. - --s - If specified, the script will try to unpack the openmpi.spec - file from the tarball specified on the command line. By default, - the script will look for the specfile in the current directory. - --R directory - Specifies the top level RPM build direcotry. - --h - Prints script usage information. - - -Target architecture is currently hard-coded in the beginning -of the buildrpm.sh script. - -Alternatively, you can build directly from the openmpi.spec spec file -or SRPM directly. Many options can be passed to the building process -via rpmbuild's --define option (there are older versions of rpmbuild -that do not seem to handle --define'd values properly in all cases, -but we generally don't care about those old versions of rpmbuild...). -The available options are described in the comments in the beginning -of the spec file in this directory. diff --git a/contrib/dist/linux/README.md b/contrib/dist/linux/README.md new file mode 100644 index 0000000000..65aae3c3c2 --- /dev/null +++ b/contrib/dist/linux/README.md @@ -0,0 +1,88 @@ +# Open MPI Linux distribution helpers + +Note that you probably want to download the latest release of the SRPM +for any given Open MPI version. The SRPM release number is the +version after the dash in the SRPM filename. For example, +`openmpi-1.6.3-2.src.rpm` is the 2nd release of the SRPM for Open MPI +v1.6.3. Subsequent releases of SRPMs typically contain bug fixes for +the RPM packaging, but not Open MPI itself. + +The `buildrpm.sh` script takes a single mandatory argument -- a +filename pointing to an Open MPI tarball (may be either `.gz` or +`.bz2`). It will create one or more RPMs from this tarball: + +1. Source RPM +1. "All in one" RPM, where all of Open MPI is put into a single RPM. +1. "Multiple" RPM, where Open MPI is split into several sub-package + RPMs: + * `openmpi-runtime` + * `openmpi-devel` + * `openmpi-docs` + +The folowing arguments could be used to affect script behaviour. +Please, do NOT set the same settings with parameters and config vars. + +* `-b`: + If you specify this option, only the all-in-one binary RPM will + be built. By default, only the source RPM (SRPM) is built. Other + parameters that affect the all-in-one binary RPM will be ignored + unless this option is specified. + +* `-n name`: + This option will change the name of the produced RPM to the "name". + It is useful to use with "-o" and "-m" options if you want to have + multiple Open MPI versions installed simultaneously in the same + enviroment. Requires use of option `-b`. + +* `-o`: + With this option the install path of the binary RPM will be changed + to `/opt/_NAME_/_VERSION_`. Requires use of option `-b`. + +* `-m`: + This option causes the RPM to also install modulefiles + to the location specified in the specfile. Requires use of option `-b`. + +* `-i`: + Also build a debuginfo RPM. By default, the debuginfo RPM is not built. + Requires use of option `-b`. + +* `-f lf_location`: + Include support for Libfabric. "lf_location" is Libfabric install + path. Requires use of option `-b`. + +* `-t tm_location`: + Include support for Torque/PBS Pro. "tm_location" is path of the + Torque/PBS Pro header files. Requires use of option `-b`. + +* `-d`: + Build with debugging support. By default, + the RPM is built without debugging support. + +* `-c parameter`: + Add custom configure parameter. + +* `-r parameter`: + Add custom RPM build parameter. + +* `-s`: + If specified, the script will try to unpack the openmpi.spec + file from the tarball specified on the command line. By default, + the script will look for the specfile in the current directory. + +* `-R directory`: + Specifies the top level RPM build direcotry. + +* `-h`: + Prints script usage information. + + +Target architecture is currently hard-coded in the beginning +of the `buildrpm.sh` script. + +Alternatively, you can build directly from the `openmpi.spec` spec +file or SRPM directly. Many options can be passed to the building +process via `rpmbuild`'s `--define` option (there are older versions +of `rpmbuild` that do not seem to handle `--define`'d values properly +in all cases, but we generally don't care about those old versions of +`rpmbuild`...). The available options are described in the comments +in the beginning of the spec file in this directory. diff --git a/contrib/platform/lanl/toss/README b/contrib/platform/lanl/toss/README.md similarity index 99% rename from contrib/platform/lanl/toss/README rename to contrib/platform/lanl/toss/README.md index 9a198d2531..7f2ada9c57 100644 --- a/contrib/platform/lanl/toss/README +++ b/contrib/platform/lanl/toss/README.md @@ -61,7 +61,7 @@ created. - copy of toss3-hfi-optimized.conf with the following changes: - change: comment "Add the interface for out-of-band communication and set it up" to "Set up the interface for out-of-band communication" - - remove: oob_tcp_if_exclude = ib0 + - remove: oob_tcp_if_exclude = ib0 - remove: btl (let Open MPI figure out what best to use for ethernet- connected hardware) - remove: btl_openib_want_fork_support (no infiniband) diff --git a/examples/Makefile.include b/examples/Makefile.include index ef3616568e..92afc1175b 100644 --- a/examples/Makefile.include +++ b/examples/Makefile.include @@ -33,7 +33,7 @@ # Automake). EXTRA_DIST += \ - examples/README \ + examples/README.md \ examples/Makefile \ examples/hello_c.c \ examples/hello_mpifh.f \ diff --git a/examples/README b/examples/README deleted file mode 100644 index a4c9b5d5f7..0000000000 --- a/examples/README +++ /dev/null @@ -1,67 +0,0 @@ -Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana - University Research and Technology - Corporation. All rights reserved. -Copyright (c) 2006-2012 Cisco Systems, Inc. All rights reserved. -Copyright (c) 2007-2009 Sun Microsystems, Inc. All rights reserved. -Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved. -Copyright (c) 2013 Mellanox Technologies, Inc. All rights reserved. - -$COPYRIGHT$ - -The files in this directory are sample MPI applications provided both -as a trivial primer to MPI as well as simple tests to ensure that your -Open MPI installation is working properly. - -If you are looking for a comprehensive MPI tutorial, these samples are -not enough. Excellent MPI tutorials are available here: - - http://www.citutor.org/login.php - -Get a free account and login; you can then browse to the list of -available courses. Look for the ones with "MPI" in the title. - -There are two MPI examples in this directory, each using one of six -different MPI interfaces: - -- Hello world - C: hello_c.c - C++: hello_cxx.cc - Fortran mpif.h: hello_mpifh.f - Fortran use mpi: hello_usempi.f90 - Fortran use mpi_f08: hello_usempif08.f90 - Java: Hello.java - C shmem.h: hello_oshmem_c.c - Fortran shmem.fh: hello_oshmemfh.f90 - -- Send a trivial message around in a ring - C: ring_c.c - C++: ring_cxx.cc - Fortran mpif.h: ring_mpifh.f - Fortran use mpi: ring_usempi.f90 - Fortran use mpi_f08: ring_usempif08.f90 - Java: Ring.java - C shmem.h: ring_oshmem_c.c - Fortran shmem.fh: ring_oshmemfh.f90 - -Additionally, there's one further example application, but this one -only uses the MPI C bindings: - -- Test the connectivity between all processes - C: connectivity_c.c - -The Makefile in this directory will build as many of the examples as -you have language support (e.g., if you do not have the Fortran "use -mpi" bindings compiled as part of Open MPI, the those examples will be -skipped). - -The Makefile assumes that the wrapper compilers mpicc, mpic++, and -mpifort are in your path. - -Although the Makefile is tailored for Open MPI (e.g., it checks the -"ompi_info" command to see if you have support for C++, mpif.h, use -mpi, and use mpi_f08 F90), all of the example programs are pure MPI, -and therefore not specific to Open MPI. Hence, you can use a -different MPI implementation to compile and run these programs if you -wish. - -Make today an Open MPI day! diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000000..ad924d2e79 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,66 @@ +The files in this directory are sample MPI applications provided both +as a trivial primer to MPI as well as simple tests to ensure that your +Open MPI installation is working properly. + +If you are looking for a comprehensive MPI tutorial, these samples are +not enough. [Excellent MPI tutorials are available +here](http://www.citutor.org/login.php). + +Get a free account and login; you can then browse to the list of +available courses. Look for the ones with "MPI" in the title. + +There are two MPI examples in this directory, each using one of six +different MPI interfaces: + +## Hello world + +The MPI version of the canonical "hello world" program: + +* C: `hello_c.c` +* C++: `hello_cxx.cc` +* Fortran mpif.h: `hello_mpifh.f` +* Fortran use mpi: `hello_usempi.f90` +* Fortran use mpi_f08: `hello_usempif08.f90` +* Java: `Hello.java` +* C shmem.h: `hello_oshmem_c.c` +* Fortran shmem.fh: `hello_oshmemfh.f90` + +## Ring + +Send a trivial message around in a ring: + +* C: `ring_c.c` +* C++: `ring_cxx.cc` +* Fortran mpif.h: `ring_mpifh.f` +* Fortran use mpi: `ring_usempi.f90` +* Fortran use mpi_f08: `ring_usempif08.f90` +* Java: `Ring.java` +* C shmem.h: `ring_oshmem_c.c` +* Fortran shmem.fh: `ring_oshmemfh.f90` + +## Connectivity Test + +Additionally, there's one further example application, but this one +only uses the MPI C bindings to test the connectivity between all +processes: + +* C: `connectivity_c.c` + +## Makefile + +The `Makefile` in this directory will build as many of the examples as +you have language support (e.g., if you do not have the Fortran `use +mpi` bindings compiled as part of Open MPI, the those examples will be +skipped). + +The `Makefile` assumes that the wrapper compilers `mpicc`, `mpic++`, and +`mpifort` are in your path. + +Although the `Makefile` is tailored for Open MPI (e.g., it checks the +`ompi_info` command to see if you have support for `mpif.h`, the `mpi` +module, and the `use mpi_f08` module), all of the example programs are +pure MPI, and therefore not specific to Open MPI. Hence, you can use +a different MPI implementation to compile and run these programs if +you wish. + +Make today an Open MPI day! diff --git a/ompi/contrib/README.md b/ompi/contrib/README.md new file mode 100644 index 0000000000..bcaae6d43d --- /dev/null +++ b/ompi/contrib/README.md @@ -0,0 +1,19 @@ +This is the OMPI contrib system. It is (far) less functional and +flexible than the OMPI MCA framework/component system. + +Each contrib package must have a `configure.m4`. It may optionally also +have an `autogen.subdirs` file. + +If it has a `configure.m4` file, it must specify its own relevant +files to `AC_CONFIG_FILES` to create during `AC_OUTPUT` -- just like +MCA components (at a minimum, usually its own `Makefile`). The +`configure.m4` file will be slurped up into the main `configure` +script, just like other MCA components. Note that there is currently +no "no configure" option for contrib packages -- you *must* have a +`configure.m4` (even if all it does it call `$1`). Feel free to fix +this situation if you want -- it probably won't not be too difficult +to extend `autogen.pl` to support this scenario, similar to how it is +done for MCA components. :smile: + +If it has an `autogen.subdirs` file, then it needs to be a +subdirectory that is autogen-able. diff --git a/ompi/contrib/README.txt b/ompi/contrib/README.txt deleted file mode 100644 index ba38c11da9..0000000000 --- a/ompi/contrib/README.txt +++ /dev/null @@ -1,19 +0,0 @@ -This is the OMPI contrib system. It is (far) less functional and -flexible than the OMPI MCA framework/component system. - -Each contrib package must have a configure.m4. It may optionally also -have an autogen.subdirs file. - -If it has a configure.m4 file, it must specify its own relevant files -to AC_CONFIG_FILES to create during AC_OUTPUT -- just like MCA -components (at a minimum, usually its own Makefile). The configure.m4 -file will be slurped up into the main configure script, just like -other MCA components. Note that there is currently no "no configure" -option for contrib packages -- you *must* have a configure.m4 (even if -all it does it call $1). Feel free to fix this situation if you want --- it probably won't not be too difficult to extend autogen.pl to -support this scenario, similar to how it is done for MCA components. -:-) - -If it has an autogen.subdirs file, then it needs to be a subdirectory -that is autogen-able. diff --git a/ompi/mca/common/monitoring/Makefile.am b/ompi/mca/common/monitoring/Makefile.am index 1812245cde..942c218c98 100644 --- a/ompi/mca/common/monitoring/Makefile.am +++ b/ompi/mca/common/monitoring/Makefile.am @@ -13,7 +13,7 @@ # $HEADER$ # -EXTRA_DIST = profile2mat.pl aggregate_profile.pl +EXTRA_DIST = profile2mat.pl aggregate_profile.pl README.md sources = common_monitoring.c common_monitoring_coll.c headers = common_monitoring.h common_monitoring_coll.h diff --git a/ompi/mca/common/monitoring/README b/ompi/mca/common/monitoring/README deleted file mode 100644 index 8361027d65..0000000000 --- a/ompi/mca/common/monitoring/README +++ /dev/null @@ -1,181 +0,0 @@ - - Copyright (c) 2013-2015 The University of Tennessee and The University - of Tennessee Research Foundation. All rights - reserved. - Copyright (c) 2013-2015 Inria. All rights reserved. - $COPYRIGHT$ - - Additional copyrights may follow - - $HEADER$ - -=========================================================================== - -Low level communication monitoring interface in Open MPI - -Introduction ------------- -This interface traces and monitors all messages sent by MPI before they go to the -communication channels. At that levels all communication are point-to-point communications: -collectives are already decomposed in send and receive calls. - -The monitoring is stored internally by each process and output on stderr at the end of the -application (during MPI_Finalize()). - - -Enabling the monitoring ------------------------ -To enable the monitoring add --mca pml_monitoring_enable x to the mpirun command line. -If x = 1 it monitors internal and external tags indifferently and aggregate everything. -If x = 2 it monitors internal tags and external tags separately. -If x = 0 the monitoring is disabled. -Other value of x are not supported. - -Internal tags are tags < 0. They are used to tag send and receive coming from -collective operations or from protocol communications - -External tags are tags >=0. They are used by the application in point-to-point communication. - -Therefore, distinguishing external and internal tags help to distinguish between point-to-point -and other communication (mainly collectives). - -Output format -------------- -The output of the monitoring looks like (with --mca pml_monitoring_enable 2): -I 0 1 108 bytes 27 msgs sent -E 0 1 1012 bytes 30 msgs sent -E 0 2 23052 bytes 61 msgs sent -I 1 2 104 bytes 26 msgs sent -I 1 3 208 bytes 52 msgs sent -E 1 0 860 bytes 24 msgs sent -E 1 3 2552 bytes 56 msgs sent -I 2 3 104 bytes 26 msgs sent -E 2 0 22804 bytes 49 msgs sent -E 2 3 860 bytes 24 msgs sent -I 3 0 104 bytes 26 msgs sent -I 3 1 204 bytes 51 msgs sent -E 3 1 2304 bytes 44 msgs sent -E 3 2 860 bytes 24 msgs sent - -Where: - - the first column distinguishes internal (I) and external (E) tags. - - the second column is the sender rank - - the third column is the receiver rank - - the fourth column is the number of bytes sent - - the last column is the number of messages. - -In this example process 0 as sent 27 messages to process 1 using point-to-point call -for 108 bytes and 30 messages with collectives and protocol related communication -for 1012 bytes to process 1. - -If the monitoring was called with --mca pml_monitoring_enable 1 everything is aggregated -under the internal tags. With te above example, you have: -I 0 1 1120 bytes 57 msgs sent -I 0 2 23052 bytes 61 msgs sent -I 1 0 860 bytes 24 msgs sent -I 1 2 104 bytes 26 msgs sent -I 1 3 2760 bytes 108 msgs sent -I 2 0 22804 bytes 49 msgs sent -I 2 3 964 bytes 50 msgs sent -I 3 0 104 bytes 26 msgs sent -I 3 1 2508 bytes 95 msgs sent -I 3 2 860 bytes 24 msgs sent - -Monitoring phases ------------------ -If one wants to monitor phases of the application, it is possible to flush the monitoring -at the application level. In this case all the monitoring since the last flush is stored -by every process in a file. - -An example of how to flush such monitoring is given in test/monitoring/monitoring_test.c - -Moreover, all the different flushed phased are aggregated at runtime and output at the end -of the application as described above. - -Example -------- -A working example is given in test/monitoring/monitoring_test.c -It features, MPI_COMM_WORLD monitoring , sub-communicator monitoring, collective and -point-to-point communication monitoring and phases monitoring - -To compile: -> make monitoring_test - -Helper scripts --------------- -Two perl scripts are provided in test/monitoring -- aggregate_profile.pl is for aggregating monitoring phases of different processes - This script aggregates the profiles generated by the flush_monitoring function. - The files need to be in in given format: name__ - They are then aggregated by phases. - If one needs the profile of all the phases he can concatenate the different files, - or use the output of the monitoring system done at MPI_Finalize - in the example it should be call as: - ./aggregate_profile.pl prof/phase to generate - prof/phase_1.prof - prof/phase_2.prof - -- profile2mat.pl is for transforming a the monitoring output into a communication matrix. - Take a profile file and aggregates all the recorded communicator into matrices. - It generated a matrices for the number of messages, (msg), - for the total bytes transmitted (size) and - the average number of bytes per messages (avg) - - The output matrix is symmetric - -Do not forget to enable the execution right to these scripts. - -For instance, the provided examples store phases output in ./prof - -If you type: -> mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test -you should have the following output -Proc 3 flushing monitoring to: ./prof/phase_1_3.prof -Proc 0 flushing monitoring to: ./prof/phase_1_0.prof -Proc 2 flushing monitoring to: ./prof/phase_1_2.prof -Proc 1 flushing monitoring to: ./prof/phase_1_1.prof -Proc 1 flushing monitoring to: ./prof/phase_2_1.prof -Proc 3 flushing monitoring to: ./prof/phase_2_3.prof -Proc 0 flushing monitoring to: ./prof/phase_2_0.prof -Proc 2 flushing monitoring to: ./prof/phase_2_2.prof -I 2 3 104 bytes 26 msgs sent -E 2 0 22804 bytes 49 msgs sent -E 2 3 860 bytes 24 msgs sent -I 3 0 104 bytes 26 msgs sent -I 3 1 204 bytes 51 msgs sent -E 3 1 2304 bytes 44 msgs sent -E 3 2 860 bytes 24 msgs sent -I 0 1 108 bytes 27 msgs sent -E 0 1 1012 bytes 30 msgs sent -E 0 2 23052 bytes 61 msgs sent -I 1 2 104 bytes 26 msgs sent -I 1 3 208 bytes 52 msgs sent -E 1 0 860 bytes 24 msgs sent -E 1 3 2552 bytes 56 msgs sent - -you can parse the phases with: -> /aggregate_profile.pl prof/phase -Building prof/phase_1.prof -Building prof/phase_2.prof - -And you can build the different communication matrices of phase 1 with: -> ./profile2mat.pl prof/phase_1.prof -prof/phase_1.prof -> all -prof/phase_1_size_all.mat -prof/phase_1_msg_all.mat -prof/phase_1_avg_all.mat - -prof/phase_1.prof -> external -prof/phase_1_size_external.mat -prof/phase_1_msg_external.mat -prof/phase_1_avg_external.mat - -prof/phase_1.prof -> internal -prof/phase_1_size_internal.mat -prof/phase_1_msg_internal.mat -prof/phase_1_avg_internal.mat - -Credit ------- -Designed by George Bosilca and -Emmanuel Jeannot diff --git a/ompi/mca/common/monitoring/README.md b/ompi/mca/common/monitoring/README.md new file mode 100644 index 0000000000..4f46523e6c --- /dev/null +++ b/ompi/mca/common/monitoring/README.md @@ -0,0 +1,209 @@ +# Open MPI common monitoring module + +Copyright (c) 2013-2015 The University of Tennessee and The University + of Tennessee Research Foundation. All rights + reserved. + Copyright (c) 2013-2015 Inria. All rights reserved. + +Low level communication monitoring interface in Open MPI + +## Introduction + +This interface traces and monitors all messages sent by MPI before +they go to the communication channels. At that levels all +communication are point-to-point communications: collectives are +already decomposed in send and receive calls. + +The monitoring is stored internally by each process and output on +stderr at the end of the application (during `MPI_Finalize()`). + + +## Enabling the monitoring + +To enable the monitoring add `--mca pml_monitoring_enable x` to the +`mpirun` command line: + +* If x = 1 it monitors internal and external tags indifferently and aggregate everything. +* If x = 2 it monitors internal tags and external tags separately. +* If x = 0 the monitoring is disabled. +* Other value of x are not supported. + +Internal tags are tags < 0. They are used to tag send and receive +coming from collective operations or from protocol communications + +External tags are tags >=0. They are used by the application in +point-to-point communication. + +Therefore, distinguishing external and internal tags help to +distinguish between point-to-point and other communication (mainly +collectives). + +## Output format + +The output of the monitoring looks like (with `--mca +pml_monitoring_enable 2`): + +``` +I 0 1 108 bytes 27 msgs sent +E 0 1 1012 bytes 30 msgs sent +E 0 2 23052 bytes 61 msgs sent +I 1 2 104 bytes 26 msgs sent +I 1 3 208 bytes 52 msgs sent +E 1 0 860 bytes 24 msgs sent +E 1 3 2552 bytes 56 msgs sent +I 2 3 104 bytes 26 msgs sent +E 2 0 22804 bytes 49 msgs sent +E 2 3 860 bytes 24 msgs sent +I 3 0 104 bytes 26 msgs sent +I 3 1 204 bytes 51 msgs sent +E 3 1 2304 bytes 44 msgs sent +E 3 2 860 bytes 24 msgs sent +``` + +Where: + +1. the first column distinguishes internal (I) and external (E) tags. +1. the second column is the sender rank +1. the third column is the receiver rank +1. the fourth column is the number of bytes sent +1. the last column is the number of messages. + +In this example process 0 as sent 27 messages to process 1 using +point-to-point call for 108 bytes and 30 messages with collectives and +protocol related communication for 1012 bytes to process 1. + +If the monitoring was called with `--mca pml_monitoring_enable 1`, +everything is aggregated under the internal tags. With the e above +example, you have: + +``` +I 0 1 1120 bytes 57 msgs sent +I 0 2 23052 bytes 61 msgs sent +I 1 0 860 bytes 24 msgs sent +I 1 2 104 bytes 26 msgs sent +I 1 3 2760 bytes 108 msgs sent +I 2 0 22804 bytes 49 msgs sent +I 2 3 964 bytes 50 msgs sent +I 3 0 104 bytes 26 msgs sent +I 3 1 2508 bytes 95 msgs sent +I 3 2 860 bytes 24 msgs sent +``` + +## Monitoring phases + +If one wants to monitor phases of the application, it is possible to +flush the monitoring at the application level. In this case all the +monitoring since the last flush is stored by every process in a file. + +An example of how to flush such monitoring is given in +`test/monitoring/monitoring_test.c`. + +Moreover, all the different flushed phased are aggregated at runtime +and output at the end of the application as described above. + +## Example + +A working example is given in `test/monitoring/monitoring_test.c` It +features, `MPI_COMM_WORLD` monitoring , sub-communicator monitoring, +collective and point-to-point communication monitoring and phases +monitoring + +To compile: + +``` +shell$ make monitoring_test +``` + +## Helper scripts + +Two perl scripts are provided in test/monitoring: + +1. `aggregate_profile.pl` is for aggregating monitoring phases of + different processes This script aggregates the profiles generated by + the `flush_monitoring` function. + + The files need to be in in given format: `name__` + They are then aggregated by phases. + If one needs the profile of all the phases he can concatenate the different files, + or use the output of the monitoring system done at `MPI_Finalize` + in the example it should be call as: + ``` + ./aggregate_profile.pl prof/phase to generate + prof/phase_1.prof + prof/phase_2.prof + ``` + +1. `profile2mat.pl` is for transforming a the monitoring output into a + communication matrix. Take a profile file and aggregates all the + recorded communicator into matrices. It generated a matrices for + the number of messages, (msg), for the total bytes transmitted + (size) and the average number of bytes per messages (avg) + + The output matrix is symmetric. + +For instance, the provided examples store phases output in `./prof`: + +``` +shell$ mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test +``` + +Should provide the following output: + +``` +Proc 3 flushing monitoring to: ./prof/phase_1_3.prof +Proc 0 flushing monitoring to: ./prof/phase_1_0.prof +Proc 2 flushing monitoring to: ./prof/phase_1_2.prof +Proc 1 flushing monitoring to: ./prof/phase_1_1.prof +Proc 1 flushing monitoring to: ./prof/phase_2_1.prof +Proc 3 flushing monitoring to: ./prof/phase_2_3.prof +Proc 0 flushing monitoring to: ./prof/phase_2_0.prof +Proc 2 flushing monitoring to: ./prof/phase_2_2.prof +I 2 3 104 bytes 26 msgs sent +E 2 0 22804 bytes 49 msgs sent +E 2 3 860 bytes 24 msgs sent +I 3 0 104 bytes 26 msgs sent +I 3 1 204 bytes 51 msgs sent +E 3 1 2304 bytes 44 msgs sent +E 3 2 860 bytes 24 msgs sent +I 0 1 108 bytes 27 msgs sent +E 0 1 1012 bytes 30 msgs sent +E 0 2 23052 bytes 61 msgs sent +I 1 2 104 bytes 26 msgs sent +I 1 3 208 bytes 52 msgs sent +E 1 0 860 bytes 24 msgs sent +E 1 3 2552 bytes 56 msgs sent +``` + +You can then parse the phases with: + +``` +shell$ /aggregate_profile.pl prof/phase +Building prof/phase_1.prof +Building prof/phase_2.prof +``` + +And you can build the different communication matrices of phase 1 +with: + +``` +shell$ ./profile2mat.pl prof/phase_1.prof +prof/phase_1.prof -> all +prof/phase_1_size_all.mat +prof/phase_1_msg_all.mat +prof/phase_1_avg_all.mat + +prof/phase_1.prof -> external +prof/phase_1_size_external.mat +prof/phase_1_msg_external.mat +prof/phase_1_avg_external.mat + +prof/phase_1.prof -> internal +prof/phase_1_size_internal.mat +prof/phase_1_msg_internal.mat +prof/phase_1_avg_internal.mat +``` + +## Authors + +Designed by George Bosilca and +Emmanuel Jeannot diff --git a/ompi/mca/mtl/ofi/README b/ompi/mca/mtl/ofi/README deleted file mode 100644 index 7a8a6838a7..0000000000 --- a/ompi/mca/mtl/ofi/README +++ /dev/null @@ -1,340 +0,0 @@ -OFI MTL: --------- -The OFI MTL supports Libfabric (a.k.a. Open Fabrics Interfaces OFI, -https://ofiwg.github.io/libfabric/) tagged APIs (fi_tagged(3)). At -initialization time, the MTL queries libfabric for providers supporting tag matching -(fi_getinfo(3)). Libfabric will return a list of providers that satisfy the requested -capabilities, having the most performant one at the top of the list. -The user may modify the OFI provider selection with mca parameters -mtl_ofi_provider_include or mtl_ofi_provider_exclude. - -PROGRESS: ---------- -The MTL registers a progress function to opal_progress. There is currently -no support for asynchronous progress. The progress function reads multiple events -from the OFI provider Completion Queue (CQ) per iteration (defaults to 100, can be -modified with the mca mtl_ofi_progress_event_cnt) and iterates until the -completion queue is drained. - -COMPLETIONS: ------------- -Each operation uses a request type ompi_mtl_ofi_request_t which includes a reference -to an operation specific completion callback, an MPI request, and a context. The -context (fi_context) is used to map completion events with MPI_requests when reading the -CQ. - -OFI TAG: --------- -MPI needs to send 96 bits of information per message (32 bits communicator id, -32 bits source rank, 32 bits MPI tag) but OFI only offers 64 bits tags. In -addition, the OFI MTL uses 2 bits of the OFI tag for the synchronous send protocol. -Therefore, there are only 62 bits available in the OFI tag for message usage. The -OFI MTL offers the mtl_ofi_tag_mode mca parameter with 4 modes to address this: - -"auto" (Default): -After the OFI provider is selected, a runtime check is performed to assess -FI_REMOTE_CQ_DATA and FI_DIRECTED_RECV support (see fi_tagged(3), fi_msg(2) -and fi_getinfo(3)). If supported, "ofi_tag_full" is used. If not supported, -fall back to "ofi_tag_1". - -"ofi_tag_1": -For providers that do not support FI_REMOTE_CQ_DATA, the OFI MTL will -trim the fields (Communicator ID, Source Rank, MPI tag) to make them fit the 62 -bits available bit in the OFI tag. There are two options available with different -number of bits for the Communicator ID and MPI tag fields. This tag distribution -offers: 12 bits for Communicator ID (max Communicator ID 4,095) subject to -provider reserved bits (see mem_tag_format below), 18 bits for Source Rank (max -Source Rank 262,143), 32 bits for MPI tag (max MPI tag is INT_MAX). - -"ofi_tag_2": -Same as 2 "ofi_tag_1" but offering a different OFI tag distribution for -applications that may require a greater number of supported Communicators at the -expense of fewer MPI tag bits. This tag distribution offers: 24 bits for -Communicator ID (max Communicator ED 16,777,215. See mem_tag_format below), 18 -bits for Source Rank (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag -524,287). - -"ofi_tag_full": -For executions that cannot accept trimming source rank or MPI tag, this mode sends -source rank for each message in the CQ DATA. The Source Rank is made available at -the remote process CQ (FI_CQ_FORMAT_TAGGED is used, see fi_cq(3)) at the completion -of the matching receive operation. Since the minimum size for FI_REMOTE_CQ_DATA -is 32 bits, the Source Rank fits with no limitations. The OFI tag is used for the -Communicator id (28 bits, max Communicator ID 268,435,455. See mem_tag_format below), -and the MPI tag (max MPI tag is INT_MAX). If this mode is selected by the user -and FI_REMOTE_CQ_DATA or FI_DIRECTED_RECV are not supported, the execution will abort. - -mem_tag_format (fi_endpoint(3)) -Some providers can reserve the higher order bits from the OFI tag for internal purposes. -This is signaled in mem_tag_format (see fi_endpoint(3)) by setting higher order bits -to zero. In such cases, the OFI MTL will reduce the number of communicator ids supported -by reducing the bits available for the communicator ID field in the OFI tag. - -SCALABLE ENDPOINTS: -------------------- -OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to improve -multi-threaded application throughput and message rate. Currently the feature -is designed to utilize multiple TX/RX contexts exposed by the OFI provider in -conjunction with a multi-communicator MPI application model. Therefore, new OFI -contexts are created as and when communicators are duplicated in a lazy fashion -instead of creating them all at once during init time and this approach also -favours only creating as many contexts as needed. - -1. Multi-communicator model: - With this approach, the MPI application is requried to first duplicate - the communicators it wants to use with MPI operations (ideally creating - as many communicators as the number of threads it wants to use to call - into MPI). The duplicated communicators are then used by the - corresponding threads to perform MPI operations. A possible usage - scenario could be in an MPI + OMP application as follows - (example limited to 2 ranks): - - MPI_Comm dup_comm[n]; - MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); - for (i = 0; i < n; i++) { - MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]); - } - if (rank == 0) { -#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n) - for (i = 0; i < n ; i++) { - MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR, - 1, MSG_TAG, dup_comm[i]); - MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR, - 1, MSG_TAG, dup_comm[i], &status); - } - } else if (rank == 1) { -#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n) - for (i = 0; i < n ; i++) { - MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR, - 0, MSG_TAG, dup_comm[i], &status); - MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR, - 0, MSG_TAG, dup_comm[i]); - } - } - -2. MCA variables: - To utilize the feature, the following MCA variables need to be set: - mtl_ofi_enable_sep: - This MCA variable needs to be set to enable the use of Scalable Endpoints (SEP) - feature in the OFI MTL. The underlying provider is also checked to ensure the - feature is supported. If the provider chosen does not support it, user needs - to either set this variable to 0 or select a different provider which supports - the feature. - For single-threaded applications one OFI context is sufficient, so OFI SEPs - may not add benefit. - Note that mtl_ofi_thread_grouping (see below) needs to be enabled to use the - different OFI SEP contexts. Otherwise, only one context (ctxt 0) will be used. - - Default: 0 - - Command-line syntax: - "-mca mtl_ofi_enable_sep 1" - - mtl_ofi_thread_grouping: - Turn Thread Grouping feature on. This is needed to use the Multi-communicator - model explained above. This means that the OFI MTL will use the communicator - ID to decide the SEP contexts to be used by the thread. In this way, each - thread will have direct access to different OFI resources. If disabled, - only context 0 will be used. - Requires mtl_ofi_enable_sep to be set to 1. - - Default: 0 - - It is not recommended to set the MCA variable for: - - Multi-threaded MPI applications not following multi-communicator approach. - - Applications that have multiple threads using a single communicator as - it may degrade performance. - - Command-line syntax: - "-mca mtl_ofi_thread_grouping 1" - - mtl_ofi_num_ctxts: - This MCA variable allows user to set the number of OFI SEP contexts the - application expects to use. For multi-threaded applications using Thread - Grouping feature, this number should be set to the number of user threads - that will call into MPI. This variable will only have effect if - mtl_ofi_enable_sep is set to 1. - - Default: 1 - - Command-line syntax: - "-mca mtl_ofi_num_ctxts N" [ N: number of OFI contexts required by - application ] - -3. Notes on performance: - - OFI MTL will create as many TX/RX contexts as set by MCA mtl_ofi_num_ctxts. - The number of contexts that can be created is also limited by the underlying - provider as each provider may have different thresholds. Once the threshold - is exceeded, contexts are used in a round-robin fashion which leads to - resource sharing among threads. Therefore locks are required to guard - against race conditions. For performance, it is recommended to have - - Number of threads = Number of communicators = Number of contexts - - For example, when using PSM2 provider, the number of contexts is dictated - by the Intel Omni-Path HFI1 driver module. - - - OPAL layer allows for multiple threads to enter progress simultaneously. To - enable this feature, user needs to set MCA variable - "max_thread_in_progress". When using Thread Grouping feature, it is - recommended to set this MCA parameter to the number of threads expected to - call into MPI as it provides performance benefits. - - Command-line syntax: - "-mca opal_max_thread_in_progress N" [ N: number of threads expected to - make MPI calls ] - Default: 1 - - - For applications using a single thread with multiple communicators and MCA - variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple - contexts, but the benefits may be negligible as only one thread is driving - progress. - -SPECIALIZED FUNCTIONS: -------------------- -To improve performance when calling message passing APIs in the OFI mtl -specialized functions are generated at compile time that eliminate all the -if conditionals that can be determined at init and don't need to be -queried again during the critical path. These functions are generated by -perl scripts during make which generate functions and symbols for every -combination of flags for each function. - -1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION: - To add a new flag to an existing specialized function for handling cases - where different OFI providers may or may not support a particular feature, - then you must follow these steps: - 1) Update the "_generic" function in mtl_ofi.h with the new flag and - the if conditionals to read the new value. - 2) Update the *.pm file corresponding to the function with the new flag in: - gen_funcs(), gen_*_function(), & gen_*_sym_init() - 3) Update mtl_ofi_opt.h with: - The new flag as #define NEW_FLAG_TYPES #NUMBER_OF_STATES - example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states) - Update the function's types with: - #define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES] - -2. ADDING A NEW FUNCTION FOR SPECIALIZATION: - To add a new function to be specialized you must - follow these steps: - 1) Create a new mtl_ofi_"function_name"_opt.pm based off opt_common/mtl_ofi_opt.pm.template - 2) Add new .pm file to generated_source_modules in Makefile.am - 3) Add .c file to generated_sources in Makefile.am named the same as the corresponding .pm file - 4) Update existing or create function in mtl_ofi.h to _generic with new flags. - 5) Update mtl_ofi_opt.h with: - a) New function types: #define OMPI_MTL_OFI_FUNCTION_TYPES [FLAG_TYPES] - b) Add new function to the struct ompi_mtl_ofi_symtable: - struct ompi_mtl_ofi_symtable { - ... - int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES ) - } - c) Add new symbol table init function definition: - void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table); - 6) Add calls to init the new function in the symbol table and assign the function - pointer to be used based off the flags in mtl_ofi_component.c: - ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table); - ompi_mtl_ofi.base.mtl_FUNCTION = - ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag]; - -3. EXAMPLE SPECIALIZED FILE: -The code below is an example of what is generated by the specialization -scripts for use in the OFI mtl. This code specializes the blocking -send functionality based on FI_REMOTE_CQ_DATA & OFI Scalable Endpoint support -provided by an OFI Provider. Only one function and symbol is used during -runtime based on if FI_REMOTE_CQ_DATA is supported and/or if OFI Scalable -Endpoint support is enabled. -/* - * Copyright (c) 2013-2018 Intel, Inc. All rights reserved - * - * $COPYRIGHT$ - * - * Additional copyrights may follow - * - * $HEADER$ - */ - -#include "mtl_ofi.h" - -__opal_attribute_always_inline__ static inline int -ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl, - struct ompi_communicator_t *comm, - int dest, - int tag, - struct opal_convertor_t *convertor, - mca_pml_base_send_mode_t mode) -{ - const bool OFI_CQ_DATA = false; - const bool OFI_SCEP_EPS = false; - - return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag, - convertor, mode, - OFI_CQ_DATA, OFI_SCEP_EPS); -} - -__opal_attribute_always_inline__ static inline int -ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl, - struct ompi_communicator_t *comm, - int dest, - int tag, - struct opal_convertor_t *convertor, - mca_pml_base_send_mode_t mode) -{ - const bool OFI_CQ_DATA = false; - const bool OFI_SCEP_EPS = true; - - return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag, - convertor, mode, - OFI_CQ_DATA, OFI_SCEP_EPS); -} - -__opal_attribute_always_inline__ static inline int -ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl, - struct ompi_communicator_t *comm, - int dest, - int tag, - struct opal_convertor_t *convertor, - mca_pml_base_send_mode_t mode) -{ - const bool OFI_CQ_DATA = true; - const bool OFI_SCEP_EPS = false; - - return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag, - convertor, mode, - OFI_CQ_DATA, OFI_SCEP_EPS); -} - -__opal_attribute_always_inline__ static inline int -ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl, - struct ompi_communicator_t *comm, - int dest, - int tag, - struct opal_convertor_t *convertor, - mca_pml_base_send_mode_t mode) -{ - const bool OFI_CQ_DATA = true; - const bool OFI_SCEP_EPS = true; - - return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag, - convertor, mode, - OFI_CQ_DATA, OFI_SCEP_EPS); -} - -void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table) -{ - - sym_table->ompi_mtl_ofi_send[false][false] - = ompi_mtl_ofi_send_false_false; - - - sym_table->ompi_mtl_ofi_send[false][true] - = ompi_mtl_ofi_send_false_true; - - - sym_table->ompi_mtl_ofi_send[true][false] - = ompi_mtl_ofi_send_true_false; - - - sym_table->ompi_mtl_ofi_send[true][true] - = ompi_mtl_ofi_send_true_true; - -} -### diff --git a/ompi/mca/mtl/ofi/README.md b/ompi/mca/mtl/ofi/README.md new file mode 100644 index 0000000000..109ab23cd4 --- /dev/null +++ b/ompi/mca/mtl/ofi/README.md @@ -0,0 +1,368 @@ +# Open MPI OFI MTL + +The OFI MTL supports Libfabric (a.k.a., [Open Fabrics Interfaces +OFI](https://ofiwg.github.io/libfabric/)) tagged APIs +(`fi_tagged(3)`). At initialization time, the MTL queries libfabric +for providers supporting tag matching (`fi_getinfo(3)`). Libfabric +will return a list of providers that satisfy the requested +capabilities, having the most performant one at the top of the list. +The user may modify the OFI provider selection with mca parameters +`mtl_ofi_provider_include` or `mtl_ofi_provider_exclude`. + +## PROGRESS + +The MTL registers a progress function to `opal_progress`. There is +currently no support for asynchronous progress. The progress function +reads multiple events from the OFI provider Completion Queue (CQ) per +iteration (defaults to 100, can be modified with the mca +`mtl_ofi_progress_event_cnt`) and iterates until the completion queue is +drained. + +## COMPLETIONS + +Each operation uses a request type `ompi_mtl_ofi_request_t` which +includes a reference to an operation specific completion callback, an +MPI request, and a context. The context (`fi_context`) is used to map +completion events with `MPI_requests` when reading the CQ. + +## OFI TAG + +MPI needs to send 96 bits of information per message (32 bits +communicator id, 32 bits source rank, 32 bits MPI tag) but OFI only +offers 64 bits tags. In addition, the OFI MTL uses 2 bits of the OFI +tag for the synchronous send protocol. Therefore, there are only 62 +bits available in the OFI tag for message usage. The OFI MTL offers +the `mtl_ofi_tag_mode` mca parameter with 4 modes to address this: + +* `auto` (Default): + After the OFI provider is selected, a runtime check is performed to + assess `FI_REMOTE_CQ_DATA` and `FI_DIRECTED_RECV` support (see + `fi_tagged(3)`, `fi_msg(2)` and `fi_getinfo(3)`). If supported, + `ofi_tag_full` is used. If not supported, fall back to `ofi_tag_1`. + +* `ofi_tag_1`: + For providers that do not support `FI_REMOTE_CQ_DATA`, the OFI MTL + will trim the fields (Communicator ID, Source Rank, MPI tag) to make + them fit the 62 bits available bit in the OFI tag. There are two + options available with different number of bits for the Communicator + ID and MPI tag fields. This tag distribution offers: 12 bits for + Communicator ID (max Communicator ID 4,095) subject to provider + reserved bits (see `mem_tag_format` below), 18 bits for Source Rank + (max Source Rank 262,143), 32 bits for MPI tag (max MPI tag is + `INT_MAX`). + +* `ofi_tag_2`: + Same as 2 `ofi_tag_1` but offering a different OFI tag distribution + for applications that may require a greater number of supported + Communicators at the expense of fewer MPI tag bits. This tag + distribution offers: 24 bits for Communicator ID (max Communicator + ED 16,777,215. See mem_tag_format below), 18 bits for Source Rank + (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag + 524,287). + +* `ofi_tag_full`: + For executions that cannot accept trimming source rank or MPI tag, + this mode sends source rank for each message in the CQ DATA. The + Source Rank is made available at the remote process CQ + (`FI_CQ_FORMAT_TAGGED` is used, see `fi_cq(3)`) at the completion of + the matching receive operation. Since the minimum size for + `FI_REMOTE_CQ_DATA` is 32 bits, the Source Rank fits with no + limitations. The OFI tag is used for the Communicator id (28 bits, + max Communicator ID 268,435,455. See `mem_tag_format` below), and + the MPI tag (max MPI tag is `INT_MAX`). If this mode is selected by + the user and `FI_REMOTE_CQ_DATA` or `FI_DIRECTED_RECV` are not + supported, the execution will abort. + +* `mem_tag_format` (`fi_endpoint(3)`) + Some providers can reserve the higher order bits from the OFI tag + for internal purposes. This is signaled in `mem_tag_format` (see + `fi_endpoint(3)`) by setting higher order bits to zero. In such + cases, the OFI MTL will reduce the number of communicator ids + supported by reducing the bits available for the communicator ID + field in the OFI tag. + +## SCALABLE ENDPOINTS + +OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to +improve multi-threaded application throughput and message +rate. Currently the feature is designed to utilize multiple TX/RX +contexts exposed by the OFI provider in conjunction with a +multi-communicator MPI application model. Therefore, new OFI contexts +are created as and when communicators are duplicated in a lazy fashion +instead of creating them all at once during init time and this +approach also favours only creating as many contexts as needed. + +1. Multi-communicator model: + With this approach, the MPI application is requried to first duplicate + the communicators it wants to use with MPI operations (ideally creating + as many communicators as the number of threads it wants to use to call + into MPI). The duplicated communicators are then used by the + corresponding threads to perform MPI operations. A possible usage + scenario could be in an MPI + OMP application as follows + (example limited to 2 ranks): + + ```c + MPI_Comm dup_comm[n]; + MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); + for (i = 0; i < n; i++) { + MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]); + } + if (rank == 0) { + #pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n) + for (i = 0; i < n ; i++) { + MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR, + 1, MSG_TAG, dup_comm[i]); + MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR, + 1, MSG_TAG, dup_comm[i], &status); + } + } else if (rank == 1) { + #pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n) + for (i = 0; i < n ; i++) { + MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR, + 0, MSG_TAG, dup_comm[i], &status); + MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR, + 0, MSG_TAG, dup_comm[i]); + } + } + ``` + +2. MCA variables: + To utilize the feature, the following MCA variables need to be set: + + * `mtl_ofi_enable_sep`: + This MCA variable needs to be set to enable the use of Scalable + Endpoints (SEP) feature in the OFI MTL. The underlying provider + is also checked to ensure the feature is supported. If the + provider chosen does not support it, user needs to either set + this variable to 0 or select a different provider which supports + the feature. For single-threaded applications one OFI context is + sufficient, so OFI SEPs may not add benefit. Note that + `mtl_ofi_thread_grouping` (see below) needs to be enabled to use + the different OFI SEP contexts. Otherwise, only one context (ctxt + 0) will be used. + + Default: 0 + + Command-line syntax: `--mca mtl_ofi_enable_sep 1` + + * `mtl_ofi_thread_grouping`: + Turn Thread Grouping feature on. This is needed to use the + Multi-communicator model explained above. This means that the OFI + MTL will use the communicator ID to decide the SEP contexts to be + used by the thread. In this way, each thread will have direct + access to different OFI resources. If disabled, only context 0 + will be used. Requires `mtl_ofi_enable_sep` to be set to 1. + + Default: 0 + + It is not recommended to set the MCA variable for: + + * Multi-threaded MPI applications not following multi-communicator + approach. + * Applications that have multiple threads using a single + communicator as it may degrade performance. + + Command-line syntax: `--mca mtl_ofi_thread_grouping 1` + + * `mtl_ofi_num_ctxts`: + This MCA variable allows user to set the number of OFI SEP + contexts the application expects to use. For multi-threaded + applications using Thread Grouping feature, this number should be + set to the number of user threads that will call into MPI. This + variable will only have effect if `mtl_ofi_enable_sep` is set to 1. + + Default: 1 + + Command-line syntax: `--mca mtl_ofi_num_ctxts N` (`N`: number of OFI contexts required by application) + +3. Notes on performance: + * OFI MTL will create as many TX/RX contexts as set by MCA + mtl_ofi_num_ctxts. The number of contexts that can be created is + also limited by the underlying provider as each provider may have + different thresholds. Once the threshold is exceeded, contexts are + used in a round-robin fashion which leads to resource sharing + among threads. Therefore locks are required to guard against race + conditions. For performance, it is recommended to have + + Number of threads = Number of communicators = Number of contexts + + For example, when using PSM2 provider, the number of contexts is + dictated by the Intel Omni-Path HFI1 driver module. + + * OPAL layer allows for multiple threads to enter progress + simultaneously. To enable this feature, user needs to set MCA + variable `max_thread_in_progress`. When using Thread Grouping + feature, it is recommended to set this MCA parameter to the number + of threads expected to call into MPI as it provides performance + benefits. + + Default: 1 + + Command-line syntax: `--mca opal_max_thread_in_progress N` (`N`: number of threads expected to make MPI calls ) + + * For applications using a single thread with multiple communicators + and MCA variable `mtl_ofi_thread_grouping` set to 1, the MTL will + use multiple contexts, but the benefits may be negligible as only + one thread is driving progress. + +## SPECIALIZED FUNCTIONS + +To improve performance when calling message passing APIs in the OFI +mtl specialized functions are generated at compile time that eliminate +all the if conditionals that can be determined at init and don't need +to be queried again during the critical path. These functions are +generated by perl scripts during make which generate functions and +symbols for every combination of flags for each function. + +1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION: + To add a new flag to an existing specialized function for handling + cases where different OFI providers may or may not support a + particular feature, then you must follow these steps: + + 1. Update the `_generic` function in `mtl_ofi.h` with the new flag + and the if conditionals to read the new value. + 1. Update the `*.pm` file corresponding to the function with the + new flag in: `gen_funcs()`, `gen_*_function()`, & + `gen_*_sym_init()` + 1. Update `mtl_ofi_opt.h` with: + * The new flag as `#define NEW_FLAG_TYPES #NUMBER_OF_STATES`. + Example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states) + * Update the function's types with: + `#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]` + +1. ADDING A NEW FUNCTION FOR SPECIALIZATION: + To add a new function to be specialized you must + follow these steps: + 1. Create a new `mtl_ofi__opt.pm` based off + `opt_common/mtl_ofi_opt.pm.template` + 1. Add new `.pm` file to `generated_source_modules` in `Makefile.am` + 1. Add `.c` file to `generated_sources` in `Makefile.am` named the + same as the corresponding `.pm` file + 1. Update existing or create function in `mtl_ofi.h` to `_generic` + with new flags. + 1. Update `mtl_ofi_opt.h` with: + 1. New function types: `#define OMPI_MTL_OFI_FUNCTION_TYPES` `[FLAG_TYPES]` + 1. Add new function to the `struct ompi_mtl_ofi_symtable`: + ```c + struct ompi_mtl_ofi_symtable { + ... + int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES ) + } + ``` + 1. Add new symbol table init function definition: + ```c + void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table); + ``` + 1. Add calls to init the new function in the symbol table and + assign the function pointer to be used based off the flags in + `mtl_ofi_component.c`: + * `ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);` + * `ompi_mtl_ofi.base.mtl_FUNCTION = ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];` + +## EXAMPLE SPECIALIZED FILE + +The code below is an example of what is generated by the +specialization scripts for use in the OFI mtl. This code specializes +the blocking send functionality based on `FI_REMOTE_CQ_DATA` & OFI +Scalable Endpoint support provided by an OFI Provider. Only one +function and symbol is used during runtime based on if +`FI_REMOTE_CQ_DATA` is supported and/or if OFI Scalable Endpoint support +is enabled. + +```c +/* + * Copyright (c) 2013-2018 Intel, Inc. All rights reserved + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "mtl_ofi.h" + +__opal_attribute_always_inline__ static inline int +ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl, + struct ompi_communicator_t *comm, + int dest, + int tag, + struct opal_convertor_t *convertor, + mca_pml_base_send_mode_t mode) +{ + const bool OFI_CQ_DATA = false; + const bool OFI_SCEP_EPS = false; + + return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag, + convertor, mode, + OFI_CQ_DATA, OFI_SCEP_EPS); +} + +__opal_attribute_always_inline__ static inline int +ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl, + struct ompi_communicator_t *comm, + int dest, + int tag, + struct opal_convertor_t *convertor, + mca_pml_base_send_mode_t mode) +{ + const bool OFI_CQ_DATA = false; + const bool OFI_SCEP_EPS = true; + + return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag, + convertor, mode, + OFI_CQ_DATA, OFI_SCEP_EPS); +} + +__opal_attribute_always_inline__ static inline int +ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl, + struct ompi_communicator_t *comm, + int dest, + int tag, + struct opal_convertor_t *convertor, + mca_pml_base_send_mode_t mode) +{ + const bool OFI_CQ_DATA = true; + const bool OFI_SCEP_EPS = false; + + return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag, + convertor, mode, + OFI_CQ_DATA, OFI_SCEP_EPS); +} + +__opal_attribute_always_inline__ static inline int +ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl, + struct ompi_communicator_t *comm, + int dest, + int tag, + struct opal_convertor_t *convertor, + mca_pml_base_send_mode_t mode) +{ + const bool OFI_CQ_DATA = true; + const bool OFI_SCEP_EPS = true; + + return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag, + convertor, mode, + OFI_CQ_DATA, OFI_SCEP_EPS); +} + +void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table) +{ + + sym_table->ompi_mtl_ofi_send[false][false] + = ompi_mtl_ofi_send_false_false; + + + sym_table->ompi_mtl_ofi_send[false][true] + = ompi_mtl_ofi_send_false_true; + + + sym_table->ompi_mtl_ofi_send[true][false] + = ompi_mtl_ofi_send_true_false; + + + sym_table->ompi_mtl_ofi_send[true][true] + = ompi_mtl_ofi_send_true_true; + +} +``` diff --git a/ompi/mca/op/example/README.txt b/ompi/mca/op/example/README.md similarity index 50% rename from ompi/mca/op/example/README.txt rename to ompi/mca/op/example/README.md index af4d75d58a..15ffc68cb1 100644 --- a/ompi/mca/op/example/README.txt +++ b/ompi/mca/op/example/README.md @@ -1,5 +1,3 @@ -Copyright 2009 Cisco Systems, Inc. All rights reserved. - This is a simple example op component meant to be a template / springboard for people to write their own op components. There are many different ways to write components and modules; this is but one @@ -13,28 +11,26 @@ same end effect. Feel free to customize / simplify / strip out what you don't need from this example. This example component supports a ficticious set of hardware that -provides acceleation for the MPI_MAX and MPI_BXOR MPI_Ops. The +provides acceleation for the `MPI_MAX` and `MPI_BXOR` `MPI_Ops`. The ficticious hardware has multiple versions, too: some versions only -support single precision floating point types for MAX and single -precision integer types for BXOR, whereas later versions support both -single and double precision floating point types for MAX and both -single and double precision integer types for BXOR. Hence, this -example walks through setting up particular MPI_Op function pointers -based on: +support single precision floating point types for `MAX` and single +precision integer types for `BXOR`, whereas later versions support +both single and double precision floating point types for `MAX` and +both single and double precision integer types for `BXOR`. Hence, +this example walks through setting up particular `MPI_Op` function +pointers based on: -a) hardware availability (e.g., does the node where this MPI process +1. hardware availability (e.g., does the node where this MPI process is running have the relevant hardware/resources?) - -b) MPI_Op (e.g., in this example, only MPI_MAX and MPI_BXOR are +1. `MPI_Op` (e.g., in this example, only `MPI_MAX` and `MPI_BXOR` are supported) - -c) datatype (e.g., single/double precision floating point for MAX and - single/double precision integer for BXOR) +1. datatype (e.g., single/double precision floating point for `MAX` + and single/double precision integer for `BXOR`) Additionally, there are other considerations that should be factored in at run time. Hardware accelerators are great, but they do induce overhead -- for example, some accelerator hardware require registered -memory. So even if a particular MPI_Op and datatype are supported, it +memory. So even if a particular `MPI_Op` and datatype are supported, it may not be worthwhile to use the hardware unless the amount of data to be processed is "big enough" (meaning that the cost of the registration and/or copy-in/copy-out is ameliorated) or the memory to @@ -47,57 +43,65 @@ failover strategy is well-supported by the op framework; during the query process, a component can "stack" itself similar to how POSIX signal handlers can be stacked. Specifically, op components can cache other implementations of operation functions for use in the case of -failover. The MAX and BXOR module implementations show one way of +failover. The `MAX` and `BXOR` module implementations show one way of using this method. Here's a listing of the files in the example component and what they do: -- configure.m4: Tests that get slurped into OMPI's top-level configure - script to determine whether this component will be built or not. -- Makefile.am: Automake makefile that builds this component. -- op_example_component.c: The main "component" source file. -- op_example_module.c: The main "module" source file. -- op_example.h: information that is shared between the .c files. -- .ompi_ignore: the presence of this file causes OMPI's autogen.pl to - skip this component in the configure/build/install process (see +- `configure.m4`: Tests that get slurped into OMPI's top-level + `configure` script to determine whether this component will be built + or not. +- `Makefile.am`: Automake makefile that builds this component. +- `op_example_component.c`: The main "component" source file. +- `op_example_module.c`: The main "module" source file. +- `op_example.h`: information that is shared between the `.c` files. +- `.ompi_ignore`: the presence of this file causes OMPI's `autogen.pl` + to skip this component in the configure/build/install process (see below). To use this example as a template for your component (assume your new -component is named "foo"): +component is named `foo`): +``` shell$ cd (top_ompi_dir)/ompi/mca/op shell$ cp -r example foo shell$ cd foo +``` -Remove the .ompi_ignore file (which makes the component "visible" to -all developers) *OR* add an .ompi_unignore file with one username per -line (as reported by `whoami`). OMPI's autogen.pl will skip any -component with a .ompi_ignore file *unless* there is also an +Remove the `.ompi_ignore` file (which makes the component "visible" to +all developers) *OR* add an `.ompi_unignore` file with one username per +line (as reported by `whoami`). OMPI's `autogen.pl` will skip any +component with a `.ompi_ignore` file *unless* there is also an .ompi_unignore file containing your user ID in it. This is a handy mechanism to have a component in the tree but have it not built / used by most other developers: +``` shell$ rm .ompi_ignore *OR* shell$ whoami > .ompi_unignore +``` -Now rename any file that contains "example" in the filename to have -"foo", instead. For example: +Now rename any file that contains `example` in the filename to have +`foo`, instead. For example: +``` shell$ mv op_example_component.c op_foo_component.c #...etc. +``` -Now edit all the files and s/example/foo/gi. Specifically, replace -all instances of "example" with "foo" in all function names, type -names, header #defines, strings, and global variables. +Now edit all the files and `s/example/foo/gi`. Specifically, replace +all instances of `example` with `foo` in all function names, type +names, header `#defines`, strings, and global variables. Now your component should be fully functional (although entirely -renamed as "foo" instead of "example"). You can go to the top-level -OMPI directory and run "autogen.pl" (which will find your component -and att it to the configure/build process) and then "configure ..." -and "make ..." as normal. +renamed as `foo` instead of `example`). You can go to the top-level +OMPI directory and run `autogen.pl` (which will find your component +and att it to the configure/build process) and then `configure ...` +and `make ...` as normal. +``` shell$ cd (top_ompi_dir) shell$ ./autogen.pl # ...lots of output... @@ -107,19 +111,21 @@ shell$ make -j 4 all # ...lots of output... shell$ make install # ...lots of output... +``` -After you have installed Open MPI, running "ompi_info" should show -your "foo" component in the output. +After you have installed Open MPI, running `ompi_info` should show +your `foo` component in the output. +``` shell$ ompi_info | grep op: MCA op: example (MCA v2.0, API v1.0, Component v1.4) MCA op: foo (MCA v2.0, API v1.0, Component v1.4) shell$ +``` -If you do not see your foo component, check the above steps, and check -the output of autogen.pl, configure, and make to ensure that "foo" was -found, configured, and built successfully. - -Once ompi_info sees your component, start editing the "foo" component -files in a meaningful way. +If you do not see your `foo` component, check the above steps, and +check the output of `autogen.pl`, `configure`, and `make` to ensure +that `foo` was found, configured, and built successfully. +Once `ompi_info` sees your component, start editing the `foo` +component files in a meaningful way. diff --git a/ompi/mpi/java/Makefile.am b/ompi/mpi/java/Makefile.am index 9e516a704a..943f3ecc75 100644 --- a/ompi/mpi/java/Makefile.am +++ b/ompi/mpi/java/Makefile.am @@ -10,3 +10,5 @@ # SUBDIRS = java c + +EXTRA_DIST = README.md diff --git a/ompi/mpi/java/README b/ompi/mpi/java/README.md similarity index 64% rename from ompi/mpi/java/README rename to ompi/mpi/java/README.md index a3641303d0..93b43d3521 100644 --- a/ompi/mpi/java/README +++ b/ompi/mpi/java/README.md @@ -1,26 +1,27 @@ -*************************************************************************** +# Open MPI Java bindings Note about the Open MPI Java bindings -The Java bindings in this directory are not part of the MPI specification, -as noted in the README.JAVA.txt file in the root directory. That file also -contains some information regarding the installation and use of the Java -bindings. Further details can be found in the paper [1]. +The Java bindings in this directory are not part of the MPI +specification, as noted in the README.JAVA.md file in the root +directory. That file also contains some information regarding the +installation and use of the Java bindings. Further details can be +found in the paper [1]. We originally took the code from the mpiJava project [2] as starting point for our developments, but we have pretty much rewritten 100% of it. The original copyrights and license terms of mpiJava are listed below. - [1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and - implementation of Java bindings in Open MPI". Parallel Comput. - 59: 1-20 (2016). +1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and + implementation of Java bindings in Open MPI". Parallel Comput. + 59: 1-20 (2016). +1. M. Baker et al. "mpiJava: An object-oriented Java interface to + MPI". In Parallel and Distributed Processing, LNCS vol. 1586, + pp. 748-762, Springer (1999). - [2] M. Baker et al. "mpiJava: An object-oriented Java interface to - MPI". In Parallel and Distributed Processing, LNCS vol. 1586, - pp. 748-762, Springer (1999). - -*************************************************************************** +## Original citation +``` mpiJava - A Java Interface to MPI --------------------------------- Copyright 2003 @@ -39,6 +40,7 @@ original copyrights and license terms of mpiJava are listed below. (Bugfixes/Additions, CMake based configure/build) Blasius Czink HLRS, University of Stuttgart +``` Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/ompi/mpiext/README.txt b/ompi/mpiext/README.md similarity index 63% rename from ompi/mpiext/README.txt rename to ompi/mpiext/README.md index ccfb63dfe1..ff4e7827c4 100644 --- a/ompi/mpiext/README.txt +++ b/ompi/mpiext/README.md @@ -1,4 +1,5 @@ -Symbol conventions for Open MPI extensions +# Symbol conventions for Open MPI extensions + Last updated: January 2015 This README provides some rule-of-thumb guidance for how to name @@ -15,26 +16,22 @@ Generally speaking, there are usually three kinds of extensions: 3. Functionality that is strongly expected to be in an upcoming version of the MPI specification. ----------------------------------------------------------------------- +## Case 1 -Case 1 - -The OMPI_Paffinity_str() extension is a good example of this type: it -is solely intended to be for Open MPI. It will likely never be pushed -to other MPI implementations, and it will likely never be pushed to -the MPI Forum. +The `OMPI_Paffinity_str()` extension is a good example of this type: +it is solely intended to be for Open MPI. It will likely never be +pushed to other MPI implementations, and it will likely never be +pushed to the MPI Forum. It's Open MPI-specific functionality, through and through. Public symbols of this type of functionality should be named with an -"OMPI_" prefix to emphasize its Open MPI-specific nature. To be -clear: the "OMPI_" prefix clearly identifies parts of user code that +`OMPI_` prefix to emphasize its Open MPI-specific nature. To be +clear: the `OMPI_` prefix clearly identifies parts of user code that are relying on Open MPI (and likely need to be surrounded with #if -OPEN_MPI blocks, etc.). +`OPEN_MPI` blocks, etc.). ----------------------------------------------------------------------- - -Case 2 +## Case 2 The MPI extensions mechanism in Open MPI was designed to help MPI Forum members prototype new functionality that is intended for the @@ -43,23 +40,21 @@ functionality is not only to be included in the MPI spec, but possibly also be included in another MPI implementation. As such, it seems reasonable to prefix public symbols in this type of -functionality with "MPIX_". This commonly-used prefix allows the same +functionality with `MPIX_`. This commonly-used prefix allows the same symbols to be available in multiple MPI implementations, and therefore allows user code to easily check for it. E.g., user apps can check -for the presence of MPIX_Foo to know if both Open MPI and Other MPI -support the proposed MPIX_Foo functionality. +for the presence of `MPIX_Foo` to know if both Open MPI and Other MPI +support the proposed `MPIX_Foo` functionality. -Of course, when using the MPIX_ namespace, there is the possibility of -symbol name collisions. E.g., what if Open MPI has an MPIX_Foo and -Other MPI has a *different* MPIX_Foo? +Of course, when using the `MPIX_` namespace, there is the possibility of +symbol name collisions. E.g., what if Open MPI has an `MPIX_Foo` and +Other MPI has a *different* `MPIX_Foo`? While we technically can't prevent such collisions from happening, we encourage extension authors to avoid such symbol clashes whenever possible. ----------------------------------------------------------------------- - -Case 3 +## Case 3 It is well-known that the MPI specification (intentionally) takes a long time to publish. MPI implementers can typically know, with a @@ -72,13 +67,13 @@ functionality early (i.e., before the actual publication of the corresponding MPI specification document). Case in point: the non-blocking collective operations that were -included in MPI-3.0 (e.g., MPI_Ibarrier). It was known for a year or -two before MPI-3.0 was published that these functions would be +included in MPI-3.0 (e.g., `MPI_Ibarrier()`). It was known for a year +or two before MPI-3.0 was published that these functions would be included in MPI-3.0. There is a continual debate among the developer community: when implementing such functionality, should the symbols be in the MPIX_ -namespace or in the MPI_ namespace? On one hand, the symbols are not +namespace or in the `MPI_` namespace? On one hand, the symbols are not yet officially standardized -- *they could change* before publication. On the other hand, developers who participate in the Forum typically have a good sense for whether symbols are going to change before @@ -89,35 +84,31 @@ before the MPI specification is published. ...and so on. After much debate: for functionality that has a high degree of confidence that it will be included in an upcoming spec (e.g., it has passed at least one vote in the MPI Forum), our conclusion is that it -is OK to use the MPI_ namespace. +is OK to use the `MPI_` namespace. Case in point: Open MPI released non-blocking collectives with the -MPI_ prefix (not the MPIX_ prefix) before the MPI-3.0 specification -officially standardized these functions. +`MPI_` prefix (not the `MPIX_` prefix) before the MPI-3.0 +specification officially standardized these functions. The rationale was threefold: 1. Let users use the functionality as soon as possible. - -2. If OMPI initially creates MPIX_Foo, but eventually renames it to - MPI_Foo when the MPI specification is published, then users will +1. If OMPI initially creates `MPIX_Foo`, but eventually renames it to + `MPI_Foo` when the MPI specification is published, then users will have to modify their codes to match. This is an artificial change inserted just to be "pure" to the MPI spec (i.e., it's a "lawyer's - answer"). But since the MPIX_Foo -> MPI_Foo change is inevitable, - it just ends up annoying users. - -3. Once OMPI introduces MPIX_ symbols, if we want to *not* annoy + answer"). But since the `MPIX_Foo` -> `MPI_Foo` change is + inevitable, it just ends up annoying users. +1. Once OMPI introduces `MPIX_` symbols, if we want to *not* annoy users, we'll likely have weak symbols / aliased versions of both - MPIX_Foo and MPI_Foo once the Foo functionality is included in a - published MPI specification. However, when can we delete the - MPIX_Foo symbol? It becomes a continuing annoyance of backwards + `MPIX_Foo` and `MPI_Foo` once the Foo functionality is included in + a published MPI specification. However, when can we delete the + `MPIX_Foo` symbol? It becomes a continuing annoyance of backwards compatibility that we have to keep carrying forward. For all these reasons, we believe that it's better to put -expected-upcoming official MPI functionality in the MPI_ namespace, -not the MPIX_ namespace. - ----------------------------------------------------------------------- +expected-upcoming official MPI functionality in the `MPI_` namespace, +not the `MPIX_` namespace. All that being said, these are rules of thumb. They are not an official mandate. There may well be cases where there are reasons to diff --git a/ompi/mpiext/affinity/Makefile.am b/ompi/mpiext/affinity/Makefile.am index de819bd32c..188936c652 100644 --- a/ompi/mpiext/affinity/Makefile.am +++ b/ompi/mpiext/affinity/Makefile.am @@ -2,7 +2,7 @@ # Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. -# Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved. +# Copyright (c) 2010-2020 Cisco Systems, Inc. All rights reserved. # $COPYRIGHT$ # # Additional copyrights may follow @@ -20,4 +20,4 @@ SUBDIRS = c -EXTRA_DIST = README.txt +EXTRA_DIST = README.md diff --git a/ompi/mpiext/affinity/README.md b/ompi/mpiext/affinity/README.md new file mode 100644 index 0000000000..5ff6a043fb --- /dev/null +++ b/ompi/mpiext/affinity/README.md @@ -0,0 +1,30 @@ +# Open MPI extension: Affinity + +## Copyrights + +``` +Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved. +Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved. +``` + +## Authors + +* Jeff Squyres, 19 April 2010, and 16 April 2012 +* Terry Dontje, 18 November 2010 + +## Description + +This extension provides a single new function, `OMPI_Affinity_str()`, +that takes a format value and then provides 3 prettyprint strings as +output: + +* `fmt_type`: is an enum that tells `OMPI_Affinity_str()` whether to + use a resource description string or layout string format for + `ompi_bound` and `currently_bound` output strings. +* `ompi_bound`: describes what sockets/cores Open MPI bound this process + to (or indicates that Open MPI did not bind this process). +* `currently_bound`: describes what sockets/cores this process is + currently bound to (or indicates that it is unbound). +* `exists`: describes what processors are available in the current host. + +See `OMPI_Affinity_str(3)` for more details. diff --git a/ompi/mpiext/affinity/README.txt b/ompi/mpiext/affinity/README.txt deleted file mode 100644 index c3f1fb09bf..0000000000 --- a/ompi/mpiext/affinity/README.txt +++ /dev/null @@ -1,29 +0,0 @@ -# Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved. -Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved. - -$COPYRIGHT$ - -Jeff Squyres -19 April 2010, and -16 April 2012 - -Terry Dontje -18 November 2010 - -This extension provides a single new function, OMPI_Affinity_str(), -that takes a format value and then provides 3 prettyprint strings as -output: - -fmt_type: is an enum that tells OMPI_Affinity_str() whether to use a -resource description string or layout string format for ompi_bound and -currently_bound output strings. - -ompi_bound: describes what sockets/cores Open MPI bound this process -to (or indicates that Open MPI did not bind this process). - -currently_bound: describes what sockets/cores this process is -currently bound to (or indicates that it is unbound). - -exists: describes what processors are available in the current host. - -See OMPI_Affinity_str(3) for more details. diff --git a/ompi/mpiext/cuda/Makefile.am b/ompi/mpiext/cuda/Makefile.am index 3d8db46ce9..17a5469b3c 100644 --- a/ompi/mpiext/cuda/Makefile.am +++ b/ompi/mpiext/cuda/Makefile.am @@ -21,4 +21,4 @@ SUBDIRS = c -EXTRA_DIST = README.txt +EXTRA_DIST = README.md diff --git a/ompi/mpiext/cuda/README.md b/ompi/mpiext/cuda/README.md new file mode 100644 index 0000000000..a6acbde5d1 --- /dev/null +++ b/ompi/mpiext/cuda/README.md @@ -0,0 +1,11 @@ +# Open MPI extension: Cuda + +Copyright (c) 2015 NVIDIA, Inc. All rights reserved. + +Author: Rolf vandeVaart + +This extension provides a macro for compile time check of CUDA aware +support. It also provides a function for runtime check of CUDA aware +support. + +See `MPIX_Query_cuda_support(3)` for more details. diff --git a/ompi/mpiext/cuda/README.txt b/ompi/mpiext/cuda/README.txt deleted file mode 100644 index cc46fc3ef9..0000000000 --- a/ompi/mpiext/cuda/README.txt +++ /dev/null @@ -1,11 +0,0 @@ -# Copyright (c) 2015 NVIDIA, Inc. All rights reserved. - -$COPYRIGHT$ - -Rolf vandeVaart - - -This extension provides a macro for compile time check of CUDA aware support. -It also provides a function for runtime check of CUDA aware support. - -See MPIX_Query_cuda_support(3) for more details. diff --git a/ompi/mpiext/example/Makefile.am b/ompi/mpiext/example/Makefile.am index 8b2b03942d..6e2c66e5fe 100644 --- a/ompi/mpiext/example/Makefile.am +++ b/ompi/mpiext/example/Makefile.am @@ -1,5 +1,5 @@ # -# Copyright (c) 2012 Cisco Systems, Inc. All rights reserved. +# Copyright (c) 2020 Cisco Systems, Inc. All rights reserved. # $COPYRIGHT$ # # Additional copyrights may follow @@ -17,4 +17,4 @@ SUBDIRS = c mpif-h use-mpi use-mpi-f08 -EXTRA_DIST = README.txt +EXTRA_DIST = README.md diff --git a/ompi/mpiext/example/README.md b/ompi/mpiext/example/README.md new file mode 100644 index 0000000000..36bb1881f7 --- /dev/null +++ b/ompi/mpiext/example/README.md @@ -0,0 +1,148 @@ +# Open MPI extension: Example + +## Overview + +This example MPI extension shows how to make an MPI extension for Open +MPI. + +An MPI extension provides new top-level APIs in Open MPI that are +available to user-level applications (vs. adding new code/APIs that is +wholly internal to Open MPI). MPI extensions are generally used to +prototype new MPI APIs, or provide Open MPI-specific APIs to +applications. This example MPI extension provides a new top-level MPI +API named `OMPI_Progress` that is callable in both C and Fortran. + +MPI extensions are similar to Open MPI components, but due to +complex ordering requirements for the Fortran-based MPI bindings, +their build order is a little different. + +Note that MPI has 4 different sets of bindings (C, Fortran `mpif.h`, +the Fortran `mpi` module, and the Fortran `mpi_f08` module), and Open +MPI extensions allow adding API calls to all 4 of them. Prototypes +for the user-accessible functions/subroutines/constants are included +in the following publicly-available mechanisms: + +* C: `mpi-ext.h` +* Fortran mpif.h: `mpif-ext.h` +* Fortran "use mpi": `use mpi_ext` +* Fortran "use mpi_f08": `use mpi_f08_ext` + +This example extension defines a new top-level API named +`OMPI_Progress()` in all four binding types, and provides test programs +to call this API in each of the four binding types. Code (and +comments) is worth 1,000 words -- see the code in this example +extension to understand how it works and how the build system builds +and inserts each piece into the publicly-available mechansisms (e.g., +`mpi-ext.h` and the `mpi_f08_ext` module). + +## Comparison to General Open MPI MCA Components + +Here's the ways that MPI extensions are similar to Open MPI +components: + +1. Extensions have a top-level `configure.m4` with a well-known m4 macro + that is run during Open MPI's configure that determines whether the + component wants to build or not. + + Note, however, that unlike components, extensions *must* have a + `configure.m4`. No other method of configuration is supported. + +1. Extensions must adhere to normal Automake-based targets. We + strongly suggest that you use `Makefile.am`'s and have the + extension's `configure.m4` `AC_CONFIG_FILE` each `Makefile.am` in + the extension. Using other build systems may work, but are + untested and unsupported. + +1. Extensions create specifically-named libtool convenience archives + (i.e., `*.la` files) that the build system slurps into higher-level + libraries. + +Unlike components, however, extensions: + +1. Have a bit more rigid directory and file naming scheme. +1. Have up to four different, specifically-named subdirectories (one + for each MPI binding type). +1. Also install some specifically-named header files (for C and the + Fortran `mpif.h` bindings). + +Similar to components, an MPI extension's name is determined by its +directory name: `ompi/mpiext/EXTENSION_NAME` + +## Extension requirements + +### Required: C API + +Under this top-level directory, the extension *must* have a directory +named `c` (for the C bindings) that: + +1. contains a file named `mpiext_EXTENSION_NAME_c.h` +1. installs `mpiext_EXTENSION_NAME_c.h` to + `$includedir/openmpi/mpiext/EXTENSION_NAME/c` +1. builds a Libtool convenience library named + `libmpiext_EXTENSION_NAME_c.la` + +### Optional: `mpif.h` bindings + +Optionally, the extension may have a director named `mpif-h` (for the +Fortran `mpif.h` bindings) that: + +1. contains a file named `mpiext_EXTENSION_NAME_mpifh.h` +1. installs `mpiext_EXTENSION_NAME_mpih.h` to + `$includedir/openmpi/mpiext/EXTENSION_NAME/mpif-h` +1. builds a Libtool convenience library named + `libmpiext_EXTENSION_NAME_mpifh.la` + +### Optional: `mpi` module bindings + +Optionally, the extension may have a directory named `use-mpi` (for the +Fortran `mpi` module) that: + +1. contains a file named `mpiext_EXTENSION_NAME_usempi.h` + +***NOTE:*** The MPI extension system does NOT support building an +additional library in the `use-mpi` extension directory. It is +assumed that the `use-mpi` bindings will use the same back-end symbols +as the `mpif.h` bindings, and that the only output product of the +`use-mpi` directory is a file to be included in the `mpi-ext` module +(i.e., strong Fortran prototypes for the functions/global variables in +this extension). + +### Optional: `mpi_f08` module bindings + +Optionally, the extension may have a directory named `use-mpi-f08` (for +the Fortran `mpi_f08` module) that: + +1. contains a file named `mpiext_EXTENSION_NAME_usempif08.h` +1. builds a Libtool convenience library named + `libmpiext_EXTENSION_NAME_usempif08.la` + +See the comments in all the header and source files in this tree to +see what each file is for and what should be in each. + +## Notes + +Note that the build order of MPI extensions is a bit strange. The +directories in a MPI extensions are NOT traversed top-down in +sequential order. Instead, due to ordering requirements when building +the Fortran module-based interfaces, each subdirectory in extensions +are traversed individually at different times in the overall Open MPI +build. + +As such, `ompi/mpiext/EXTENSION_NAME/Makefile.am` is not traversed +during a normal top-level `make all` target. This `Makefile.am` +exists for two reasons, however: + +1. For the conveneince of the developer, so that you can issue normal + `make` commands at the top of your extension tree (e.g., `make all` + will still build all bindings in an extension). +1. During a top-level `make dist`, extension directories *are* + traversed top-down in sequence order. Having a top-level + `Makefile.am` in an extension allows `EXTRA_DIST`ing of files, such + as this `README.md` file. + +This are reasons for this strange ordering, but suffice it to say that +`make dist` doesn't have the same ordering requiements as `make all`, +and is therefore easier to have a "normal" Automake-usual top-down +sequential directory traversal. + +Enjoy! diff --git a/ompi/mpiext/example/README.txt b/ompi/mpiext/example/README.txt deleted file mode 100644 index 13e237df3c..0000000000 --- a/ompi/mpiext/example/README.txt +++ /dev/null @@ -1,138 +0,0 @@ -Copyright (C) 2012 Cisco Systems, Inc. All rights reserved. - -$COPYRIGHT$ - -This example MPI extension shows how to make an MPI extension for Open -MPI. - -An MPI extension provides new top-level APIs in Open MPI that are -available to user-level applications (vs. adding new code/APIs that is -wholly internal to Open MPI). MPI extensions are generally used to -prototype new MPI APIs, or provide Open MPI-specific APIs to -applications. This example MPI extension provides a new top-level MPI -API named "OMPI_Progress" that is callable in both C and Fortran. - -MPI extensions are similar to Open MPI components, but due to -complex ordering requirements for the Fortran-based MPI bindings, -their build order is a little different. - -Note that MPI has 4 different sets of bindings (C, Fortran mpif.h, -Fortran "use mpi", and Fortran "use mpi_f08"), and Open MPI extensions -allow adding API calls to all 4 of them. Prototypes for the -user-accessible functions/subroutines/constants are included in the -following publicly-available mechanisms: - -- C: mpi-ext.h -- Fortran mpif.h: mpif-ext.h -- Fortran "use mpi": use mpi_ext -- Fortran "use mpi_f08": use mpi_f08_ext - -This example extension defines a new top-level API named -"OMPI_Progress" in all four binding types, and provides test programs -to call this API in each of the four binding types. Code (and -comments) is worth 1,000 words -- see the code in this example -extension to understand how it works and how the build system builds -and inserts each piece into the publicly-available mechansisms (e.g., -mpi-ext.h and the mpi_f08_ext module). - --------------------------------------------------------------------------------- - -Here's the ways that MPI extensions are similar to Open MPI -components: - -- Extensions have a top-level configure.m4 with a well-known m4 macro - that is run during Open MPI's configure that determines whether the - component wants to build or not. - - Note, however, that unlike components, extensions *must* have a - configure.m4. No other method of configuration is supported. - -- Extensions must adhere to normal Automake-based targets. We - strongly suggest that you use Makefile.am's and have the extension's - configure.m4 AC_CONFIG_FILE each Makefile.am in the extension. - Using other build systems may work, but are untested and - unsupported. - -- Extensions create specifically-named libtool convenience archives - (i.e., *.la files) that the build system slurps into higher-level - libraries. - -Unlike components, however, extensions: - -- Have a bit more rigid directory and file naming scheme. - -- Have up to four different, specifically-named subdirectories (one - for each MPI binding type). - -- Also install some specifically-named header files (for C and the - Fortran mpif.h bindings). - -Similar to components, an MPI extension's name is determined by its -directory name: ompi/mpiext/ - -Under this top-level directory, the extension *must* have a directory -named "c" (for the C bindings) that: - -- contains a file named mpiext__c.h -- installs mpiext__c.h to - $includedir/openmpi/mpiext//c -- builds a Libtool convenience library named libmpiext__c.la - -Optionally, the extension may have a director named "mpif-h" (for the -Fortran mpif.h bindings) that: - -- contains a file named mpiext__mpifh.h -- installs mpiext__mpih.h to - $includedir/openmpi/mpiext//mpif-h -- builds a Libtool convenience library named libmpiext__mpifh.la - -Optionally, the extension may have a director named "use-mpi" (for the -Fortran "use mpi" bindings) that: - -- contains a file named mpiext__usempi.h - -NOTE: The MPI extension system does NOT support building an additional -library in the use-mpi extension directory. It is assumed that the -use-mpi bindings will use the same back-end symbols as the mpif.h -bindings, and that the only output product of the use-mpi directory is -a file to be included in the mpi-ext module (i.e., strong Fortran -prototypes for the functions/global variables in this extension). - -Optionally, the extension may have a director named "use-mpi-f08" (for -the Fortran mpi_f08 bindings) that: - -- contains a file named mpiext__usempif08.h -- builds a Libtool convenience library named - libmpiext__usempif08.la - -See the comments in all the header and source files in this tree to -see what each file is for and what should be in each. - --------------------------------------------------------------------------------- - -Note that the build order of MPI extensions is a bit strange. The -directories in a MPI extensions are NOT traversed top-down in -sequential order. Instead, due to ordering requirements when building -the Fortran module-based interfaces, each subdirectory in extensions -are traversed individually at different times in the overall Open MPI -build. - -As such, ompi/mpiext//Makefile.am is not traversed during a -normal top-level "make all" target. This Makefile.am exists for two -reasons, however: - -1. For the conveneince of the developer, so that you can issue normal -"make" commands at the top of your extension tree (e.g., "make all" -will still build all bindings in an extension). - -2. During a top-level "make dist", extension directories *are* -traversed top-down in sequence order. Having a top-level Makefile.am -in an extension allows EXTRA_DISTing of files, such as this README -file. - -This are reasons for this strange ordering, but suffice it to say that -"make dist" doesn't have the same ordering requiements as "make all", -and is therefore easier to have a "normal" Automake-usual top-down -sequential directory traversal. - -Enjoy! diff --git a/ompi/mpiext/pcollreq/Makefile.am b/ompi/mpiext/pcollreq/Makefile.am index 00da7f5ff0..329a4d1a9d 100644 --- a/ompi/mpiext/pcollreq/Makefile.am +++ b/ompi/mpiext/pcollreq/Makefile.am @@ -8,3 +8,5 @@ # SUBDIRS = c mpif-h use-mpi use-mpi-f08 + +EXTRA_DIST = README.md diff --git a/ompi/mpiext/pcollreq/README.md b/ompi/mpiext/pcollreq/README.md new file mode 100644 index 0000000000..124019ea73 --- /dev/null +++ b/ompi/mpiext/pcollreq/README.md @@ -0,0 +1,14 @@ +# Open MPI extension: pcollreq + +Copyright (c) 2018 FUJITSU LIMITED. All rights reserved. + +This extension provides the feature of persistent collective +communication operations and persistent neighborhood collective +communication operations, which is planned to be included in the next +MPI Standard after MPI-3.1 as of Nov. 2018. + +See `MPIX_Barrier_init(3)` for more details. + +The code will be moved to the `ompi/mpi` directory and the `MPIX_` +prefix will be switch to the `MPI_` prefix once the MPI Standard which +includes this feature is published. diff --git a/ompi/mpiext/pcollreq/README.txt b/ompi/mpiext/pcollreq/README.txt deleted file mode 100644 index 7dd491f81d..0000000000 --- a/ompi/mpiext/pcollreq/README.txt +++ /dev/null @@ -1,14 +0,0 @@ -Copyright (c) 2018 FUJITSU LIMITED. All rights reserved. - -$COPYRIGHT$ - -This extension provides the feature of persistent collective communication -operations and persistent neighborhood collective communication operations, -which is planned to be included in the next MPI Standard after MPI-3.1 as -of Nov. 2018. - -See MPIX_Barrier_init(3) for more details. - -The code will be moved to the ompi/mpi directory and the MPIX_ prefix will -be switch to the MPI_ prefix once the MPI Standard which includes this -feature is published. diff --git a/ompi/mpiext/shortfloat/Makefile.am b/ompi/mpiext/shortfloat/Makefile.am index 32c354bbc2..7b5597df40 100644 --- a/ompi/mpiext/shortfloat/Makefile.am +++ b/ompi/mpiext/shortfloat/Makefile.am @@ -8,3 +8,5 @@ # SUBDIRS = c mpif-h use-mpi use-mpi-f08 + +EXTRA_DIST = README.md diff --git a/ompi/mpiext/shortfloat/README.md b/ompi/mpiext/shortfloat/README.md new file mode 100644 index 0000000000..c41ac5941b --- /dev/null +++ b/ompi/mpiext/shortfloat/README.md @@ -0,0 +1,35 @@ +# Open MPI extension: shortfloat + +Copyright (c) 2018 FUJITSU LIMITED. All rights reserved. + +This extension provides additional MPI datatypes `MPIX_SHORT_FLOAT`, +`MPIX_C_SHORT_FLOAT_COMPLEX`, and `MPIX_CXX_SHORT_FLOAT_COMPLEX`, +which are proposed with the `MPI_` prefix in June 2017 for proposal in +the MPI 4.0 standard. As of February 2019, it is not accepted yet. + +See https://github.com/mpi-forum/mpi-issues/issues/65 for moe details + +Each MPI datatype corresponds to the C/C++ type `short float`, the C +type `short float _Complex`, and the C++ type `std::complex`, respectively. + +In addition, this extension provides a datatype `MPIX_C_FLOAT16` for +the C type `_Float16`, which is defined in ISO/IEC JTC 1/SC 22/WG 14 +N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as +that of MPICH. See https://github.com/pmodels/mpich/pull/3455. + +This extension is enabled only if the C compiler supports `short float` +or `_Float16`, or the `--enable-alt-short-float=TYPE` option is passed +to the Open MPI `configure` script. + +NOTE: The Clang 6.0.x and 7.0.x compilers support the `_Float16` type +(via software emulation), but require an additional linker flag to +function properly. If you wish to enable Clang 6.0.x or 7.0.x's +software emulation of `_Float16`, use the following CLI options to Open +MPI configure script: + +``` +./configure \ + LDFLAGS=--rtlib=compiler-rt \ + --with-wrapper-ldflags=--rtlib=compiler-rt ... +``` diff --git a/ompi/mpiext/shortfloat/README.txt b/ompi/mpiext/shortfloat/README.txt deleted file mode 100644 index 3406fc5d1b..0000000000 --- a/ompi/mpiext/shortfloat/README.txt +++ /dev/null @@ -1,35 +0,0 @@ -Copyright (c) 2018 FUJITSU LIMITED. All rights reserved. - -$COPYRIGHT$ - -This extension provides additional MPI datatypes MPIX_SHORT_FLOAT, -MPIX_C_SHORT_FLOAT_COMPLEX, and MPIX_CXX_SHORT_FLOAT_COMPLEX, which -are proposed with the MPI_ prefix in June 2017 for proposal in the -MPI 4.0 standard. As of February 2019, it is not accepted yet. - - https://github.com/mpi-forum/mpi-issues/issues/65 - -Each MPI datatype corresponds to the C/C++ type 'short float', the C type -'short float _Complex', and the C++ type 'std::complex', -respectively. - -In addition, this extension provides a datatype MPIX_C_FLOAT16 for -the C type _Float16, which is defined in ISO/IEC JTC 1/SC 22/WG 14 -N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as -that of MPICH. - - https://github.com/pmodels/mpich/pull/3455 - -This extension is enabled only if the C compiler supports 'short float' -or '_Float16', or the '--enable-alt-short-float=TYPE' option is passed -to the configure script. - -NOTE: The Clang 6.0.x and 7.0.x compilers support the "_Float16" type -(via software emulation), but require an additional linker flag to -function properly. If you wish to enable Clang 6.0.x or 7.0.x's -software emulation of _Float16, use the following CLI options to Open -MPI configure script: - - ./configure \ - LDFLAGS=--rtlib=compiler-rt \ - --with-wrapper-ldflags=--rtlib=compiler-rt ... diff --git a/opal/mca/btl/ofi/README b/opal/mca/btl/ofi/README deleted file mode 100644 index 0872da0aab..0000000000 --- a/opal/mca/btl/ofi/README +++ /dev/null @@ -1,110 +0,0 @@ -======================================== -Design notes on BTL/OFI -======================================== - -This is the RDMA only btl based on OFI Libfabric. The goal is to enable RDMA -with multiple vendor hardware through one interface. Most of the operations are -managed by upper layer (osc/rdma). This BTL is mostly doing the low level work. - -Tested providers: sockets,psm2,ugni - -======================================== - -Component - -This BTL is requesting libfabric version 1.5 API and will not support older versions. - -The required capabilities of this BTL is FI_ATOMIC and FI_RMA with the endpoint type -of FI_EP_RDM only. This BTL does NOT support libfabric provider that requires local -memory registration (FI_MR_LOCAL). - -BTL/OFI will initialize a module with ONLY the first compatible info returned from OFI. -This means it will rely on OFI provider to do load balancing. The support for multiple -device might be added later. - -The BTL creates only one endpoint and one CQ. - -======================================== - -Memory Registration - -Open MPI has a system in place to exchange remote address and always use the remote -virtual address to refer to a piece of memory. However, some libfabric providers might -not support the use of virtual address and instead will use zero-based offset addressing. - -FI_MR_VIRT_ADDR is the flag that determine this behavior. mca_btl_ofi_reg_mem() handles -this by storing the base address in registration handle in case of the provider does not -support FI_MR_VIRT_ADDR. This base address will be used to calculate the offset later in -RDMA/Atomic operations. - -The BTL will try to use the address of registration handle as the key. However, if the -provider supports FI_MR_PROV_KEY, it will use provider provided key. Simply does not care. - -The BTL does not register local operand or compare. This is why this BTL does not support -FI_MR_LOCAL and will allocate every buffer before registering. This means FI_MR_ALLOCATED -is supported. So to be explicit. - -Supported MR mode bits (will work with or without): - enum: - - FI_MR_BASIC - - FI_MR_SCALABLE - - mode bits: - - FI_MR_VIRT_ADDR - - FI_MR_ALLOCATED - - FI_MR_PROV_KEY - -The BTL does NOT support (will not work with): - - FI_MR_LOCAL - - FI_MR_MMU_NOTIFY - - FI_MR_RMA_EVENT - - FI_MR_ENDPOINT - -Just a reminder, in libfabric API 1.5... -FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR) - -======================================== - -Completions - -Every operation in this BTL is asynchronous. The completion handling will occur in -mca_btl_ofi_component_progress() where we read the CQ with the completion context and -execute the callback functions. The completions are local. No remote completion event is -generated as local completion already guarantee global completion. - -The BTL keep tracks of number of outstanding operations and provide flush interface. - -======================================== - -Sockets Provider - -Sockets provider is the proof of concept provider for libfabric. It is supposed to support -all the OFI API with emulations. This provider is considered very slow and bound to raise -problems that we might not see from other faster providers. - -Known Problems: - - sockets provider uses progress thread and can cause segfault in finalize as we free - the resources while progress thread is still using it. sleep(1) was put in - mca_btl_ofi_componenet_close() for this reason. - - sockets provider deadlock in two-sided mode. Might be something about buffered recv. - (August 2018). - -======================================== - -Scalable Endpoint - -This BTL will try to use scalable endpoint to create communication context. This will increase -multithreaded performance for some application. The default number of context created is 1 and -can be tuned VIA MCA parameter "btl_ofi_num_contexts_per_module". It is advised that the number -of context should be equal to number of physical core for optimal performance. - -User can disable scalable endpoint by MCA parameter "btl_ofi_disable_sep". -With scalable endpoint disbled, the BTL will alias OFI endpoint to both tx and rx context. - -======================================== - -Two sided communication - -Two sided communication is added later on to BTL OFI to enable non tag-matching provider -to be able to use in Open MPI with this BTL. However, the support is only for "functional" -and has not been optimized for performance at this point. (August 2018) diff --git a/opal/mca/btl/ofi/README.md b/opal/mca/btl/ofi/README.md new file mode 100644 index 0000000000..bab4ef1c1b --- /dev/null +++ b/opal/mca/btl/ofi/README.md @@ -0,0 +1,113 @@ +# Design notes on BTL/OFI + +This is the RDMA only btl based on OFI Libfabric. The goal is to +enable RDMA with multiple vendor hardware through one interface. Most +of the operations are managed by upper layer (osc/rdma). This BTL is +mostly doing the low level work. + +Tested providers: sockets,psm2,ugni + +## Component + +This BTL is requesting libfabric version 1.5 API and will not support +older versions. + +The required capabilities of this BTL is `FI_ATOMIC` and `FI_RMA` with +the endpoint type of `FI_EP_RDM` only. This BTL does NOT support +libfabric provider that requires local memory registration +(`FI_MR_LOCAL`). + +BTL/OFI will initialize a module with ONLY the first compatible info +returned from OFI. This means it will rely on OFI provider to do load +balancing. The support for multiple device might be added later. + +The BTL creates only one endpoint and one CQ. + +## Memory Registration + +Open MPI has a system in place to exchange remote address and always +use the remote virtual address to refer to a piece of memory. However, +some libfabric providers might not support the use of virtual address +and instead will use zero-based offset addressing. + +`FI_MR_VIRT_ADDR` is the flag that determine this +behavior. `mca_btl_ofi_reg_mem()` handles this by storing the base +address in registration handle in case of the provider does not +support `FI_MR_VIRT_ADDR`. This base address will be used to calculate +the offset later in RDMA/Atomic operations. + +The BTL will try to use the address of registration handle as the +key. However, if the provider supports `FI_MR_PROV_KEY`, it will use +provider provided key. Simply does not care. + +The BTL does not register local operand or compare. This is why this +BTL does not support `FI_MR_LOCAL` and will allocate every buffer +before registering. This means `FI_MR_ALLOCATED` is supported. So to +be explicit. + +Supported MR mode bits (will work with or without): + +* enum: + * `FI_MR_BASIC` + * `FI_MR_SCALABLE` +* mode bits: + * `FI_MR_VIRT_ADDR` + * `FI_MR_ALLOCATED` + * `FI_MR_PROV_KEY` + +The BTL does NOT support (will not work with): + +* `FI_MR_LOCAL` +* `FI_MR_MMU_NOTIFY` +* `FI_MR_RMA_EVENT` +* `FI_MR_ENDPOINT` + +Just a reminder, in libfabric API 1.5... +`FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)` + +## Completions + +Every operation in this BTL is asynchronous. The completion handling +will occur in `mca_btl_ofi_component_progress()` where we read the CQ +with the completion context and execute the callback functions. The +completions are local. No remote completion event is generated as +local completion already guarantee global completion. + +The BTL keep tracks of number of outstanding operations and provide +flush interface. + +## Sockets Provider + +Sockets provider is the proof of concept provider for libfabric. It is +supposed to support all the OFI API with emulations. This provider is +considered very slow and bound to raise problems that we might not see +from other faster providers. + +Known Problems: + +* sockets provider uses progress thread and can cause segfault in + finalize as we free the resources while progress thread is still + using it. `sleep(1)` was put in `mca_btl_ofi_component_close()` for + this reason. +* sockets provider deadlock in two-sided mode. Might be something + about buffered recv. (August 2018). + +## Scalable Endpoint + +This BTL will try to use scalable endpoint to create communication +context. This will increase multithreaded performance for some +application. The default number of context created is 1 and can be +tuned VIA MCA parameter `btl_ofi_num_contexts_per_module`. It is +advised that the number of context should be equal to number of +physical core for optimal performance. + +User can disable scalable endpoint by MCA parameter +`btl_ofi_disable_sep`. With scalable endpoint disbled, the BTL will +alias OFI endpoint to both tx and rx context. + +## Two sided communication + +Two sided communication is added later on to BTL OFI to enable non +tag-matching provider to be able to use in Open MPI with this +BTL. However, the support is only for "functional" and has not been +optimized for performance at this point. (August 2018) diff --git a/opal/mca/btl/smcuda/README b/opal/mca/btl/smcuda/README deleted file mode 100644 index 859015e1a4..0000000000 --- a/opal/mca/btl/smcuda/README +++ /dev/null @@ -1,113 +0,0 @@ -Copyright (c) 2013 NVIDIA Corporation. All rights reserved. -August 21, 2013 - -SMCUDA DESIGN DOCUMENT -This document describes the design and use of the smcuda BTL. - -BACKGROUND -The smcuda btl is a copy of the sm btl but with some additional features. -The main extra feature is the ability to make use of the CUDA IPC APIs to -quickly move GPU buffers from one GPU to another. Without this support, -the GPU buffers would all be moved into and then out of host memory. - -GENERAL DESIGN - -The general design makes use of the large message RDMA RGET support in the -OB1 PML. However, there are some interesting choices to make use of it. -First, we disable any large message RDMA support in the BTL for host -messages. This is done because we need to use the mca_btl_smcuda_get() for -the GPU buffers. This is also done because the upper layers expect there -to be a single mpool but we need one for the GPU memory and one for the -host memory. Since the advantages of using RDMA with host memory is -unclear, we disabled it. This means no KNEM or CMA support built in to the -smcuda BTL. - -Also note that we give the smcuda BTL a higher rank than the sm BTL. This -means it will always be selected even if we are doing host only data -transfers. The smcuda BTL is not built if it is not requested via the ---with-cuda flag to the configure line. - -Secondly, the smcuda does not make use of the traditional method of -enabling RDMA operations. The traditional method checks for the existence -of an RDMA btl hanging off the endpoint. The smcuda works in conjunction -with the OB1 PML and uses flags that it sends in the BML layer. - -OTHER CONSIDERATIONS -CUDA IPC is not necessarily supported by all GPUs on a node. In NUMA -nodes, CUDA IPC may only work between GPUs that are not connected -over the IOH. In addition, we want to check for CUDA IPC support lazily, -when the first GPU access occurs, rather than during MPI_Init() time. -This complicates the design. - -INITIALIZATION -When the smcuda BTL initializes, it starts with no support for CUDA IPC. -Upon the first access of a GPU buffer, the smcuda checks which GPU device -it has and sends that to the remote side using a smcuda specific control -message. The other rank receives the message, and checks to see if there -is CUDA IPC support between the two GPUs via a call to -cuDeviceCanAccessPeer(). If it is true, then the smcuda BTL piggy backs on -the PML error handler callback to make a call into the PML and let it know -to enable CUDA IPC. We created a new flag so that the error handler does -the right thing. Large message RDMA is enabled by setting a flag in the -bml->btl_flags field. Control returns to the smcuda BTL where a reply -message is sent so the sending side can set its flag. - -At that point, the PML layer starts using the large message RDMA support -in the smcuda BTL. This is done in some special CUDA code in the PML layer. - -ESTABLISHING CUDA IPC SUPPORT -A check has been added into both the send and sendi path in the smcuda btl -that checks to see if it should send a request for CUDA IPC setup message. - - /* Initiate setting up CUDA IPC support. */ - if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) { - mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint); - } - -The first check is to see if the CUDA environment has been initialized. If -not, then presumably we are not sending any GPU buffers yet and there is -nothing to be done. If we are initialized, then check the status of the -CUDA IPC endpoint. If it is in the IPC_INIT stage, then call the function -to send of a control message to the endpoint. - -On the receiving side, we first check to see if we are initialized. If -not, then send a message back to the sender saying we are not initialized. -This will cause the sender to reset its state to IPC_INIT so it can try -again on the next send. - -I considered putting the receiving side into a new state like IPC_NOTREADY, -and then when it switches to ready, to then sending the ACK to the sender. -The problem with this is that we would need to do these checks during the -progress loop which adds some extra overhead as we would have to check all -endpoints to see if they were ready. - -Note that any rank can initiate the setup of CUDA IPC. It is triggered by -whichever side does a send or sendi call of a GPU buffer. - -I have the sender attempt 5 times to set up the connection. After that, we -give up. Note that I do not expect many scenarios where the sender has to -resend. It could happen in a race condition where one rank has initialized -its CUDA environment but the other side has not. - -There are several states the connections can go through. - -IPC_INIT - nothing has happened -IPC_SENT - message has been sent to other side -IPC_ACKING - Received request and figuring out what to send back -IPC_ACKED - IPC ACK sent -IPC_OK - IPC ACK received back -IPC_BAD - Something went wrong, so marking as no IPC support - -NOTE ABOUT CUDA IPC AND MEMORY POOLS -The CUDA IPC support works in the following way. A sender makes a call to -cuIpcGetMemHandle() and gets a memory handle for its local memory. The -sender then sends that handle to receiving side. The receiver calls -cuIpcOpenMemHandle() using that handle and gets back an address to the -remote memory. The receiver then calls cuMemcpyAsync() to initiate a -remote read of the GPU data. - -The receiver maintains a cache of remote memory that it has handles open on. -This is because a call to cuIpcOpenMemHandle() can be very expensive (90usec) so -we want to avoid it when we can. The cache of remote memory is kept in a memory -pool that is associated with each endpoint. Note that we do not cache the local -memory handles because getting them is very cheap and there is no need. diff --git a/opal/mca/btl/smcuda/README.md b/opal/mca/btl/smcuda/README.md new file mode 100644 index 0000000000..6d90148924 --- /dev/null +++ b/opal/mca/btl/smcuda/README.md @@ -0,0 +1,126 @@ +# Open MPI SMCUDA design document + +Copyright (c) 2013 NVIDIA Corporation. All rights reserved. +August 21, 2013 + +This document describes the design and use of the `smcuda` BTL. + +## BACKGROUND + +The `smcuda` btl is a copy of the `sm` btl but with some additional +features. The main extra feature is the ability to make use of the +CUDA IPC APIs to quickly move GPU buffers from one GPU to another. +Without this support, the GPU buffers would all be moved into and then +out of host memory. + +## GENERAL DESIGN + +The general design makes use of the large message RDMA RGET support in +the OB1 PML. However, there are some interesting choices to make use +of it. First, we disable any large message RDMA support in the BTL +for host messages. This is done because we need to use the +`mca_btl_smcuda_get()` for the GPU buffers. This is also done because +the upper layers expect there to be a single mpool but we need one for +the GPU memory and one for the host memory. Since the advantages of +using RDMA with host memory is unclear, we disabled it. This means no +KNEM or CMA support built in to the `smcuda` BTL. + +Also note that we give the `smcuda` BTL a higher rank than the `sm` +BTL. This means it will always be selected even if we are doing host +only data transfers. The `smcuda` BTL is not built if it is not +requested via the `--with-cuda` flag to the configure line. + +Secondly, the `smcuda` does not make use of the traditional method of +enabling RDMA operations. The traditional method checks for the existence +of an RDMA btl hanging off the endpoint. The `smcuda` works in conjunction +with the OB1 PML and uses flags that it sends in the BML layer. + +## OTHER CONSIDERATIONS + +CUDA IPC is not necessarily supported by all GPUs on a node. In NUMA +nodes, CUDA IPC may only work between GPUs that are not connected +over the IOH. In addition, we want to check for CUDA IPC support lazily, +when the first GPU access occurs, rather than during `MPI_Init()` time. +This complicates the design. + +## INITIALIZATION + +When the `smcuda` BTL initializes, it starts with no support for CUDA IPC. +Upon the first access of a GPU buffer, the `smcuda` checks which GPU device +it has and sends that to the remote side using a `smcuda` specific control +message. The other rank receives the message, and checks to see if there +is CUDA IPC support between the two GPUs via a call to +`cuDeviceCanAccessPeer()`. If it is true, then the `smcuda` BTL piggy backs on +the PML error handler callback to make a call into the PML and let it know +to enable CUDA IPC. We created a new flag so that the error handler does +the right thing. Large message RDMA is enabled by setting a flag in the +`bml->btl_flags` field. Control returns to the `smcuda` BTL where a reply +message is sent so the sending side can set its flag. + +At that point, the PML layer starts using the large message RDMA +support in the `smcuda` BTL. This is done in some special CUDA code +in the PML layer. + +## ESTABLISHING CUDA IPC SUPPORT + +A check has been added into both the `send` and `sendi` path in the +`smcuda` btl that checks to see if it should send a request for CUDA +IPC setup message. + +```c +/* Initiate setting up CUDA IPC support. */ +if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) { + mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint); +} +``` + +The first check is to see if the CUDA environment has been +initialized. If not, then presumably we are not sending any GPU +buffers yet and there is nothing to be done. If we are initialized, +then check the status of the CUDA IPC endpoint. If it is in the +IPC_INIT stage, then call the function to send of a control message to +the endpoint. + +On the receiving side, we first check to see if we are initialized. +If not, then send a message back to the sender saying we are not +initialized. This will cause the sender to reset its state to +IPC_INIT so it can try again on the next send. + +I considered putting the receiving side into a new state like +IPC_NOTREADY, and then when it switches to ready, to then sending the +ACK to the sender. The problem with this is that we would need to do +these checks during the progress loop which adds some extra overhead +as we would have to check all endpoints to see if they were ready. + +Note that any rank can initiate the setup of CUDA IPC. It is +triggered by whichever side does a send or sendi call of a GPU buffer. + +I have the sender attempt 5 times to set up the connection. After +that, we give up. Note that I do not expect many scenarios where the +sender has to resend. It could happen in a race condition where one +rank has initialized its CUDA environment but the other side has not. + +There are several states the connections can go through. + +1. IPC_INIT - nothing has happened +1. IPC_SENT - message has been sent to other side +1. IPC_ACKING - Received request and figuring out what to send back +1. IPC_ACKED - IPC ACK sent +1. IPC_OK - IPC ACK received back +1. IPC_BAD - Something went wrong, so marking as no IPC support + +## NOTE ABOUT CUDA IPC AND MEMORY POOLS + +The CUDA IPC support works in the following way. A sender makes a +call to `cuIpcGetMemHandle()` and gets a memory handle for its local +memory. The sender then sends that handle to receiving side. The +receiver calls `cuIpcOpenMemHandle()` using that handle and gets back +an address to the remote memory. The receiver then calls +`cuMemcpyAsync()` to initiate a remote read of the GPU data. + +The receiver maintains a cache of remote memory that it has handles +open on. This is because a call to `cuIpcOpenMemHandle()` can be very +expensive (90usec) so we want to avoid it when we can. The cache of +remote memory is kept in a memory pool that is associated with each +endpoint. Note that we do not cache the local memory handles because +getting them is very cheap and there is no need. diff --git a/opal/mca/btl/usnic/Makefile.am b/opal/mca/btl/usnic/Makefile.am index 42b38b32e3..48ec0bb751 100644 --- a/opal/mca/btl/usnic/Makefile.am +++ b/opal/mca/btl/usnic/Makefile.am @@ -27,7 +27,7 @@ AM_CPPFLAGS = $(opal_ofi_CPPFLAGS) -DOMPI_LIBMPI_NAME=\"$(OMPI_LIBMPI_NAME)\" -EXTRA_DIST = README.txt README.test +EXTRA_DIST = README.md README.test dist_opaldata_DATA = \ help-mpi-btl-usnic.txt diff --git a/opal/mca/btl/usnic/README.md b/opal/mca/btl/usnic/README.md new file mode 100644 index 0000000000..a1875a818f --- /dev/null +++ b/opal/mca/btl/usnic/README.md @@ -0,0 +1,330 @@ +# Design notes on usnic BTL + +## nomenclature + +* fragment - something the PML asks us to send or put, any size +* segment - something we can put on the wire in a single packet +* chunk - a piece of a fragment that fits into one segment + +a segment can contain either an entire fragment or a chunk of a fragment + +each segment and fragment has associated descriptor. + +Each segment data structure has a block of registered memory associated with +it which matches MTU for that segment + +* ACK - acks get special small segments with only enough memory for an ACK +* non-ACK segments always have a parent fragment + +* fragments are either large (> MTU) or small (<= MTU) +* a small fragment has a segment descriptor embedded within it since it + always needs exactly one. +* a large fragment has no permanently associated segments, but allocates them + as needed. + +## channels + +A channel is a queue pair with an associated completion queue +each channel has its own MTU and r/w queue entry counts + +There are 2 channels, command and data: +* command queue is generally for higher priority fragments +* data queue is for standard data traffic +* command queue should possibly be called "priority" queue + +command queue is shorter and has a smaller MTU that the data queue. +this makes the command queue a lot faster than the data queue, so we +hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes) + +command queue is used for ACKs and tiny fragments. +data queue is used for everything else. + +PML fragments marked priority should perhaps use command queue + +## sending + +Normally, all send requests are simply enqueued and then actually posted +to the NIC by the routine `opal_btl_usnic_module_progress_sends()`. +"fastpath" tiny sends are the exception. + +Each module maintains a queue of endpoints that are ready to send. +An endpoint is ready to send if all of the following are met: +1. the endpoint has fragments to send +1. the endpoint has send credits +1. the endpoint's send window is "open" (not full of un-ACKed segments) + +Each module also maintains a list of segments that need to be retransmitted. +Note that the list of pending retrans is per-module, not per-endpoint. + +Send progression first posts any pending retransmissions, always using +the data channel. (reason is that if we start getting heavy +congestion and there are lots of retransmits, it becomes more +important than ever to prioritize ACKs, clogging command channel with +retrans data makes things worse, not better) + +Next, progression loops sending segments to the endpoint at the top of +the `endpoints_with_sends` queue. When an endpoint exhausts its send +credits or fills its send window or runs out of segments to send, it +removes itself from the `endpoint_with_sends` list. Any pending ACKs +will be picked up and piggy-backed on these sends. + +Finally, any endpoints that still need ACKs whose timer has expired will +be sent explicit ACK packets. + +## fragment sending + +The middle part of the progression loop handles both small +(single-segment) and large (multi-segment) sends. + +For small fragments, the verbs descriptor within the embedded segment +is updated with length, BTL header is updated, then we call +`opal_btl_usnic_endpoint_send_segment()` to send the segment. After +posting, we make a PML callback if needed. + +For large fragments, a little more is needed. segments froma large +fragment have a slightly larger BTL header which contains a fragment +ID, and offset, and a size. The fragment ID is allocated when the +first chunk the fragment is sent. A segment gets allocated, next blob +of data is copied into this segment, segment is posted. If last chunk +of fragment sent, perform callback if needed, then remove fragment +from endpoint send queue. + +## `opal_btl_usnic_endpoint_send_segment()` + +This is common posting code for large or small segments. It assigns a +sequence number to a segment, checks for an ACK to piggy-back, +posts the segment to the NIC, and then starts the retransmit timer +by checking the segment into hotel. Send credits are consumed here. + + +## send dataflow + +PML control messages with no user data are sent via: +* `desc = usnic_alloc(size)` +* `usnic_send(desc)` + +user messages less than eager limit and 1st part of larger + +messages are sent via: +* `desc = usnic_prepare_src(convertor, size)` +* `usnic_send(desc)` + +larger msgs: +* `desc = usnic_prepare_src(convertor, size)` +* `usnic_put(desc)` + + +`usnic_alloc()` currently asserts the length is "small", allocates and +fills in a small fragment. src pointer will point to start of +associated registered mem + sizeof BTL header, and PML will put its +data there. + +`usnic_prepare_src()` allocated either a large or small fragment based +on size The fragment descriptor is filled in to have 2 SG entries, 1st +pointing to place where PML should construct its header. If the data +convertor says data is contiguous, 2nd SG entry points to user buffer, +else it is null and sf_convertor is filled in with address of +convertor. + +### `usnic_send()` + +If the fragment being sent is small enough, has contiguous data, and +"very few" command queue send WQEs have been consumed, `usnic_send()` +does a fastpath send. This means it posts the segment immediately to +the NIC with INLINE flag set. + +If all of the conditions for fastpath send are not met, and this is a +small fragment, the user data is copied into the associated registered +memory at this time and the SG list in the descriptor is collapsed to +one entry. + +After the checks above are done, the fragment is enqueued to be sent +via `opal_btl_usnic_endpoint_enqueue_frag()` + +### `usnic_put()` + +Do a fast version of what happens in `prepare_src()` (can take shortcuts +because we know it will always be a contiguous buffer / no convertor +needed). PML gives us the destination address, which we save on the +fragment (which is the sentinel value that the underlying engine uses +to know that this is a PUT and not a SEND), and the fragment is +enqueued for processing. + +### `opal_btl_usnic_endpoint_enqueue_frag()` + +This appends the fragment to the "to be sent" list of the endpoint and +conditionally adds the endpoint to the list of endpoints with data to +send via `opal_btl_usnic_check_rts()` + +## receive dataflow + +BTL packets has one of 3 types in header: frag, chunk, or ack. + +* A frag packet is a full PML fragment. +* A chunk packet is a piece of a fragment that needs to be reassembled. +* An ack packet is header only with a sequence number being ACKed. + +* Both frag and chunk packets go through some of the same processing. +* Both may carry piggy-backed ACKs which may need to be processed. +* Both have sequence numbers which must be processed and may result in + dropping the packet and/or queueing an ACK to the sender. + +frag packets may be either regular PML fragments or PUT segments. If +the "put_addr" field of the BTL header is set, this is a PUT and the +data is copied directly to the user buffer. If this field is NULL, +the segment is passed up to the PML. The PML is expected to do +everything it needs with this packet in the callback, including +copying data out if needed. Once the callback is complete, the +receive buffer is recycled. + +chunk packets are parts of a larger fragment. If an active fragment +receive for the matching fragment ID cannot be found, and new fragment +info descriptor is allocated. If this is not a PUT (`put_addr == NULL`), +we `malloc()` data to reassemble the fragment into. Each +subsequent chunk is copied either into this reassembly buffer or +directly into user memory. When the last chunk of a fragment arrives, +a PML callback is made for non-PUTs, then the fragment info descriptor +is released. + +## fast receive optimization + +In order to optimize latency of small packets, the component progress +routine implements a fast path for receives. If the first completion +is a receive on the priority queue, then it is handled by a routine +called `opal_btl_usnic_recv_fast()` which does nothing but validates +that the packet is OK to be received (sequence number OK and not a +DUP) and then delivers it to the PML. This packet is recorded in the +channel structure, and all bookeeping for the packet is deferred until +the next time `component_progress` is called again. + +This fast path cannot be taken every time we pass through +`component_progress` because there will be other completions that need +processing, and the receive bookeeping for one fast receive must be +complete before allowing another fast receive to occur, as only one +recv segment can be saved for deferred processing at a time. This is +handled by maintaining a variable in `opal_btl_usnic_recv_fast()` +called fastpath_ok which is set to false every time the fastpath is +taken. A call into the regular progress routine will set this flag +back to true. + +## reliability: + +* every packet has sequence # +* each endpoint has a "send window" , currently 4096 entries. +* once a segment is sent, it is saved in window array until ACK is received +* ACKs acknowledge all packets <= specified sequence # +* rcvr only ACKs a sequence # when all packets up to that sequence have arrived + +* each pkt has dflt retrans timer of 100ms +* packet will be scheduled for retrans if timer expires + +Once a segment is sent, it always has its retransmit timer started. +This is accomplished by `opal_hotel_checkin()`. +Any time a segment is posted to the NIC for retransmit, it is checked out +of the hotel (timer stopped). +So, a send segment is always in one of 4 states: +* on free list, unallocated +* on endpoint to-send list in the case of segment associated with small fragment +* posted to NIC and in hotel awaiting ACK +* on module re-send list awaiting retransmission + +rcvr: +* if a pkt with seq >= expected seq is received, schedule ack of largest + in-order sequence received if not already scheduled. dflt time is 50us +* if a packet with seq < expected seq arrives, we send an ACK immediately, + as this indicates a lost ACK + +sender: +* duplicate ACK triggers immediate retrans if one is not pending for + that segment + +## Reordering induced by two queues and piggy-backing: + +ACKs can be reordered- +* not an issue at all, old ACKs are simply ignored + +Sends can be reordered- +* (small send can jump far ahead of large sends) +* large send followed by lots of small sends could trigger many + retrans of the large sends. smalls would have to be paced pretty + precisely to keep command queue empty enough and also beat out the + large sends. send credits limit how many larges can be queued on + the sender, but there could be many on the receiver + + +## RDMA emulation + +We emulate the RDMA PUT because it's more efficient than regular send: +it allows the receive to copy directly to the target buffer +(vs. making an intermediate copy out of the bounce buffer). + +It would actually be better to morph this PUT into a GET -- GET would +be slightly more efficient. In short, when the target requests the +actual RDMA data, with PUT, the request has to go up to the PML, which +will then invoke PUT on the source's BTL module. With GET, the target +issues the GET, and the source BTL module can reply without needing to +go up the stack to the PML. + +Once we start supporting RDMA in hardware: + +* we need to provide `module.btl_register_mem` and + `module.btl_deregister_mem` functions (see openib for an example) +* we need to put something meaningful in + `btl_usnic_frag.h:mca_btl_base_registration_handle_t`. +* we need to set `module.btl_registration_handle_size` to `sizeof(struct + mca_btl_base_registration_handle_t`). +* `module.btl_put` / `module.btl_get` will receive the + `mca_btl_base_registration_handle_t` from the peer as a cookie. + +Also, `module.btl_put` / `module.btl_get` do not need to make +descriptors (this was an optimization added in BTL 3.0). They are now +called with enough information to do whatever they need to do. +module.btl_put still makes a descriptor and submits it to the usnic +sending engine so as to utilize a common infrastructure for send and +put. + +But it doesn't necessarily have to be that way -- we could optimize +out the use of the descriptors. Have not investigated how easy/hard +that would be. + +## libfabric abstractions: + +* `fi_fabric`: corresponds to a VIC PF +* `fi_domain`: corresponds to a VIC VF +* `fi_endpoint`: resources inside the VIC VF (basically a QP) + +## `MPI_THREAD_MULTIPLE` support + +In order to make usnic btl thread-safe, the mutex locks are issued to +protect the critical path. ie; libfabric routines, book keeping, etc. + +The said lock is `btl_usnic_lock`. It is a RECURSIVE lock, meaning +that the same thread can take the lock again even if it already has +the lock to allow the callback function to post another segment right +away if we know that the current segment is completed inline. (So we +can call send in send without deadlocking) + +These two functions taking care of hotel checkin/checkout and we have +to protect that part. So we take the mutex lock before we enter the +function. + +* `opal_btl_usnic_check_rts()` +* `opal_btl_usnic_handle_ack()` + +We also have to protect the call to libfabric routines + +* `opal_btl_usnic_endpoint_send_segment()` (`fi_send`) +* `opal_btl_usnic_recv_call()` (`fi_recvmsg`) + +have to be protected as well. + +Also cclient connection checking (`opal_btl_usnic_connectivity_ping`) +has to be protected. This happens only in the beginning but cclient +communicate with cagent through `opal_fd_read/write()` and if two or +more clients do `opal_fd_write()` at the same time, the data might be +corrupt. + +With this concept, many functions in btl/usnic that make calls to the +listed functions are protected by `OPAL_THREAD_LOCK` macro which will +only be active if the user specify `MPI_Init_thread()` with +`MPI_THREAD_MULTIPLE` support. diff --git a/opal/mca/btl/usnic/README.txt b/opal/mca/btl/usnic/README.txt deleted file mode 100644 index bc589e36c7..0000000000 --- a/opal/mca/btl/usnic/README.txt +++ /dev/null @@ -1,383 +0,0 @@ -Design notes on usnic BTL - -====================================== -nomenclature - -fragment - something the PML asks us to send or put, any size -segment - something we can put on the wire in a single packet -chunk - a piece of a fragment that fits into one segment - -a segment can contain either an entire fragment or a chunk of a fragment - -each segment and fragment has associated descriptor. - -Each segment data structure has a block of registered memory associated with -it which matches MTU for that segment -ACK - acks get special small segments with only enough memory for an ACK -non-ACK segments always have a parent fragment - -fragments are either large (> MTU) or small (<= MTU) -a small fragment has a segment descriptor embedded within it since it -always needs exactly one. - -a large fragment has no permanently associated segments, but allocates them -as needed. - -====================================== -channels - -a channel is a queue pair with an associated completion queue -each channel has its own MTU and r/w queue entry counts - -There are 2 channels, command and data -command queue is generally for higher priority fragments -data queue is for standard data traffic -command queue should possibly be called "priority" queue - -command queue is shorter and has a smaller MTU that the data queue -this makes the command queue a lot faster than the data queue, so we -hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes) - -command queue is used for ACKs and tiny fragments -data queue is used for everything else - -PML fragments marked priority should perhaps use command queue - -====================================== -sending - -Normally, all send requests are simply enqueued and then actually posted -to the NIC by the routine opal_btl_usnic_module_progress_sends(). -"fastpath" tiny sends are the exception. - -Each module maintains a queue of endpoints that are ready to send. -An endpoint is ready to send if all of the following are met: -- the endpoint has fragments to send -- the endpoint has send credits -- the endpoint's send window is "open" (not full of un-ACKed segments) - -Each module also maintains a list of segments that need to be retransmitted. -Note that the list of pending retrans is per-module, not per-endpoint. - -send progression first posts any pending retransmissions, always using the -data channel. (reason is that if we start getting heavy congestion and -there are lots of retransmits, it becomes more important than ever to -prioritize ACKs, clogging command channel with retrans data makes things worse, -not better) - -Next, progression loops sending segments to the endpoint at the top of -the "endpoints_with_sends" queue. When an endpoint exhausts its send -credits or fills its send window or runs out of segments to send, it removes -itself from the endpoint_with_sends list. Any pending ACKs will be -picked up and piggy-backed on these sends. - -Finally, any endpoints that still need ACKs whose timer has expired will -be sent explicit ACK packets. - -[double-click fragment sending] -The middle part of the progression loop handles both small (single-segment) -and large (multi-segment) sends. - -For small fragments, the verbs descriptor within the embedded segment is -updated with length, BTL header is updated, then we call -opal_btl_usnic_endpoint_send_segment() to send the segment. -After posting, we make a PML callback if needed. - -For large fragments, a little more is needed. segments froma large -fragment have a slightly larger BTL header which contains a fragment ID, -and offset, and a size. The fragment ID is allocated when the first chunk -the fragment is sent. A segment gets allocated, next blob of data is -copied into this segment, segment is posted. If last chunk of fragment -sent, perform callback if needed, then remove fragment from endpoint -send queue. - -[double-click opal_btl_usnic_endpoint_send_segment()] - -This is common posting code for large or small segments. It assigns a -sequence number to a segment, checks for an ACK to piggy-back, -posts the segment to the NIC, and then starts the retransmit timer -by checking the segment into hotel. Send credits are consumed here. - - -====================================== -send dataflow - -PML control messages with no user data are sent via: -desc = usnic_alloc(size) -usnic_send(desc) - -user messages less than eager limit and 1st part of larger -messages are sent via: -desc = usnic_prepare_src(convertor, size) -usnic_send(desc) - -larger msgs -desc = usnic_prepare_src(convertor, size) -usnic_put(desc) - - -usnic_alloc() currently asserts the length is "small", allocates and -fills in a small fragment. src pointer will point to start of -associated registered mem + sizeof BTL header, and PML will put its -data there. - -usnic_prepare_src() allocated either a large or small fragment based on size -The fragment descriptor is filled in to have 2 SG entries, 1st pointing to -place where PML should construct its header. If the data convertor says -data is contiguous, 2nd SG entry points to user buffer, else it is null and -sf_convertor is filled in with address of convertor. - -usnic_send() -If the fragment being sent is small enough, has contiguous data, and -"very few" command queue send WQEs have been consumed, usnic_send() does -a fastpath send. This means it posts the segment immediately to the NIC -with INLINE flag set. - -If all of the conditions for fastpath send are not met, and this is a small -fragment, the user data is copied into the associated registered memory at this -time and the SG list in the descriptor is collapsed to one entry. - -After the checks above are done, the fragment is enqueued to be sent -via opal_btl_usnic_endpoint_enqueue_frag() - -usnic_put() -Do a fast version of what happens in prepare_src() (can take shortcuts -because we know it will always be a contiguous buffer / no convertor -needed). PML gives us the destination address, which we save on the -fragment (which is the sentinel value that the underlying engine uses -to know that this is a PUT and not a SEND), and the fragment is -enqueued for processing. - -opal_btl_usnic_endpoint_enqueue_frag() -This appends the fragment to the "to be sent" list of the endpoint and -conditionally adds the endpoint to the list of endpoints with data to send -via opal_btl_usnic_check_rts() - -====================================== -receive dataflow - -BTL packets has one of 3 types in header: frag, chunk, or ack. - -A frag packet is a full PML fragment. -A chunk packet is a piece of a fragment that needs to be reassembled. -An ack packet is header only with a sequence number being ACKed. - -Both frag and chunk packets go through some of the same processing. -Both may carry piggy-backed ACKs which may need to be processed. -Both have sequence numbers which must be processed and may result in -dropping the packet and/or queueing an ACK to the sender. - -frag packets may be either regular PML fragments or PUT segments. -If the "put_addr" field of the BTL header is set, this is a PUT and -the data is copied directly to the user buffer. If this field is NULL, -the segment is passed up to the PML. The PML is expected to do everything -it needs with this packet in the callback, including copying data out if -needed. Once the callback is complete, the receive buffer is recycled. - -chunk packets are parts of a larger fragment. If an active fragment receive -for the matching fragment ID cannot be found, and new fragment info -descriptor is allocated. If this is not a PUT (put_addr == NULL), we -malloc() data to reassemble the fragment into. Each subsequent chunk -is copied either into this reassembly buffer or directly into user memory. -When the last chunk of a fragment arrives, a PML callback is made for non-PUTs, -then the fragment info descriptor is released. - -====================================== -fast receive optimization - -In order to optimize latency of small packets, the component progress routine -implements a fast path for receives. If the first completion is a receive on -the priority queue, then it is handled by a routine called -opal_btl_usnic_recv_fast() which does nothing but validates that the packet -is OK to be received (sequence number OK and not a DUP) and then delivers it -to the PML. This packet is recorded in the channel structure, and all -bookeeping for the packet is deferred until the next time component_progress -is called again. - -This fast path cannot be taken every time we pass through component_progress -because there will be other completions that need processing, and the receive -bookeeping for one fast receive must be complete before allowing another fast -receive to occur, as only one recv segment can be saved for deferred -processing at a time. This is handled by maintaining a variable in -opal_btl_usnic_recv_fast() called fastpath_ok which is set to false every time -the fastpath is taken. A call into the regular progress routine will set this -flag back to true. - -====================================== -reliability: - -every packet has sequence # -each endpoint has a "send window" , currently 4096 entries. -once a segment is sent, it is saved in window array until ACK is received -ACKs acknowledge all packets <= specified sequence # -rcvr only ACKs a sequence # when all packets up to that sequence have arrived - -each pkt has dflt retrans timer of 100ms -packet will be scheduled for retrans if timer expires - -Once a segment is sent, it always has its retransmit timer started. -This is accomplished by opal_hotel_checkin() -Any time a segment is posted to the NIC for retransmit, it is checked out -of the hotel (timer stopped). -So, a send segment is always in one of 4 states: -- on free list, unallocated -- on endpoint to-send list in the case of segment associated with small fragment -- posted to NIC and in hotel awaiting ACK -- on module re-send list awaiting retransmission - -rcvr: -- if a pkt with seq >= expected seq is received, schedule ack of largest - in-order sequence received if not already scheduled. dflt time is 50us -- if a packet with seq < expected seq arrives, we send an ACK immediately, - as this indicates a lost ACK - -sender: -duplicate ACK triggers immediate retrans if one is not pending for that segment - -====================================== -Reordering induced by two queues and piggy-backing: - -ACKs can be reordered- - not an issue at all, old ACKs are simply ignored - -Sends can be reordered- -(small send can jump far ahead of large sends) -large send followed by lots of small sends could trigger many retrans -of the large sends. smalls would have to be paced pretty precisely to -keep command queue empty enough and also beat out the large sends. -send credits limit how many larges can be queued on the sender, but there -could be many on the receiver - - -====================================== -RDMA emulation - -We emulate the RDMA PUT because it's more efficient than regular send: -it allows the receive to copy directly to the target buffer -(vs. making an intermediate copy out of the bounce buffer). - -It would actually be better to morph this PUT into a GET -- GET would -be slightly more efficient. In short, when the target requests the -actual RDMA data, with PUT, the request has to go up to the PML, which -will then invoke PUT on the source's BTL module. With GET, the target -issues the GET, and the source BTL module can reply without needing to -go up the stack to the PML. - -Once we start supporting RDMA in hardware: - -- we need to provide module.btl_register_mem and - module.btl_deregister_mem functions (see openib for an example) -- we need to put something meaningful in - btl_usnic_frag.h:mca_btl_base_registration_handle_t. -- we need to set module.btl_registration_handle_size to sizeof(struct - mca_btl_base_registration_handle_t). -- module.btl_put / module.btl_get will receive the - mca_btl_base_registration_handle_t from the peer as a cookie. - -Also, module.btl_put / module.btl_get do not need to make descriptors -(this was an optimization added in BTL 3.0). They are now called with -enough information to do whatever they need to do. module.btl_put -still makes a descriptor and submits it to the usnic sending engine so -as to utilize a common infrastructure for send and put. - -But it doesn't necessarily have to be that way -- we could optimize -out the use of the descriptors. Have not investigated how easy/hard -that would be. - -====================================== - -November 2014 / SC 2014 -Update February 2015 - -The usnic BTL code has been unified across master and the v1.8 -branches. - - NOTE: As of May 2018, this is no longer true. This was generally - only necessary back when the BTLs were moved from the OMPI layer to - the OPAL layer. Now that the BTLs have been down in OPAL for - several years, this tomfoolery is no longer necessary. This note - is kept for historical purposes, just in case someone needs to go - back and look at the v1.8 series. - -That is, you can copy the code from v1.8:ompi/mca/btl/usnic/* to -master:opal/mca/btl/usnic*, and then only have to make 3 changes in -the resulting code in master: - -1. Edit Makefile.am: s/ompi/opal/gi -2. Edit configure.m4: s/ompi/opal/gi - --> EXCEPT for: - - opal_common_libfabric_* (which will eventually be removed, - when the embedded libfabric goes away) - - OPAL_BTL_USNIC_FI_EXT_USNIC_H (which will eventually be - removed, when the embedded libfabric goes away) - - OPAL_VAR_SCOPE_* -3. Edit Makefile.am: change -DBTL_IN_OPAL=0 to -DBTL_IN_OPAL=1 - -*** Note: the BTL_IN_OPAL preprocessor macro is set in Makefile.am - rather that in btl_usnic_compat.h to avoid all kinds of include - file dependency issues (i.e., btl_usnic_compat.h would need to be - included first, but it requires some data structures to be - defined, which means it either can't be first or we have to - declare various structs first... just put BTL_IN_OPAL in - Makefile.am and be happy). - -*** Note 2: CARE MUST BE TAKEN WHEN COPYING THE OTHER DIRECTION! It - is *not* as simple as simple s/opal/ompi/gi in configure.m4 and - Makefile.am. It certainly can be done, but there's a few strings - that need to stay "opal" or "OPAL" (e.g., OPAL_HAVE_FOO). - Hence, the string replace will likely need to be done via manual - inspection. - -Things still to do: - -- VF/PF sanity checks in component.c:check_usnic_config() uses - usnic-specific fi_provider info. The exact mechanism might change - as provider-specific info is still being discussed upstream. - -- component.c:usnic_handle_cq_error is using a USD_* constant from - usnic_direct. Need to get that value through libfabric somehow. - -====================================== - -libfabric abstractions: - -fi_fabric: corresponds to a VIC PF -fi_domain: corresponds to a VIC VF -fi_endpoint: resources inside the VIC VF (basically a QP) - -====================================== - -MPI_THREAD_MULTIPLE support - -In order to make usnic btl thread-safe, the mutex locks are issued -to protect the critical path. ie; libfabric routines, book keeping, etc. - -The said lock is btl_usnic_lock. It is a RECURSIVE lock, meaning that -the same thread can take the lock again even if it already has the lock to -allow the callback function to post another segment right away if we know -that the current segment is completed inline. (So we can call send in send -without deadlocking) - -These two functions taking care of hotel checkin/checkout and we -have to protect that part. So we take the mutex lock before we enter the -function. - -- opal_btl_usnic_check_rts() -- opal_btl_usnic_handle_ack() - -We also have to protect the call to libfabric routines - -- opal_btl_usnic_endpoint_send_segment() (fi_send) -- opal_btl_usnic_recv_call() (fi_recvmsg) - -have to be protected as well. - -Also cclient connection checking (opal_btl_usnic_connectivity_ping) has to be -protected. This happens only in the beginning but cclient communicate with cagent -through opal_fd_read/write() and if two or more clients do opal_fd_write() at the -same time, the data might be corrupt. - -With this concept, many functions in btl/usnic that make calls to the -listed functions are protected by OPAL_THREAD_LOCK macro which will only -be active if the user specify MPI_Init_thread() with MPI_THREAD_MULTIPLE -support. diff --git a/oshmem/mca/memheap/README b/oshmem/mca/memheap/README deleted file mode 100644 index 88f93e8a0b..0000000000 --- a/oshmem/mca/memheap/README +++ /dev/null @@ -1,50 +0,0 @@ -# Copyright (c) 2013 Mellanox Technologies, Inc. -# All rights reserved -# $COPYRIGHT$ -MEMHEAP Infrustructure documentation ------------------------------------- - -MEMHEAP Infrustructure is responsible for managing the symmetric heap. -The framework currently has following components: buddy and ptmalloc. buddy which uses a buddy allocator in order to manage the Memory allocations on the symmetric heap. Ptmalloc is an adaptation of ptmalloc3. - -Additional components may be added easily to the framework by defining the component's and the module's base and extended structures, and their funtionalities. - -The buddy allocator has the following data structures: -1. Base component - of type struct mca_memheap_base_component_2_0_0_t -2. Base module - of type struct mca_memheap_base_module_t -3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t -4. Buddy module - of type struct mca_memheap_buddy_module_t extending the base module (struct mca_memheap_base_module_t) - -Each data structure includes the following fields: -1. Base component - memheap_version, memheap_data and memheap_init -2. Base module - Holds pointers to the base component and to the functions: alloc, free and finalize -3. Buddy component - is a base component. -4. Buddy module - Extends the base module and holds additional data on the components's priority, buddy allocator, - maximal order of the symmetric heap, symmetric heap, pointer to the symmetric heap and hashtable maintaining the size of each allocated address. - -In the case that the user decides to implement additional components, the Memheap infrastructure chooses a component with the maximal priority. -Handling the component opening is done under the base directory, in three stages: -1. Open all available components. Implemented by memheap_base_open.c and called from shmem_init. -2. Select the maximal priority component. This procedure involves the initialization of all components and then their - finalization except to the chosen component. It is implemented by memheap_base_select.c and called from shmem_init. -3. Close the max priority active cmponent. Implemented by memheap_base_close.c and called from shmem finalize. - - -Buddy Component/Module ----------------------- - -Responsible for handling the entire activities of the symmetric heap. -The supported activities are: - - buddy_init (Initialization) - - buddy_alloc (Allocates a variable on the symmetric heap) - - buddy_free (frees a variable previously allocated on the symetric heap) - - buddy_finalize (Finalization). - -Data members of buddy module: - priority. The module's priority. - - buddy allocator: bits, num_free, lock and the maximal order (log2 of the maximal size) - of a variable on the symmetric heap. Buddy Allocator gives the offset in the symmetric heap - where a variable should be allocated. - - symmetric_heap: a range of reserved addresses (equal in all executing PE's) dedicated to "shared memory" allocation. - - symmetric_heap_hashtable (holding the size of an allocated variable on the symmetric heap. - used to free an allocated variable on the symmetric heap) - diff --git a/oshmem/mca/memheap/README.md b/oshmem/mca/memheap/README.md new file mode 100644 index 0000000000..b487eb08e5 --- /dev/null +++ b/oshmem/mca/memheap/README.md @@ -0,0 +1,71 @@ +# MEMHEAP infrastructure documentation + +Copyright (c) 2013 Mellanox Technologies, Inc. + All rights reserved + +MEMHEAP Infrustructure is responsible for managing the symmetric heap. +The framework currently has following components: buddy and +ptmalloc. buddy which uses a buddy allocator in order to manage the +Memory allocations on the symmetric heap. Ptmalloc is an adaptation of +ptmalloc3. + +Additional components may be added easily to the framework by defining +the component's and the module's base and extended structures, and +their funtionalities. + +The buddy allocator has the following data structures: + +1. Base component - of type struct mca_memheap_base_component_2_0_0_t +2. Base module - of type struct mca_memheap_base_module_t +3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t +4. Buddy module - of type struct mca_memheap_buddy_module_t extending + the base module (struct mca_memheap_base_module_t) + +Each data structure includes the following fields: + +1. Base component - memheap_version, memheap_data and memheap_init +2. Base module - Holds pointers to the base component and to the + functions: alloc, free and finalize +3. Buddy component - is a base component. +4. Buddy module - Extends the base module and holds additional data on + the components's priority, buddy allocator, + maximal order of the symmetric heap, symmetric heap, pointer to the + symmetric heap and hashtable maintaining the size of each allocated + address. + +In the case that the user decides to implement additional components, +the Memheap infrastructure chooses a component with the maximal +priority. Handling the component opening is done under the base +directory, in three stages: +1. Open all available components. Implemented by memheap_base_open.c + and called from shmem_init. +2. Select the maximal priority component. This procedure involves the + initialization of all components and then their finalization except + to the chosen component. It is implemented by memheap_base_select.c + and called from shmem_init. +3. Close the max priority active cmponent. Implemented by + memheap_base_close.c and called from shmem finalize. + + +## Buddy Component/Module + +Responsible for handling the entire activities of the symmetric heap. +The supported activities are: + +1. buddy_init (Initialization) +1. buddy_alloc (Allocates a variable on the symmetric heap) +1. buddy_free (frees a variable previously allocated on the symetric heap) +1. buddy_finalize (Finalization). + +Data members of buddy module: + +1. priority. The module's priority. +1. buddy allocator: bits, num_free, lock and the maximal order (log2 + of the maximal size) of a variable on the symmetric heap. Buddy + Allocator gives the offset in the symmetric heap where a variable + should be allocated. +1. symmetric_heap: a range of reserved addresses (equal in all + executing PE's) dedicated to "shared memory" allocation. +1. symmetric_heap_hashtable (holding the size of an allocated variable + on the symmetric heap. used to free an allocated variable on the + symmetric heap) diff --git a/test/runtime/README b/test/runtime/README deleted file mode 100644 index 9ee84a67b9..0000000000 --- a/test/runtime/README +++ /dev/null @@ -1,7 +0,0 @@ -The functions in this directory are all intended to test registry operations against a persistent seed. Thus, they perform a system init/finalize. The functions in the directory above this one should be used to test basic registry operations within the replica - they will isolate the replica so as to avoid the communications issues and the init/finalize problems in other subsystems that may cause problems here. - -To run these tests, you need to first start a persistent daemon. This can be done using the command: - -orted --seed --scope public --persistent - -The daemon will "daemonize" itself and establish the registry (as well as other central services) replica, and then return a system prompt. You can then run any of these functions. If desired, you can utilize gdb and/or debug options on the persistent orted to watch/debug replica operations as well. diff --git a/test/runtime/README.md b/test/runtime/README.md new file mode 100644 index 0000000000..9af61944f5 --- /dev/null +++ b/test/runtime/README.md @@ -0,0 +1,20 @@ +The functions in this directory are all intended to test registry +operations against a persistent seed. Thus, they perform a system +init/finalize. The functions in the directory above this one should be +used to test basic registry operations within the replica - they will +isolate the replica so as to avoid the communications issues and the +init/finalize problems in other subsystems that may cause problems +here. + +To run these tests, you need to first start a persistent daemon. This +can be done using the command: + +``` +orted --seed --scope public --persistent +``` + +The daemon will "daemonize" itself and establish the registry (as well +as other central services) replica, and then return a system +prompt. You can then run any of these functions. If desired, you can +utilize gdb and/or debug options on the persistent orted to +watch/debug replica operations as well.