Convert all README files to Markdown

A mindless task for a lazy weekend: convert all the README and README.txt files to Markdown. Paired with the slow conversion of all of our man pages to Markdown, this gives a uniform language to the Open MPI docs. This commit moved a bunch of copyright headers out of the top-level README.txt file, so I updated the relevant copyright header years in the top-level LICENSE file to match what was removed from README.txt. Additionally, this commit did (very) little to update the actual content of the README files. A very small number of updates were made for topics that I found blatently obvious while Markdown-izing the content, but in general, I did not update content during this commit. For example, there's still quite a bit of text about ORTE that was not meaningfully updated. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> Co-authored-by: Josh Hursey <jhursey@us.ibm.com>
2020-11-08 13:19:39 -05:00 · 2020-11-08 13:19:39 -05:00 · c960d292ec
--- a/272
+++ b/272
@ -1,272 +0,0 @@
-Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
-                        University Research and Technology
-                        Corporation.  All rights reserved.
-Copyright (c) 2004-2005 The University of Tennessee and The University
-                        of Tennessee Research Foundation.  All rights
-                        reserved.
-Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
-                        University of Stuttgart.  All rights reserved.
-Copyright (c) 2004-2005 The Regents of the University of California.
-                        All rights reserved.
-Copyright (c) 2008-2020 Cisco Systems, Inc.  All rights reserved.
-Copyright (c) 2013      Intel, Inc.  All rights reserved.
-$COPYRIGHT$
-
-Additional copyrights may follow
-
-$HEADER$
-
-Overview
-========
-
-This file is here for those who are building/exploring OMPI in its
-source code form, most likely through a developer's tree (i.e., a
-Git clone).
-
-
-Developer Builds: Compiler Pickyness by Default
-===============================================
-
-If you are building Open MPI from a Git clone (i.e., there is a ".git"
-directory in your build tree), the default build includes extra
-compiler pickyness, which will result in more compiler warnings than
-in non-developer builds.  Getting these extra compiler warnings is
-helpful to Open MPI developers in making the code base as clean as
-possible.
-
-Developers can disable this picky-by-default behavior by using the
--disable-picky configure option.  Also note that extra-picky compiles
-do *not* happen automatically when you do a VPATH build (e.g., if
-".git" is in your source tree, but not in your build tree).
-
-Prior versions of Open MPI would automatically activate a lot of
-(performance-reducing) debugging code by default if ".git" was found
-in your build tree.  This is no longer true.  You can manually enable
-these (performance-reducing) debugging features in the Open MPI code
-base with these configure options:
-
-    --enable-debug
-    --enable-mem-debug
-    --enable-mem-profile
-
-NOTE: These options are really only relevant to those who are
-developing Open MPI itself.  They are not generally helpful for
-debugging general MPI applications.
-
-
-Use of GNU Autoconf, Automake, and Libtool (and m4)
-===================================================
-
-You need to read/care about this section *ONLY* if you are building
-from a developer's tree (i.e., a Git clone of the Open MPI source
-tree).  If you have an Open MPI distribution tarball, the contents of
-this section are optional -- you can (and probably should) skip
-reading this section.
-
-If you are building Open MPI from a developer's tree, you must first
-install fairly recent versions of the GNU tools Autoconf, Automake,
-and Libtool (and possibly GNU m4, because recent versions of Autoconf
-have specific GNU m4 version requirements).  The specific versions
-required depend on if you are using the Git master branch or a release
-branch (and which release branch you are using).  The specific
-versions can be found here:
-
-  https://www.open-mpi.org/source/building.php
-
-You can check what versions of the autotools you have installed with
-the following:
-
-shell$ m4 --version
-shell$ autoconf --version
-shell$ automake --version
-shell$ libtoolize --version
-
-Required version levels for all the OMPI releases can be found here:
-
-https://www.open-mpi.org/source/building.php
-
-To strengthen the above point: the core Open MPI developers typically
-use very, very recent versions of the GNU tools.  There are known bugs
-in older versions of the GNU tools that Open MPI no longer compensates
-for (it seemed senseless to indefinitely support patches for ancient
-versions of Autoconf, for example).  You *WILL* have problems if you
-do not use recent versions of the GNU tools.
-
-If you need newer versions, you are *strongly* encouraged to heed the
-following advice:
-
-NOTE: On MacOS/X, the default "libtool" program is different than the
-      GNU libtool.  You must download and install the GNU version
-      (e.g., via MacPorts, Homebrew, or some other mechanism).
-
-1. Unless your OS distribution has easy-to-use binary installations,
-   the sources can be can be downloaded from:
-
-        ftp://ftp.gnu.org/gnu/autoconf/
-        ftp://ftp.gnu.org/gnu/automake/
-        ftp://ftp.gnu.org/gnu/libtool/
-        and if you need it:
-        ftp://ftp.gnu.org/gnu/m4/
-
-   NOTE: It is certainly easiest to download/build/install all four of
-   these tools together.  But note that Open MPI has no specific m4
-   requirements; it is only listed here because Autoconf requires
-   minimum versions of GNU m4.  Hence, you may or may not *need* to
-   actually install a new version of GNU m4.  That being said, if you
-   are confused or don't know, just install the latest GNU m4 with the
-   rest of the GNU Autotools and everything will work out fine.
-
-2. Build and install the tools in the following order:
-
-   2a. m4
-   2b. Autoconf
-   2c. Automake
-   2d. Libtool
-
-3. You MUST install the last three tools (Autoconf, Automake, Libtool)
-   into the same prefix directory.  These three tools are somewhat
-   inter-related, and if they're going to be used together, they MUST
-   share a common installation prefix.
-
-   You can install m4 anywhere as long as it can be found in the path;
-   it may be convenient to install it in the same prefix as the other
-   three.  Or you can use any recent-enough m4 that is in your path.
-
-   3a. It is *strongly* encouraged that you do not install your new
-       versions over the OS-installed versions.  This could cause
-       other things on your system to break.  Instead, install into
-       $HOME/local, or /usr/local, or wherever else you tend to
-       install "local" kinds of software.
-   3b. In doing so, be sure to prefix your $path with the directory
-       where they are installed.  For example, if you install into
-       $HOME/local, you may want to edit your shell startup file
-       (.bashrc, .cshrc, .tcshrc, etc.) to have something like:
-
-          # For bash/sh:
-          export PATH=$HOME/local/bin:$PATH
-          # For csh/tcsh:
-          set path = ($HOME/local/bin $path)
-
-   3c. Ensure to set your $path *BEFORE* you configure/build/install
-       the four packages.
-
-4. All four packages require two simple commands to build and
-   install (where PREFIX is the prefix discussed in 3, above).
-
-      shell$ cd <m4 directory>
-      shell$ ./configure --prefix=PREFIX
-      shell$ make; make install
-
-      --> If you are using the csh or tcsh shells, be sure to run the
-          "rehash" command after you install each package.
-
-      shell$ cd <autoconf directory>
-      shell$ ./configure --prefix=PREFIX
-      shell$ make; make install
-
-      --> If you are using the csh or tcsh shells, be sure to run the
-          "rehash" command after you install each package.
-
-      shell$ cd <automake directory>
-      shell$ ./configure --prefix=PREFIX
-      shell$ make; make install
-
-      --> If you are using the csh or tcsh shells, be sure to run the
-          "rehash" command after you install each package.
-
-      shell$ cd <libtool directory>
-      shell$ ./configure --prefix=PREFIX
-      shell$ make; make install
-
-      --> If you are using the csh or tcsh shells, be sure to run the
-          "rehash" command after you install each package.
-
-   m4, Autoconf and Automake build and install very quickly; Libtool will
-   take a minute or two.
-
-5. You can now run OMPI's top-level "autogen.pl" script.  This script
-   will invoke the GNU Autoconf, Automake, and Libtool commands in the
-   proper order and setup to run OMPI's top-level "configure" script.
-
-   Running autogen.pl may take a few minutes, depending on your
-   system.  It's not very exciting to watch.  :-)
-
-   If you have a multi-processor system, enabling the multi-threaded
-   behavior in Automake 1.11 (or newer) can result in autogen.pl
-   running faster.  Do this by setting the AUTOMAKE_JOBS environment
-   variable to the number of processors (threads) that you want it to
-   use before invoking autogen.pl.  For example (you can again put
-   this in your shell startup files):
-
-       # For bash/sh:
-       export AUTOMAKE_JOBS=4
-       # For csh/tcsh:
-       set AUTOMAKE_JOBS 4
-
-   5a. You generally need to run autogen.pl whenever the top-level
-       file "configure.ac" changes, or any files in the config/ or
-       <project>/config/ directories change (these directories are
-       where a lot of "include" files for OMPI's configure script
-       live).
-
-   5b. You do *NOT* need to re-run autogen.pl if you modify a
-       Makefile.am.
-
-Use of Flex
-===========
-
-Flex is used during the compilation of a developer's checkout (it is
-not used to build official distribution tarballs).  Other flavors of
-lex are *not* supported: given the choice of making parsing code
-portable between all flavors of lex and doing more interesting work on
-Open MPI, we greatly prefer the latter.
-
-Note that no testing has been performed to see what the minimum
-version of Flex is required by Open MPI.  We suggest that you use
-v2.5.35 at the earliest.
-
-*** NOTE: Windows developer builds of Open MPI *require* Flex version
-2.5.35.  Specifically, we know that v2.5.35 works and 2.5.4a does not.
-We have not tested to figure out exactly what the minimum required
-flex version is on Windows; we suggest that you use 2.5.35 at the
-earliest.  It is for this reason that the
-contrib/dist/make_dist_tarball script checks for a Windows-friendly
-version of flex before continuing.
-
-For now, Open MPI will allow developer builds with Flex 2.5.4.  This
-is primarily motivated by the fact that RedHat/Centos 5 ships with
-Flex 2.5.4.  It is likely that someday Open MPI developer builds will
-require Flex version >=2.5.35.
-
-Note that the flex-generated code generates some compiler warnings on
-some platforms, but the warnings do not seem to be consistent or
-uniform on all platforms, compilers, and flex versions.  As such, we
-have done little to try to remove those warnings.
-
-If you do not have Flex installed, it can be downloaded from the
-following URL:
-
-    https://github.com/westes/flex
-
-Use of Pandoc
-=============
-
-Similar to prior sections, you need to read/care about this section
-*ONLY* if you are building from a developer's tree (i.e., a Git clone
-of the Open MPI source tree).  If you have an Open MPI distribution
-tarball, the contents of this section are optional -- you can (and
-probably should) skip reading this section.
-
-The Pandoc tool is used to generate Open MPI's man pages.
-Specifically: Open MPI's man pages are written in Markdown; Pandoc is
-the tool that converts that Markdown to nroff (i.e., the format of man
-pages).
-
-You must have Pandoc >=v1.12 when building Open MPI from a developer's
-tree.  If configure cannot find Pandoc >=v1.12, it will abort.
-
-If you need to install Pandoc, check your operating system-provided
-packages (to include MacOS Homebrew and MacPorts).  The Pandoc project
-itself also offers binaries for their releases:
-
-   https://pandoc.org/
--- a/HACKING.md
+++ b/HACKING.md
@ -0,0 +1,258 @@
+# Open MPI Hacking / Developer's Guide
+
+## Overview
+
+This file is here for those who are building/exploring OMPI in its
+source code form, most likely through a developer's tree (i.e., a
+Git clone).
+
+
+## Developer Builds: Compiler Pickyness by Default
+
+If you are building Open MPI from a Git clone (i.e., there is a `.git`
+directory in your build tree), the default build includes extra
+compiler pickyness, which will result in more compiler warnings than
+in non-developer builds.  Getting these extra compiler warnings is
+helpful to Open MPI developers in making the code base as clean as
+possible.
+
+Developers can disable this picky-by-default behavior by using the
+`--disable-picky` configure option.  Also note that extra-picky compiles
+do *not* happen automatically when you do a VPATH build (e.g., if
+`.git` is in your source tree, but not in your build tree).
+
+Prior versions of Open MPI would automatically activate a lot of
+(performance-reducing) debugging code by default if `.git` was found
+in your build tree.  This is no longer true.  You can manually enable
+these (performance-reducing) debugging features in the Open MPI code
+base with these configure options:
+
+* `--enable-debug`
+* `--enable-mem-debug`
+* `--enable-mem-profile`
+
+***NOTE:*** These options are really only relevant to those who are
+developing Open MPI itself.  They are not generally helpful for
+debugging general MPI applications.
+
+
+## Use of GNU Autoconf, Automake, and Libtool (and m4)
+
+You need to read/care about this section *ONLY* if you are building
+from a developer's tree (i.e., a Git clone of the Open MPI source
+tree).  If you have an Open MPI distribution tarball, the contents of
+this section are optional -- you can (and probably should) skip
+reading this section.
+
+If you are building Open MPI from a developer's tree, you must first
+install fairly recent versions of the GNU tools Autoconf, Automake,
+and Libtool (and possibly GNU m4, because recent versions of Autoconf
+have specific GNU m4 version requirements).  The specific versions
+required depend on if you are using the Git master branch or a release
+branch (and which release branch you are using).  [The specific
+versions can be found
+here](https://www.open-mpi.org/source/building.php).
+
+You can check what versions of the autotools you have installed with
+the following:
+
+```
+shell$ m4 --version
+shell$ autoconf --version
+shell$ automake --version
+shell$ libtoolize --version
+```
+
+[Required version levels for all the OMPI releases can be found
+here](https://www.open-mpi.org/source/building.php).
+
+To strengthen the above point: the core Open MPI developers typically
+use very, very recent versions of the GNU tools.  There are known bugs
+in older versions of the GNU tools that Open MPI no longer compensates
+for (it seemed senseless to indefinitely support patches for ancient
+versions of Autoconf, for example).  You *WILL* have problems if you
+do not use recent versions of the GNU tools.
+
+***NOTE:*** On MacOS/X, the default `libtool` program is different
+than the GNU libtool.  You must download and install the GNU version
+(e.g., via MacPorts, Homebrew, or some other mechanism).
+
+If you need newer versions, you are *strongly* encouraged to heed the
+following advice:
+
+1. Unless your OS distribution has easy-to-use binary installations,
+   the sources can be can be downloaded from:
+   * https://ftp.gnu.org/gnu/autoconf/
+   * https://ftp.gnu.org/gnu/automake/
+   * https://ftp.gnu.org/gnu/libtool/
+   * And if you need it: https://ftp.gnu.org/gnu/m4/
+
+   ***NOTE:*** It is certainly easiest to download/build/install all
+   four of these tools together.  But note that Open MPI has no
+   specific m4 requirements; it is only listed here because Autoconf
+   requires minimum versions of GNU m4.  Hence, you may or may not
+   *need* to actually install a new version of GNU m4.  That being
+   said, if you are confused or don't know, just install the latest
+   GNU m4 with the rest of the GNU Autotools and everything will work
+   out fine.
+
+1. Build and install the tools in the following order:
+   1. m4
+   1. Autoconf
+   1. Automake
+   1. Libtool
+
+1. You MUST install the last three tools (Autoconf, Automake, Libtool)
+   into the same prefix directory.  These three tools are somewhat
+   inter-related, and if they're going to be used together, they MUST
+   share a common installation prefix.
+
+   You can install m4 anywhere as long as it can be found in the path;
+   it may be convenient to install it in the same prefix as the other
+   three.  Or you can use any recent-enough m4 that is in your path.
+
+   1. It is *strongly* encouraged that you do not install your new
+      versions over the OS-installed versions.  This could cause
+      other things on your system to break.  Instead, install into
+      `$HOME/local`, or `/usr/local`, or wherever else you tend to
+      install "local" kinds of software.
+   1. In doing so, be sure to prefix your $path with the directory
+      where they are installed.  For example, if you install into
+      `$HOME/local`, you may want to edit your shell startup file
+      (`.bashrc`, `.cshrc`, `.tcshrc`, etc.) to have something like:
+
+      ```sh
+      # For bash/sh:
+      export PATH=$HOME/local/bin:$PATH
+      # For csh/tcsh:
+      set path = ($HOME/local/bin $path)
+      ```
+
+   1. Ensure to set your `$PATH` *BEFORE* you configure/build/install
+      the four packages.
+
+1. All four packages require two simple commands to build and
+   install (where PREFIX is the prefix discussed in 3, above).
+
+   ```
+   shell$ cd <m4 directory>
+   shell$ ./configure --prefix=PREFIX
+   shell$ make; make install
+   ```
+
+   ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
+   run the `rehash` command after you install each package.
+
+   ```
+   shell$ cd <autoconf directory>
+   shell$ ./configure --prefix=PREFIX
+   shell$ make; make install
+   ```
+
+   ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
+   run the `rehash` command after you install each package.
+
+   ```
+   shell$ cd <automake directory>
+   shell$ ./configure --prefix=PREFIX
+   shell$ make; make install
+   ```
+
+   ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
+   run the `rehash` command after you install each package.
+
+   ```
+   shell$ cd <libtool directory>
+   shell$ ./configure --prefix=PREFIX
+   shell$ make; make install
+   ```
+
+   ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
+   run the `rehash` command after you install each package.
+
+   m4, Autoconf and Automake build and install very quickly; Libtool
+   will take a minute or two.
+
+1. You can now run OMPI's top-level `autogen.pl` script.  This script
+   will invoke the GNU Autoconf, Automake, and Libtool commands in the
+   proper order and setup to run OMPI's top-level `configure` script.
+
+   Running `autogen.pl` may take a few minutes, depending on your
+   system.  It's not very exciting to watch.  :smile:
+
+   If you have a multi-processor system, enabling the multi-threaded
+   behavior in Automake 1.11 (or newer) can result in `autogen.pl`
+   running faster.  Do this by setting the `AUTOMAKE_JOBS` environment
+   variable to the number of processors (threads) that you want it to
+   use before invoking `autogen`.pl.  For example (you can again put
+   this in your shell startup files):
+
+   ```sh
+    # For bash/sh:
+    export AUTOMAKE_JOBS=4
+    # For csh/tcsh:
+    set AUTOMAKE_JOBS 4
+    ```
+
+   1. You generally need to run autogen.pl whenever the top-level file
+      `configure.ac` changes, or any files in the `config/` or
+      `<project>/config/` directories change (these directories are
+      where a lot of "include" files for Open MPI's `configure` script
+      live).
+
+   1. You do *NOT* need to re-run `autogen.pl` if you modify a
+      `Makefile.am`.
+
+## Use of Flex
+
+Flex is used during the compilation of a developer's checkout (it is
+not used to build official distribution tarballs).  Other flavors of
+lex are *not* supported: given the choice of making parsing code
+portable between all flavors of lex and doing more interesting work on
+Open MPI, we greatly prefer the latter.
+
+Note that no testing has been performed to see what the minimum
+version of Flex is required by Open MPI.  We suggest that you use
+v2.5.35 at the earliest.
+
+***NOTE:*** Windows developer builds of Open MPI *require* Flex version
+2.5.35.  Specifically, we know that v2.5.35 works and 2.5.4a does not.
+We have not tested to figure out exactly what the minimum required
+flex version is on Windows; we suggest that you use 2.5.35 at the
+earliest.  It is for this reason that the
+`contrib/dist/make_dist_tarball` script checks for a Windows-friendly
+version of Flex before continuing.
+
+For now, Open MPI will allow developer builds with Flex 2.5.4.  This
+is primarily motivated by the fact that RedHat/Centos 5 ships with
+Flex 2.5.4.  It is likely that someday Open MPI developer builds will
+require Flex version >=2.5.35.
+
+Note that the `flex`-generated code generates some compiler warnings
+on some platforms, but the warnings do not seem to be consistent or
+uniform on all platforms, compilers, and flex versions.  As such, we
+have done little to try to remove those warnings.
+
+If you do not have Flex installed, see [the Flex Github
+repository](https://github.com/westes/flex).
+
+## Use of Pandoc
+
+Similar to prior sections, you need to read/care about this section
+*ONLY* if you are building from a developer's tree (i.e., a Git clone
+of the Open MPI source tree).  If you have an Open MPI distribution
+tarball, the contents of this section are optional -- you can (and
+probably should) skip reading this section.
+
+The Pandoc tool is used to generate Open MPI's man pages.
+Specifically: Open MPI's man pages are written in Markdown; Pandoc is
+the tool that converts that Markdown to nroff (i.e., the format of man
+pages).
+
+You must have Pandoc >=v1.12 when building Open MPI from a developer's
+tree.  If configure cannot find Pandoc >=v1.12, it will abort.
+
+If you need to install Pandoc, check your operating system-provided
+packages (to include MacOS Homebrew and MacPorts).  [The Pandoc
+project web site](https://pandoc.org/) itself also offers binaries for
+their releases.
--- a/11
+++ b/11
@ -15,9 +15,9 @@ Copyright (c) 2004-2010 High Performance Computing Center Stuttgart,
                        University of Stuttgart.  All rights reserved.
 Copyright (c) 2004-2008 The Regents of the University of California.
                        All rights reserved.
-Copyright (c) 2006-2017 Los Alamos National Security, LLC.  All rights
+Copyright (c) 2006-2018 Los Alamos National Security, LLC.  All rights
                        reserved.
-Copyright (c) 2006-2017 Cisco Systems, Inc.  All rights reserved.
+Copyright (c) 2006-2020 Cisco Systems, Inc.  All rights reserved.
 Copyright (c) 2006-2010 Voltaire, Inc. All rights reserved.
 Copyright (c) 2006-2017 Sandia National Laboratories. All rights reserved.
 Copyright (c) 2006-2010 Sun Microsystems, Inc.  All rights reserved.
@ -25,7 +25,7 @@ Copyright (c) 2006-2010 Sun Microsystems, Inc.  All rights reserved.
 Copyright (c) 2006-2017 The University of Houston. All rights reserved.
 Copyright (c) 2006-2009 Myricom, Inc.  All rights reserved.
 Copyright (c) 2007-2017 UT-Battelle, LLC. All rights reserved.
-Copyright (c) 2007-2017 IBM Corporation.  All rights reserved.
+Copyright (c) 2007-2020 IBM Corporation.  All rights reserved.
 Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich Supercomputing
                        Centre, Federal Republic of Germany
 Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
@ -45,7 +45,7 @@ Copyright (c) 2016      ARM, Inc.  All rights reserved.
 Copyright (c) 2010-2011 Alex Brick <bricka@ccs.neu.edu>.  All rights reserved.
 Copyright (c) 2012      The University of Wisconsin-La Crosse. All rights
                        reserved.
-Copyright (c) 2013-2016 Intel, Inc. All rights reserved.
+Copyright (c) 2013-2020 Intel, Inc. All rights reserved.
 Copyright (c) 2011-2017 NVIDIA Corporation.  All rights reserved.
 Copyright (c) 2016      Broadcom Limited.  All rights reserved.
 Copyright (c) 2011-2017 Fujitsu Limited.  All rights reserved.
@ -56,7 +56,8 @@ Copyright (c) 2013-2017 Research Organization for Information Science (RIST).
 Copyright (c) 2017-2020 Amazon.com, Inc. or its affiliates.  All Rights
                        reserved.
 Copyright (c) 2018      DataDirect Networks. All rights reserved.
-Copyright (c) 2018-2019 Triad National Security, LLC. All rights reserved.
+Copyright (c) 2018-2020 Triad National Security, LLC. All rights reserved.
+Copyright (c) 2020      Google, LLC. All rights reserved.

 $COPYRIGHT$

--- a/Makefile.am
+++ b/Makefile.am
@ -24,7 +24,7 @@

 SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_SUBDIRS) test
 DIST_SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_DIST_SUBDIRS) test
-EXTRA_DIST = README INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.txt AUTHORS
+EXTRA_DIST = README.md INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.md AUTHORS

 include examples/Makefile.include

--- a/2243
+++ b/2243
--- a/README.JAVA.md
+++ b/README.JAVA.md
@ -0,0 +1,281 @@
+# Open MPI Java Bindings
+
+## Important node
+
+JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
+NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
+JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
+THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
+CONTINUED DEVELOPER SUPPORT.
+
+## Overview
+
+This version of Open MPI provides support for Java-based
+MPI applications.
+
+The rest of this document provides step-by-step instructions on
+building OMPI with Java bindings, and compiling and running Java-based
+MPI applications. Also, part of the functionality is explained with
+examples. Further details about the design, implementation and usage
+of Java bindings in Open MPI can be found in [1]. The bindings follow
+a JNI approach, that is, we do not provide a pure Java implementation
+of MPI primitives, but a thin layer on top of the C
+implementation. This is the same approach as in mpiJava [2]; in fact,
+mpiJava was taken as a starting point for Open MPI Java bindings, but
+they were later totally rewritten.
+
+1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
+   implementation of Java bindings in Open MPI". Parallel Comput.
+   59: 1-20 (2016).
+2. M. Baker et al. "mpiJava: An object-oriented Java interface to
+   MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
+   pp. 748-762, Springer (1999).
+
+## Building Java Bindings
+
+If this software was obtained as a developer-level checkout as opposed
+to a tarball, you will need to start your build by running
+`./autogen.pl`. This will also require that you have a fairly recent
+version of GNU Autotools on your system - see the HACKING.md file for
+details.
+
+Java support requires that Open MPI be built at least with shared libraries
+(i.e., `--enable-shared`) - any additional options are fine and will not
+conflict. Note that this is the default for Open MPI, so you don't
+have to explicitly add the option. The Java bindings will build only
+if `--enable-mpi-java` is specified, and a JDK is found in a typical
+system default location.
+
+If the JDK is not in a place where we automatically find it, you can
+specify the location. For example, this is required on the Mac
+platform as the JDK headers are located in a non-typical location. Two
+options are available for this purpose:
+
+1. `--with-jdk-bindir=<foo>`: the location of `javac` and `javah`
+1. `--with-jdk-headers=<bar>`: the directory containing `jni.h`
+
+For simplicity, typical configurations are provided in platform files
+under `contrib/platform/hadoop`. These will meet the needs of most
+users, or at least provide a starting point for your own custom
+configuration.
+
+In summary, therefore, you can configure the system using the
+following Java-related options:
+
+```
+$ ./configure --with-platform=contrib/platform/hadoop/<your-platform> ...
+
+````
+
+or
+
+```
+$ ./configure --enable-mpi-java --with-jdk-bindir=<foo> --with-jdk-headers=<bar> ...
+```
+
+or simply
+
+```
+$ ./configure --enable-mpi-java ...
+```
+
+if JDK is in a "standard" place that we automatically find.
+
+## Running Java Applications
+
+For convenience, the `mpijavac` wrapper compiler has been provided for
+compiling Java-based MPI applications. It ensures that all required MPI
+libraries and class paths are defined. You can see the actual command
+line using the `--showme` option, if you are interested.
+
+Once your application has been compiled, you can run it with the
+standard `mpirun` command line:
+
+```
+$ mpirun <options> java <your-java-options> <my-app>
+```
+
+For convenience, `mpirun` has been updated to detect the `java` command
+and ensure that the required MPI libraries and class paths are defined
+to support execution. You therefore do _NOT_ need to specify the Java
+library path to the MPI installation, nor the MPI classpath. Any class
+path definitions required for your application should be specified
+either on the command line or via the `CLASSPATH` environment
+variable. Note that the local directory will be added to the class
+path if nothing is specified.
+
+As always, the `java` executable, all required libraries, and your
+application classes must be available on all nodes.
+
+## Basic usage of Java bindings
+
+There is an MPI package that contains all classes of the MPI Java
+bindings: `Comm`, `Datatype`, `Request`, etc. These classes have a
+direct correspondence with classes defined by the MPI standard. MPI
+primitives are just methods included in these classes. The convention
+used for naming Java methods and classes is the usual camel-case
+convention, e.g., the equivalent of `MPI_File_set_info(fh,info)` is
+`fh.setInfo(info)`, where `fh` is an object of the class `File`.
+
+Apart from classes, the MPI package contains predefined public
+attributes under a convenience class `MPI`. Examples are the
+predefined communicator `MPI.COMM_WORLD` or predefined datatypes such
+as `MPI.DOUBLE`. Also, MPI initialization and finalization are methods
+of the `MPI` class and must be invoked by all MPI Java
+applications. The following example illustrates these concepts:
+
+```java
+import mpi.*;
+
+class ComputePi {
+
+    public static void main(String args[]) throws MPIException {
+
+        MPI.Init(args);
+
+        int rank = MPI.COMM_WORLD.getRank(),
+            size = MPI.COMM_WORLD.getSize(),
+            nint = 100; // Intervals.
+        double h = 1.0/(double)nint, sum = 0.0;
+
+        for(int i=rank+1; i<=nint; i+=size) {
+            double x = h * ((double)i - 0.5);
+            sum += (4.0 / (1.0 + x * x));
+        }
+
+        double sBuf[] = { h * sum },
+               rBuf[] = new double[1];
+
+        MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
+
+        if(rank == 0) System.out.println("PI: " + rBuf[0]);
+        MPI.Finalize();
+    }
+}
+```
+
+## Exception handling
+
+Java bindings in Open MPI support exception handling. By default, errors
+are fatal, but this behavior can be changed. The Java API will throw
+exceptions if the MPI.ERRORS_RETURN error handler is set:
+
+```java
+MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
+```
+
+If you add this statement to your program, it will show the line
+where it breaks, instead of just crashing in case of an error.
+Error-handling code can be separated from main application code by
+means of try-catch blocks, for instance:
+
+```java
+try
+{
+    File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
+}
+catch(MPIException ex)
+{
+    System.err.println("Error Message: "+ ex.getMessage());
+    System.err.println("  Error Class: "+ ex.getErrorClass());
+    ex.printStackTrace();
+    System.exit(-1);
+}
+```
+
+## How to specify buffers
+
+In MPI primitives that require a buffer (either send or receive) the
+Java API admits a Java array. Since Java arrays can be relocated by
+the Java runtime environment, the MPI Java bindings need to make a
+copy of the contents of the array to a temporary buffer, then pass the
+pointer to this buffer to the underlying C implementation. From the
+practical point of view, this implies an overhead associated to all
+buffers that are represented by Java arrays. The overhead is small
+for small buffers but increases for large arrays.
+
+There is a pool of temporary buffers with a default capacity of 64K.
+If a temporary buffer of 64K or less is needed, then the buffer will
+be obtained from the pool. But if the buffer is larger, then it will
+be necessary to allocate the buffer and free it later.
+
+The default capacity of pool buffers can be modified with an Open MPI
+MCA parameter:
+
+```
+shell$ mpirun --mca mpi_java_eager size ...
+```
+
+Where `size` is the number of bytes, or kilobytes if it ends with 'k',
+or megabytes if it ends with 'm'.
+
+An alternative is to use "direct buffers" provided by standard classes
+available in the Java SDK such as `ByteBuffer`. For convenience we
+provide a few static methods `new[Type]Buffer` in the `MPI` class to
+create direct buffers for a number of basic datatypes. Elements of the
+direct buffer can be accessed with methods `put()` and `get()`, and
+the number of elements in the buffer can be obtained with the method
+`capacity()`. This example illustrates its use:
+
+```java
+int myself = MPI.COMM_WORLD.getRank();
+int tasks  = MPI.COMM_WORLD.getSize();
+
+IntBuffer in  = MPI.newIntBuffer(MAXLEN * tasks),
+          out = MPI.newIntBuffer(MAXLEN);
+
+for(int i = 0; i < MAXLEN; i++)
+    out.put(i, myself);      // fill the buffer with the rank
+
+Request request = MPI.COMM_WORLD.iAllGather(
+                  out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
+request.waitFor();
+request.free();
+
+for(int i = 0; i < tasks; i++)
+{
+    for(int k = 0; k < MAXLEN; k++)
+    {
+        if(in.get(k + i * MAXLEN) != i)
+            throw new AssertionError("Unexpected value");
+    }
+}
+```
+
+Direct buffers are available for: `BYTE`, `CHAR`, `SHORT`, `INT`,
+`LONG`, `FLOAT`, and `DOUBLE`. There is no direct buffer for booleans.
+
+Direct buffers are not a replacement for arrays, because they have
+higher allocation and deallocation costs than arrays. In some
+cases arrays will be a better choice. You can easily convert a
+buffer into an array and vice versa.
+
+All non-blocking methods must use direct buffers and only
+blocking methods can choose between arrays and direct buffers.
+
+The above example also illustrates that it is necessary to call
+the `free()` method on objects whose class implements the `Freeable`
+interface. Otherwise a memory leak is produced.
+
+## Specifying offsets in buffers
+
+In a C program, it is common to specify an offset in a array with
+`&array[i]` or `array+i`, for instance to send data starting from
+a given position in the array. The equivalent form in the Java bindings
+is to `slice()` the buffer to start at an offset. Making a `slice()`
+on a buffer is only necessary, when the offset is not zero. Slices
+work for both arrays and direct buffers.
+
+```java
+import static mpi.MPI.slice;
+// ...
+int numbers[] = new int[SIZE];
+// ...
+MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
+```
+
+## Questions?  Problems?
+
+If you have any problems, or find any bugs, please feel free to report
+them to [Open MPI user's mailing
+list](https://www.open-mpi.org/community/lists/ompi.php).
--- a/README.JAVA.txt
+++ b/README.JAVA.txt
@ -1,275 +0,0 @@
-***************************************************************************
-IMPORTANT NOTE
-
-JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
-NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
-JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
-THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
-CONTINUED DEVELOPER SUPPORT.
-
-***************************************************************************
-
-This version of Open MPI provides support for Java-based
-MPI applications.
-
-The rest of this document provides step-by-step instructions on
-building OMPI with Java bindings, and compiling and running
-Java-based MPI applications. Also, part of the functionality is
-explained with examples. Further details about the design,
-implementation and usage of Java bindings in Open MPI can be found
-in [1]. The bindings follow a JNI approach, that is, we do not
-provide a pure Java implementation of MPI primitives, but a thin
-layer on top of the C implementation. This is the same approach
-as in mpiJava [2]; in fact, mpiJava was taken as a starting point
-for Open MPI Java bindings, but they were later totally rewritten.
-
- [1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
-     implementation of Java bindings in Open MPI". Parallel Comput.
-     59: 1-20 (2016).
-
- [2] M. Baker et al. "mpiJava: An object-oriented Java interface to
-     MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
-     pp. 748-762, Springer (1999).
-
-============================================================================
-
-Building Java Bindings
-
-If this software was obtained as a developer-level
-checkout as opposed to a tarball, you will need to start your build by
-running ./autogen.pl. This will also require that you have a fairly
-recent version of autotools on your system - see the HACKING file for
-details.
-
-Java support requires that Open MPI be built at least with shared libraries
-(i.e., --enable-shared) - any additional options are fine and will not
-conflict. Note that this is the default for Open MPI, so you don't
-have to explicitly add the option. The Java bindings will build only
-if --enable-mpi-java is specified, and a JDK is found in a typical
-system default location.
-
-If the JDK is not in a place where we automatically find it, you can
-specify the location. For example, this is required on the Mac
-platform as the JDK headers are located in a non-typical location. Two
-options are available for this purpose:
-
--with-jdk-bindir=<foo> - the location of javac and javah
--with-jdk-headers=<bar> - the directory containing jni.h
-
-For simplicity, typical configurations are provided in platform files
-under contrib/platform/hadoop. These will meet the needs of most
-users, or at least provide a starting point for your own custom
-configuration.
-
-In summary, therefore, you can configure the system using the
-following Java-related options:
-
-$ ./configure --with-platform=contrib/platform/hadoop/<your-platform>
-...
-
-or
-
-$ ./configure --enable-mpi-java --with-jdk-bindir=<foo>
-              --with-jdk-headers=<bar> ...
-
-or simply
-
-$ ./configure --enable-mpi-java ...
-
-if JDK is in a "standard" place that we automatically find.
-
----------------------------------------------------------------------------
-
-Running Java Applications
-
-For convenience, the "mpijavac" wrapper compiler has been provided for
-compiling Java-based MPI applications. It ensures that all required MPI
-libraries and class paths are defined. You can see the actual command
-line using the --showme option, if you are interested.
-
-Once your application has been compiled, you can run it with the
-standard "mpirun" command line:
-
-$ mpirun <options> java <your-java-options> <my-app>
-
-For convenience, mpirun has been updated to detect the "java" command
-and ensure that the required MPI libraries and class paths are defined
-to support execution. You therefore do NOT need to specify the Java
-library path to the MPI installation, nor the MPI classpath. Any class
-path definitions required for your application should be specified
-either on the command line or via the CLASSPATH environmental
-variable. Note that the local directory will be added to the class
-path if nothing is specified.
-
-As always, the "java" executable, all required libraries, and your
-application classes must be available on all nodes.
-
----------------------------------------------------------------------------
-
-Basic usage of Java bindings
-
-There is an MPI package that contains all classes of the MPI Java
-bindings: Comm, Datatype, Request, etc. These classes have a direct
-correspondence with classes defined by the MPI standard. MPI primitives
-are just methods included in these classes. The convention used for
-naming Java methods and classes is the usual camel-case convention,
-e.g., the equivalent of MPI_File_set_info(fh,info) is fh.setInfo(info),
-where fh is an object of the class File.
-
-Apart from classes, the MPI package contains predefined public attributes
-under a convenience class MPI. Examples are the predefined communicator
-MPI.COMM_WORLD or predefined datatypes such as MPI.DOUBLE. Also, MPI
-initialization and finalization are methods of the MPI class and must
-be invoked by all MPI Java applications. The following example illustrates
-these concepts:
-
-import mpi.*;
-
-class ComputePi {
-
-    public static void main(String args[]) throws MPIException {
-
-        MPI.Init(args);
-
-        int rank = MPI.COMM_WORLD.getRank(),
-            size = MPI.COMM_WORLD.getSize(),
-            nint = 100; // Intervals.
-        double h = 1.0/(double)nint, sum = 0.0;
-
-        for(int i=rank+1; i<=nint; i+=size) {
-            double x = h * ((double)i - 0.5);
-            sum += (4.0 / (1.0 + x * x));
-        }
-
-        double sBuf[] = { h * sum },
-               rBuf[] = new double[1];
-
-        MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
-
-        if(rank == 0) System.out.println("PI: " + rBuf[0]);
-        MPI.Finalize();
-    }
-}
-
----------------------------------------------------------------------------
-
-Exception handling
-
-Java bindings in Open MPI support exception handling. By default, errors
-are fatal, but this behavior can be changed. The Java API will throw
-exceptions if the MPI.ERRORS_RETURN error handler is set:
-
-    MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
-
-If you add this statement to your program, it will show the line
-where it breaks, instead of just crashing in case of an error.
-Error-handling code can be separated from main application code by
-means of try-catch blocks, for instance:
-
-    try
-    {
-        File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
-    }
-    catch(MPIException ex)
-    {
-        System.err.println("Error Message: "+ ex.getMessage());
-        System.err.println("  Error Class: "+ ex.getErrorClass());
-        ex.printStackTrace();
-        System.exit(-1);
-    }
-
-
----------------------------------------------------------------------------
-
-How to specify buffers
-
-In MPI primitives that require a buffer (either send or receive) the
-Java API admits a Java array. Since Java arrays can be relocated by
-the Java runtime environment, the MPI Java bindings need to make a
-copy of the contents of the array to a temporary buffer, then pass the
-pointer to this buffer to the underlying C implementation. From the
-practical point of view, this implies an overhead associated to all
-buffers that are represented by Java arrays. The overhead is small
-for small buffers but increases for large arrays.
-
-There is a pool of temporary buffers with a default capacity of 64K.
-If a temporary buffer of 64K or less is needed, then the buffer will
-be obtained from the pool. But if the buffer is larger, then it will
-be necessary to allocate the buffer and free it later.
-
-The default capacity of pool buffers can be modified with an 'mca'
-parameter:
-
-    mpirun --mca mpi_java_eager size ...
-
-Where 'size' is the number of bytes, or kilobytes if it ends with 'k',
-or megabytes if it ends with 'm'.
-
-An alternative is to use "direct buffers" provided by standard
-classes available in the Java SDK such as ByteBuffer. For convenience
-we provide a few static methods "new[Type]Buffer" in the MPI class
-to create direct buffers for a number of basic datatypes. Elements
-of the direct buffer can be accessed with methods put() and get(),
-and the number of elements in the buffer can be obtained with the
-method capacity(). This example illustrates its use:
-
-    int myself = MPI.COMM_WORLD.getRank();
-    int tasks  = MPI.COMM_WORLD.getSize();
-
-    IntBuffer in  = MPI.newIntBuffer(MAXLEN * tasks),
-              out = MPI.newIntBuffer(MAXLEN);
-
-    for(int i = 0; i < MAXLEN; i++)
-        out.put(i, myself);      // fill the buffer with the rank
-
-    Request request = MPI.COMM_WORLD.iAllGather(
-                      out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
-    request.waitFor();
-    request.free();
-
-    for(int i = 0; i < tasks; i++)
-    {
-        for(int k = 0; k < MAXLEN; k++)
-        {
-            if(in.get(k + i * MAXLEN) != i)
-                throw new AssertionError("Unexpected value");
-        }
-    }
-
-Direct buffers are available for: BYTE, CHAR, SHORT, INT, LONG,
-FLOAT, and DOUBLE. There is no direct buffer for booleans.
-
-Direct buffers are not a replacement for arrays, because they have
-higher allocation and deallocation costs than arrays. In some
-cases arrays will be a better choice. You can easily convert a
-buffer into an array and vice versa.
-
-All non-blocking methods must use direct buffers and only
-blocking methods can choose between arrays and direct buffers.
-
-The above example also illustrates that it is necessary to call
-the free() method on objects whose class implements the Freeable
-interface. Otherwise a memory leak is produced.
-
----------------------------------------------------------------------------
-
-Specifying offsets in buffers
-
-In a C program, it is common to specify an offset in a array with
-"&array[i]" or "array+i", for instance to send data starting from
-a given position in the array. The equivalent form in the Java bindings
-is to "slice()" the buffer to start at an offset. Making a "slice()"
-on a buffer is only necessary, when the offset is not zero. Slices
-work for both arrays and direct buffers.
-
-    import static mpi.MPI.slice;
-    ...
-    int numbers[] = new int[SIZE];
-    ...
-    MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
-
----------------------------------------------------------------------------
-
-If you have any problems, or find any bugs, please feel free to report
-them to Open MPI user's mailing list (see
-https://www.open-mpi.org/community/lists/ompi.php).
--- a/README.md
+++ b/README.md
--- a/contrib/Makefile.am
+++ b/contrib/Makefile.am
@ -64,7 +64,7 @@ EXTRA_DIST = \
        platform/lanl/cray_xc_cle5.2/optimized-common \
        platform/lanl/cray_xc_cle5.2/optimized-lustre \
        platform/lanl/cray_xc_cle5.2/optimized-lustre.conf \
-        platform/lanl/toss/README \
+        platform/lanl/toss/README.md \
        platform/lanl/toss/common \
        platform/lanl/toss/common-optimized \
        platform/lanl/toss/cray-lustre-optimized \
--- a/contrib/build-mca-comps-outside-of-tree/README.txt
+++ b/contrib/build-mca-comps-outside-of-tree/README.txt
@ -1,121 +1,108 @@
+# Description
+
 2 Feb 2011

-Description
-===========
-
-This sample "tcp2" BTL component is a simple example of how to build
+This sample `tcp2` BTL component is a simple example of how to build
 an Open MPI MCA component from outside of the Open MPI source tree.
 This is a valuable technique for 3rd parties who want to provide their
 own components for Open MPI, but do not want to be in the mainstream
 distribution (i.e., their code is not part of the main Open MPI code
 base).

-NOTE: We do recommend that 3rd party developers investigate using a
-      DVCS such as Mercurial or Git to keep up with Open MPI
-      development.  Using a DVCS allows you to host your component in
-      your own copy of the Open MPI source tree, and yet still keep up
-      with development changes, stable releases, etc.
-
 Previous colloquial knowledge held that building a component from
 outside of the Open MPI source tree required configuring Open MPI
--with-devel-headers, and then building and installing it.  This
-configure switch installs all of OMPI's internal .h files under
-$prefix/include/openmpi, and therefore allows 3rd party code to be
+`--with-devel-headers`, and then building and installing it.  This
+configure switch installs all of OMPI's internal `.h` files under
+`$prefix/include/openmpi`, and therefore allows 3rd party code to be
 compiled outside of the Open MPI tree.

 This method definitely works, but is annoying:

- * You have to ask users to use this special configure switch.
- * Not all users install from source; many get binary packages (e.g.,
-   RPMs).
+* You have to ask users to use this special configure switch.
+* Not all users install from source; many get binary packages (e.g.,
+  RPMs).

 This example package shows two ways to build an Open MPI MCA component
 from outside the Open MPI source tree:

- 1. Using the above --with-devel-headers technique
- 2. Compiling against the Open MPI source tree itself (vs. the
-    installation tree)
+1. Using the above `--with-devel-headers` technique
+2. Compiling against the Open MPI source tree itself (vs. the
+   installation tree)

 The user still has to have a source tree, but at least they don't have
-to be required to use --with-devel-headers (which most users don't) --
+to be required to use `--with-devel-headers` (which most users don't) --
 they can likely build off the source tree that they already used.

-Example project contents
-========================
+# Example project contents

-The "tcp2" component is a direct copy of the TCP BTL as of January
+The `tcp2` component is a direct copy of the TCP BTL as of January
 2011 -- it has just been renamed so that it can be built separately
 and installed alongside the real TCP BTL component.

 Most of the mojo for both methods is handled in the example
-components' configure.ac, but the same techniques are applicable
+components' `configure.ac`, but the same techniques are applicable
 outside of the GNU Auto toolchain.

-This sample "tcp2" component has an autogen.sh script that requires
+This sample `tcp2` component has an `autogen.sh` script that requires
 the normal Autoconf, Automake, and Libtool.  It also adds the
 following two configure switches:

- --with-openmpi-install=DIR
+1. `--with-openmpi-install=DIR`:
+    If provided, `DIR` is an Open MPI installation tree that was
+    installed `--with-devel-headers`.

-    If provided, DIR is an Open MPI installation tree that was
-    installed --with-devel-headers.
-
-    This switch uses the installed mpicc --showme:<foo> functionality
-    to extract the relevant CPPFLAGS, LDFLAGS, and LIBS.
-
- --with-openmpi-source=DIR
-
-    If provided, DIR is the source of a configured and built Open MPI
+    This switch uses the installed `mpicc --showme:<foo>` functionality
+    to extract the relevant `CPPFLAGS`, `LDFLAGS`, and `LIBS`.
+1. `--with-openmpi-source=DIR`:
+    If provided, `DIR` is the source of a configured and built Open MPI
    source tree (corresponding to the version expected by the example
    component).  The source tree is not required to have been
-    configured --with-devel-headers.
+    configured `--with-devel-headers`.

-    This switch uses the source tree's config.status script to extract
-    the relevant CPPFLAGS and CFLAGS.
+    This switch uses the source tree's `config.status` script to
+    extract the relevant `CPPFLAGS` and `CFLAGS`.

 Either one of these two switches must be provided, or appropriate
-CPPFLAGS, CFLAGS, LDFLAGS, and/or LIBS must be provided such that
-valid Open MPI header and library files can be found and compiled /
-linked against, respectively.
+`CPPFLAGS`, `CFLAGS`, `LDFLAGS`, and/or `LIBS` must be provided such
+that valid Open MPI header and library files can be found and compiled
+/ linked against, respectively.

-Example use
-===========
+# Example use

 First, download, build, and install Open MPI:

-----
+```
 $ cd $HOME
-$ wget \
-  https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
-  [lots of output]
+$ wget https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
+[...lots of output...]
 $ tar jxf openmpi-X.Y.Z.tar.bz2
 $ cd openmpi-X.Y.Z
 $ ./configure --prefix=/opt/openmpi ...
-  [lots of output]
+[...lots of output...]
 $ make -j 4 install
-  [lots of output]
+[...lots of output...]
 $ /opt/openmpi/bin/ompi_info | grep btl
                 MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
                 MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
                 MCA btl: tcp (MCA vA.B, API vM.N, Component vX.Y.Z)
  [where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
 $
-----
+```

-Notice the installed BTLs from ompi_info.
+Notice the installed BTLs from `ompi_info`.

-Now cd into this example project and build it, pointing it to the
+Now `cd` into this example project and build it, pointing it to the
 source directory of the Open MPI that you just built.  Note that we
-use the same --prefix as when installing Open MPI (so that the built
+use the same `--prefix` as when installing Open MPI (so that the built
 component will be installed into the Right place):

-----
+```
 $ cd /path/to/this/sample
 $ ./autogen.sh
 $ ./configure --prefix=/opt/openmpi --with-openmpi-source=$HOME/openmpi-X.Y.Z
-  [lots of output]
+[...lots of output...]
 $ make -j 4 install
-  [lots of output]
+[...lots of output...]
 $ /opt/openmpi/bin/ompi_info | grep btl
                 MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
                 MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
@ -123,12 +110,11 @@ $ /opt/openmpi/bin/ompi_info | grep btl
                 MCA btl: tcp2 (MCA vA.B, API vM.N, Component vX.Y.Z)
  [where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
 $
-----
+```

-Notice that the "tcp2" BTL is now installed.
+Notice that the `tcp2` BTL is now installed.

-Random notes
-============
+# Random notes

 The component in this project is just an example; I whipped it up in
 the span of several hours.  Your component may be a bit more complex
@ -139,17 +125,15 @@ what you need.
 Changes required to the component to make it build in a standalone
 mode:

-1. Write your own configure script.  This component is just a sample.
-   You basically need to build against an OMPI install that was
-   installed --with-devel-headers or a built OMPI source tree.  See
-   ./configure --help for details.
-
-2. I also provided a bogus btl_tcp2_config.h (generated by configure).
-   This file is not included anywhere, but it does provide protection
-   against re-defined PACKAGE_* macros when running configure, which
-   is quite annoying.
-
-3. Modify Makefile.am to only build DSOs.  I.e., you can optionally
+1. Write your own `configure` script.  This component is just a
+   sample.  You basically need to build against an OMPI install that
+   was installed `--with-devel-headers` or a built OMPI source tree.
+   See `./configure --help` for details.
+1. I also provided a bogus `btl_tcp2_config.h` (generated by
+   `configure`).  This file is not included anywhere, but it does
+   provide protection against re-defined `PACKAGE_*` macros when
+   running `configure`, which is quite annoying.
+1. Modify `Makefile.am` to only build DSOs.  I.e., you can optionally
   take the static option out since the component can *only* build in
   DSO mode when building standalone.  That being said, it doesn't
   hurt to leave the static builds in -- this would (hypothetically)
--- a/contrib/dist/linux/README
+++ b/contrib/dist/linux/README
@ -1,105 +0,0 @@
-Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
-                        University Research and Technology
-                        Corporation.  All rights reserved.
-Copyright (c) 2004-2006 The University of Tennessee and The University
-                        of Tennessee Research Foundation.  All rights
-                        reserved.
-Copyright (c) 2004-2006 High Performance Computing Center Stuttgart,
-                        University of Stuttgart.  All rights reserved.
-Copyright (c) 2004-2006 The Regents of the University of California.
-                        All rights reserved.
-Copyright (c) 2006-2016 Cisco Systems, Inc.  All rights reserved.
-$COPYRIGHT$
-
-Additional copyrights may follow
-
-$HEADER$
-
-===========================================================================
-
-Note that you probably want to download the latest release of the SRPM
-for any given Open MPI version.  The SRPM release number is the
-version after the dash in the SRPM filename.  For example,
-"openmpi-1.6.3-2.src.rpm" is the 2nd release of the SRPM for Open MPI
-v1.6.3.  Subsequent releases of SRPMs typically contain bug fixes for
-the RPM packaging, but not Open MPI itself.
-
-The buildrpm.sh script takes a single mandatory argument -- a filename
-pointing to an Open MPI tarball (may be either .gz or .bz2).  It will
-create one or more RPMs from this tarball:
-
-1. Source RPM
-2. "All in one" RPM, where all of Open MPI is put into a single RPM.
-3. "Multiple" RPM, where Open MPI is split into several sub-package
-   RPMs:
-   - openmpi-runtime
-   - openmpi-devel
-   - openmpi-docs
-
-The folowing arguments could be used to affect script behaviour.
-Please, do NOT set the same settings with parameters and config vars.
-
-b
-   If you specify this option, only the all-in-one binary RPM will
-   be built. By default, only the source RPM (SRPM) is built. Other
-   parameters that affect the all-in-one binary RPM will be ignored
-   unless this option is specified.
-
-n name
-   This option will change the name of the produced RPM to the "name".
-   It is useful to use with "-o" and "-m" options if you want to have
-   multiple Open MPI versions installed simultaneously in the same
-   enviroment. Requires use of option "-b".
-
-o
-   With this option the install path of the binary RPM will be changed
-   to /opt/_NAME_/_VERSION_. Requires use of option "-b".
-
-m
-   This option causes the RPM to also install modulefiles
-   to the location specified in the specfile. Requires use of option "-b".
-
-i
-   Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
-   Requires use of option "-b".
-
-f lf_location
-   Include support for Libfabric. "lf_location" is Libfabric install
-   path. Requires use of option "-b".
-
-t tm_location
-   Include support for Torque/PBS Pro. "tm_location" is path of the
-   Torque/PBS Pro header files. Requires use of option "-b".
-
-d
-   Build with debugging support. By default,
-   the RPM is built without debugging support.
-
-c parameter
-   Add custom configure parameter.
-
-r parameter
-   Add custom RPM build parameter.
-
-s
-   If specified, the script will try to unpack the openmpi.spec
-   file from the tarball specified on the command line. By default,
-   the script will look for the specfile in the current directory.
-
-R directory
-   Specifies the top level RPM build direcotry.
-
-h
-   Prints script usage information.
-
-
-Target architecture is currently hard-coded in the beginning
-of the buildrpm.sh script.
-
-Alternatively, you can build directly from the openmpi.spec spec file
-or SRPM directly.  Many options can be passed to the building process
-via rpmbuild's --define option (there are older versions of rpmbuild
-that do not seem to handle --define'd values properly in all cases,
-but we generally don't care about those old versions of rpmbuild...).
-The available options are described in the comments in the beginning
-of the spec file in this directory.
--- a/contrib/dist/linux/README.md
+++ b/contrib/dist/linux/README.md
@ -0,0 +1,88 @@
+# Open MPI Linux distribution helpers
+
+Note that you probably want to download the latest release of the SRPM
+for any given Open MPI version.  The SRPM release number is the
+version after the dash in the SRPM filename.  For example,
+`openmpi-1.6.3-2.src.rpm` is the 2nd release of the SRPM for Open MPI
+v1.6.3.  Subsequent releases of SRPMs typically contain bug fixes for
+the RPM packaging, but not Open MPI itself.
+
+The `buildrpm.sh` script takes a single mandatory argument -- a
+filename pointing to an Open MPI tarball (may be either `.gz` or
+`.bz2`).  It will create one or more RPMs from this tarball:
+
+1. Source RPM
+1. "All in one" RPM, where all of Open MPI is put into a single RPM.
+1. "Multiple" RPM, where Open MPI is split into several sub-package
+   RPMs:
+   * `openmpi-runtime`
+   * `openmpi-devel`
+   * `openmpi-docs`
+
+The folowing arguments could be used to affect script behaviour.
+Please, do NOT set the same settings with parameters and config vars.
+
+* `-b`:
+   If you specify this option, only the all-in-one binary RPM will
+   be built. By default, only the source RPM (SRPM) is built. Other
+   parameters that affect the all-in-one binary RPM will be ignored
+   unless this option is specified.
+
+* `-n name`:
+   This option will change the name of the produced RPM to the "name".
+   It is useful to use with "-o" and "-m" options if you want to have
+   multiple Open MPI versions installed simultaneously in the same
+   enviroment. Requires use of option `-b`.
+
+* `-o`:
+   With this option the install path of the binary RPM will be changed
+   to `/opt/_NAME_/_VERSION_`. Requires use of option `-b`.
+
+* `-m`:
+   This option causes the RPM to also install modulefiles
+   to the location specified in the specfile. Requires use of option `-b`.
+
+* `-i`:
+   Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
+   Requires use of option `-b`.
+
+* `-f lf_location`:
+   Include support for Libfabric. "lf_location" is Libfabric install
+   path. Requires use of option `-b`.
+
+* `-t tm_location`:
+   Include support for Torque/PBS Pro. "tm_location" is path of the
+   Torque/PBS Pro header files. Requires use of option `-b`.
+
+* `-d`:
+   Build with debugging support. By default,
+   the RPM is built without debugging support.
+
+* `-c parameter`:
+   Add custom configure parameter.
+
+* `-r parameter`:
+   Add custom RPM build parameter.
+
+* `-s`:
+   If specified, the script will try to unpack the openmpi.spec
+   file from the tarball specified on the command line. By default,
+   the script will look for the specfile in the current directory.
+
+* `-R directory`:
+   Specifies the top level RPM build direcotry.
+
+* `-h`:
+   Prints script usage information.
+
+
+Target architecture is currently hard-coded in the beginning
+of the `buildrpm.sh` script.
+
+Alternatively, you can build directly from the `openmpi.spec` spec
+file or SRPM directly.  Many options can be passed to the building
+process via `rpmbuild`'s `--define` option (there are older versions
+of `rpmbuild` that do not seem to handle `--define`'d values properly
+in all cases, but we generally don't care about those old versions of
+`rpmbuild`...).  The available options are described in the comments
+in the beginning of the spec file in this directory.
--- a/contrib/platform/lanl/toss/README.md
+++ b/contrib/platform/lanl/toss/README.md
@ -61,7 +61,7 @@ created.
  - copy of toss3-hfi-optimized.conf with the following changes:
    - change: comment "Add the interface for out-of-band communication and set
      it up" to "Set up the interface for out-of-band communication"
-    - remove: oob_tcp_if_exclude = ib0 
+    - remove: oob_tcp_if_exclude = ib0
    - remove: btl (let Open MPI figure out what best to use for ethernet-
      connected hardware)
    - remove: btl_openib_want_fork_support (no infiniband)
--- a/examples/Makefile.include
+++ b/examples/Makefile.include
@ -33,7 +33,7 @@
 # Automake).

 EXTRA_DIST += \
-        examples/README \
+        examples/README.md \
        examples/Makefile \
        examples/hello_c.c \
        examples/hello_mpifh.f \
--- a/examples/README
+++ b/examples/README
@ -1,67 +0,0 @@
-Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
-                        University Research and Technology
-                        Corporation.  All rights reserved.
-Copyright (c) 2006-2012 Cisco Systems, Inc.  All rights reserved.
-Copyright (c) 2007-2009 Sun Microsystems, Inc.  All rights reserved.
-Copyright (c) 2010      Oracle and/or its affiliates.  All rights reserved.
-Copyright (c) 2013      Mellanox Technologies, Inc.  All rights reserved.
-
-$COPYRIGHT$
-
-The files in this directory are sample MPI applications provided both
-as a trivial primer to MPI as well as simple tests to ensure that your
-Open MPI installation is working properly.
-
-If you are looking for a comprehensive MPI tutorial, these samples are
-not enough.  Excellent MPI tutorials are available here:
-
-        http://www.citutor.org/login.php
-
-Get a free account and login; you can then browse to the list of
-available courses.  Look for the ones with "MPI" in the title.
-
-There are two MPI examples in this directory, each using one of six
-different MPI interfaces:
-
- Hello world
-  C:                   hello_c.c
-  C++:                 hello_cxx.cc
-  Fortran mpif.h:      hello_mpifh.f
-  Fortran use mpi:     hello_usempi.f90
-  Fortran use mpi_f08: hello_usempif08.f90
-  Java:                Hello.java
-  C shmem.h:           hello_oshmem_c.c
-  Fortran shmem.fh:    hello_oshmemfh.f90
-
- Send a trivial message around in a ring
-  C:                   ring_c.c
-  C++:                 ring_cxx.cc
-  Fortran mpif.h:      ring_mpifh.f
-  Fortran use mpi:     ring_usempi.f90
-  Fortran use mpi_f08: ring_usempif08.f90
-  Java:                Ring.java
-  C shmem.h:           ring_oshmem_c.c
-  Fortran shmem.fh:    ring_oshmemfh.f90
-
-Additionally, there's one further example application, but this one
-only uses the MPI C bindings:
-
- Test the connectivity between all processes
-  C:   connectivity_c.c
-
-The Makefile in this directory will build as many of the examples as
-you have language support (e.g., if you do not have the Fortran "use
-mpi" bindings compiled as part of Open MPI, the those examples will be
-skipped).
-
-The Makefile assumes that the wrapper compilers mpicc, mpic++, and
-mpifort are in your path.
-
-Although the Makefile is tailored for Open MPI (e.g., it checks the
-"ompi_info" command to see if you have support for C++, mpif.h, use
-mpi, and use mpi_f08 F90), all of the example programs are pure MPI,
-and therefore not specific to Open MPI.  Hence, you can use a
-different MPI implementation to compile and run these programs if you
-wish.
-
-Make today an Open MPI day!
--- a/examples/README.md
+++ b/examples/README.md
@ -0,0 +1,66 @@
+The files in this directory are sample MPI applications provided both
+as a trivial primer to MPI as well as simple tests to ensure that your
+Open MPI installation is working properly.
+
+If you are looking for a comprehensive MPI tutorial, these samples are
+not enough.  [Excellent MPI tutorials are available
+here](http://www.citutor.org/login.php).
+
+Get a free account and login; you can then browse to the list of
+available courses.  Look for the ones with "MPI" in the title.
+
+There are two MPI examples in this directory, each using one of six
+different MPI interfaces:
+
+## Hello world
+
+The MPI version of the canonical "hello world" program:
+
+* C: `hello_c.c`
+* C++: `hello_cxx.cc`
+* Fortran mpif.h: `hello_mpifh.f`
+* Fortran use mpi: `hello_usempi.f90`
+* Fortran use mpi_f08: `hello_usempif08.f90`
+* Java: `Hello.java`
+* C shmem.h: `hello_oshmem_c.c`
+* Fortran shmem.fh: `hello_oshmemfh.f90`
+
+## Ring
+
+Send a trivial message around in a ring:
+
+* C: `ring_c.c`
+* C++: `ring_cxx.cc`
+* Fortran mpif.h: `ring_mpifh.f`
+* Fortran use mpi: `ring_usempi.f90`
+* Fortran use mpi_f08: `ring_usempif08.f90`
+* Java: `Ring.java`
+* C shmem.h: `ring_oshmem_c.c`
+* Fortran shmem.fh: `ring_oshmemfh.f90`
+
+## Connectivity Test
+
+Additionally, there's one further example application, but this one
+only uses the MPI C bindings to test the connectivity between all
+processes:
+
+* C: `connectivity_c.c`
+
+## Makefile
+
+The `Makefile` in this directory will build as many of the examples as
+you have language support (e.g., if you do not have the Fortran `use
+mpi` bindings compiled as part of Open MPI, the those examples will be
+skipped).
+
+The `Makefile` assumes that the wrapper compilers `mpicc`, `mpic++`, and
+`mpifort` are in your path.
+
+Although the `Makefile` is tailored for Open MPI (e.g., it checks the
+`ompi_info` command to see if you have support for `mpif.h`, the `mpi`
+module, and the `use mpi_f08` module), all of the example programs are
+pure MPI, and therefore not specific to Open MPI.  Hence, you can use
+a different MPI implementation to compile and run these programs if
+you wish.
+
+Make today an Open MPI day!
--- a/ompi/contrib/README.md
+++ b/ompi/contrib/README.md
@ -0,0 +1,19 @@
+This is the OMPI contrib system.  It is (far) less functional and
+flexible than the OMPI MCA framework/component system.
+
+Each contrib package must have a `configure.m4`.  It may optionally also
+have an `autogen.subdirs` file.
+
+If it has a `configure.m4` file, it must specify its own relevant
+files to `AC_CONFIG_FILES` to create during `AC_OUTPUT` -- just like
+MCA components (at a minimum, usually its own `Makefile`).  The
+`configure.m4` file will be slurped up into the main `configure`
+script, just like other MCA components.  Note that there is currently
+no "no configure" option for contrib packages -- you *must* have a
+`configure.m4` (even if all it does it call `$1`).  Feel free to fix
+this situation if you want -- it probably won't not be too difficult
+to extend `autogen.pl` to support this scenario, similar to how it is
+done for MCA components.  :smile:
+
+If it has an `autogen.subdirs` file, then it needs to be a
+subdirectory that is autogen-able.
--- a/ompi/contrib/README.txt
+++ b/ompi/contrib/README.txt
@ -1,19 +0,0 @@
-This is the OMPI contrib system.  It is (far) less functional and
-flexible than the OMPI MCA framework/component system.
-
-Each contrib package must have a configure.m4.  It may optionally also
-have an autogen.subdirs file.
-
-If it has a configure.m4 file, it must specify its own relevant files
-to AC_CONFIG_FILES to create during AC_OUTPUT -- just like MCA
-components (at a minimum, usually its own Makefile).  The configure.m4
-file will be slurped up into the main configure script, just like
-other MCA components.  Note that there is currently no "no configure"
-option for contrib packages -- you *must* have a configure.m4 (even if
-all it does it call $1).  Feel free to fix this situation if you want
-- it probably won't not be too difficult to extend autogen.pl to
-support this scenario, similar to how it is done for MCA components.
-:-)
-
-If it has an autogen.subdirs file, then it needs to be a subdirectory
-that is autogen-able.
--- a/ompi/mca/common/monitoring/Makefile.am
+++ b/ompi/mca/common/monitoring/Makefile.am
@ -13,7 +13,7 @@
 # $HEADER$
 #

-EXTRA_DIST = profile2mat.pl aggregate_profile.pl
+EXTRA_DIST = profile2mat.pl aggregate_profile.pl README.md

 sources = common_monitoring.c common_monitoring_coll.c
 headers = common_monitoring.h common_monitoring_coll.h
--- a/ompi/mca/common/monitoring/README
+++ b/ompi/mca/common/monitoring/README
@ -1,181 +0,0 @@
-
- Copyright (c) 2013-2015 The University of Tennessee and The University
-                         of Tennessee Research Foundation.  All rights
-                         reserved.
- Copyright (c) 2013-2015 Inria.  All rights reserved.
- $COPYRIGHT$
-
- Additional copyrights may follow
-
- $HEADER$
-
-===========================================================================
-
-Low level communication monitoring interface in Open MPI
-
-Introduction
------------
-This interface traces and monitors all messages sent by MPI before they go to the
-communication channels. At that levels all communication are point-to-point communications:
-collectives are already decomposed in send and receive calls.
-
-The monitoring is stored internally by each process and output on stderr at the end of the
-application (during MPI_Finalize()).
-
-
-Enabling the monitoring
-----------------------
-To enable the monitoring add  --mca pml_monitoring_enable x to the mpirun command line.
-If x = 1 it monitors internal and external tags indifferently and aggregate everything.
-If x = 2 it monitors internal tags and external tags separately.
-If x = 0 the monitoring is disabled.
-Other value of x are not supported.
-
-Internal tags are tags < 0. They are used to tag send and receive coming from
-collective operations or from protocol communications
-
-External tags are tags >=0. They are used by the application in point-to-point communication.
-
-Therefore, distinguishing external and internal tags help to distinguish between point-to-point
-and other communication (mainly collectives).
-
-Output format
-------------
-The output of the monitoring looks like (with --mca pml_monitoring_enable 2):
-I	0	1	108 bytes	27 msgs sent
-E	0	1	1012 bytes	30 msgs sent
-E	0	2	23052 bytes	61 msgs sent
-I	1	2	104 bytes	26 msgs sent
-I	1	3	208 bytes	52 msgs sent
-E	1	0	860 bytes	24 msgs sent
-E	1	3	2552 bytes	56 msgs sent
-I	2	3	104 bytes	26 msgs sent
-E	2	0	22804 bytes	49 msgs sent
-E	2	3	860 bytes	24 msgs sent
-I	3	0	104 bytes	26 msgs sent
-I	3	1	204 bytes	51 msgs sent
-E	3	1	2304 bytes	44 msgs sent
-E	3	2	860 bytes	24 msgs sent
-
-Where:
-  - the first column distinguishes internal (I)  and external (E) tags.
-  - the second column is the sender rank
-  - the third column is the receiver rank
-  - the fourth column is the number of bytes sent
-  - the last column is the number of messages.
-
-In this example process 0 as sent 27 messages to process 1 using point-to-point call
-for 108 bytes and 30 messages with collectives and protocol related communication
-for 1012 bytes to process 1.
-
-If the monitoring was called with --mca pml_monitoring_enable 1 everything is aggregated
-under the internal tags. With te above example, you have:
-I	0	1	1120 bytes	57 msgs sent
-I	0	2	23052 bytes	61 msgs sent
-I	1	0	860 bytes	24 msgs sent
-I	1	2	104 bytes	26 msgs sent
-I	1	3	2760 bytes	108 msgs sent
-I	2	0	22804 bytes	49 msgs sent
-I	2	3	964 bytes	50 msgs sent
-I	3	0	104 bytes	26 msgs sent
-I	3	1	2508 bytes	95 msgs sent
-I	3	2	860 bytes	24 msgs sent
-
-Monitoring phases
-----------------
-If one wants to monitor phases of the application, it is possible to flush the monitoring
-at the application level. In this case all the monitoring since the last flush is stored
-by every process in a file.
-
-An example of how to flush such monitoring is given in test/monitoring/monitoring_test.c
-
-Moreover, all the different flushed phased are aggregated at runtime and output at the end
-of the application as described above.
-
-Example
-------
-A working example is given in test/monitoring/monitoring_test.c
-It features, MPI_COMM_WORLD monitoring , sub-communicator monitoring, collective and
-point-to-point communication monitoring and  phases monitoring
-
-To compile:
-> make monitoring_test
-
-Helper scripts
--------------
-Two perl scripts are provided in test/monitoring
- aggregate_profile.pl is for aggregating monitoring phases of different processes
-  This script aggregates the profiles generated by the flush_monitoring function.
-  The files need to be in in given format: name_<phase_id>_<process_id>
-  They are then aggregated by phases.
-  If one needs the profile of all the phases he can concatenate the different files,
-  or use the output of the monitoring system done at MPI_Finalize
-  in the example it should be call as:
-   ./aggregate_profile.pl prof/phase to generate
-   prof/phase_1.prof
-   prof/phase_2.prof
-
- profile2mat.pl is for transforming a the monitoring output into a communication matrix.
-   Take a profile file and aggregates all the recorded communicator into matrices.
-   It generated a matrices for the number of messages, (msg),
-   for the total bytes transmitted (size) and
-   the average number of bytes per messages (avg)
-
-   The output matrix is symmetric
-
-Do not forget to enable the execution right to these scripts.
-
-For instance, the provided examples store phases output in ./prof
-
-If you type:
-> mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
-you should have the following output
-Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
-Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
-Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
-Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
-Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
-Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
-Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
-Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
-I	2	3	104 bytes	26 msgs sent
-E	2	0	22804 bytes	49 msgs sent
-E	2	3	860 bytes	24 msgs sent
-I	3	0	104 bytes	26 msgs sent
-I	3	1	204 bytes	51 msgs sent
-E	3	1	2304 bytes	44 msgs sent
-E	3	2	860 bytes	24 msgs sent
-I	0	1	108 bytes	27 msgs sent
-E	0	1	1012 bytes	30 msgs sent
-E	0	2	23052 bytes	61 msgs sent
-I	1	2	104 bytes	26 msgs sent
-I	1	3	208 bytes	52 msgs sent
-E	1	0	860 bytes	24 msgs sent
-E	1	3	2552 bytes	56 msgs sent
-
-you can parse the phases with:
-> /aggregate_profile.pl prof/phase
-Building prof/phase_1.prof
-Building prof/phase_2.prof
-
-And you can build the different communication matrices of phase 1 with:
-> ./profile2mat.pl prof/phase_1.prof
-prof/phase_1.prof -> all
-prof/phase_1_size_all.mat
-prof/phase_1_msg_all.mat
-prof/phase_1_avg_all.mat
-
-prof/phase_1.prof -> external
-prof/phase_1_size_external.mat
-prof/phase_1_msg_external.mat
-prof/phase_1_avg_external.mat
-
-prof/phase_1.prof -> internal
-prof/phase_1_size_internal.mat
-prof/phase_1_msg_internal.mat
-prof/phase_1_avg_internal.mat
-
-Credit
------
-Designed by George Bosilca <bosilca@icl.utk.edu> and
-Emmanuel Jeannot <emmanuel.jeannot@inria.fr>
--- a/ompi/mca/common/monitoring/README.md
+++ b/ompi/mca/common/monitoring/README.md
@ -0,0 +1,209 @@
+# Open MPI common monitoring module
+
+Copyright (c) 2013-2015 The University of Tennessee and The University
+                         of Tennessee Research Foundation.  All rights
+                         reserved.
+ Copyright (c) 2013-2015 Inria.  All rights reserved.
+
+Low level communication monitoring interface in Open MPI
+
+## Introduction
+
+This interface traces and monitors all messages sent by MPI before
+they go to the communication channels. At that levels all
+communication are point-to-point communications: collectives are
+already decomposed in send and receive calls.
+
+The monitoring is stored internally by each process and output on
+stderr at the end of the application (during `MPI_Finalize()`).
+
+
+## Enabling the monitoring
+
+To enable the monitoring add `--mca pml_monitoring_enable x` to the
+`mpirun` command line:
+
+* If x = 1 it monitors internal and external tags indifferently and aggregate everything.
+* If x = 2 it monitors internal tags and external tags separately.
+* If x = 0 the monitoring is disabled.
+* Other value of x are not supported.
+
+Internal tags are tags < 0. They are used to tag send and receive
+coming from collective operations or from protocol communications
+
+External tags are tags >=0. They are used by the application in
+point-to-point communication.
+
+Therefore, distinguishing external and internal tags help to
+distinguish between point-to-point and other communication (mainly
+collectives).
+
+## Output format
+
+The output of the monitoring looks like (with `--mca
+pml_monitoring_enable 2`):
+
+```
+I	0	1	108 bytes	27 msgs sent
+E	0	1	1012 bytes	30 msgs sent
+E	0	2	23052 bytes	61 msgs sent
+I	1	2	104 bytes	26 msgs sent
+I	1	3	208 bytes	52 msgs sent
+E	1	0	860 bytes	24 msgs sent
+E	1	3	2552 bytes	56 msgs sent
+I	2	3	104 bytes	26 msgs sent
+E	2	0	22804 bytes	49 msgs sent
+E	2	3	860 bytes	24 msgs sent
+I	3	0	104 bytes	26 msgs sent
+I	3	1	204 bytes	51 msgs sent
+E	3	1	2304 bytes	44 msgs sent
+E	3	2	860 bytes	24 msgs sent
+```
+
+Where:
+
+1. the first column distinguishes internal (I)  and external (E) tags.
+1. the second column is the sender rank
+1. the third column is the receiver rank
+1. the fourth column is the number of bytes sent
+1. the last column is the number of messages.
+
+In this example process 0 as sent 27 messages to process 1 using
+point-to-point call for 108 bytes and 30 messages with collectives and
+protocol related communication for 1012 bytes to process 1.
+
+If the monitoring was called with `--mca pml_monitoring_enable 1`,
+everything is aggregated under the internal tags. With the e above
+example, you have:
+
+```
+I	0	1	1120 bytes	57 msgs sent
+I	0	2	23052 bytes	61 msgs sent
+I	1	0	860 bytes	24 msgs sent
+I	1	2	104 bytes	26 msgs sent
+I	1	3	2760 bytes	108 msgs sent
+I	2	0	22804 bytes	49 msgs sent
+I	2	3	964 bytes	50 msgs sent
+I	3	0	104 bytes	26 msgs sent
+I	3	1	2508 bytes	95 msgs sent
+I	3	2	860 bytes	24 msgs sent
+```
+
+## Monitoring phases
+
+If one wants to monitor phases of the application, it is possible to
+flush the monitoring at the application level. In this case all the
+monitoring since the last flush is stored by every process in a file.
+
+An example of how to flush such monitoring is given in
+`test/monitoring/monitoring_test.c`.
+
+Moreover, all the different flushed phased are aggregated at runtime
+and output at the end of the application as described above.
+
+## Example
+
+A working example is given in `test/monitoring/monitoring_test.c` It
+features, `MPI_COMM_WORLD` monitoring , sub-communicator monitoring,
+collective and point-to-point communication monitoring and phases
+monitoring
+
+To compile:
+
+```
+shell$ make monitoring_test
+```
+
+## Helper scripts
+
+Two perl scripts are provided in test/monitoring:
+
+1. `aggregate_profile.pl` is for aggregating monitoring phases of
+   different processes This script aggregates the profiles generated by
+   the `flush_monitoring` function.
+
+   The files need to be in in given format: `name_<phase_id>_<process_id>`
+   They are then aggregated by phases.
+   If one needs the profile of all the phases he can concatenate the different files,
+   or use the output of the monitoring system done at `MPI_Finalize`
+   in the example it should be call as:
+   ```
+   ./aggregate_profile.pl prof/phase to generate
+   prof/phase_1.prof
+   prof/phase_2.prof
+   ```
+
+1. `profile2mat.pl` is for transforming a the monitoring output into a
+   communication matrix.  Take a profile file and aggregates all the
+   recorded communicator into matrices.  It generated a matrices for
+   the number of messages, (msg), for the total bytes transmitted
+   (size) and the average number of bytes per messages (avg)
+
+   The output matrix is symmetric.
+
+For instance, the provided examples store phases output in `./prof`:
+
+```
+shell$ mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
+```
+
+Should provide the following output:
+
+```
+Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
+Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
+Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
+Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
+Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
+Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
+Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
+Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
+I	2	3	104 bytes	26 msgs sent
+E	2	0	22804 bytes	49 msgs sent
+E	2	3	860 bytes	24 msgs sent
+I	3	0	104 bytes	26 msgs sent
+I	3	1	204 bytes	51 msgs sent
+E	3	1	2304 bytes	44 msgs sent
+E	3	2	860 bytes	24 msgs sent
+I	0	1	108 bytes	27 msgs sent
+E	0	1	1012 bytes	30 msgs sent
+E	0	2	23052 bytes	61 msgs sent
+I	1	2	104 bytes	26 msgs sent
+I	1	3	208 bytes	52 msgs sent
+E	1	0	860 bytes	24 msgs sent
+E	1	3	2552 bytes	56 msgs sent
+```
+
+You can then parse the phases with:
+
+```
+shell$ /aggregate_profile.pl prof/phase
+Building prof/phase_1.prof
+Building prof/phase_2.prof
+```
+
+And you can build the different communication matrices of phase 1
+with:
+
+```
+shell$ ./profile2mat.pl prof/phase_1.prof
+prof/phase_1.prof -> all
+prof/phase_1_size_all.mat
+prof/phase_1_msg_all.mat
+prof/phase_1_avg_all.mat
+
+prof/phase_1.prof -> external
+prof/phase_1_size_external.mat
+prof/phase_1_msg_external.mat
+prof/phase_1_avg_external.mat
+
+prof/phase_1.prof -> internal
+prof/phase_1_size_internal.mat
+prof/phase_1_msg_internal.mat
+prof/phase_1_avg_internal.mat
+```
+
+## Authors
+
+Designed by George Bosilca <bosilca@icl.utk.edu> and
+Emmanuel Jeannot <emmanuel.jeannot@inria.fr>
--- a/ompi/mca/mtl/ofi/README
+++ b/ompi/mca/mtl/ofi/README
@ -1,340 +0,0 @@
-OFI MTL:
--------
-The OFI MTL supports Libfabric (a.k.a. Open Fabrics Interfaces OFI,
-https://ofiwg.github.io/libfabric/) tagged APIs (fi_tagged(3)). At
-initialization time, the MTL queries libfabric for providers supporting tag matching
-(fi_getinfo(3)). Libfabric will return a list of providers that satisfy the requested
-capabilities, having the most performant one at the top of the list.
-The user may modify the OFI provider selection with mca parameters
-mtl_ofi_provider_include or mtl_ofi_provider_exclude.
-
-PROGRESS:
---------
-The MTL registers a progress function to opal_progress. There is currently
-no support for asynchronous progress. The progress function reads multiple events
-from the OFI provider Completion Queue (CQ) per iteration (defaults to 100, can be
-modified with the mca mtl_ofi_progress_event_cnt) and iterates until the
-completion queue is drained.
-
-COMPLETIONS:
------------
-Each operation uses a request type ompi_mtl_ofi_request_t which includes a reference
-to an operation specific completion callback, an MPI request, and a context. The
-context (fi_context) is used to map completion events with MPI_requests when reading the
-CQ.
-
-OFI TAG:
--------
-MPI needs to send 96 bits of information per message (32 bits communicator id,
-32 bits source rank, 32 bits MPI tag) but OFI only offers 64 bits tags. In
-addition, the OFI MTL uses 2 bits of the OFI tag for the synchronous send protocol.
-Therefore, there are only 62 bits available in the OFI tag for message usage. The
-OFI MTL offers the mtl_ofi_tag_mode mca parameter with 4 modes to address this:
-
-"auto" (Default):
-After the OFI provider is selected, a runtime check is performed to assess
-FI_REMOTE_CQ_DATA and FI_DIRECTED_RECV support (see fi_tagged(3), fi_msg(2)
-and fi_getinfo(3)). If supported, "ofi_tag_full" is used. If not supported,
-fall back to "ofi_tag_1".
-
-"ofi_tag_1":
-For providers that do not support FI_REMOTE_CQ_DATA, the OFI MTL will
-trim the fields (Communicator ID, Source Rank, MPI tag) to make them fit the 62
-bits available bit in the OFI tag. There are two options available with different
-number of bits for the Communicator ID and MPI tag fields. This tag distribution
-offers: 12 bits for Communicator ID (max Communicator ID 4,095) subject to
-provider reserved bits (see mem_tag_format below), 18 bits for Source Rank (max
-Source Rank 262,143), 32 bits for MPI tag (max MPI tag is INT_MAX).
-
-"ofi_tag_2":
-Same as 2 "ofi_tag_1" but offering a different OFI tag distribution for
-applications that may require a greater number of supported Communicators at the
-expense of fewer MPI tag bits. This tag distribution offers: 24 bits for
-Communicator ID (max Communicator ED 16,777,215. See mem_tag_format below), 18
-bits for Source Rank (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
-524,287).
-
-"ofi_tag_full":
-For executions that cannot accept trimming source rank or MPI tag, this mode sends
-source rank for each message in the CQ DATA. The Source Rank is made available at
-the remote process CQ (FI_CQ_FORMAT_TAGGED is used, see fi_cq(3)) at the completion
-of the matching receive operation. Since the minimum size for FI_REMOTE_CQ_DATA
-is 32 bits, the Source Rank fits with no limitations. The OFI tag is used for the
-Communicator id (28 bits, max Communicator ID 268,435,455. See mem_tag_format below),
-and the MPI tag (max MPI tag is INT_MAX). If this mode is selected by the user
-and FI_REMOTE_CQ_DATA or FI_DIRECTED_RECV are not supported, the execution will abort.
-
-mem_tag_format (fi_endpoint(3))
-Some providers can reserve the higher order bits from the OFI tag for internal purposes.
-This is signaled in mem_tag_format (see fi_endpoint(3)) by setting higher order bits
-to zero. In such cases, the OFI MTL will reduce the number of communicator ids supported
-by reducing the bits available for the communicator ID field in the OFI tag.
-
-SCALABLE ENDPOINTS:
-------------------
-OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to improve
-multi-threaded application throughput and message rate. Currently the feature
-is designed to utilize multiple TX/RX contexts exposed by the OFI provider in
-conjunction with a multi-communicator MPI application model. Therefore, new OFI
-contexts are created as and when communicators are duplicated in a lazy fashion
-instead of creating them all at once during init time and this approach also
-favours only creating as many contexts as needed.
-
-1. Multi-communicator model:
-   With this approach, the MPI application is requried to first duplicate
-   the communicators it wants to use with MPI operations (ideally creating
-   as many communicators as the number of threads it wants to use to call
-   into MPI). The duplicated communicators are then used by the
-   corresponding threads to perform MPI operations. A possible usage
-   scenario could be in an MPI + OMP application as follows
-   (example limited to 2 ranks):
-
-    MPI_Comm dup_comm[n];
-    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
-     for (i = 0; i < n; i++) {
-        MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
-     }
-     if (rank == 0) {
-#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
-         for (i = 0; i < n ; i++) {
-                    MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
-                                        1, MSG_TAG, dup_comm[i]);
-                    MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
-                                       1, MSG_TAG, dup_comm[i], &status);
-         }
-    } else if (rank == 1) {
-#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
-            for (i = 0; i < n ; i++) {
-                MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
-                                   0, MSG_TAG, dup_comm[i], &status);
-                MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
-                                   0, MSG_TAG, dup_comm[i]);
-           }
-    }
-
-2. MCA variables:
-  To utilize the feature, the following MCA variables need to be set:
-  mtl_ofi_enable_sep:
-  This MCA variable needs to be set to enable the use of Scalable Endpoints (SEP)
-  feature in the OFI MTL. The underlying provider is also checked to ensure the
-  feature is supported. If the provider chosen does not support it, user needs
-  to either set this variable to 0 or select a different provider which supports
-  the feature.
-  For single-threaded applications one OFI context is sufficient, so OFI SEPs
-  may not add benefit.
-  Note that mtl_ofi_thread_grouping (see below) needs to be enabled to use the
-  different OFI SEP contexts. Otherwise, only one context (ctxt 0) will be used.
-
-  Default: 0
-
-  Command-line syntax:
-  "-mca mtl_ofi_enable_sep 1"
-
-  mtl_ofi_thread_grouping:
-  Turn Thread Grouping feature on. This is needed to use the Multi-communicator
-  model explained above. This means that the OFI MTL will use the communicator
-  ID to decide the SEP contexts to be used by the thread. In this way, each
-  thread will have direct access to different OFI resources. If disabled,
-  only context 0 will be used.
-  Requires mtl_ofi_enable_sep to be set to 1.
-
-  Default: 0
-
-  It is not recommended to set the MCA variable for:
-   - Multi-threaded MPI applications not following multi-communicator approach.
-   - Applications that have multiple threads using a single communicator as
-     it may degrade performance.
-
-  Command-line syntax:
-    "-mca mtl_ofi_thread_grouping 1"
-
-  mtl_ofi_num_ctxts:
-  This MCA variable allows user to set the number of OFI SEP contexts the
-  application expects to use. For multi-threaded applications using Thread
-  Grouping feature, this number should be set to the number of user threads
-  that will call into MPI. This variable will only have effect if
-  mtl_ofi_enable_sep is set to 1.
-
-  Default: 1
-
-  Command-line syntax:
-  "-mca mtl_ofi_num_ctxts N" [ N: number of OFI contexts required by
-                                         application ]
-
-3. Notes on performance:
-  - OFI MTL will create as many TX/RX contexts as set by MCA mtl_ofi_num_ctxts.
-    The number of contexts that can be created is also limited by the underlying
-    provider as each provider may have different thresholds. Once the threshold
-    is exceeded, contexts are used in a round-robin fashion which leads to
-    resource sharing among threads. Therefore locks are required to guard
-    against race conditions. For performance, it is recommended to have
-
-      Number of threads = Number of communicators = Number of contexts
-
-    For example, when using PSM2 provider, the number of contexts is dictated
-    by the Intel Omni-Path HFI1 driver module.
-
-  - OPAL layer allows for multiple threads to enter progress simultaneously. To
-    enable this feature, user needs to set MCA variable
-    "max_thread_in_progress". When using Thread Grouping feature, it is
-    recommended to set this MCA parameter to the number of threads expected to
-    call into MPI as it provides performance benefits.
-
-    Command-line syntax:
-    "-mca opal_max_thread_in_progress N" [ N: number of threads expected to
-                                              make MPI calls ]
-      Default: 1
-
-  - For applications using a single thread with multiple communicators and MCA
-    variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple
-    contexts, but the benefits may be negligible as only one thread is driving
-    progress.
-
-SPECIALIZED FUNCTIONS:
-------------------
-To improve performance when calling message passing APIs in the OFI mtl
-specialized functions are generated at compile time that eliminate all the
-if conditionals that can be determined at init and don't need to be
-queried again during the critical path. These functions are generated by
-perl scripts during make which generate functions and symbols for every
-combination of flags for each function.
-
-1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
-    To add a new flag to an existing specialized function for handling cases
-    where different OFI providers may or may not support a particular feature,
-    then you must follow these steps:
-    1) Update the "_generic" function in mtl_ofi.h with the new flag and
-       the if conditionals to read the new value.
-    2) Update the *.pm file corresponding to the function with the new flag in:
-       gen_funcs(), gen_*_function(), & gen_*_sym_init()
-    3) Update mtl_ofi_opt.h with:
-        The new flag as #define NEW_FLAG_TYPES #NUMBER_OF_STATES
-            example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
-        Update the function's types with:
-            #define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]
-
-2. ADDING A NEW FUNCTION FOR SPECIALIZATION:
-    To add a new function to be specialized you must
-    follow these steps:
-    1) Create a new mtl_ofi_"function_name"_opt.pm based off opt_common/mtl_ofi_opt.pm.template
-    2) Add new .pm file to generated_source_modules in Makefile.am
-    3) Add .c file to generated_sources in Makefile.am named the same as the corresponding .pm file
-    4) Update existing or create function in mtl_ofi.h to _generic with new flags.
-    5) Update mtl_ofi_opt.h with:
-        a) New function types: #define OMPI_MTL_OFI_FUNCTION_TYPES [FLAG_TYPES]
-        b) Add new function to the struct ompi_mtl_ofi_symtable:
-            struct ompi_mtl_ofi_symtable {
-                ...
-                int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
-            }
-        c) Add new symbol table init function definition:
-            void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
-    6) Add calls to init the new function in the symbol table and assign the function
-       pointer to be used based off the flags in mtl_ofi_component.c:
-        ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);
-        ompi_mtl_ofi.base.mtl_FUNCTION =
-            ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];
-
-3. EXAMPLE SPECIALIZED FILE:
-The code below is an example of what is generated by the specialization
-scripts for use in the OFI mtl. This code specializes the blocking
-send functionality based on FI_REMOTE_CQ_DATA & OFI Scalable Endpoint support
-provided by an OFI Provider. Only one function and symbol is used during
-runtime based on if FI_REMOTE_CQ_DATA is supported and/or if OFI Scalable
-Endpoint support is enabled.
-/*
- * Copyright (c) 2013-2018 Intel, Inc. All rights reserved
- *
- * $COPYRIGHT$
- *
- * Additional copyrights may follow
- *
- * $HEADER$
- */
-
-#include "mtl_ofi.h"
-
-__opal_attribute_always_inline__ static inline int
-ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
-                  struct ompi_communicator_t *comm,
-                  int dest,
-                  int tag,
-                  struct opal_convertor_t *convertor,
-                  mca_pml_base_send_mode_t mode)
-{
-    const bool OFI_CQ_DATA = false;
-    const bool OFI_SCEP_EPS = false;
-
-    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
-                                    convertor, mode,
-                                    OFI_CQ_DATA, OFI_SCEP_EPS);
-}
-
-__opal_attribute_always_inline__ static inline int
-ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
-                  struct ompi_communicator_t *comm,
-                  int dest,
-                  int tag,
-                  struct opal_convertor_t *convertor,
-                  mca_pml_base_send_mode_t mode)
-{
-    const bool OFI_CQ_DATA = false;
-    const bool OFI_SCEP_EPS = true;
-
-    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
-                                    convertor, mode,
-                                    OFI_CQ_DATA, OFI_SCEP_EPS);
-}
-
-__opal_attribute_always_inline__ static inline int
-ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
-                  struct ompi_communicator_t *comm,
-                  int dest,
-                  int tag,
-                  struct opal_convertor_t *convertor,
-                  mca_pml_base_send_mode_t mode)
-{
-    const bool OFI_CQ_DATA = true;
-    const bool OFI_SCEP_EPS = false;
-
-    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
-                                    convertor, mode,
-                                    OFI_CQ_DATA, OFI_SCEP_EPS);
-}
-
-__opal_attribute_always_inline__ static inline int
-ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
-                  struct ompi_communicator_t *comm,
-                  int dest,
-                  int tag,
-                  struct opal_convertor_t *convertor,
-                  mca_pml_base_send_mode_t mode)
-{
-    const bool OFI_CQ_DATA = true;
-    const bool OFI_SCEP_EPS = true;
-
-    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
-                                    convertor, mode,
-                                    OFI_CQ_DATA, OFI_SCEP_EPS);
-}
-
-void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
-{
-
-    sym_table->ompi_mtl_ofi_send[false][false]
-        = ompi_mtl_ofi_send_false_false;
-
-
-    sym_table->ompi_mtl_ofi_send[false][true]
-        = ompi_mtl_ofi_send_false_true;
-
-
-    sym_table->ompi_mtl_ofi_send[true][false]
-        = ompi_mtl_ofi_send_true_false;
-
-
-    sym_table->ompi_mtl_ofi_send[true][true]
-        = ompi_mtl_ofi_send_true_true;
-
-}
-###
--- a/ompi/mca/mtl/ofi/README.md
+++ b/ompi/mca/mtl/ofi/README.md
@ -0,0 +1,368 @@
+# Open MPI OFI MTL
+
+The OFI MTL supports Libfabric (a.k.a., [Open Fabrics Interfaces
+OFI](https://ofiwg.github.io/libfabric/)) tagged APIs
+(`fi_tagged(3)`). At initialization time, the MTL queries libfabric
+for providers supporting tag matching (`fi_getinfo(3)`). Libfabric
+will return a list of providers that satisfy the requested
+capabilities, having the most performant one at the top of the list.
+The user may modify the OFI provider selection with mca parameters
+`mtl_ofi_provider_include` or `mtl_ofi_provider_exclude`.
+
+## PROGRESS
+
+The MTL registers a progress function to `opal_progress`. There is
+currently no support for asynchronous progress. The progress function
+reads multiple events from the OFI provider Completion Queue (CQ) per
+iteration (defaults to 100, can be modified with the mca
+`mtl_ofi_progress_event_cnt`) and iterates until the completion queue is
+drained.
+
+## COMPLETIONS
+
+Each operation uses a request type `ompi_mtl_ofi_request_t` which
+includes a reference to an operation specific completion callback, an
+MPI request, and a context. The context (`fi_context`) is used to map
+completion events with `MPI_requests` when reading the CQ.
+
+## OFI TAG
+
+MPI needs to send 96 bits of information per message (32 bits
+communicator id, 32 bits source rank, 32 bits MPI tag) but OFI only
+offers 64 bits tags. In addition, the OFI MTL uses 2 bits of the OFI
+tag for the synchronous send protocol.  Therefore, there are only 62
+bits available in the OFI tag for message usage. The OFI MTL offers
+the `mtl_ofi_tag_mode` mca parameter with 4 modes to address this:
+
+* `auto` (Default):
+  After the OFI provider is selected, a runtime check is performed to
+  assess `FI_REMOTE_CQ_DATA` and `FI_DIRECTED_RECV` support (see
+  `fi_tagged(3)`, `fi_msg(2)` and `fi_getinfo(3)`). If supported,
+  `ofi_tag_full` is used. If not supported, fall back to `ofi_tag_1`.
+
+* `ofi_tag_1`:
+  For providers that do not support `FI_REMOTE_CQ_DATA`, the OFI MTL
+  will trim the fields (Communicator ID, Source Rank, MPI tag) to make
+  them fit the 62 bits available bit in the OFI tag. There are two
+  options available with different number of bits for the Communicator
+  ID and MPI tag fields. This tag distribution offers: 12 bits for
+  Communicator ID (max Communicator ID 4,095) subject to provider
+  reserved bits (see `mem_tag_format` below), 18 bits for Source Rank
+  (max Source Rank 262,143), 32 bits for MPI tag (max MPI tag is
+  `INT_MAX`).
+
+* `ofi_tag_2`:
+  Same as 2 `ofi_tag_1` but offering a different OFI tag distribution
+  for applications that may require a greater number of supported
+  Communicators at the expense of fewer MPI tag bits. This tag
+  distribution offers: 24 bits for Communicator ID (max Communicator
+  ED 16,777,215. See mem_tag_format below), 18 bits for Source Rank
+  (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
+  524,287).
+
+* `ofi_tag_full`:
+  For executions that cannot accept trimming source rank or MPI tag,
+  this mode sends source rank for each message in the CQ DATA. The
+  Source Rank is made available at the remote process CQ
+  (`FI_CQ_FORMAT_TAGGED` is used, see `fi_cq(3)`) at the completion of
+  the matching receive operation. Since the minimum size for
+  `FI_REMOTE_CQ_DATA` is 32 bits, the Source Rank fits with no
+  limitations. The OFI tag is used for the Communicator id (28 bits,
+  max Communicator ID 268,435,455. See `mem_tag_format` below), and
+  the MPI tag (max MPI tag is `INT_MAX`). If this mode is selected by
+  the user and `FI_REMOTE_CQ_DATA` or `FI_DIRECTED_RECV` are not
+  supported, the execution will abort.
+
+* `mem_tag_format` (`fi_endpoint(3)`)
+  Some providers can reserve the higher order bits from the OFI tag
+  for internal purposes.  This is signaled in `mem_tag_format` (see
+  `fi_endpoint(3)`) by setting higher order bits to zero. In such
+  cases, the OFI MTL will reduce the number of communicator ids
+  supported by reducing the bits available for the communicator ID
+  field in the OFI tag.
+
+## SCALABLE ENDPOINTS
+
+OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to
+improve multi-threaded application throughput and message
+rate. Currently the feature is designed to utilize multiple TX/RX
+contexts exposed by the OFI provider in conjunction with a
+multi-communicator MPI application model. Therefore, new OFI contexts
+are created as and when communicators are duplicated in a lazy fashion
+instead of creating them all at once during init time and this
+approach also favours only creating as many contexts as needed.
+
+1. Multi-communicator model:
+   With this approach, the MPI application is requried to first duplicate
+   the communicators it wants to use with MPI operations (ideally creating
+   as many communicators as the number of threads it wants to use to call
+   into MPI). The duplicated communicators are then used by the
+   corresponding threads to perform MPI operations. A possible usage
+   scenario could be in an MPI + OMP application as follows
+   (example limited to 2 ranks):
+
+    ```c
+    MPI_Comm dup_comm[n];
+    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
+    for (i = 0; i < n; i++) {
+        MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
+    }
+    if (rank == 0) {
+    #pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
+        for (i = 0; i < n ; i++) {
+            MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
+                     1, MSG_TAG, dup_comm[i]);
+            MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
+                     1, MSG_TAG, dup_comm[i], &status);
+        }
+    } else if (rank == 1) {
+    #pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
+        for (i = 0; i < n ; i++) {
+            MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
+                     0, MSG_TAG, dup_comm[i], &status);
+            MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
+                     0, MSG_TAG, dup_comm[i]);
+        }
+    }
+    ```
+
+2. MCA variables:
+   To utilize the feature, the following MCA variables need to be set:
+
+   * `mtl_ofi_enable_sep`:
+     This MCA variable needs to be set to enable the use of Scalable
+     Endpoints (SEP) feature in the OFI MTL. The underlying provider
+     is also checked to ensure the feature is supported. If the
+     provider chosen does not support it, user needs to either set
+     this variable to 0 or select a different provider which supports
+     the feature.  For single-threaded applications one OFI context is
+     sufficient, so OFI SEPs may not add benefit.  Note that
+     `mtl_ofi_thread_grouping` (see below) needs to be enabled to use
+     the different OFI SEP contexts. Otherwise, only one context (ctxt
+     0) will be used.
+
+     Default: 0
+
+     Command-line syntax: `--mca mtl_ofi_enable_sep 1`
+
+   * `mtl_ofi_thread_grouping`:
+     Turn Thread Grouping feature on. This is needed to use the
+     Multi-communicator model explained above. This means that the OFI
+     MTL will use the communicator ID to decide the SEP contexts to be
+     used by the thread. In this way, each thread will have direct
+     access to different OFI resources. If disabled, only context 0
+     will be used.  Requires `mtl_ofi_enable_sep` to be set to 1.
+
+     Default: 0
+
+     It is not recommended to set the MCA variable for:
+
+     * Multi-threaded MPI applications not following multi-communicator
+       approach.
+     * Applications that have multiple threads using a single
+       communicator as it may degrade performance.
+
+     Command-line syntax: `--mca mtl_ofi_thread_grouping 1`
+
+   * `mtl_ofi_num_ctxts`:
+     This MCA variable allows user to set the number of OFI SEP
+     contexts the application expects to use. For multi-threaded
+     applications using Thread Grouping feature, this number should be
+     set to the number of user threads that will call into MPI. This
+     variable will only have effect if `mtl_ofi_enable_sep` is set to 1.
+
+     Default: 1
+
+     Command-line syntax: `--mca mtl_ofi_num_ctxts N` (`N`: number of OFI contexts required by application)
+
+3. Notes on performance:
+   * OFI MTL will create as many TX/RX contexts as set by MCA
+     mtl_ofi_num_ctxts.  The number of contexts that can be created is
+     also limited by the underlying provider as each provider may have
+     different thresholds. Once the threshold is exceeded, contexts are
+     used in a round-robin fashion which leads to resource sharing
+     among threads. Therefore locks are required to guard against race
+     conditions. For performance, it is recommended to have
+
+       Number of threads = Number of communicators = Number of contexts
+
+     For example, when using PSM2 provider, the number of contexts is
+     dictated by the Intel Omni-Path HFI1 driver module.
+
+   * OPAL layer allows for multiple threads to enter progress
+     simultaneously. To enable this feature, user needs to set MCA
+     variable `max_thread_in_progress`. When using Thread Grouping
+     feature, it is recommended to set this MCA parameter to the number
+     of threads expected to call into MPI as it provides performance
+     benefits.
+
+     Default: 1
+
+     Command-line syntax: `--mca opal_max_thread_in_progress N` (`N`: number of threads expected to make MPI calls )
+
+   * For applications using a single thread with multiple communicators
+     and MCA variable `mtl_ofi_thread_grouping` set to 1, the MTL will
+     use multiple contexts, but the benefits may be negligible as only
+     one thread is driving progress.
+
+## SPECIALIZED FUNCTIONS
+
+To improve performance when calling message passing APIs in the OFI
+mtl specialized functions are generated at compile time that eliminate
+all the if conditionals that can be determined at init and don't need
+to be queried again during the critical path. These functions are
+generated by perl scripts during make which generate functions and
+symbols for every combination of flags for each function.
+
+1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
+   To add a new flag to an existing specialized function for handling
+   cases where different OFI providers may or may not support a
+   particular feature, then you must follow these steps:
+
+   1. Update the `_generic` function in `mtl_ofi.h` with the new flag
+      and the if conditionals to read the new value.
+   1. Update the `*.pm` file corresponding to the function with the
+      new flag in: `gen_funcs()`, `gen_*_function()`, &
+      `gen_*_sym_init()`
+   1. Update `mtl_ofi_opt.h` with:
+      * The new flag as `#define NEW_FLAG_TYPES #NUMBER_OF_STATES`.
+        Example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
+      * Update the function's types with:
+        `#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]`
+
+1. ADDING A NEW FUNCTION FOR SPECIALIZATION:
+   To add a new function to be specialized you must
+   follow these steps:
+   1. Create a new `mtl_ofi_<function_name>_opt.pm` based off
+      `opt_common/mtl_ofi_opt.pm.template`
+   1. Add new `.pm` file to `generated_source_modules` in `Makefile.am`
+   1. Add `.c` file to `generated_sources` in `Makefile.am` named the
+      same as the corresponding `.pm` file
+   1. Update existing or create function in `mtl_ofi.h` to `_generic`
+      with new flags.
+   1. Update `mtl_ofi_opt.h` with:
+      1. New function types: `#define OMPI_MTL_OFI_FUNCTION_TYPES` `[FLAG_TYPES]`
+      1. Add new function to the `struct ompi_mtl_ofi_symtable`:
+         ```c
+         struct ompi_mtl_ofi_symtable {
+               ...
+               int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
+         }
+         ```
+      1. Add new symbol table init function definition:
+         ```c
+         void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
+         ```
+   1. Add calls to init the new function in the symbol table and
+      assign the function pointer to be used based off the flags in
+      `mtl_ofi_component.c`:
+      * `ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);`
+      * `ompi_mtl_ofi.base.mtl_FUNCTION = ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];`
+
+## EXAMPLE SPECIALIZED FILE
+
+The code below is an example of what is generated by the
+specialization scripts for use in the OFI mtl. This code specializes
+the blocking send functionality based on `FI_REMOTE_CQ_DATA` & OFI
+Scalable Endpoint support provided by an OFI Provider. Only one
+function and symbol is used during runtime based on if
+`FI_REMOTE_CQ_DATA` is supported and/or if OFI Scalable Endpoint support
+is enabled.
+
+```c
+/*
+ * Copyright (c) 2013-2018 Intel, Inc. All rights reserved
+ *
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "mtl_ofi.h"
+
+__opal_attribute_always_inline__ static inline int
+ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
+                  struct ompi_communicator_t *comm,
+                  int dest,
+                  int tag,
+                  struct opal_convertor_t *convertor,
+                  mca_pml_base_send_mode_t mode)
+{
+    const bool OFI_CQ_DATA = false;
+    const bool OFI_SCEP_EPS = false;
+
+    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
+                                    convertor, mode,
+                                    OFI_CQ_DATA, OFI_SCEP_EPS);
+}
+
+__opal_attribute_always_inline__ static inline int
+ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
+                  struct ompi_communicator_t *comm,
+                  int dest,
+                  int tag,
+                  struct opal_convertor_t *convertor,
+                  mca_pml_base_send_mode_t mode)
+{
+    const bool OFI_CQ_DATA = false;
+    const bool OFI_SCEP_EPS = true;
+
+    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
+                                    convertor, mode,
+                                    OFI_CQ_DATA, OFI_SCEP_EPS);
+}
+
+__opal_attribute_always_inline__ static inline int
+ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
+                  struct ompi_communicator_t *comm,
+                  int dest,
+                  int tag,
+                  struct opal_convertor_t *convertor,
+                  mca_pml_base_send_mode_t mode)
+{
+    const bool OFI_CQ_DATA = true;
+    const bool OFI_SCEP_EPS = false;
+
+    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
+                                    convertor, mode,
+                                    OFI_CQ_DATA, OFI_SCEP_EPS);
+}
+
+__opal_attribute_always_inline__ static inline int
+ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
+                  struct ompi_communicator_t *comm,
+                  int dest,
+                  int tag,
+                  struct opal_convertor_t *convertor,
+                  mca_pml_base_send_mode_t mode)
+{
+    const bool OFI_CQ_DATA = true;
+    const bool OFI_SCEP_EPS = true;
+
+    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
+                                    convertor, mode,
+                                    OFI_CQ_DATA, OFI_SCEP_EPS);
+}
+
+void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
+{
+
+    sym_table->ompi_mtl_ofi_send[false][false]
+        = ompi_mtl_ofi_send_false_false;
+
+
+    sym_table->ompi_mtl_ofi_send[false][true]
+        = ompi_mtl_ofi_send_false_true;
+
+
+    sym_table->ompi_mtl_ofi_send[true][false]
+        = ompi_mtl_ofi_send_true_false;
+
+
+    sym_table->ompi_mtl_ofi_send[true][true]
+        = ompi_mtl_ofi_send_true_true;
+
+}
+```
--- a/ompi/mca/op/example/README.txt
+++ b/ompi/mca/op/example/README.txt
@ -1,5 +1,3 @@
-Copyright 2009 Cisco Systems, Inc.  All rights reserved.
-
 This is a simple example op component meant to be a template /
 springboard for people to write their own op components.  There are
 many different ways to write components and modules; this is but one
@ -13,28 +11,26 @@ same end effect.  Feel free to customize / simplify / strip out what
 you don't need from this example.

 This example component supports a ficticious set of hardware that
-provides acceleation for the MPI_MAX and MPI_BXOR MPI_Ops.  The
+provides acceleation for the `MPI_MAX` and `MPI_BXOR` `MPI_Ops`.  The
 ficticious hardware has multiple versions, too: some versions only
-support single precision floating point types for MAX and single
-precision integer types for BXOR, whereas later versions support both
-single and double precision floating point types for MAX and both
-single and double precision integer types for BXOR.  Hence, this
-example walks through setting up particular MPI_Op function pointers
-based on:
+support single precision floating point types for `MAX` and single
+precision integer types for `BXOR`, whereas later versions support
+both single and double precision floating point types for `MAX` and
+both single and double precision integer types for `BXOR`.  Hence,
+this example walks through setting up particular `MPI_Op` function
+pointers based on:

-a) hardware availability (e.g., does the node where this MPI process
+1. hardware availability (e.g., does the node where this MPI process
   is running have the relevant hardware/resources?)
-
-b) MPI_Op (e.g., in this example, only MPI_MAX and MPI_BXOR are
+1. `MPI_Op` (e.g., in this example, only `MPI_MAX` and `MPI_BXOR` are
   supported)
-
-c) datatype (e.g., single/double precision floating point for MAX and
-   single/double precision integer for BXOR)
+1. datatype (e.g., single/double precision floating point for `MAX`
+   and single/double precision integer for `BXOR`)

 Additionally, there are other considerations that should be factored
 in at run time.  Hardware accelerators are great, but they do induce
 overhead -- for example, some accelerator hardware require registered
-memory.  So even if a particular MPI_Op and datatype are supported, it
+memory.  So even if a particular `MPI_Op` and datatype are supported, it
 may not be worthwhile to use the hardware unless the amount of data to
 be processed is "big enough" (meaning that the cost of the
 registration and/or copy-in/copy-out is ameliorated) or the memory to
@ -47,57 +43,65 @@ failover strategy is well-supported by the op framework; during the
 query process, a component can "stack" itself similar to how POSIX
 signal handlers can be stacked.  Specifically, op components can cache
 other implementations of operation functions for use in the case of
-failover.  The MAX and BXOR module implementations show one way of
+failover.  The `MAX` and `BXOR` module implementations show one way of
 using this method.

 Here's a listing of the files in the example component and what they
 do:

- configure.m4: Tests that get slurped into OMPI's top-level configure
-  script to determine whether this component will be built or not.
- Makefile.am: Automake makefile that builds this component.
- op_example_component.c: The main "component" source file.
- op_example_module.c: The main "module" source file.
- op_example.h: information that is shared between the .c files.
- .ompi_ignore: the presence of this file causes OMPI's autogen.pl to
-  skip this component in the configure/build/install process (see
+- `configure.m4`: Tests that get slurped into OMPI's top-level
+  `configure` script to determine whether this component will be built
+  or not.
+- `Makefile.am`: Automake makefile that builds this component.
+- `op_example_component.c`: The main "component" source file.
+- `op_example_module.c`: The main "module" source file.
+- `op_example.h`: information that is shared between the `.c` files.
+- `.ompi_ignore`: the presence of this file causes OMPI's `autogen.pl`
+  to skip this component in the configure/build/install process (see
  below).

 To use this example as a template for your component (assume your new
-component is named "foo"):
+component is named `foo`):

+```
 shell$ cd (top_ompi_dir)/ompi/mca/op
 shell$ cp -r example foo
 shell$ cd foo
+```

-Remove the .ompi_ignore file (which makes the component "visible" to
-all developers) *OR* add an .ompi_unignore file with one username per
-line (as reported by `whoami`).  OMPI's autogen.pl will skip any
-component with a .ompi_ignore file *unless* there is also an
+Remove the `.ompi_ignore` file (which makes the component "visible" to
+all developers) *OR* add an `.ompi_unignore` file with one username per
+line (as reported by `whoami`).  OMPI's `autogen.pl` will skip any
+component with a `.ompi_ignore` file *unless* there is also an
 .ompi_unignore file containing your user ID in it.  This is a handy
 mechanism to have a component in the tree but have it not built / used
 by most other developers:

+```
 shell$ rm .ompi_ignore
 *OR*
 shell$ whoami > .ompi_unignore
+```

-Now rename any file that contains "example" in the filename to have
-"foo", instead.  For example:
+Now rename any file that contains `example` in the filename to have
+`foo`, instead.  For example:

+```
 shell$ mv op_example_component.c op_foo_component.c
 #...etc.
+```

-Now edit all the files and s/example/foo/gi.  Specifically, replace
-all instances of "example" with "foo" in all function names, type
-names, header #defines, strings, and global variables.
+Now edit all the files and `s/example/foo/gi`.  Specifically, replace
+all instances of `example` with `foo` in all function names, type
+names, header `#defines`, strings, and global variables.

 Now your component should be fully functional (although entirely
-renamed as "foo" instead of "example").  You can go to the top-level
-OMPI directory and run "autogen.pl" (which will find your component
-and att it to the configure/build process) and then "configure ..."
-and "make ..." as normal.
+renamed as `foo` instead of `example`).  You can go to the top-level
+OMPI directory and run `autogen.pl` (which will find your component
+and att it to the configure/build process) and then `configure ...`
+and `make ...` as normal.

+```
 shell$ cd (top_ompi_dir)
 shell$ ./autogen.pl
 # ...lots of output...
@ -107,19 +111,21 @@ shell$ make -j 4 all
 # ...lots of output...
 shell$ make install
 # ...lots of output...
+```

-After you have installed Open MPI, running "ompi_info" should show
-your "foo" component in the output.
+After you have installed Open MPI, running `ompi_info` should show
+your `foo` component in the output.

+```
 shell$ ompi_info | grep op:
                  MCA op: example (MCA v2.0, API v1.0, Component v1.4)
                  MCA op: foo (MCA v2.0, API v1.0, Component v1.4)
 shell$
+```

-If you do not see your foo component, check the above steps, and check
-the output of autogen.pl, configure, and make to ensure that "foo" was
-found, configured, and built successfully.
-
-Once ompi_info sees your component, start editing the "foo" component
-files in a meaningful way.
+If you do not see your `foo` component, check the above steps, and
+check the output of `autogen.pl`, `configure`, and `make` to ensure
+that `foo` was found, configured, and built successfully.

+Once `ompi_info` sees your component, start editing the `foo`
+component files in a meaningful way.
--- a/ompi/mpi/java/Makefile.am
+++ b/ompi/mpi/java/Makefile.am
@ -10,3 +10,5 @@
 #

 SUBDIRS = java c
+
+EXTRA_DIST = README.md
--- a/ompi/mpi/java/README.md
+++ b/ompi/mpi/java/README.md
@ -1,26 +1,27 @@
-***************************************************************************
+# Open MPI Java bindings

 Note about the Open MPI Java bindings

-The Java bindings in this directory are not part of the MPI specification,
-as noted in the README.JAVA.txt file in the root directory. That file also
-contains some information regarding the installation and use of the Java
-bindings. Further details can be found in the paper [1].
+The Java bindings in this directory are not part of the MPI
+specification, as noted in the README.JAVA.md file in the root
+directory. That file also contains some information regarding the
+installation and use of the Java bindings. Further details can be
+found in the paper [1].

 We originally took the code from the mpiJava project [2] as starting point
 for our developments, but we have pretty much rewritten 100% of it. The
 original copyrights and license terms of mpiJava are listed below.

- [1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
-     implementation of Java bindings in Open MPI". Parallel Comput.
-     59: 1-20 (2016).
+1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
+   implementation of Java bindings in Open MPI". Parallel Comput.
+   59: 1-20 (2016).
+1. M. Baker et al. "mpiJava: An object-oriented Java interface to
+   MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
+   pp. 748-762, Springer (1999).

- [2] M. Baker et al. "mpiJava: An object-oriented Java interface to
-     MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
-     pp. 748-762, Springer (1999).
-
-***************************************************************************
+## Original citation

+```
            mpiJava - A Java Interface to MPI
            ---------------------------------
                    Copyright 2003
@ -39,6 +40,7 @@ original copyrights and license terms of mpiJava are listed below.
      (Bugfixes/Additions, CMake based configure/build)
                      Blasius Czink
               HLRS, University of Stuttgart
+```

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
--- a/ompi/mpiext/README.txt
+++ b/ompi/mpiext/README.txt
@ -1,4 +1,5 @@
-Symbol conventions for Open MPI extensions
+# Symbol conventions for Open MPI extensions
+
 Last updated: January 2015

 This README provides some rule-of-thumb guidance for how to name
@ -15,26 +16,22 @@ Generally speaking, there are usually three kinds of extensions:
 3. Functionality that is strongly expected to be in an upcoming
   version of the MPI specification.

----------------------------------------------------------------------
+## Case 1

-Case 1
-
-The OMPI_Paffinity_str() extension is a good example of this type: it
-is solely intended to be for Open MPI.  It will likely never be pushed
-to other MPI implementations, and it will likely never be pushed to
-the MPI Forum.
+The `OMPI_Paffinity_str()` extension is a good example of this type:
+it is solely intended to be for Open MPI.  It will likely never be
+pushed to other MPI implementations, and it will likely never be
+pushed to the MPI Forum.

 It's Open MPI-specific functionality, through and through.

 Public symbols of this type of functionality should be named with an
-"OMPI_" prefix to emphasize its Open MPI-specific nature.  To be
-clear: the "OMPI_" prefix clearly identifies parts of user code that
+`OMPI_` prefix to emphasize its Open MPI-specific nature.  To be
+clear: the `OMPI_` prefix clearly identifies parts of user code that
 are relying on Open MPI (and likely need to be surrounded with #if
-OPEN_MPI blocks, etc.).
+`OPEN_MPI` blocks, etc.).

----------------------------------------------------------------------
-
-Case 2
+## Case 2

 The MPI extensions mechanism in Open MPI was designed to help MPI
 Forum members prototype new functionality that is intended for the
@ -43,23 +40,21 @@ functionality is not only to be included in the MPI spec, but possibly
 also be included in another MPI implementation.

 As such, it seems reasonable to prefix public symbols in this type of
-functionality with "MPIX_".  This commonly-used prefix allows the same
+functionality with `MPIX_`.  This commonly-used prefix allows the same
 symbols to be available in multiple MPI implementations, and therefore
 allows user code to easily check for it.  E.g., user apps can check
-for the presence of MPIX_Foo to know if both Open MPI and Other MPI
-support the proposed MPIX_Foo functionality.
+for the presence of `MPIX_Foo` to know if both Open MPI and Other MPI
+support the proposed `MPIX_Foo` functionality.

-Of course, when using the MPIX_ namespace, there is the possibility of
-symbol name collisions.  E.g., what if Open MPI has an MPIX_Foo and
-Other MPI has a *different* MPIX_Foo?
+Of course, when using the `MPIX_` namespace, there is the possibility of
+symbol name collisions.  E.g., what if Open MPI has an `MPIX_Foo` and
+Other MPI has a *different* `MPIX_Foo`?

 While we technically can't prevent such collisions from happening, we
 encourage extension authors to avoid such symbol clashes whenever
 possible.

----------------------------------------------------------------------
-
-Case 3
+## Case 3

 It is well-known that the MPI specification (intentionally) takes a
 long time to publish.  MPI implementers can typically know, with a
@ -72,13 +67,13 @@ functionality early (i.e., before the actual publication of the
 corresponding MPI specification document).

 Case in point: the non-blocking collective operations that were
-included in MPI-3.0 (e.g., MPI_Ibarrier).  It was known for a year or
-two before MPI-3.0 was published that these functions would be
+included in MPI-3.0 (e.g., `MPI_Ibarrier()`).  It was known for a year
+or two before MPI-3.0 was published that these functions would be
 included in MPI-3.0.

 There is a continual debate among the developer community: when
 implementing such functionality, should the symbols be in the MPIX_
-namespace or in the MPI_ namespace?  On one hand, the symbols are not
+namespace or in the `MPI_` namespace?  On one hand, the symbols are not
 yet officially standardized -- *they could change* before publication.
 On the other hand, developers who participate in the Forum typically
 have a good sense for whether symbols are going to change before
@ -89,35 +84,31 @@ before the MPI specification is published.  ...and so on.
 After much debate: for functionality that has a high degree of
 confidence that it will be included in an upcoming spec (e.g., it has
 passed at least one vote in the MPI Forum), our conclusion is that it
-is OK to use the MPI_ namespace.
+is OK to use the `MPI_` namespace.

 Case in point: Open MPI released non-blocking collectives with the
-MPI_ prefix (not the MPIX_ prefix) before the MPI-3.0 specification
-officially standardized these functions.
+`MPI_` prefix (not the `MPIX_` prefix) before the MPI-3.0
+specification officially standardized these functions.

 The rationale was threefold:

 1. Let users use the functionality as soon as possible.
-
-2. If OMPI initially creates MPIX_Foo, but eventually renames it to
-   MPI_Foo when the MPI specification is published, then users will
+1. If OMPI initially creates `MPIX_Foo`, but eventually renames it to
+   `MPI_Foo` when the MPI specification is published, then users will
   have to modify their codes to match.  This is an artificial change
   inserted just to be "pure" to the MPI spec (i.e., it's a "lawyer's
-   answer").  But since the MPIX_Foo -> MPI_Foo change is inevitable,
-   it just ends up annoying users.
-
-3. Once OMPI introduces MPIX_ symbols, if we want to *not* annoy
+   answer").  But since the `MPIX_Foo` -> `MPI_Foo` change is
+   inevitable, it just ends up annoying users.
+1. Once OMPI introduces `MPIX_` symbols, if we want to *not* annoy
   users, we'll likely have weak symbols / aliased versions of both
-   MPIX_Foo and MPI_Foo once the Foo functionality is included in a
-   published MPI specification.  However, when can we delete the
-   MPIX_Foo symbol?  It becomes a continuing annoyance of backwards
+   `MPIX_Foo` and `MPI_Foo` once the Foo functionality is included in
+   a published MPI specification.  However, when can we delete the
+   `MPIX_Foo` symbol?  It becomes a continuing annoyance of backwards
   compatibility that we have to keep carrying forward.

 For all these reasons, we believe that it's better to put
-expected-upcoming official MPI functionality in the MPI_ namespace,
-not the MPIX_ namespace.
-
----------------------------------------------------------------------
+expected-upcoming official MPI functionality in the `MPI_` namespace,
+not the `MPIX_` namespace.

 All that being said, these are rules of thumb.  They are not an
 official mandate.  There may well be cases where there are reasons to
--- a/ompi/mpiext/affinity/Makefile.am
+++ b/ompi/mpiext/affinity/Makefile.am
@ -2,7 +2,7 @@
 # Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
 #                         University Research and Technology
 #                         Corporation.  All rights reserved.
-# Copyright (c) 2010-2012 Cisco Systems, Inc.  All rights reserved.
+# Copyright (c) 2010-2020 Cisco Systems, Inc.  All rights reserved.
 # $COPYRIGHT$
 #
 # Additional copyrights may follow
@ -20,4 +20,4 @@

 SUBDIRS = c

-EXTRA_DIST = README.txt
+EXTRA_DIST = README.md
--- a/ompi/mpiext/affinity/README.md
+++ b/ompi/mpiext/affinity/README.md
@ -0,0 +1,30 @@
+# Open MPI extension: Affinity
+
+## Copyrights
+
+```
+Copyright (c) 2010-2012 Cisco Systems, Inc.  All rights reserved.
+Copyright (c) 2010 Oracle and/or its affiliates.  All rights reserved.
+```
+
+## Authors
+
+* Jeff Squyres, 19 April 2010, and 16 April 2012
+* Terry Dontje, 18 November 2010
+
+## Description
+
+This extension provides a single new function, `OMPI_Affinity_str()`,
+that takes a format value and then provides 3 prettyprint strings as
+output:
+
+* `fmt_type`: is an enum that tells `OMPI_Affinity_str()` whether to
+  use a resource description string or layout string format for
+  `ompi_bound` and `currently_bound` output strings.
+* `ompi_bound`: describes what sockets/cores Open MPI bound this process
+  to (or indicates that Open MPI did not bind this process).
+* `currently_bound`: describes what sockets/cores this process is
+  currently bound to (or indicates that it is unbound).
+* `exists`: describes what processors are available in the current host.
+
+See `OMPI_Affinity_str(3)` for more details.
--- a/ompi/mpiext/affinity/README.txt
+++ b/ompi/mpiext/affinity/README.txt
@ -1,29 +0,0 @@
-# Copyright (c) 2010-2012 Cisco Systems, Inc.  All rights reserved.
-Copyright (c) 2010 Oracle and/or its affiliates.  All rights reserved.
-
-$COPYRIGHT$
-
-Jeff Squyres
-19 April 2010, and
-16 April 2012
-
-Terry Dontje
-18 November 2010
-
-This extension provides a single new function, OMPI_Affinity_str(),
-that takes a format value and then provides 3 prettyprint strings as
-output:
-
-fmt_type: is an enum that tells OMPI_Affinity_str() whether to use a
-resource description string or layout string format for ompi_bound and
-currently_bound output strings.
-
-ompi_bound: describes what sockets/cores Open MPI bound this process
-to (or indicates that Open MPI did not bind this process).
-
-currently_bound: describes what sockets/cores this process is
-currently bound to (or indicates that it is unbound).
-
-exists: describes what processors are available in the current host.
-
-See OMPI_Affinity_str(3) for more details.
--- a/ompi/mpiext/cuda/Makefile.am
+++ b/ompi/mpiext/cuda/Makefile.am
@ -21,4 +21,4 @@

 SUBDIRS = c

-EXTRA_DIST = README.txt
+EXTRA_DIST = README.md
--- a/ompi/mpiext/cuda/README.md
+++ b/ompi/mpiext/cuda/README.md
@ -0,0 +1,11 @@
+# Open MPI extension: Cuda
+
+Copyright (c) 2015 NVIDIA, Inc.  All rights reserved.
+
+Author: Rolf vandeVaart
+
+This extension provides a macro for compile time check of CUDA aware
+support.  It also provides a function for runtime check of CUDA aware
+support.
+
+See `MPIX_Query_cuda_support(3)` for more details.
--- a/ompi/mpiext/cuda/README.txt
+++ b/ompi/mpiext/cuda/README.txt
@ -1,11 +0,0 @@
-# Copyright (c) 2015      NVIDIA, Inc.  All rights reserved.
-
-$COPYRIGHT$
-
-Rolf vandeVaart
-
-
-This extension provides a macro for compile time check of CUDA aware support.
-It also provides a function for runtime check of CUDA aware support.
-
-See MPIX_Query_cuda_support(3) for more details.
--- a/ompi/mpiext/example/Makefile.am
+++ b/ompi/mpiext/example/Makefile.am
@ -1,5 +1,5 @@
 #
-# Copyright (c) 2012 Cisco Systems, Inc.  All rights reserved.
+# Copyright (c) 2020 Cisco Systems, Inc.  All rights reserved.
 # $COPYRIGHT$
 #
 # Additional copyrights may follow
@ -17,4 +17,4 @@

 SUBDIRS = c mpif-h use-mpi use-mpi-f08

-EXTRA_DIST = README.txt
+EXTRA_DIST = README.md
--- a/ompi/mpiext/example/README.md
+++ b/ompi/mpiext/example/README.md
@ -0,0 +1,148 @@
+# Open MPI extension: Example
+
+## Overview
+
+This example MPI extension shows how to make an MPI extension for Open
+MPI.
+
+An MPI extension provides new top-level APIs in Open MPI that are
+available to user-level applications (vs. adding new code/APIs that is
+wholly internal to Open MPI).  MPI extensions are generally used to
+prototype new MPI APIs, or provide Open MPI-specific APIs to
+applications.  This example MPI extension provides a new top-level MPI
+API named `OMPI_Progress` that is callable in both C and Fortran.
+
+MPI extensions are similar to Open MPI components, but due to
+complex ordering requirements for the Fortran-based MPI bindings,
+their build order is a little different.
+
+Note that MPI has 4 different sets of bindings (C, Fortran `mpif.h`,
+the Fortran `mpi` module, and the Fortran `mpi_f08` module), and Open
+MPI extensions allow adding API calls to all 4 of them.  Prototypes
+for the user-accessible functions/subroutines/constants are included
+in the following publicly-available mechanisms:
+
+* C: `mpi-ext.h`
+* Fortran mpif.h: `mpif-ext.h`
+* Fortran "use mpi": `use mpi_ext`
+* Fortran "use mpi_f08": `use mpi_f08_ext`
+
+This example extension defines a new top-level API named
+`OMPI_Progress()` in all four binding types, and provides test programs
+to call this API in each of the four binding types.  Code (and
+comments) is worth 1,000 words -- see the code in this example
+extension to understand how it works and how the build system builds
+and inserts each piece into the publicly-available mechansisms (e.g.,
+`mpi-ext.h` and the `mpi_f08_ext` module).
+
+## Comparison to General Open MPI MCA Components
+
+Here's the ways that MPI extensions are similar to Open MPI
+components:
+
+1. Extensions have a top-level `configure.m4` with a well-known m4 macro
+   that is run during Open MPI's configure that determines whether the
+   component wants to build or not.
+
+   Note, however, that unlike components, extensions *must* have a
+   `configure.m4`.  No other method of configuration is supported.
+
+1. Extensions must adhere to normal Automake-based targets.  We
+   strongly suggest that you use `Makefile.am`'s and have the
+   extension's `configure.m4` `AC_CONFIG_FILE` each `Makefile.am` in
+   the extension.  Using other build systems may work, but are
+   untested and unsupported.
+
+1. Extensions create specifically-named libtool convenience archives
+   (i.e., `*.la` files) that the build system slurps into higher-level
+   libraries.
+
+Unlike components, however, extensions:
+
+1. Have a bit more rigid directory and file naming scheme.
+1. Have up to four different, specifically-named subdirectories (one
+   for each MPI binding type).
+1. Also install some specifically-named header files (for C and the
+   Fortran `mpif.h` bindings).
+
+Similar to components, an MPI extension's name is determined by its
+directory name: `ompi/mpiext/EXTENSION_NAME`
+
+## Extension requirements
+
+### Required: C API
+
+Under this top-level directory, the extension *must* have a directory
+named `c` (for the C bindings) that:
+
+1. contains a file named `mpiext_EXTENSION_NAME_c.h`
+1. installs `mpiext_EXTENSION_NAME_c.h` to
+   `$includedir/openmpi/mpiext/EXTENSION_NAME/c`
+1. builds a Libtool convenience library named
+   `libmpiext_EXTENSION_NAME_c.la`
+
+### Optional: `mpif.h` bindings
+
+Optionally, the extension may have a director named `mpif-h` (for the
+Fortran `mpif.h` bindings) that:
+
+1. contains a file named `mpiext_EXTENSION_NAME_mpifh.h`
+1. installs `mpiext_EXTENSION_NAME_mpih.h` to
+   `$includedir/openmpi/mpiext/EXTENSION_NAME/mpif-h`
+1. builds a Libtool convenience library named
+   `libmpiext_EXTENSION_NAME_mpifh.la`
+
+### Optional: `mpi` module bindings
+
+Optionally, the extension may have a directory named `use-mpi` (for the
+Fortran `mpi` module) that:
+
+1. contains a file named `mpiext_EXTENSION_NAME_usempi.h`
+
+***NOTE:*** The MPI extension system does NOT support building an
+additional library in the `use-mpi` extension directory.  It is
+assumed that the `use-mpi` bindings will use the same back-end symbols
+as the `mpif.h` bindings, and that the only output product of the
+`use-mpi` directory is a file to be included in the `mpi-ext` module
+(i.e., strong Fortran prototypes for the functions/global variables in
+this extension).
+
+### Optional: `mpi_f08` module bindings
+
+Optionally, the extension may have a directory named `use-mpi-f08` (for
+the Fortran `mpi_f08` module) that:
+
+1. contains a file named `mpiext_EXTENSION_NAME_usempif08.h`
+1. builds a Libtool convenience library named
+   `libmpiext_EXTENSION_NAME_usempif08.la`
+
+See the comments in all the header and source files in this tree to
+see what each file is for and what should be in each.
+
+## Notes
+
+Note that the build order of MPI extensions is a bit strange.  The
+directories in a MPI extensions are NOT traversed top-down in
+sequential order.  Instead, due to ordering requirements when building
+the Fortran module-based interfaces, each subdirectory in extensions
+are traversed individually at different times in the overall Open MPI
+build.
+
+As such, `ompi/mpiext/EXTENSION_NAME/Makefile.am` is not traversed
+during a normal top-level `make all` target.  This `Makefile.am`
+exists for two reasons, however:
+
+1. For the conveneince of the developer, so that you can issue normal
+   `make` commands at the top of your extension tree (e.g., `make all`
+   will still build all bindings in an extension).
+1. During a top-level `make dist`, extension directories *are*
+   traversed top-down in sequence order.  Having a top-level
+   `Makefile.am` in an extension allows `EXTRA_DIST`ing of files, such
+   as this `README.md` file.
+
+This are reasons for this strange ordering, but suffice it to say that
+`make dist` doesn't have the same ordering requiements as `make all`,
+and is therefore easier to have a "normal" Automake-usual top-down
+sequential directory traversal.
+
+Enjoy!
--- a/ompi/mpiext/example/README.txt
+++ b/ompi/mpiext/example/README.txt
@ -1,138 +0,0 @@
-Copyright (C) 2012 Cisco Systems, Inc.  All rights reserved.
-
-$COPYRIGHT$
-
-This example MPI extension shows how to make an MPI extension for Open
-MPI.
-
-An MPI extension provides new top-level APIs in Open MPI that are
-available to user-level applications (vs. adding new code/APIs that is
-wholly internal to Open MPI).  MPI extensions are generally used to
-prototype new MPI APIs, or provide Open MPI-specific APIs to
-applications.  This example MPI extension provides a new top-level MPI
-API named "OMPI_Progress" that is callable in both C and Fortran.
-
-MPI extensions are similar to Open MPI components, but due to
-complex ordering requirements for the Fortran-based MPI bindings,
-their build order is a little different.
-
-Note that MPI has 4 different sets of bindings (C, Fortran mpif.h,
-Fortran "use mpi", and Fortran "use mpi_f08"), and Open MPI extensions
-allow adding API calls to all 4 of them.  Prototypes for the
-user-accessible functions/subroutines/constants are included in the
-following publicly-available mechanisms:
-
- C: mpi-ext.h
- Fortran mpif.h: mpif-ext.h
- Fortran "use mpi": use mpi_ext
- Fortran "use mpi_f08": use mpi_f08_ext
-
-This example extension defines a new top-level API named
-"OMPI_Progress" in all four binding types, and provides test programs
-to call this API in each of the four binding types.  Code (and
-comments) is worth 1,000 words -- see the code in this example
-extension to understand how it works and how the build system builds
-and inserts each piece into the publicly-available mechansisms (e.g.,
-mpi-ext.h and the mpi_f08_ext module).
-
--------------------------------------------------------------------------------
-
-Here's the ways that MPI extensions are similar to Open MPI
-components:
-
- Extensions have a top-level configure.m4 with a well-known m4 macro
-  that is run during Open MPI's configure that determines whether the
-  component wants to build or not.
-
-  Note, however, that unlike components, extensions *must* have a
-  configure.m4.  No other method of configuration is supported.
-
- Extensions must adhere to normal Automake-based targets.  We
-  strongly suggest that you use Makefile.am's and have the extension's
-  configure.m4 AC_CONFIG_FILE each Makefile.am in the extension.
-  Using other build systems may work, but are untested and
-  unsupported.
-
- Extensions create specifically-named libtool convenience archives
-  (i.e., *.la files) that the build system slurps into higher-level
-  libraries.
-
-Unlike components, however, extensions:
-
- Have a bit more rigid directory and file naming scheme.
-
- Have up to four different, specifically-named subdirectories (one
-  for each MPI binding type).
-
- Also install some specifically-named header files (for C and the
-  Fortran mpif.h bindings).
-
-Similar to components, an MPI extension's name is determined by its
-directory name: ompi/mpiext/<extension name>
-
-Under this top-level directory, the extension *must* have a directory
-named "c" (for the C bindings) that:
-
- contains a file named mpiext_<ext_name>_c.h
- installs mpiext_<ext_name>_c.h to
-  $includedir/openmpi/mpiext/<ext_name>/c
- builds a Libtool convenience library named libmpiext_<ext_name>_c.la
-
-Optionally, the extension may have a director named "mpif-h" (for the
-Fortran mpif.h bindings) that:
-
- contains a file named mpiext_<ext_name>_mpifh.h
- installs mpiext_<ext_name>_mpih.h to
-  $includedir/openmpi/mpiext/<ext_name>/mpif-h
- builds a Libtool convenience library named libmpiext_<ext_name>_mpifh.la
-
-Optionally, the extension may have a director named "use-mpi" (for the
-Fortran "use mpi" bindings) that:
-
- contains a file named mpiext_<ext_name>_usempi.h
-
-NOTE: The MPI extension system does NOT support building an additional
-library in the use-mpi extension directory.  It is assumed that the
-use-mpi bindings will use the same back-end symbols as the mpif.h
-bindings, and that the only output product of the use-mpi directory is
-a file to be included in the mpi-ext module (i.e., strong Fortran
-prototypes for the functions/global variables in this extension).
-
-Optionally, the extension may have a director named "use-mpi-f08" (for
-the Fortran mpi_f08 bindings) that:
-
- contains a file named mpiext_<ext_name>_usempif08.h
- builds a Libtool convenience library named
-  libmpiext_<ext_name>_usempif08.la
-
-See the comments in all the header and source files in this tree to
-see what each file is for and what should be in each.
-
--------------------------------------------------------------------------------
-
-Note that the build order of MPI extensions is a bit strange.  The
-directories in a MPI extensions are NOT traversed top-down in
-sequential order.  Instead, due to ordering requirements when building
-the Fortran module-based interfaces, each subdirectory in extensions
-are traversed individually at different times in the overall Open MPI
-build.
-
-As such, ompi/mpiext/<ext_name>/Makefile.am is not traversed during a
-normal top-level "make all" target.  This Makefile.am exists for two
-reasons, however:
-
-1. For the conveneince of the developer, so that you can issue normal
-"make" commands at the top of your extension tree (e.g., "make all"
-will still build all bindings in an extension).
-
-2. During a top-level "make dist", extension directories *are*
-traversed top-down in sequence order.  Having a top-level Makefile.am
-in an extension allows EXTRA_DISTing of files, such as this README
-file.
-
-This are reasons for this strange ordering, but suffice it to say that
-"make dist" doesn't have the same ordering requiements as "make all",
-and is therefore easier to have a "normal" Automake-usual top-down
-sequential directory traversal.
-
-Enjoy!
--- a/ompi/mpiext/pcollreq/Makefile.am
+++ b/ompi/mpiext/pcollreq/Makefile.am
@ -8,3 +8,5 @@
 #

 SUBDIRS = c mpif-h use-mpi use-mpi-f08
+
+EXTRA_DIST = README.md
--- a/ompi/mpiext/pcollreq/README.md
+++ b/ompi/mpiext/pcollreq/README.md
@ -0,0 +1,14 @@
+# Open MPI extension: pcollreq
+
+Copyright (c) 2018      FUJITSU LIMITED.  All rights reserved.
+
+This extension provides the feature of persistent collective
+communication operations and persistent neighborhood collective
+communication operations, which is planned to be included in the next
+MPI Standard after MPI-3.1 as of Nov. 2018.
+
+See `MPIX_Barrier_init(3)` for more details.
+
+The code will be moved to the `ompi/mpi` directory and the `MPIX_`
+prefix will be switch to the `MPI_` prefix once the MPI Standard which
+includes this feature is published.
--- a/ompi/mpiext/pcollreq/README.txt
+++ b/ompi/mpiext/pcollreq/README.txt
@ -1,14 +0,0 @@
-Copyright (c) 2018      FUJITSU LIMITED.  All rights reserved.
-
-$COPYRIGHT$
-
-This extension provides the feature of persistent collective communication
-operations and persistent neighborhood collective communication operations,
-which is planned to be included in the next MPI Standard after MPI-3.1 as
-of Nov. 2018.
-
-See MPIX_Barrier_init(3) for more details.
-
-The code will be moved to the ompi/mpi directory and the MPIX_ prefix will
-be switch to the MPI_ prefix once the MPI Standard which includes this
-feature is published.
--- a/ompi/mpiext/shortfloat/Makefile.am
+++ b/ompi/mpiext/shortfloat/Makefile.am
@ -8,3 +8,5 @@
 #

 SUBDIRS = c mpif-h use-mpi use-mpi-f08
+
+EXTRA_DIST = README.md
--- a/ompi/mpiext/shortfloat/README.md
+++ b/ompi/mpiext/shortfloat/README.md
@ -0,0 +1,35 @@
+# Open MPI extension: shortfloat
+
+Copyright (c) 2018      FUJITSU LIMITED.  All rights reserved.
+
+This extension provides additional MPI datatypes `MPIX_SHORT_FLOAT`,
+`MPIX_C_SHORT_FLOAT_COMPLEX`, and `MPIX_CXX_SHORT_FLOAT_COMPLEX`,
+which are proposed with the `MPI_` prefix in June 2017 for proposal in
+the MPI 4.0 standard. As of February 2019, it is not accepted yet.
+
+See https://github.com/mpi-forum/mpi-issues/issues/65 for moe details
+
+Each MPI datatype corresponds to the C/C++ type `short float`, the C
+type `short float _Complex`, and the C++ type `std::complex<short
+float>`, respectively.
+
+In addition, this extension provides a datatype `MPIX_C_FLOAT16` for
+the C type `_Float16`, which is defined in ISO/IEC JTC 1/SC 22/WG 14
+N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
+that of MPICH.  See https://github.com/pmodels/mpich/pull/3455.
+
+This extension is enabled only if the C compiler supports `short float`
+or `_Float16`, or the `--enable-alt-short-float=TYPE` option is passed
+to the Open MPI `configure` script.
+
+NOTE: The Clang 6.0.x and 7.0.x compilers support the `_Float16` type
+(via software emulation), but require an additional linker flag to
+function properly.  If you wish to enable Clang 6.0.x or 7.0.x's
+software emulation of `_Float16`, use the following CLI options to Open
+MPI configure script:
+
+```
+./configure \
+        LDFLAGS=--rtlib=compiler-rt \
+        --with-wrapper-ldflags=--rtlib=compiler-rt ...
+```
--- a/ompi/mpiext/shortfloat/README.txt
+++ b/ompi/mpiext/shortfloat/README.txt
@ -1,35 +0,0 @@
-Copyright (c) 2018      FUJITSU LIMITED.  All rights reserved.
-
-$COPYRIGHT$
-
-This extension provides additional MPI datatypes MPIX_SHORT_FLOAT,
-MPIX_C_SHORT_FLOAT_COMPLEX, and MPIX_CXX_SHORT_FLOAT_COMPLEX, which
-are proposed with the MPI_ prefix in June 2017 for proposal in the
-MPI 4.0 standard. As of February 2019, it is not accepted yet.
-
-  https://github.com/mpi-forum/mpi-issues/issues/65
-
-Each MPI datatype corresponds to the C/C++ type 'short float', the C type
-'short float _Complex', and the C++ type 'std::complex<short float>',
-respectively.
-
-In addition, this extension provides a datatype MPIX_C_FLOAT16 for
-the C type _Float16, which is defined in ISO/IEC JTC 1/SC 22/WG 14
-N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
-that of MPICH.
-
-  https://github.com/pmodels/mpich/pull/3455
-
-This extension is enabled only if the C compiler supports 'short float'
-or '_Float16', or the '--enable-alt-short-float=TYPE' option is passed
-to the configure script.
-
-NOTE: The Clang 6.0.x and 7.0.x compilers support the "_Float16" type
-(via software emulation), but require an additional linker flag to
-function properly.  If you wish to enable Clang 6.0.x or 7.0.x's
-software emulation of _Float16, use the following CLI options to Open
-MPI configure script:
-
-    ./configure \
-        LDFLAGS=--rtlib=compiler-rt \
-        --with-wrapper-ldflags=--rtlib=compiler-rt ...
--- a/opal/mca/btl/ofi/README
+++ b/opal/mca/btl/ofi/README
@ -1,110 +0,0 @@
-========================================
-Design notes on BTL/OFI
-========================================
-
-This is the RDMA only btl based on OFI Libfabric. The goal is to enable RDMA
-with multiple vendor hardware through one interface. Most of the operations are
-managed by upper layer (osc/rdma). This BTL is mostly doing the low level work.
-
-Tested providers: sockets,psm2,ugni
-
-========================================
-
-Component
-
-This BTL is requesting libfabric version 1.5 API and will not support older versions.
-
-The required capabilities of this BTL is FI_ATOMIC and FI_RMA with the endpoint type
-of FI_EP_RDM only. This BTL does NOT support libfabric provider that requires local
-memory registration (FI_MR_LOCAL).
-
-BTL/OFI will initialize a module with ONLY the first compatible info returned from OFI.
-This means it will rely on OFI provider to do load balancing. The support for multiple
-device might be added later.
-
-The BTL creates only one endpoint and one CQ.
-
-========================================
-
-Memory Registration
-
-Open MPI has a system in place to exchange remote address and always use the remote
-virtual address to refer to a piece of memory. However, some libfabric providers might
-not support the use of virtual address and instead will use zero-based offset addressing.
-
-FI_MR_VIRT_ADDR is the flag that determine this behavior. mca_btl_ofi_reg_mem() handles
-this by storing the base address in registration handle in case of the provider does not
-support FI_MR_VIRT_ADDR. This base address will be used to calculate the offset later in
-RDMA/Atomic operations.
-
-The BTL will try to use the address of registration handle as the key. However, if the
-provider supports FI_MR_PROV_KEY, it will use provider provided key. Simply does not care.
-
-The BTL does not register local operand or compare. This is why this BTL does not support
-FI_MR_LOCAL and will allocate every buffer before registering. This means FI_MR_ALLOCATED
-is supported. So to be explicit.
-
-Supported MR mode bits (will work with or without):
-    enum:
-    - FI_MR_BASIC
-    - FI_MR_SCALABLE
-
-    mode bits:
-    - FI_MR_VIRT_ADDR
-    - FI_MR_ALLOCATED
-    - FI_MR_PROV_KEY
-
-The BTL does NOT support (will not work with):
-    - FI_MR_LOCAL
-    - FI_MR_MMU_NOTIFY
-    - FI_MR_RMA_EVENT
-    - FI_MR_ENDPOINT
-
-Just a reminder, in libfabric API 1.5...
-FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)
-
-========================================
-
-Completions
-
-Every operation in this BTL is asynchronous. The completion handling will occur in
-mca_btl_ofi_component_progress() where we read the CQ with the completion context and
-execute the callback functions. The completions are local. No remote completion event is
-generated as local completion already guarantee global completion.
-
-The BTL keep tracks of number of outstanding operations and provide flush interface.
-
-========================================
-
-Sockets Provider
-
-Sockets provider is the proof of concept provider for libfabric. It is supposed to support
-all the OFI API with emulations. This provider is considered very slow and bound to raise
-problems that we might not see from other faster providers.
-
-Known Problems:
-    - sockets provider uses progress thread and can cause segfault in finalize as we free
-      the resources while progress thread is still using it. sleep(1) was put in
-      mca_btl_ofi_componenet_close() for this reason.
-    - sockets provider deadlock in two-sided mode. Might be something about buffered recv.
-      (August 2018).
-
-========================================
-
-Scalable Endpoint
-
-This BTL will try to use scalable endpoint to create communication context. This will increase
-multithreaded performance for some application. The default number of context created is 1 and
-can be tuned VIA MCA parameter "btl_ofi_num_contexts_per_module". It is advised that the number
-of context should be equal to number of physical core for optimal performance.
-
-User can disable scalable endpoint by MCA parameter "btl_ofi_disable_sep".
-With scalable endpoint disbled, the BTL will alias OFI endpoint to both tx and rx context.
-
-========================================
-
-Two sided communication
-
-Two sided communication is added later on to BTL OFI to enable non tag-matching provider
-to be able to use in Open MPI with this BTL. However, the support is only for "functional"
-and has not been optimized for performance at this point. (August 2018)
--- a/opal/mca/btl/ofi/README.md
+++ b/opal/mca/btl/ofi/README.md
@ -0,0 +1,113 @@
+# Design notes on BTL/OFI
+
+This is the RDMA only btl based on OFI Libfabric. The goal is to
+enable RDMA with multiple vendor hardware through one interface. Most
+of the operations are managed by upper layer (osc/rdma). This BTL is
+mostly doing the low level work.
+
+Tested providers: sockets,psm2,ugni
+
+## Component
+
+This BTL is requesting libfabric version 1.5 API and will not support
+older versions.
+
+The required capabilities of this BTL is `FI_ATOMIC` and `FI_RMA` with
+the endpoint type of `FI_EP_RDM` only. This BTL does NOT support
+libfabric provider that requires local memory registration
+(`FI_MR_LOCAL`).
+
+BTL/OFI will initialize a module with ONLY the first compatible info
+returned from OFI.  This means it will rely on OFI provider to do load
+balancing. The support for multiple device might be added later.
+
+The BTL creates only one endpoint and one CQ.
+
+## Memory Registration
+
+Open MPI has a system in place to exchange remote address and always
+use the remote virtual address to refer to a piece of memory. However,
+some libfabric providers might not support the use of virtual address
+and instead will use zero-based offset addressing.
+
+`FI_MR_VIRT_ADDR` is the flag that determine this
+behavior. `mca_btl_ofi_reg_mem()` handles this by storing the base
+address in registration handle in case of the provider does not
+support `FI_MR_VIRT_ADDR`. This base address will be used to calculate
+the offset later in RDMA/Atomic operations.
+
+The BTL will try to use the address of registration handle as the
+key. However, if the provider supports `FI_MR_PROV_KEY`, it will use
+provider provided key. Simply does not care.
+
+The BTL does not register local operand or compare. This is why this
+BTL does not support `FI_MR_LOCAL` and will allocate every buffer
+before registering. This means `FI_MR_ALLOCATED` is supported. So to
+be explicit.
+
+Supported MR mode bits (will work with or without):
+
+* enum:
+  * `FI_MR_BASIC`
+  * `FI_MR_SCALABLE`
+* mode bits:
+  * `FI_MR_VIRT_ADDR`
+  * `FI_MR_ALLOCATED`
+  * `FI_MR_PROV_KEY`
+
+The BTL does NOT support (will not work with):
+
+* `FI_MR_LOCAL`
+* `FI_MR_MMU_NOTIFY`
+* `FI_MR_RMA_EVENT`
+* `FI_MR_ENDPOINT`
+
+Just a reminder, in libfabric API 1.5...
+`FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)`
+
+## Completions
+
+Every operation in this BTL is asynchronous. The completion handling
+will occur in `mca_btl_ofi_component_progress()` where we read the CQ
+with the completion context and execute the callback functions. The
+completions are local. No remote completion event is generated as
+local completion already guarantee global completion.
+
+The BTL keep tracks of number of outstanding operations and provide
+flush interface.
+
+## Sockets Provider
+
+Sockets provider is the proof of concept provider for libfabric. It is
+supposed to support all the OFI API with emulations. This provider is
+considered very slow and bound to raise problems that we might not see
+from other faster providers.
+
+Known Problems:
+
+* sockets provider uses progress thread and can cause segfault in
+  finalize as we free the resources while progress thread is still
+  using it. `sleep(1)` was put in `mca_btl_ofi_component_close()` for
+  this reason.
+* sockets provider deadlock in two-sided mode. Might be something
+  about buffered recv.  (August 2018).
+
+## Scalable Endpoint
+
+This BTL will try to use scalable endpoint to create communication
+context. This will increase multithreaded performance for some
+application. The default number of context created is 1 and can be
+tuned VIA MCA parameter `btl_ofi_num_contexts_per_module`. It is
+advised that the number of context should be equal to number of
+physical core for optimal performance.
+
+User can disable scalable endpoint by MCA parameter
+`btl_ofi_disable_sep`.  With scalable endpoint disbled, the BTL will
+alias OFI endpoint to both tx and rx context.
+
+## Two sided communication
+
+Two sided communication is added later on to BTL OFI to enable non
+tag-matching provider to be able to use in Open MPI with this
+BTL. However, the support is only for "functional" and has not been
+optimized for performance at this point. (August 2018)
--- a/opal/mca/btl/smcuda/README
+++ b/opal/mca/btl/smcuda/README
@ -1,113 +0,0 @@
-Copyright (c) 2013      NVIDIA Corporation.  All rights reserved.
-August 21, 2013
-
-SMCUDA DESIGN DOCUMENT
-This document describes the design and use of the smcuda BTL.
-
-BACKGROUND
-The smcuda btl is a copy of the sm btl but with some additional features.
-The main extra feature is the ability to make use of the CUDA IPC APIs to
-quickly move GPU buffers from one GPU to another.  Without this support,
-the GPU buffers would all be moved into and then out of host memory.
-
-GENERAL DESIGN
-
-The general design makes use of the large message RDMA RGET support in the
-OB1 PML.  However, there are some interesting choices to make use of it.
-First, we disable any large message RDMA support in the BTL for host
-messages.  This is done because we need to use the mca_btl_smcuda_get() for
-the GPU buffers.  This is also done because the upper layers expect there
-to be a single mpool but we need one for the GPU memory and one for the
-host memory.  Since the advantages of using RDMA with host memory is
-unclear, we disabled it.  This means no KNEM or CMA support built in to the
-smcuda BTL.
-
-Also note that we give the smcuda BTL a higher rank than the sm BTL.  This
-means it will always be selected even if we are doing host only data
-transfers.  The smcuda BTL is not built if it is not requested via the
--with-cuda flag to the configure line.
-
-Secondly, the smcuda does not make use of the traditional method of
-enabling RDMA operations.  The traditional method checks for the existence
-of an RDMA btl hanging off the endpoint.  The smcuda works in conjunction
-with the OB1 PML and uses flags that it sends in the BML layer.
-
-OTHER CONSIDERATIONS
-CUDA IPC is not necessarily supported by all GPUs on a node.  In NUMA
-nodes, CUDA IPC may only work between GPUs that are not connected
-over the IOH.  In addition, we want to check for CUDA IPC support lazily,
-when the first GPU access occurs, rather than during MPI_Init() time.
-This complicates the design.
-
-INITIALIZATION
-When the smcuda BTL initializes, it starts with no support for CUDA IPC.
-Upon the first access of a GPU buffer, the smcuda checks which GPU device
-it has and sends that to the remote side using a smcuda specific control
-message.  The other rank receives the message, and checks to see if there
-is CUDA IPC support between the two GPUs via a call to
-cuDeviceCanAccessPeer().  If it is true, then the smcuda BTL piggy backs on
-the PML error handler callback to make a call into the PML and let it know
-to enable CUDA IPC. We created a new flag so that the error handler does
-the right thing.  Large message RDMA is enabled by setting a flag in the
-bml->btl_flags field.  Control returns to the smcuda BTL where a reply
-message is sent so the sending side can set its flag.
-
-At that point, the PML layer starts using the large message RDMA support
-in the smcuda BTL.  This is done in some special CUDA code in the PML layer.
-
-ESTABLISHING CUDA IPC SUPPORT
-A check has been added into both the send and sendi path in the smcuda btl
-that checks to see if it should send a request for CUDA IPC setup message.
-
-    /* Initiate setting up CUDA IPC support. */
-    if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
-        mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
-    }
-
-The first check is to see if the CUDA environment has been initialized.  If
-not, then presumably we are not sending any GPU buffers yet and there is
-nothing to be done.  If we are initialized, then check the status of the
-CUDA IPC endpoint.  If it is in the IPC_INIT stage, then call the function
-to send of a control message to the endpoint.
-
-On the receiving side, we first check to see if we are initialized.  If
-not, then send a message back to the sender saying we are not initialized.
-This will cause the sender to reset its state to IPC_INIT so it can try
-again on the next send.
-
-I considered putting the receiving side into a new state like IPC_NOTREADY,
-and then when it switches to ready, to then sending the ACK to the sender.
-The problem with this is that we would need to do these checks during the
-progress loop which adds some extra overhead as we would have to check all
-endpoints to see if they were ready.
-
-Note that any rank can initiate the setup of CUDA IPC.  It is triggered by
-whichever side does a send or sendi call of a GPU buffer.
-
-I have the sender attempt 5 times to set up the connection.  After that, we
-give up.  Note that I do not expect many scenarios where the sender has to
-resend.  It could happen in a race condition where one rank has initialized
-its CUDA environment but the other side has not.
-
-There are several states the connections can go through.
-
-IPC_INIT   - nothing has happened
-IPC_SENT   - message has been sent to other side
-IPC_ACKING - Received request and figuring out what to send back
-IPC_ACKED  - IPC ACK sent
-IPC_OK     - IPC ACK received back
-IPC_BAD    - Something went wrong, so marking as no IPC support
-
-NOTE ABOUT CUDA IPC AND MEMORY POOLS
-The CUDA IPC support works in the following way.  A sender makes a call to
-cuIpcGetMemHandle() and gets a memory handle for its local memory.  The
-sender then sends that handle to receiving side.  The receiver calls
-cuIpcOpenMemHandle() using that handle and gets back an address to the
-remote memory.  The receiver then calls cuMemcpyAsync() to initiate a
-remote read of the GPU data.
-
-The receiver maintains a cache of remote memory that it has handles open on.
-This is because a call to cuIpcOpenMemHandle() can be very expensive (90usec) so
-we want to avoid it when we can.  The cache of remote memory is kept in a memory
-pool that is associated with each endpoint.  Note that we do not cache the local
-memory handles because getting them is very cheap and there is no need.
--- a/opal/mca/btl/smcuda/README.md
+++ b/opal/mca/btl/smcuda/README.md
@ -0,0 +1,126 @@
+# Open MPI SMCUDA design document
+
+Copyright (c) 2013      NVIDIA Corporation.  All rights reserved.
+August 21, 2013
+
+This document describes the design and use of the `smcuda` BTL.
+
+## BACKGROUND
+
+The `smcuda` btl is a copy of the `sm` btl but with some additional
+features.  The main extra feature is the ability to make use of the
+CUDA IPC APIs to quickly move GPU buffers from one GPU to another.
+Without this support, the GPU buffers would all be moved into and then
+out of host memory.
+
+## GENERAL DESIGN
+
+The general design makes use of the large message RDMA RGET support in
+the OB1 PML.  However, there are some interesting choices to make use
+of it.  First, we disable any large message RDMA support in the BTL
+for host messages.  This is done because we need to use the
+`mca_btl_smcuda_get()` for the GPU buffers.  This is also done because
+the upper layers expect there to be a single mpool but we need one for
+the GPU memory and one for the host memory.  Since the advantages of
+using RDMA with host memory is unclear, we disabled it.  This means no
+KNEM or CMA support built in to the `smcuda` BTL.
+
+Also note that we give the `smcuda` BTL a higher rank than the `sm`
+BTL.  This means it will always be selected even if we are doing host
+only data transfers.  The `smcuda` BTL is not built if it is not
+requested via the `--with-cuda` flag to the configure line.
+
+Secondly, the `smcuda` does not make use of the traditional method of
+enabling RDMA operations.  The traditional method checks for the existence
+of an RDMA btl hanging off the endpoint.  The `smcuda` works in conjunction
+with the OB1 PML and uses flags that it sends in the BML layer.
+
+## OTHER CONSIDERATIONS
+
+CUDA IPC is not necessarily supported by all GPUs on a node.  In NUMA
+nodes, CUDA IPC may only work between GPUs that are not connected
+over the IOH.  In addition, we want to check for CUDA IPC support lazily,
+when the first GPU access occurs, rather than during `MPI_Init()` time.
+This complicates the design.
+
+## INITIALIZATION
+
+When the `smcuda` BTL initializes, it starts with no support for CUDA IPC.
+Upon the first access of a GPU buffer, the `smcuda` checks which GPU device
+it has and sends that to the remote side using a `smcuda` specific control
+message.  The other rank receives the message, and checks to see if there
+is CUDA IPC support between the two GPUs via a call to
+`cuDeviceCanAccessPeer()`.  If it is true, then the `smcuda` BTL piggy backs on
+the PML error handler callback to make a call into the PML and let it know
+to enable CUDA IPC. We created a new flag so that the error handler does
+the right thing.  Large message RDMA is enabled by setting a flag in the
+`bml->btl_flags` field.  Control returns to the `smcuda` BTL where a reply
+message is sent so the sending side can set its flag.
+
+At that point, the PML layer starts using the large message RDMA
+support in the `smcuda` BTL.  This is done in some special CUDA code
+in the PML layer.
+
+## ESTABLISHING CUDA IPC SUPPORT
+
+A check has been added into both the `send` and `sendi` path in the
+`smcuda` btl that checks to see if it should send a request for CUDA
+IPC setup message.
+
+```c
+/* Initiate setting up CUDA IPC support. */
+if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
+    mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
+}
+```
+
+The first check is to see if the CUDA environment has been
+initialized.  If not, then presumably we are not sending any GPU
+buffers yet and there is nothing to be done.  If we are initialized,
+then check the status of the CUDA IPC endpoint.  If it is in the
+IPC_INIT stage, then call the function to send of a control message to
+the endpoint.
+
+On the receiving side, we first check to see if we are initialized.
+If not, then send a message back to the sender saying we are not
+initialized.  This will cause the sender to reset its state to
+IPC_INIT so it can try again on the next send.
+
+I considered putting the receiving side into a new state like
+IPC_NOTREADY, and then when it switches to ready, to then sending the
+ACK to the sender.  The problem with this is that we would need to do
+these checks during the progress loop which adds some extra overhead
+as we would have to check all endpoints to see if they were ready.
+
+Note that any rank can initiate the setup of CUDA IPC.  It is
+triggered by whichever side does a send or sendi call of a GPU buffer.
+
+I have the sender attempt 5 times to set up the connection.  After
+that, we give up.  Note that I do not expect many scenarios where the
+sender has to resend.  It could happen in a race condition where one
+rank has initialized its CUDA environment but the other side has not.
+
+There are several states the connections can go through.
+
+1. IPC_INIT   - nothing has happened
+1. IPC_SENT   - message has been sent to other side
+1. IPC_ACKING - Received request and figuring out what to send back
+1. IPC_ACKED  - IPC ACK sent
+1. IPC_OK     - IPC ACK received back
+1. IPC_BAD    - Something went wrong, so marking as no IPC support
+
+## NOTE ABOUT CUDA IPC AND MEMORY POOLS
+
+The CUDA IPC support works in the following way.  A sender makes a
+call to `cuIpcGetMemHandle()` and gets a memory handle for its local
+memory.  The sender then sends that handle to receiving side.  The
+receiver calls `cuIpcOpenMemHandle()` using that handle and gets back
+an address to the remote memory.  The receiver then calls
+`cuMemcpyAsync()` to initiate a remote read of the GPU data.
+
+The receiver maintains a cache of remote memory that it has handles
+open on.  This is because a call to `cuIpcOpenMemHandle()` can be very
+expensive (90usec) so we want to avoid it when we can.  The cache of
+remote memory is kept in a memory pool that is associated with each
+endpoint.  Note that we do not cache the local memory handles because
+getting them is very cheap and there is no need.
--- a/opal/mca/btl/usnic/Makefile.am
+++ b/opal/mca/btl/usnic/Makefile.am
@ -27,7 +27,7 @@

 AM_CPPFLAGS = $(opal_ofi_CPPFLAGS) -DOMPI_LIBMPI_NAME=\"$(OMPI_LIBMPI_NAME)\"

-EXTRA_DIST = README.txt README.test
+EXTRA_DIST = README.md README.test

 dist_opaldata_DATA = \
    help-mpi-btl-usnic.txt
--- a/opal/mca/btl/usnic/README.md
+++ b/opal/mca/btl/usnic/README.md
@ -0,0 +1,330 @@
+# Design notes on usnic BTL
+
+## nomenclature
+
+* fragment - something the PML asks us to send or put, any size
+* segment - something we can put on the wire in a single packet
+* chunk - a piece of a fragment that fits into one segment
+
+a segment can contain either an entire fragment or a chunk of a fragment
+
+each segment and fragment has associated descriptor.
+
+Each segment data structure has a block of registered memory associated with
+it which matches MTU for that segment
+
+* ACK - acks get special small segments with only enough memory for an ACK
+* non-ACK segments always have a parent fragment
+
+* fragments are either large (> MTU) or small (<= MTU)
+* a small fragment has a segment descriptor embedded within it since it
+  always needs exactly one.
+* a large fragment has no permanently associated segments, but allocates them
+  as needed.
+
+## channels
+
+A channel is a queue pair with an associated completion queue
+each channel has its own MTU and r/w queue entry counts
+
+There are 2 channels, command and data:
+* command queue is generally for higher priority fragments
+* data queue is for standard data traffic
+* command queue should possibly be called "priority" queue
+
+command queue is shorter and has a smaller MTU that the data queue.
+this makes the command queue a lot faster than the data queue, so we
+hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
+
+command queue is used for ACKs and tiny fragments.
+data queue is used for everything else.
+
+PML fragments marked priority should perhaps use command queue
+
+## sending
+
+Normally, all send requests are simply enqueued and then actually posted
+to the NIC by the routine `opal_btl_usnic_module_progress_sends()`.
+"fastpath" tiny sends are the exception.
+
+Each module maintains a queue of endpoints that are ready to send.
+An endpoint is ready to send if all of the following are met:
+1. the endpoint has fragments to send
+1. the endpoint has send credits
+1. the endpoint's send window is "open" (not full of un-ACKed segments)
+
+Each module also maintains a list of segments that need to be retransmitted.
+Note that the list of pending retrans is per-module, not per-endpoint.
+
+Send progression first posts any pending retransmissions, always using
+the data channel.  (reason is that if we start getting heavy
+congestion and there are lots of retransmits, it becomes more
+important than ever to prioritize ACKs, clogging command channel with
+retrans data makes things worse, not better)
+
+Next, progression loops sending segments to the endpoint at the top of
+the `endpoints_with_sends` queue.  When an endpoint exhausts its send
+credits or fills its send window or runs out of segments to send, it
+removes itself from the `endpoint_with_sends` list.  Any pending ACKs
+will be picked up and piggy-backed on these sends.
+
+Finally, any endpoints that still need ACKs whose timer has expired will
+be sent explicit ACK packets.
+
+## fragment sending
+
+The middle part of the progression loop handles both small
+(single-segment) and large (multi-segment) sends.
+
+For small fragments, the verbs descriptor within the embedded segment
+is updated with length, BTL header is updated, then we call
+`opal_btl_usnic_endpoint_send_segment()` to send the segment.  After
+posting, we make a PML callback if needed.
+
+For large fragments, a little more is needed.  segments froma large
+fragment have a slightly larger BTL header which contains a fragment
+ID, and offset, and a size.  The fragment ID is allocated when the
+first chunk the fragment is sent.  A segment gets allocated, next blob
+of data is copied into this segment, segment is posted.  If last chunk
+of fragment sent, perform callback if needed, then remove fragment
+from endpoint send queue.
+
+## `opal_btl_usnic_endpoint_send_segment()`
+
+This is common posting code for large or small segments.  It assigns a
+sequence number to a segment, checks for an ACK to piggy-back,
+posts the segment to the NIC, and then starts the retransmit timer
+by checking the segment into hotel.  Send credits are consumed here.
+
+
+## send dataflow
+
+PML control messages with no user data are sent via:
+* `desc = usnic_alloc(size)`
+* `usnic_send(desc)`
+
+user messages less than eager limit and 1st part of larger
+
+messages are sent via:
+* `desc = usnic_prepare_src(convertor, size)`
+* `usnic_send(desc)`
+
+larger msgs:
+* `desc = usnic_prepare_src(convertor, size)`
+* `usnic_put(desc)`
+
+
+`usnic_alloc()` currently asserts the length is "small", allocates and
+fills in a small fragment.  src pointer will point to start of
+associated registered mem + sizeof BTL header, and PML will put its
+data there.
+
+`usnic_prepare_src()` allocated either a large or small fragment based
+on size The fragment descriptor is filled in to have 2 SG entries, 1st
+pointing to place where PML should construct its header.  If the data
+convertor says data is contiguous, 2nd SG entry points to user buffer,
+else it is null and sf_convertor is filled in with address of
+convertor.
+
+### `usnic_send()`
+
+If the fragment being sent is small enough, has contiguous data, and
+"very few" command queue send WQEs have been consumed, `usnic_send()`
+does a fastpath send.  This means it posts the segment immediately to
+the NIC with INLINE flag set.
+
+If all of the conditions for fastpath send are not met, and this is a
+small fragment, the user data is copied into the associated registered
+memory at this time and the SG list in the descriptor is collapsed to
+one entry.
+
+After the checks above are done, the fragment is enqueued to be sent
+via `opal_btl_usnic_endpoint_enqueue_frag()`
+
+### `usnic_put()`
+
+Do a fast version of what happens in `prepare_src()` (can take shortcuts
+because we know it will always be a contiguous buffer / no convertor
+needed).  PML gives us the destination address, which we save on the
+fragment (which is the sentinel value that the underlying engine uses
+to know that this is a PUT and not a SEND), and the fragment is
+enqueued for processing.
+
+### `opal_btl_usnic_endpoint_enqueue_frag()`
+
+This appends the fragment to the "to be sent" list of the endpoint and
+conditionally adds the endpoint to the list of endpoints with data to
+send via `opal_btl_usnic_check_rts()`
+
+## receive dataflow
+
+BTL packets has one of 3 types in header: frag, chunk, or ack.
+
+* A frag packet is a full PML fragment.
+* A chunk packet is a piece of a fragment that needs to be reassembled.
+* An ack packet is header only with a sequence number being ACKed.
+
+* Both frag and chunk packets go through some of the same processing.
+* Both may carry piggy-backed ACKs which may need to be processed.
+* Both have sequence numbers which must be processed and may result in
+  dropping the packet and/or queueing an ACK to the sender.
+
+frag packets may be either regular PML fragments or PUT segments.  If
+the "put_addr" field of the BTL header is set, this is a PUT and the
+data is copied directly to the user buffer.  If this field is NULL,
+the segment is passed up to the PML.  The PML is expected to do
+everything it needs with this packet in the callback, including
+copying data out if needed.  Once the callback is complete, the
+receive buffer is recycled.
+
+chunk packets are parts of a larger fragment.  If an active fragment
+receive for the matching fragment ID cannot be found, and new fragment
+info descriptor is allocated.  If this is not a PUT (`put_addr == NULL`),
+we `malloc()` data to reassemble the fragment into.  Each
+subsequent chunk is copied either into this reassembly buffer or
+directly into user memory.  When the last chunk of a fragment arrives,
+a PML callback is made for non-PUTs, then the fragment info descriptor
+is released.
+
+## fast receive optimization
+
+In order to optimize latency of small packets, the component progress
+routine implements a fast path for receives.  If the first completion
+is a receive on the priority queue, then it is handled by a routine
+called `opal_btl_usnic_recv_fast()` which does nothing but validates
+that the packet is OK to be received (sequence number OK and not a
+DUP) and then delivers it to the PML.  This packet is recorded in the
+channel structure, and all bookeeping for the packet is deferred until
+the next time `component_progress` is called again.
+
+This fast path cannot be taken every time we pass through
+`component_progress` because there will be other completions that need
+processing, and the receive bookeeping for one fast receive must be
+complete before allowing another fast receive to occur, as only one
+recv segment can be saved for deferred processing at a time.  This is
+handled by maintaining a variable in `opal_btl_usnic_recv_fast()`
+called fastpath_ok which is set to false every time the fastpath is
+taken.  A call into the regular progress routine will set this flag
+back to true.
+
+## reliability:
+
+* every packet has sequence #
+* each endpoint has a "send window" , currently 4096 entries.
+* once a segment is sent, it is saved in window array until ACK is received
+* ACKs acknowledge all packets <= specified sequence #
+* rcvr only ACKs a sequence # when all packets up to that sequence have arrived
+
+* each pkt has dflt retrans timer of 100ms
+* packet will be scheduled for retrans if timer expires
+
+Once a segment is sent, it always has its retransmit timer started.
+This is accomplished by `opal_hotel_checkin()`.
+Any time a segment is posted to the NIC for retransmit, it is checked out
+of the hotel (timer stopped).
+So, a send segment is always in one of 4 states:
+* on free list, unallocated
+* on endpoint to-send list in the case of segment associated with small fragment
+* posted to NIC and in hotel awaiting ACK
+* on module re-send list awaiting retransmission
+
+rcvr:
+* if a pkt with seq >= expected seq is received, schedule ack of largest
+  in-order sequence received if not already scheduled.  dflt time is 50us
+* if a packet with seq < expected seq arrives, we send an ACK immediately,
+  as this indicates a lost ACK
+
+sender:
+* duplicate ACK triggers immediate retrans if one is not pending for
+  that segment
+
+## Reordering induced by two queues and piggy-backing:
+
+ACKs can be reordered-
+*  not an issue at all, old ACKs are simply ignored
+
+Sends can be reordered-
+* (small send can jump far ahead of large sends)
+* large send followed by lots of small sends could trigger many
+  retrans of the large sends.  smalls would have to be paced pretty
+  precisely to keep command queue empty enough and also beat out the
+  large sends.  send credits limit how many larges can be queued on
+  the sender, but there could be many on the receiver
+
+
+## RDMA emulation
+
+We emulate the RDMA PUT because it's more efficient than regular send:
+it allows the receive to copy directly to the target buffer
+(vs. making an intermediate copy out of the bounce buffer).
+
+It would actually be better to morph this PUT into a GET -- GET would
+be slightly more efficient.  In short, when the target requests the
+actual RDMA data, with PUT, the request has to go up to the PML, which
+will then invoke PUT on the source's BTL module.  With GET, the target
+issues the GET, and the source BTL module can reply without needing to
+go up the stack to the PML.
+
+Once we start supporting RDMA in hardware:
+
+* we need to provide `module.btl_register_mem` and
+  `module.btl_deregister_mem` functions (see openib for an example)
+* we need to put something meaningful in
+  `btl_usnic_frag.h:mca_btl_base_registration_handle_t`.
+* we need to set `module.btl_registration_handle_size` to `sizeof(struct
+  mca_btl_base_registration_handle_t`).
+* `module.btl_put` / `module.btl_get` will receive the
+  `mca_btl_base_registration_handle_t` from the peer as a cookie.
+
+Also, `module.btl_put` / `module.btl_get` do not need to make
+descriptors (this was an optimization added in BTL 3.0).  They are now
+called with enough information to do whatever they need to do.
+module.btl_put still makes a descriptor and submits it to the usnic
+sending engine so as to utilize a common infrastructure for send and
+put.
+
+But it doesn't necessarily have to be that way -- we could optimize
+out the use of the descriptors.  Have not investigated how easy/hard
+that would be.
+
+## libfabric abstractions:
+
+* `fi_fabric`: corresponds to a VIC PF
+* `fi_domain`: corresponds to a VIC VF
+* `fi_endpoint`: resources inside the VIC VF (basically a QP)
+
+## `MPI_THREAD_MULTIPLE` support
+
+In order to make usnic btl thread-safe, the mutex locks are issued to
+protect the critical path. ie; libfabric routines, book keeping, etc.
+
+The said lock is `btl_usnic_lock`. It is a RECURSIVE lock, meaning
+that the same thread can take the lock again even if it already has
+the lock to allow the callback function to post another segment right
+away if we know that the current segment is completed inline. (So we
+can call send in send without deadlocking)
+
+These two functions taking care of hotel checkin/checkout and we have
+to protect that part. So we take the mutex lock before we enter the
+function.
+
+* `opal_btl_usnic_check_rts()`
+* `opal_btl_usnic_handle_ack()`
+
+We also have to protect the call to libfabric routines
+
+* `opal_btl_usnic_endpoint_send_segment()` (`fi_send`)
+* `opal_btl_usnic_recv_call()` (`fi_recvmsg`)
+
+have to be protected as well.
+
+Also cclient connection checking (`opal_btl_usnic_connectivity_ping`)
+has to be protected. This happens only in the beginning but cclient
+communicate with cagent through `opal_fd_read/write()` and if two or
+more clients do `opal_fd_write()` at the same time, the data might be
+corrupt.
+
+With this concept, many functions in btl/usnic that make calls to the
+listed functions are protected by `OPAL_THREAD_LOCK` macro which will
+only be active if the user specify `MPI_Init_thread()` with
+`MPI_THREAD_MULTIPLE` support.
--- a/opal/mca/btl/usnic/README.txt
+++ b/opal/mca/btl/usnic/README.txt
@ -1,383 +0,0 @@
-Design notes on usnic BTL
-
-======================================
-nomenclature
-
-fragment - something the PML asks us to send or put, any size
-segment - something we can put on the wire in a single packet
-chunk - a piece of a fragment that fits into one segment
-
-a segment can contain either an entire fragment or a chunk of a fragment
-
-each segment and fragment has associated descriptor.
-
-Each segment data structure has a block of registered memory associated with
-it which matches MTU for that segment
-ACK - acks get special small segments with only enough memory for an ACK
-non-ACK segments always have a parent fragment
-
-fragments are either large (> MTU) or small (<= MTU)
-a small fragment has a segment descriptor embedded within it since it
-always needs exactly one.
-
-a large fragment has no permanently associated segments, but allocates them
-as needed.
-
-======================================
-channels
-
-a channel is a queue pair with an associated completion queue
-each channel has its own MTU and r/w queue entry counts
-
-There are 2 channels, command and data
-command queue is generally for higher priority fragments
-data queue is for standard data traffic
-command queue should possibly be called "priority" queue
-
-command queue is shorter and has a smaller MTU that the data queue
-this makes the command queue a lot faster than the data queue, so we
-hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
-
-command queue is used for ACKs and tiny fragments
-data queue is used for everything else
-
-PML fragments marked priority should perhaps use command queue
-
-======================================
-sending
-
-Normally, all send requests are simply enqueued and then actually posted
-to the NIC by the routine opal_btl_usnic_module_progress_sends().
-"fastpath" tiny sends are the exception.
-
-Each module maintains a queue of endpoints that are ready to send.
-An endpoint is ready to send if all of the following are met:
- the endpoint has fragments to send
- the endpoint has send credits
- the endpoint's send window is "open" (not full of un-ACKed segments)
-
-Each module also maintains a list of segments that need to be retransmitted.
-Note that the list of pending retrans is per-module, not per-endpoint.
-
-send progression first posts any pending retransmissions, always using the
-data channel.  (reason is that if we start getting heavy congestion and
-there are lots of retransmits, it becomes more important than ever to
-prioritize ACKs, clogging command channel with retrans data makes things worse,
-not better)
-
-Next, progression loops sending segments to the endpoint at the top of
-the "endpoints_with_sends" queue.  When an endpoint exhausts its send
-credits or fills its send window or runs out of segments to send, it removes
-itself from the endpoint_with_sends list.  Any pending ACKs will be
-picked up and piggy-backed on these sends.
-
-Finally, any endpoints that still need ACKs whose timer has expired will
-be sent explicit ACK packets.
-
-[double-click fragment sending]
-The middle part of the progression loop handles both small (single-segment)
-and large (multi-segment) sends.
-
-For small fragments, the verbs descriptor within the embedded segment is
-updated with length, BTL header is updated, then we call
-opal_btl_usnic_endpoint_send_segment() to send the segment.
-After posting, we make a PML callback if needed.
-
-For large fragments, a little more is needed.  segments froma large
-fragment have a slightly larger BTL header which contains a fragment ID,
-and offset, and a size.  The fragment ID is allocated when the first chunk
-the fragment is sent.  A segment gets allocated, next blob of data is
-copied into this segment, segment is posted.  If last chunk of fragment
-sent, perform callback if needed, then remove fragment from endpoint
-send queue.
-
-[double-click opal_btl_usnic_endpoint_send_segment()]
-
-This is common posting code for large or small segments.  It assigns a
-sequence number to a segment, checks for an ACK to piggy-back,
-posts the segment to the NIC, and then starts the retransmit timer
-by checking the segment into hotel.  Send credits are consumed here.
-
-
-======================================
-send dataflow
-
-PML control messages with no user data are sent via:
-desc = usnic_alloc(size)
-usnic_send(desc)
-
-user messages less than eager limit and 1st part of larger
-messages are sent via:
-desc = usnic_prepare_src(convertor, size)
-usnic_send(desc)
-
-larger msgs
-desc = usnic_prepare_src(convertor, size)
-usnic_put(desc)
-
-
-usnic_alloc() currently asserts the length is "small", allocates and
-fills in a small fragment.  src pointer will point to start of
-associated registered mem + sizeof BTL header, and PML will put its
-data there.
-
-usnic_prepare_src() allocated either a large or small fragment based on size
-The fragment descriptor is filled in to have 2 SG entries, 1st pointing to
-place where PML should construct its header.  If the data convertor says
-data is contiguous, 2nd SG entry points to user buffer, else it is null and
-sf_convertor is filled in with address of convertor.
-
-usnic_send()
-If the fragment being sent is small enough, has contiguous data, and
-"very few" command queue send WQEs have been consumed, usnic_send() does
-a fastpath send.  This means it posts the segment immediately to the NIC
-with INLINE flag set.
-
-If all of the conditions for fastpath send are not met, and this is a small
-fragment, the user data is copied into the associated registered memory at this
-time and the SG list in the descriptor is collapsed to one entry.
-
-After the checks above are done, the fragment is enqueued to be sent
-via opal_btl_usnic_endpoint_enqueue_frag()
-
-usnic_put()
-Do a fast version of what happens in prepare_src() (can take shortcuts
-because we know it will always be a contiguous buffer / no convertor
-needed).  PML gives us the destination address, which we save on the
-fragment (which is the sentinel value that the underlying engine uses
-to know that this is a PUT and not a SEND), and the fragment is
-enqueued for processing.
-
-opal_btl_usnic_endpoint_enqueue_frag()
-This appends the fragment to the "to be sent" list of the endpoint and
-conditionally adds the endpoint to the list of endpoints with data to send
-via opal_btl_usnic_check_rts()
-
-======================================
-receive dataflow
-
-BTL packets has one of 3 types in header: frag, chunk, or ack.
-
-A frag packet is a full PML fragment.
-A chunk packet is a piece of a fragment that needs to be reassembled.
-An ack packet is header only with a sequence number being ACKed.
-
-Both frag and chunk packets go through some of the same processing.
-Both may carry piggy-backed ACKs which may need to be processed.
-Both have sequence numbers which must be processed and may result in
-dropping the packet and/or queueing an ACK to the sender.
-
-frag packets may be either regular PML fragments or PUT segments.
-If the "put_addr" field of the BTL header is set, this is a PUT and
-the data is copied directly to the user buffer.  If this field is NULL,
-the segment is passed up to the PML.  The PML is expected to do everything
-it needs with this packet in the callback, including copying data out if
-needed.  Once the callback is complete, the receive buffer is recycled.
-
-chunk packets are parts of a larger fragment.  If an active fragment receive
-for the matching fragment ID cannot be found, and new fragment info
-descriptor is allocated.  If this is not a PUT (put_addr == NULL), we
-malloc() data to reassemble the fragment into.  Each subsequent chunk
-is copied either into this reassembly buffer or directly into user memory.
-When the last chunk of a fragment arrives, a PML callback is made for non-PUTs,
-then the fragment info descriptor is released.
-
-======================================
-fast receive optimization
-
-In order to optimize latency of small packets, the component progress routine
-implements a fast path for receives.  If the first completion is a receive on
-the priority queue, then it is handled by a routine called
-opal_btl_usnic_recv_fast() which does nothing but validates that the packet
-is OK to be received (sequence number OK and not a DUP) and then delivers it
-to the PML.  This packet is recorded in the channel structure, and all
-bookeeping for the packet is deferred until the next time component_progress
-is called again.
-
-This fast path cannot be taken every time we pass through component_progress
-because there will be other completions that need processing, and the receive
-bookeeping for one fast receive must be complete before allowing another fast
-receive to occur, as only one recv segment can be saved for deferred
-processing at a time.  This is handled by maintaining a variable in
-opal_btl_usnic_recv_fast() called fastpath_ok which is set to false every time
-the fastpath is taken.  A call into the regular progress routine will set this
-flag back to true.
-
-======================================
-reliability:
-
-every packet has sequence #
-each endpoint has a "send window" , currently 4096 entries.
-once a segment is sent, it is saved in window array until ACK is received
-ACKs acknowledge all packets <= specified sequence #
-rcvr only ACKs a sequence # when all packets up to that sequence have arrived
-
-each pkt has dflt retrans timer of 100ms
-packet will be scheduled for retrans if timer expires
-
-Once a segment is sent, it always has its retransmit timer started.
-This is accomplished by opal_hotel_checkin()
-Any time a segment is posted to the NIC for retransmit, it is checked out
-of the hotel (timer stopped).
-So, a send segment is always in one of 4 states:
- on free list, unallocated
- on endpoint to-send list in the case of segment associated with small fragment
- posted to NIC and in hotel awaiting ACK
- on module re-send list awaiting retransmission
-
-rcvr:
- if a pkt with seq >= expected seq is received, schedule ack of largest
-  in-order sequence received if not already scheduled.  dflt time is 50us
- if a packet with seq < expected seq arrives, we send an ACK immediately,
-  as this indicates a lost ACK
-
-sender:
-duplicate ACK triggers immediate retrans if one is not pending for that segment
-
-======================================
-Reordering induced by two queues and piggy-backing:
-
-ACKs can be reordered-
-  not an issue at all, old ACKs are simply ignored
-
-Sends can be reordered-
-(small send can jump far ahead of large sends)
-large send followed by lots of small sends could trigger many retrans
-of the large sends.  smalls would have to be paced pretty precisely to
-keep command queue empty enough and also beat out the large sends.
-send credits limit how many larges can be queued on the sender, but there
-could be many on the receiver
-
-
-======================================
-RDMA emulation
-
-We emulate the RDMA PUT because it's more efficient than regular send:
-it allows the receive to copy directly to the target buffer
-(vs. making an intermediate copy out of the bounce buffer).
-
-It would actually be better to morph this PUT into a GET -- GET would
-be slightly more efficient.  In short, when the target requests the
-actual RDMA data, with PUT, the request has to go up to the PML, which
-will then invoke PUT on the source's BTL module.  With GET, the target
-issues the GET, and the source BTL module can reply without needing to
-go up the stack to the PML.
-
-Once we start supporting RDMA in hardware:
-
- we need to provide module.btl_register_mem and
-  module.btl_deregister_mem functions (see openib for an example)
- we need to put something meaningful in
-  btl_usnic_frag.h:mca_btl_base_registration_handle_t.
- we need to set module.btl_registration_handle_size to sizeof(struct
-  mca_btl_base_registration_handle_t).
- module.btl_put / module.btl_get will receive the
-  mca_btl_base_registration_handle_t from the peer as a cookie.
-
-Also, module.btl_put / module.btl_get do not need to make descriptors
-(this was an optimization added in BTL 3.0).  They are now called with
-enough information to do whatever they need to do.  module.btl_put
-still makes a descriptor and submits it to the usnic sending engine so
-as to utilize a common infrastructure for send and put.
-
-But it doesn't necessarily have to be that way -- we could optimize
-out the use of the descriptors.  Have not investigated how easy/hard
-that would be.
-
-======================================
-
-November 2014 / SC 2014
-Update February 2015
-
-The usnic BTL code has been unified across master and the v1.8
-branches.
-
-   NOTE: As of May 2018, this is no longer true.  This was generally
-   only necessary back when the BTLs were moved from the OMPI layer to
-   the OPAL layer.  Now that the BTLs have been down in OPAL for
-   several years, this tomfoolery is no longer necessary.  This note
-   is kept for historical purposes, just in case someone needs to go
-   back and look at the v1.8 series.
-
-That is, you can copy the code from v1.8:ompi/mca/btl/usnic/* to
-master:opal/mca/btl/usnic*, and then only have to make 3 changes in
-the resulting code in master:
-
-1. Edit Makefile.am: s/ompi/opal/gi
-2. Edit configure.m4: s/ompi/opal/gi
-   --> EXCEPT for:
-       - opal_common_libfabric_* (which will eventually be removed,
-         when the embedded libfabric goes away)
-       - OPAL_BTL_USNIC_FI_EXT_USNIC_H (which will eventually be
-         removed, when the embedded libfabric goes away)
-       - OPAL_VAR_SCOPE_*
-3. Edit Makefile.am: change -DBTL_IN_OPAL=0 to -DBTL_IN_OPAL=1
-
-*** Note: the BTL_IN_OPAL preprocessor macro is set in Makefile.am
-    rather that in btl_usnic_compat.h to avoid all kinds of include
-    file dependency issues (i.e., btl_usnic_compat.h would need to be
-    included first, but it requires some data structures to be
-    defined, which means it either can't be first or we have to
-    declare various structs first... just put BTL_IN_OPAL in
-    Makefile.am and be happy).
-
-*** Note 2: CARE MUST BE TAKEN WHEN COPYING THE OTHER DIRECTION!  It
-    is *not* as simple as simple s/opal/ompi/gi in configure.m4 and
-    Makefile.am.  It certainly can be done, but there's a few strings
-    that need to stay "opal" or "OPAL" (e.g., OPAL_HAVE_FOO).
-    Hence, the string replace will likely need to be done via manual
-    inspection.
-
-Things still to do:
-
- VF/PF sanity checks in component.c:check_usnic_config() uses
-  usnic-specific fi_provider info.  The exact mechanism might change
-  as provider-specific info is still being discussed upstream.
-
- component.c:usnic_handle_cq_error is using a USD_* constant from
-  usnic_direct.  Need to get that value through libfabric somehow.
-
-======================================
-
-libfabric abstractions:
-
-fi_fabric: corresponds to a VIC PF
-fi_domain: corresponds to a VIC VF
-fi_endpoint: resources inside the VIC VF (basically a QP)
-
-======================================
-
-MPI_THREAD_MULTIPLE support
-
-In order to make usnic btl thread-safe, the mutex locks are issued
-to protect the critical path. ie; libfabric routines, book keeping, etc.
-
-The said lock is btl_usnic_lock. It is a RECURSIVE lock, meaning that
-the same thread can take the lock again even if it already has the lock to
-allow the callback function to post another segment right away if we know
-that the current segment is completed inline. (So we can call send in send
-without deadlocking)
-
-These two functions taking care of hotel checkin/checkout and we
-have to protect that part. So we take the mutex lock before we enter the
-function.
-
- opal_btl_usnic_check_rts()
- opal_btl_usnic_handle_ack()
-
-We also have to protect the call to libfabric routines
-
- opal_btl_usnic_endpoint_send_segment()        (fi_send)
- opal_btl_usnic_recv_call()			(fi_recvmsg)
-
-have to be protected as well.
-
-Also cclient connection checking (opal_btl_usnic_connectivity_ping) has to be
-protected. This happens only in the beginning but cclient communicate with cagent
-through opal_fd_read/write() and if two or more clients do opal_fd_write() at the
-same time, the data might be corrupt.
-
-With this concept, many functions in btl/usnic that make calls to the
-listed functions are protected by OPAL_THREAD_LOCK macro which will only
-be active if the user specify MPI_Init_thread() with MPI_THREAD_MULTIPLE
-support.
--- a/oshmem/mca/memheap/README
+++ b/oshmem/mca/memheap/README
@ -1,50 +0,0 @@
-# Copyright (c) 2013      Mellanox Technologies, Inc.
-#                         All rights reserved
-# $COPYRIGHT$
-MEMHEAP Infrustructure documentation
------------------------------------
-
-MEMHEAP Infrustructure is responsible for managing the symmetric heap.
-The framework currently has following components: buddy and ptmalloc. buddy which uses a buddy allocator in order to manage the Memory allocations on the symmetric heap. Ptmalloc is an adaptation of ptmalloc3.
-
-Additional components may be added easily to the framework by defining the component's and the module's base and extended structures, and their funtionalities.
-
-The buddy allocator has the following data structures:
-1. Base component - of type struct mca_memheap_base_component_2_0_0_t
-2. Base module - of type struct mca_memheap_base_module_t
-3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
-4. Buddy module - of type struct mca_memheap_buddy_module_t extending the base module (struct mca_memheap_base_module_t)
-
-Each data structure includes the following fields:
-1. Base component - memheap_version, memheap_data and memheap_init
-2. Base module - Holds pointers to the base component and to the functions: alloc, free and finalize
-3. Buddy component - is a base component.
-4. Buddy module - Extends the base module and holds additional data on the components's priority, buddy allocator,
-   maximal order of the symmetric heap, symmetric heap, pointer to the symmetric heap and hashtable maintaining the size of each allocated address.
-
-In the case that the user decides to implement additional components, the Memheap infrastructure chooses a component with the maximal priority.
-Handling the component opening is done under the base directory, in three stages:
-1. Open all available components. Implemented by memheap_base_open.c and called from shmem_init.
-2. Select the maximal priority component. This procedure involves the initialization of all components and then their
-   finalization except to the chosen component. It is implemented by memheap_base_select.c and called from shmem_init.
-3. Close the max priority active cmponent. Implemented by memheap_base_close.c and called from shmem finalize.
-
-
-Buddy Component/Module
----------------------
-
-Responsible for handling the entire activities of the symmetric heap.
-The supported activities are:
-                            - buddy_init (Initialization)
-                            - buddy_alloc (Allocates a variable on the symmetric heap)
-                            - buddy_free (frees a variable previously allocated on the symetric heap)
-                            - buddy_finalize (Finalization).
-
-Data members of buddy module: - priority. The module's priority.
-                              - buddy allocator: bits, num_free, lock and the maximal order (log2 of the maximal size)
-                                of a variable on the symmetric heap. Buddy Allocator gives the offset in the symmetric heap
-                                where a variable should be allocated.
-                              - symmetric_heap: a range of reserved addresses (equal in all executing PE's) dedicated to "shared memory" allocation.
-                              - symmetric_heap_hashtable (holding the size of an allocated variable on the symmetric heap.
-                                 used to free an allocated variable on the symmetric heap)
-
--- a/oshmem/mca/memheap/README.md
+++ b/oshmem/mca/memheap/README.md
@ -0,0 +1,71 @@
+# MEMHEAP infrastructure documentation
+
+Copyright (c) 2013      Mellanox Technologies, Inc.
+                        All rights reserved
+
+MEMHEAP Infrustructure is responsible for managing the symmetric heap.
+The framework currently has following components: buddy and
+ptmalloc. buddy which uses a buddy allocator in order to manage the
+Memory allocations on the symmetric heap. Ptmalloc is an adaptation of
+ptmalloc3.
+
+Additional components may be added easily to the framework by defining
+the component's and the module's base and extended structures, and
+their funtionalities.
+
+The buddy allocator has the following data structures:
+
+1. Base component - of type struct mca_memheap_base_component_2_0_0_t
+2. Base module - of type struct mca_memheap_base_module_t
+3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
+4. Buddy module - of type struct mca_memheap_buddy_module_t extending
+   the base module (struct mca_memheap_base_module_t)
+
+Each data structure includes the following fields:
+
+1. Base component - memheap_version, memheap_data and memheap_init
+2. Base module - Holds pointers to the base component and to the
+   functions: alloc, free and finalize
+3. Buddy component - is a base component.
+4. Buddy module - Extends the base module and holds additional data on
+   the components's priority, buddy allocator,
+   maximal order of the symmetric heap, symmetric heap, pointer to the
+   symmetric heap and hashtable maintaining the size of each allocated
+   address.
+
+In the case that the user decides to implement additional components,
+the Memheap infrastructure chooses a component with the maximal
+priority.  Handling the component opening is done under the base
+directory, in three stages:
+1. Open all available components. Implemented by memheap_base_open.c
+   and called from shmem_init.
+2. Select the maximal priority component. This procedure involves the
+   initialization of all components and then their finalization except
+   to the chosen component. It is implemented by memheap_base_select.c
+   and called from shmem_init.
+3. Close the max priority active cmponent. Implemented by
+   memheap_base_close.c and called from shmem finalize.
+
+
+## Buddy Component/Module
+
+Responsible for handling the entire activities of the symmetric heap.
+The supported activities are:
+
+1. buddy_init (Initialization)
+1. buddy_alloc (Allocates a variable on the symmetric heap)
+1. buddy_free (frees a variable previously allocated on the symetric heap)
+1. buddy_finalize (Finalization).
+
+Data members of buddy module:
+
+1. priority. The module's priority.
+1. buddy allocator: bits, num_free, lock and the maximal order (log2
+   of the maximal size) of a variable on the symmetric heap. Buddy
+   Allocator gives the offset in the symmetric heap where a variable
+   should be allocated.
+1. symmetric_heap: a range of reserved addresses (equal in all
+   executing PE's) dedicated to "shared memory" allocation.
+1. symmetric_heap_hashtable (holding the size of an allocated variable
+   on the symmetric heap.  used to free an allocated variable on the
+   symmetric heap)
--- a/test/runtime/README
+++ b/test/runtime/README
@ -1,7 +0,0 @@
-The functions in this directory are all intended to test registry operations against a persistent seed. Thus, they perform a system init/finalize. The functions in the directory above this one should be used to test basic registry operations within the replica - they will isolate the replica so as to avoid the communications issues and the init/finalize problems in other subsystems that may cause problems here.
-
-To run these tests, you need to first start a persistent daemon. This can be done using the command:
-
-orted --seed --scope public --persistent
-
-The daemon will "daemonize" itself and establish the registry (as well as other central services) replica, and then return a system prompt. You can then run any of these functions. If desired, you can utilize gdb and/or debug options on the persistent orted to watch/debug replica operations as well.
--- a/test/runtime/README.md
+++ b/test/runtime/README.md
@ -0,0 +1,20 @@
+The functions in this directory are all intended to test registry
+operations against a persistent seed. Thus, they perform a system
+init/finalize. The functions in the directory above this one should be
+used to test basic registry operations within the replica - they will
+isolate the replica so as to avoid the communications issues and the
+init/finalize problems in other subsystems that may cause problems
+here.
+
+To run these tests, you need to first start a persistent daemon. This
+can be done using the command:
+
+```
+orted --seed --scope public --persistent
+```
+
+The daemon will "daemonize" itself and establish the registry (as well
+as other central services) replica, and then return a system
+prompt. You can then run any of these functions. If desired, you can
+utilize gdb and/or debug options on the persistent orted to
+watch/debug replica operations as well.