Convert all README files to Markdown

A mindless task for a lazy weekend: convert all the README and README.txt files to Markdown. Paired with the slow conversion of all of our man pages to Markdown, this gives a uniform language to the Open MPI docs. This commit moved a bunch of copyright headers out of the top-level README.txt file, so I updated the relevant copyright header years in the top-level LICENSE file to match what was removed from README.txt. Additionally, this commit did (very) little to update the actual content of the README files. A very small number of updates were made for topics that I found blatently obvious while Markdown-izing the content, but in general, I did not update content during this commit. For example, there's still quite a bit of text about ORTE that was not meaningfully updated. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> Co-authored-by: Josh Hursey <jhursey@us.ibm.com>
2020-11-08 13:19:39 -05:00 · 2020-11-08 13:19:39 -05:00 · c960d292ec
--- a/272
+++ b/272
@ -1,272 +0,0 @@
 Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
                        University Research and Technology
                        Corporation.  All rights reserved.
 Copyright (c) 2004-2005 The University of Tennessee and The University
                        of Tennessee Research Foundation.  All rights
                        reserved.
 Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
                        University of Stuttgart.  All rights reserved.
 Copyright (c) 2004-2005 The Regents of the University of California.
                        All rights reserved.
 Copyright (c) 2008-2020 Cisco Systems, Inc.  All rights reserved.
 Copyright (c) 2013      Intel, Inc.  All rights reserved.
 $COPYRIGHT$
 Additional copyrights may follow
 $HEADER$
 Overview
 ========
 This file is here for those who are building/exploring OMPI in its
 source code form, most likely through a developer's tree (i.e., a
 Git clone).
 Developer Builds: Compiler Pickyness by Default
 ===============================================
 If you are building Open MPI from a Git clone (i.e., there is a ".git"
 directory in your build tree), the default build includes extra
 compiler pickyness, which will result in more compiler warnings than
 in non-developer builds.  Getting these extra compiler warnings is
 helpful to Open MPI developers in making the code base as clean as
 possible.
 Developers can disable this picky-by-default behavior by using the
 --disable-picky configure option.  Also note that extra-picky compiles
 do *not* happen automatically when you do a VPATH build (e.g., if
 ".git" is in your source tree, but not in your build tree).
 Prior versions of Open MPI would automatically activate a lot of
 (performance-reducing) debugging code by default if ".git" was found
 in your build tree.  This is no longer true.  You can manually enable
 these (performance-reducing) debugging features in the Open MPI code
 base with these configure options:
    --enable-debug
    --enable-mem-debug
    --enable-mem-profile
 NOTE: These options are really only relevant to those who are
 developing Open MPI itself.  They are not generally helpful for
 debugging general MPI applications.
 Use of GNU Autoconf, Automake, and Libtool (and m4)
 ===================================================
 You need to read/care about this section *ONLY* if you are building
 from a developer's tree (i.e., a Git clone of the Open MPI source
 tree).  If you have an Open MPI distribution tarball, the contents of
 this section are optional -- you can (and probably should) skip
 reading this section.
 If you are building Open MPI from a developer's tree, you must first
 install fairly recent versions of the GNU tools Autoconf, Automake,
 and Libtool (and possibly GNU m4, because recent versions of Autoconf
 have specific GNU m4 version requirements).  The specific versions
 required depend on if you are using the Git master branch or a release
 branch (and which release branch you are using).  The specific
 versions can be found here:
  https://www.open-mpi.org/source/building.php
 You can check what versions of the autotools you have installed with
 the following:
 shell$ m4 --version
 shell$ autoconf --version
 shell$ automake --version
 shell$ libtoolize --version
 Required version levels for all the OMPI releases can be found here:
 https://www.open-mpi.org/source/building.php
 To strengthen the above point: the core Open MPI developers typically
 use very, very recent versions of the GNU tools.  There are known bugs
 in older versions of the GNU tools that Open MPI no longer compensates
 for (it seemed senseless to indefinitely support patches for ancient
 versions of Autoconf, for example).  You *WILL* have problems if you
 do not use recent versions of the GNU tools.
 If you need newer versions, you are *strongly* encouraged to heed the
 following advice:
 NOTE: On MacOS/X, the default "libtool" program is different than the
      GNU libtool.  You must download and install the GNU version
      (e.g., via MacPorts, Homebrew, or some other mechanism).
 1. Unless your OS distribution has easy-to-use binary installations,
   the sources can be can be downloaded from:
        ftp://ftp.gnu.org/gnu/autoconf/
        ftp://ftp.gnu.org/gnu/automake/
        ftp://ftp.gnu.org/gnu/libtool/
        and if you need it:
        ftp://ftp.gnu.org/gnu/m4/
   NOTE: It is certainly easiest to download/build/install all four of
   these tools together.  But note that Open MPI has no specific m4
   requirements; it is only listed here because Autoconf requires
   minimum versions of GNU m4.  Hence, you may or may not *need* to
   actually install a new version of GNU m4.  That being said, if you
   are confused or don't know, just install the latest GNU m4 with the
   rest of the GNU Autotools and everything will work out fine.
 2. Build and install the tools in the following order:
   2a. m4
   2b. Autoconf
   2c. Automake
   2d. Libtool
 3. You MUST install the last three tools (Autoconf, Automake, Libtool)
   into the same prefix directory.  These three tools are somewhat
   inter-related, and if they're going to be used together, they MUST
   share a common installation prefix.
   You can install m4 anywhere as long as it can be found in the path;
   it may be convenient to install it in the same prefix as the other
   three.  Or you can use any recent-enough m4 that is in your path.
   3a. It is *strongly* encouraged that you do not install your new
       versions over the OS-installed versions.  This could cause
       other things on your system to break.  Instead, install into
       $HOME/local, or /usr/local, or wherever else you tend to
       install "local" kinds of software.
   3b. In doing so, be sure to prefix your $path with the directory
       where they are installed.  For example, if you install into
       $HOME/local, you may want to edit your shell startup file
       (.bashrc, .cshrc, .tcshrc, etc.) to have something like:
          # For bash/sh:
          export PATH=$HOME/local/bin:$PATH
          # For csh/tcsh:
          set path = ($HOME/local/bin $path)
   3c. Ensure to set your $path *BEFORE* you configure/build/install
       the four packages.
 4. All four packages require two simple commands to build and
   install (where PREFIX is the prefix discussed in 3, above).
      shell$ cd <m4 directory>
      shell$ ./configure --prefix=PREFIX
      shell$ make; make install
      --> If you are using the csh or tcsh shells, be sure to run the
          "rehash" command after you install each package.
      shell$ cd <autoconf directory>
      shell$ ./configure --prefix=PREFIX
      shell$ make; make install
      --> If you are using the csh or tcsh shells, be sure to run the
          "rehash" command after you install each package.
      shell$ cd <automake directory>
      shell$ ./configure --prefix=PREFIX
      shell$ make; make install
      --> If you are using the csh or tcsh shells, be sure to run the
          "rehash" command after you install each package.
      shell$ cd <libtool directory>
      shell$ ./configure --prefix=PREFIX
      shell$ make; make install
      --> If you are using the csh or tcsh shells, be sure to run the
          "rehash" command after you install each package.
   m4, Autoconf and Automake build and install very quickly; Libtool will
   take a minute or two.
 5. You can now run OMPI's top-level "autogen.pl" script.  This script
   will invoke the GNU Autoconf, Automake, and Libtool commands in the
   proper order and setup to run OMPI's top-level "configure" script.
   Running autogen.pl may take a few minutes, depending on your
   system.  It's not very exciting to watch.  :-)
   If you have a multi-processor system, enabling the multi-threaded
   behavior in Automake 1.11 (or newer) can result in autogen.pl
   running faster.  Do this by setting the AUTOMAKE_JOBS environment
   variable to the number of processors (threads) that you want it to
   use before invoking autogen.pl.  For example (you can again put
   this in your shell startup files):
       # For bash/sh:
       export AUTOMAKE_JOBS=4
       # For csh/tcsh:
       set AUTOMAKE_JOBS 4
   5a. You generally need to run autogen.pl whenever the top-level
       file "configure.ac" changes, or any files in the config/ or
       <project>/config/ directories change (these directories are
       where a lot of "include" files for OMPI's configure script
       live).
   5b. You do *NOT* need to re-run autogen.pl if you modify a
       Makefile.am.
 Use of Flex
 ===========
 Flex is used during the compilation of a developer's checkout (it is
 not used to build official distribution tarballs).  Other flavors of
 lex are *not* supported: given the choice of making parsing code
 portable between all flavors of lex and doing more interesting work on
 Open MPI, we greatly prefer the latter.
 Note that no testing has been performed to see what the minimum
 version of Flex is required by Open MPI.  We suggest that you use
 v2.5.35 at the earliest.
 *** NOTE: Windows developer builds of Open MPI *require* Flex version
 2.5.35.  Specifically, we know that v2.5.35 works and 2.5.4a does not.
 We have not tested to figure out exactly what the minimum required
 flex version is on Windows; we suggest that you use 2.5.35 at the
 earliest.  It is for this reason that the
 contrib/dist/make_dist_tarball script checks for a Windows-friendly
 version of flex before continuing.
 For now, Open MPI will allow developer builds with Flex 2.5.4.  This
 is primarily motivated by the fact that RedHat/Centos 5 ships with
 Flex 2.5.4.  It is likely that someday Open MPI developer builds will
 require Flex version >=2.5.35.
 Note that the flex-generated code generates some compiler warnings on
 some platforms, but the warnings do not seem to be consistent or
 uniform on all platforms, compilers, and flex versions.  As such, we
 have done little to try to remove those warnings.
 If you do not have Flex installed, it can be downloaded from the
 following URL:
    https://github.com/westes/flex
 Use of Pandoc
 =============
 Similar to prior sections, you need to read/care about this section
 *ONLY* if you are building from a developer's tree (i.e., a Git clone
 of the Open MPI source tree).  If you have an Open MPI distribution
 tarball, the contents of this section are optional -- you can (and
 probably should) skip reading this section.
 The Pandoc tool is used to generate Open MPI's man pages.
 Specifically: Open MPI's man pages are written in Markdown; Pandoc is
 the tool that converts that Markdown to nroff (i.e., the format of man
 pages).
 You must have Pandoc >=v1.12 when building Open MPI from a developer's
 tree.  If configure cannot find Pandoc >=v1.12, it will abort.
 If you need to install Pandoc, check your operating system-provided
 packages (to include MacOS Homebrew and MacPorts).  The Pandoc project
 itself also offers binaries for their releases:
   https://pandoc.org/
--- a/HACKING.md
+++ b/HACKING.md
@ -0,0 +1,258 @@
 # Open MPI Hacking / Developer's Guide
 ## Overview
 This file is here for those who are building/exploring OMPI in its
 source code form, most likely through a developer's tree (i.e., a
 Git clone).
 ## Developer Builds: Compiler Pickyness by Default
 If you are building Open MPI from a Git clone (i.e., there is a `.git`
 directory in your build tree), the default build includes extra
 compiler pickyness, which will result in more compiler warnings than
 in non-developer builds.  Getting these extra compiler warnings is
 helpful to Open MPI developers in making the code base as clean as
 possible.
 Developers can disable this picky-by-default behavior by using the
 `--disable-picky` configure option.  Also note that extra-picky compiles
 do *not* happen automatically when you do a VPATH build (e.g., if
 `.git` is in your source tree, but not in your build tree).
 Prior versions of Open MPI would automatically activate a lot of
 (performance-reducing) debugging code by default if `.git` was found
 in your build tree.  This is no longer true.  You can manually enable
 these (performance-reducing) debugging features in the Open MPI code
 base with these configure options:
 * `--enable-debug`
 * `--enable-mem-debug`
 * `--enable-mem-profile`
 ***NOTE:*** These options are really only relevant to those who are
 developing Open MPI itself.  They are not generally helpful for
 debugging general MPI applications.
 ## Use of GNU Autoconf, Automake, and Libtool (and m4)
 You need to read/care about this section *ONLY* if you are building
 from a developer's tree (i.e., a Git clone of the Open MPI source
 tree).  If you have an Open MPI distribution tarball, the contents of
 this section are optional -- you can (and probably should) skip
 reading this section.
 If you are building Open MPI from a developer's tree, you must first
 install fairly recent versions of the GNU tools Autoconf, Automake,
 and Libtool (and possibly GNU m4, because recent versions of Autoconf
 have specific GNU m4 version requirements).  The specific versions
 required depend on if you are using the Git master branch or a release
 branch (and which release branch you are using).  [The specific
 versions can be found
 here](https://www.open-mpi.org/source/building.php).
 You can check what versions of the autotools you have installed with
 the following:
 ```
 shell$ m4 --version
 shell$ autoconf --version
 shell$ automake --version
 shell$ libtoolize --version
 ```
 [Required version levels for all the OMPI releases can be found
 here](https://www.open-mpi.org/source/building.php).
 To strengthen the above point: the core Open MPI developers typically
 use very, very recent versions of the GNU tools.  There are known bugs
 in older versions of the GNU tools that Open MPI no longer compensates
 for (it seemed senseless to indefinitely support patches for ancient
 versions of Autoconf, for example).  You *WILL* have problems if you
 do not use recent versions of the GNU tools.
 ***NOTE:*** On MacOS/X, the default `libtool` program is different
 than the GNU libtool.  You must download and install the GNU version
 (e.g., via MacPorts, Homebrew, or some other mechanism).
 If you need newer versions, you are *strongly* encouraged to heed the
 following advice:
 1. Unless your OS distribution has easy-to-use binary installations,
   the sources can be can be downloaded from:
   * https://ftp.gnu.org/gnu/autoconf/
   * https://ftp.gnu.org/gnu/automake/
   * https://ftp.gnu.org/gnu/libtool/
   * And if you need it: https://ftp.gnu.org/gnu/m4/
   ***NOTE:*** It is certainly easiest to download/build/install all
   four of these tools together.  But note that Open MPI has no
   specific m4 requirements; it is only listed here because Autoconf
   requires minimum versions of GNU m4.  Hence, you may or may not
   *need* to actually install a new version of GNU m4.  That being
   said, if you are confused or don't know, just install the latest
   GNU m4 with the rest of the GNU Autotools and everything will work
   out fine.
 1. Build and install the tools in the following order:
   1. m4
   1. Autoconf
   1. Automake
   1. Libtool
 1. You MUST install the last three tools (Autoconf, Automake, Libtool)
   into the same prefix directory.  These three tools are somewhat
   inter-related, and if they're going to be used together, they MUST
   share a common installation prefix.
   You can install m4 anywhere as long as it can be found in the path;
   it may be convenient to install it in the same prefix as the other
   three.  Or you can use any recent-enough m4 that is in your path.
   1. It is *strongly* encouraged that you do not install your new
      versions over the OS-installed versions.  This could cause
      other things on your system to break.  Instead, install into
      `$HOME/local`, or `/usr/local`, or wherever else you tend to
      install "local" kinds of software.
   1. In doing so, be sure to prefix your $path with the directory
      where they are installed.  For example, if you install into
      `$HOME/local`, you may want to edit your shell startup file
      (`.bashrc`, `.cshrc`, `.tcshrc`, etc.) to have something like:
      ```sh
      # For bash/sh:
      export PATH=$HOME/local/bin:$PATH
      # For csh/tcsh:
      set path = ($HOME/local/bin $path)
      ```
   1. Ensure to set your `$PATH` *BEFORE* you configure/build/install
      the four packages.
 1. All four packages require two simple commands to build and
   install (where PREFIX is the prefix discussed in 3, above).
   ```
   shell$ cd <m4 directory>
   shell$ ./configure --prefix=PREFIX
   shell$ make; make install
   ```
   ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
   run the `rehash` command after you install each package.
   ```
   shell$ cd <autoconf directory>
   shell$ ./configure --prefix=PREFIX
   shell$ make; make install
   ```
   ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
   run the `rehash` command after you install each package.
   ```
   shell$ cd <automake directory>
   shell$ ./configure --prefix=PREFIX
   shell$ make; make install
   ```
   ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
   run the `rehash` command after you install each package.
   ```
   shell$ cd <libtool directory>
   shell$ ./configure --prefix=PREFIX
   shell$ make; make install
   ```
   ***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
   run the `rehash` command after you install each package.
   m4, Autoconf and Automake build and install very quickly; Libtool
   will take a minute or two.
 1. You can now run OMPI's top-level `autogen.pl` script.  This script
   will invoke the GNU Autoconf, Automake, and Libtool commands in the
   proper order and setup to run OMPI's top-level `configure` script.
   Running `autogen.pl` may take a few minutes, depending on your
   system.  It's not very exciting to watch.  :smile:
   If you have a multi-processor system, enabling the multi-threaded
   behavior in Automake 1.11 (or newer) can result in `autogen.pl`
   running faster.  Do this by setting the `AUTOMAKE_JOBS` environment
   variable to the number of processors (threads) that you want it to
   use before invoking `autogen`.pl.  For example (you can again put
   this in your shell startup files):
   ```sh
    # For bash/sh:
    export AUTOMAKE_JOBS=4
    # For csh/tcsh:
    set AUTOMAKE_JOBS 4
    ```
   1. You generally need to run autogen.pl whenever the top-level file
      `configure.ac` changes, or any files in the `config/` or
      `<project>/config/` directories change (these directories are
      where a lot of "include" files for Open MPI's `configure` script
      live).
   1. You do *NOT* need to re-run `autogen.pl` if you modify a
      `Makefile.am`.
 ## Use of Flex
 Flex is used during the compilation of a developer's checkout (it is
 not used to build official distribution tarballs).  Other flavors of
 lex are *not* supported: given the choice of making parsing code
 portable between all flavors of lex and doing more interesting work on
 Open MPI, we greatly prefer the latter.
 Note that no testing has been performed to see what the minimum
 version of Flex is required by Open MPI.  We suggest that you use
 v2.5.35 at the earliest.
 ***NOTE:*** Windows developer builds of Open MPI *require* Flex version
 2.5.35.  Specifically, we know that v2.5.35 works and 2.5.4a does not.
 We have not tested to figure out exactly what the minimum required
 flex version is on Windows; we suggest that you use 2.5.35 at the
 earliest.  It is for this reason that the
 `contrib/dist/make_dist_tarball` script checks for a Windows-friendly
 version of Flex before continuing.
 For now, Open MPI will allow developer builds with Flex 2.5.4.  This
 is primarily motivated by the fact that RedHat/Centos 5 ships with
 Flex 2.5.4.  It is likely that someday Open MPI developer builds will
 require Flex version >=2.5.35.
 Note that the `flex`-generated code generates some compiler warnings
 on some platforms, but the warnings do not seem to be consistent or
 uniform on all platforms, compilers, and flex versions.  As such, we
 have done little to try to remove those warnings.
 If you do not have Flex installed, see [the Flex Github
 repository](https://github.com/westes/flex).
 ## Use of Pandoc
 Similar to prior sections, you need to read/care about this section
 *ONLY* if you are building from a developer's tree (i.e., a Git clone
 of the Open MPI source tree).  If you have an Open MPI distribution
 tarball, the contents of this section are optional -- you can (and
 probably should) skip reading this section.
 The Pandoc tool is used to generate Open MPI's man pages.
 Specifically: Open MPI's man pages are written in Markdown; Pandoc is
 the tool that converts that Markdown to nroff (i.e., the format of man
 pages).
 You must have Pandoc >=v1.12 when building Open MPI from a developer's
 tree.  If configure cannot find Pandoc >=v1.12, it will abort.
 If you need to install Pandoc, check your operating system-provided
 packages (to include MacOS Homebrew and MacPorts).  [The Pandoc
 project web site](https://pandoc.org/) itself also offers binaries for
 their releases.
--- a/11
+++ b/11
@ -15,9 +15,9 @@ Copyright (c) 2004-2010 High Performance Computing Center Stuttgart,
                        University of Stuttgart.  All rights reserved.
 Copyright (c) 2004-2008 The Regents of the University of California.
                        All rights reserved.
-Copyright (c) 2006-2017 Los Alamos National Security, LLC.  All rights
+Copyright (c) 2006-2018 Los Alamos National Security, LLC.  All rights
                        reserved.
-Copyright (c) 2006-2017 Cisco Systems, Inc.  All rights reserved.
+Copyright (c) 2006-2020 Cisco Systems, Inc.  All rights reserved.
 Copyright (c) 2006-2010 Voltaire, Inc. All rights reserved.
 Copyright (c) 2006-2017 Sandia National Laboratories. All rights reserved.
 Copyright (c) 2006-2010 Sun Microsystems, Inc.  All rights reserved.
@ -25,7 +25,7 @@ Copyright (c) 2006-2010 Sun Microsystems, Inc.  All rights reserved.
 Copyright (c) 2006-2017 The University of Houston. All rights reserved.
 Copyright (c) 2006-2009 Myricom, Inc.  All rights reserved.
 Copyright (c) 2007-2017 UT-Battelle, LLC. All rights reserved.
-Copyright (c) 2007-2017 IBM Corporation.  All rights reserved.
+Copyright (c) 2007-2020 IBM Corporation.  All rights reserved.
 Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich Supercomputing
                        Centre, Federal Republic of Germany
 Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
@ -45,7 +45,7 @@ Copyright (c) 2016      ARM, Inc.  All rights reserved.
 Copyright (c) 2010-2011 Alex Brick <bricka@ccs.neu.edu>.  All rights reserved.
 Copyright (c) 2012      The University of Wisconsin-La Crosse. All rights
                        reserved.
-Copyright (c) 2013-2016 Intel, Inc. All rights reserved.
+Copyright (c) 2013-2020 Intel, Inc. All rights reserved.
 Copyright (c) 2011-2017 NVIDIA Corporation.  All rights reserved.
 Copyright (c) 2016      Broadcom Limited.  All rights reserved.
 Copyright (c) 2011-2017 Fujitsu Limited.  All rights reserved.
@ -56,7 +56,8 @@ Copyright (c) 2013-2017 Research Organization for Information Science (RIST).
 Copyright (c) 2017-2020 Amazon.com, Inc. or its affiliates.  All Rights
                        reserved.
 Copyright (c) 2018      DataDirect Networks. All rights reserved.
-Copyright (c) 2018-2019 Triad National Security, LLC. All rights reserved.
+Copyright (c) 2018-2020 Triad National Security, LLC. All rights reserved.
 Copyright (c) 2020      Google, LLC. All rights reserved.
 $COPYRIGHT$
--- a/Makefile.am
+++ b/Makefile.am
@ -24,7 +24,7 @@
 SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_SUBDIRS) test
 DIST_SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_DIST_SUBDIRS) test
-EXTRA_DIST = README INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.txt AUTHORS
+EXTRA_DIST = README.md INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.md AUTHORS
 include examples/Makefile.include
--- a/2243
+++ b/2243
--- a/README.JAVA.md
+++ b/README.JAVA.md
@ -0,0 +1,281 @@
 # Open MPI Java Bindings
 ## Important node
 JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
 NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
 JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
 THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
 CONTINUED DEVELOPER SUPPORT.
 ## Overview
 This version of Open MPI provides support for Java-based
 MPI applications.
 The rest of this document provides step-by-step instructions on
 building OMPI with Java bindings, and compiling and running Java-based
 MPI applications. Also, part of the functionality is explained with
 examples. Further details about the design, implementation and usage
 of Java bindings in Open MPI can be found in [1]. The bindings follow
 a JNI approach, that is, we do not provide a pure Java implementation
 of MPI primitives, but a thin layer on top of the C
 implementation. This is the same approach as in mpiJava [2]; in fact,
 mpiJava was taken as a starting point for Open MPI Java bindings, but
 they were later totally rewritten.
 1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
   implementation of Java bindings in Open MPI". Parallel Comput.
   59: 1-20 (2016).
 2. M. Baker et al. "mpiJava: An object-oriented Java interface to
   MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
   pp. 748-762, Springer (1999).
 ## Building Java Bindings
 If this software was obtained as a developer-level checkout as opposed
 to a tarball, you will need to start your build by running
 `./autogen.pl`. This will also require that you have a fairly recent
 version of GNU Autotools on your system - see the HACKING.md file for
 details.
 Java support requires that Open MPI be built at least with shared libraries
 (i.e., `--enable-shared`) - any additional options are fine and will not
 conflict. Note that this is the default for Open MPI, so you don't
 have to explicitly add the option. The Java bindings will build only
 if `--enable-mpi-java` is specified, and a JDK is found in a typical
 system default location.
 If the JDK is not in a place where we automatically find it, you can
 specify the location. For example, this is required on the Mac
 platform as the JDK headers are located in a non-typical location. Two
 options are available for this purpose:
 1. `--with-jdk-bindir=<foo>`: the location of `javac` and `javah`
 1. `--with-jdk-headers=<bar>`: the directory containing `jni.h`
 For simplicity, typical configurations are provided in platform files
 under `contrib/platform/hadoop`. These will meet the needs of most
 users, or at least provide a starting point for your own custom
 configuration.
 In summary, therefore, you can configure the system using the
 following Java-related options:
 ```
 $ ./configure --with-platform=contrib/platform/hadoop/<your-platform> ...
 ````
 or
 ```
 $ ./configure --enable-mpi-java --with-jdk-bindir=<foo> --with-jdk-headers=<bar> ...
 ```
 or simply
 ```
 $ ./configure --enable-mpi-java ...
 ```
 if JDK is in a "standard" place that we automatically find.
 ## Running Java Applications
 For convenience, the `mpijavac` wrapper compiler has been provided for
 compiling Java-based MPI applications. It ensures that all required MPI
 libraries and class paths are defined. You can see the actual command
 line using the `--showme` option, if you are interested.
 Once your application has been compiled, you can run it with the
 standard `mpirun` command line:
 ```
 $ mpirun <options> java <your-java-options> <my-app>
 ```
 For convenience, `mpirun` has been updated to detect the `java` command
 and ensure that the required MPI libraries and class paths are defined
 to support execution. You therefore do _NOT_ need to specify the Java
 library path to the MPI installation, nor the MPI classpath. Any class
 path definitions required for your application should be specified
 either on the command line or via the `CLASSPATH` environment
 variable. Note that the local directory will be added to the class
 path if nothing is specified.
 As always, the `java` executable, all required libraries, and your
 application classes must be available on all nodes.
 ## Basic usage of Java bindings
 There is an MPI package that contains all classes of the MPI Java
 bindings: `Comm`, `Datatype`, `Request`, etc. These classes have a
 direct correspondence with classes defined by the MPI standard. MPI
 primitives are just methods included in these classes. The convention
 used for naming Java methods and classes is the usual camel-case
 convention, e.g., the equivalent of `MPI_File_set_info(fh,info)` is
 `fh.setInfo(info)`, where `fh` is an object of the class `File`.
 Apart from classes, the MPI package contains predefined public
 attributes under a convenience class `MPI`. Examples are the
 predefined communicator `MPI.COMM_WORLD` or predefined datatypes such
 as `MPI.DOUBLE`. Also, MPI initialization and finalization are methods
 of the `MPI` class and must be invoked by all MPI Java
 applications. The following example illustrates these concepts:
 ```java
 import mpi.*;
 class ComputePi {
    public static void main(String args[]) throws MPIException {
        MPI.Init(args);
        int rank = MPI.COMM_WORLD.getRank(),
            size = MPI.COMM_WORLD.getSize(),
            nint = 100; // Intervals.
        double h = 1.0/(double)nint, sum = 0.0;
        for(int i=rank+1; i<=nint; i+=size) {
            double x = h * ((double)i - 0.5);
            sum += (4.0 / (1.0 + x * x));
        }
        double sBuf[] = { h * sum },
               rBuf[] = new double[1];
        MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
        if(rank == 0) System.out.println("PI: " + rBuf[0]);
        MPI.Finalize();
    }
 }
 ```
 ## Exception handling
 Java bindings in Open MPI support exception handling. By default, errors
 are fatal, but this behavior can be changed. The Java API will throw
 exceptions if the MPI.ERRORS_RETURN error handler is set:
 ```java
 MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
 ```
 If you add this statement to your program, it will show the line
 where it breaks, instead of just crashing in case of an error.
 Error-handling code can be separated from main application code by
 means of try-catch blocks, for instance:
 ```java
 try
 {
    File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
 }
 catch(MPIException ex)
 {
    System.err.println("Error Message: "+ ex.getMessage());
    System.err.println("  Error Class: "+ ex.getErrorClass());
    ex.printStackTrace();
    System.exit(-1);
 }
 ```
 ## How to specify buffers
 In MPI primitives that require a buffer (either send or receive) the
 Java API admits a Java array. Since Java arrays can be relocated by
 the Java runtime environment, the MPI Java bindings need to make a
 copy of the contents of the array to a temporary buffer, then pass the
 pointer to this buffer to the underlying C implementation. From the
 practical point of view, this implies an overhead associated to all
 buffers that are represented by Java arrays. The overhead is small
 for small buffers but increases for large arrays.
 There is a pool of temporary buffers with a default capacity of 64K.
 If a temporary buffer of 64K or less is needed, then the buffer will
 be obtained from the pool. But if the buffer is larger, then it will
 be necessary to allocate the buffer and free it later.
 The default capacity of pool buffers can be modified with an Open MPI
 MCA parameter:
 ```
 shell$ mpirun --mca mpi_java_eager size ...
 ```
 Where `size` is the number of bytes, or kilobytes if it ends with 'k',
 or megabytes if it ends with 'm'.
 An alternative is to use "direct buffers" provided by standard classes
 available in the Java SDK such as `ByteBuffer`. For convenience we
 provide a few static methods `new[Type]Buffer` in the `MPI` class to
 create direct buffers for a number of basic datatypes. Elements of the
 direct buffer can be accessed with methods `put()` and `get()`, and
 the number of elements in the buffer can be obtained with the method
 `capacity()`. This example illustrates its use:
 ```java
 int myself = MPI.COMM_WORLD.getRank();
 int tasks  = MPI.COMM_WORLD.getSize();
 IntBuffer in  = MPI.newIntBuffer(MAXLEN * tasks),
          out = MPI.newIntBuffer(MAXLEN);
 for(int i = 0; i < MAXLEN; i++)
    out.put(i, myself);      // fill the buffer with the rank
 Request request = MPI.COMM_WORLD.iAllGather(
                  out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
 request.waitFor();
 request.free();
 for(int i = 0; i < tasks; i++)
 {
    for(int k = 0; k < MAXLEN; k++)
    {
        if(in.get(k + i * MAXLEN) != i)
            throw new AssertionError("Unexpected value");
    }
 }
 ```
 Direct buffers are available for: `BYTE`, `CHAR`, `SHORT`, `INT`,
 `LONG`, `FLOAT`, and `DOUBLE`. There is no direct buffer for booleans.
 Direct buffers are not a replacement for arrays, because they have
 higher allocation and deallocation costs than arrays. In some
 cases arrays will be a better choice. You can easily convert a
 buffer into an array and vice versa.
 All non-blocking methods must use direct buffers and only
 blocking methods can choose between arrays and direct buffers.
 The above example also illustrates that it is necessary to call
 the `free()` method on objects whose class implements the `Freeable`
 interface. Otherwise a memory leak is produced.
 ## Specifying offsets in buffers
 In a C program, it is common to specify an offset in a array with
 `&array[i]` or `array+i`, for instance to send data starting from
 a given position in the array. The equivalent form in the Java bindings
 is to `slice()` the buffer to start at an offset. Making a `slice()`
 on a buffer is only necessary, when the offset is not zero. Slices
 work for both arrays and direct buffers.
 ```java
 import static mpi.MPI.slice;
 // ...
 int numbers[] = new int[SIZE];
 // ...
 MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
 ```
 ## Questions?  Problems?
 If you have any problems, or find any bugs, please feel free to report
 them to [Open MPI user's mailing
 list](https://www.open-mpi.org/community/lists/ompi.php).
--- a/README.JAVA.txt
+++ b/README.JAVA.txt
@ -1,275 +0,0 @@
 ***************************************************************************
 IMPORTANT NOTE
 JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
 NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
 JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
 THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
 CONTINUED DEVELOPER SUPPORT.
 ***************************************************************************
 This version of Open MPI provides support for Java-based
 MPI applications.
 The rest of this document provides step-by-step instructions on
 building OMPI with Java bindings, and compiling and running
 Java-based MPI applications. Also, part of the functionality is
 explained with examples. Further details about the design,
 implementation and usage of Java bindings in Open MPI can be found
 in [1]. The bindings follow a JNI approach, that is, we do not
 provide a pure Java implementation of MPI primitives, but a thin
 layer on top of the C implementation. This is the same approach
 as in mpiJava [2]; in fact, mpiJava was taken as a starting point
 for Open MPI Java bindings, but they were later totally rewritten.
 [1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
     implementation of Java bindings in Open MPI". Parallel Comput.
     59: 1-20 (2016).
 [2] M. Baker et al. "mpiJava: An object-oriented Java interface to
     MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
     pp. 748-762, Springer (1999).
 ============================================================================
 Building Java Bindings
 If this software was obtained as a developer-level
 checkout as opposed to a tarball, you will need to start your build by
 running ./autogen.pl. This will also require that you have a fairly
 recent version of autotools on your system - see the HACKING file for
 details.
 Java support requires that Open MPI be built at least with shared libraries
 (i.e., --enable-shared) - any additional options are fine and will not
 conflict. Note that this is the default for Open MPI, so you don't
 have to explicitly add the option. The Java bindings will build only
 if --enable-mpi-java is specified, and a JDK is found in a typical
 system default location.
 If the JDK is not in a place where we automatically find it, you can
 specify the location. For example, this is required on the Mac
 platform as the JDK headers are located in a non-typical location. Two
 options are available for this purpose:
 --with-jdk-bindir=<foo> - the location of javac and javah
 --with-jdk-headers=<bar> - the directory containing jni.h
 For simplicity, typical configurations are provided in platform files
 under contrib/platform/hadoop. These will meet the needs of most
 users, or at least provide a starting point for your own custom
 configuration.
 In summary, therefore, you can configure the system using the
 following Java-related options:
 $ ./configure --with-platform=contrib/platform/hadoop/<your-platform>
 ...
 or
 $ ./configure --enable-mpi-java --with-jdk-bindir=<foo>
              --with-jdk-headers=<bar> ...
 or simply
 $ ./configure --enable-mpi-java ...
 if JDK is in a "standard" place that we automatically find.
 ----------------------------------------------------------------------------
 Running Java Applications
 For convenience, the "mpijavac" wrapper compiler has been provided for
 compiling Java-based MPI applications. It ensures that all required MPI
 libraries and class paths are defined. You can see the actual command
 line using the --showme option, if you are interested.
 Once your application has been compiled, you can run it with the
 standard "mpirun" command line:
 $ mpirun <options> java <your-java-options> <my-app>
 For convenience, mpirun has been updated to detect the "java" command
 and ensure that the required MPI libraries and class paths are defined
 to support execution. You therefore do NOT need to specify the Java
 library path to the MPI installation, nor the MPI classpath. Any class
 path definitions required for your application should be specified
 either on the command line or via the CLASSPATH environmental
 variable. Note that the local directory will be added to the class
 path if nothing is specified.
 As always, the "java" executable, all required libraries, and your
 application classes must be available on all nodes.
 ----------------------------------------------------------------------------
 Basic usage of Java bindings
 There is an MPI package that contains all classes of the MPI Java
 bindings: Comm, Datatype, Request, etc. These classes have a direct
 correspondence with classes defined by the MPI standard. MPI primitives
 are just methods included in these classes. The convention used for
 naming Java methods and classes is the usual camel-case convention,
 e.g., the equivalent of MPI_File_set_info(fh,info) is fh.setInfo(info),
 where fh is an object of the class File.
 Apart from classes, the MPI package contains predefined public attributes
 under a convenience class MPI. Examples are the predefined communicator
 MPI.COMM_WORLD or predefined datatypes such as MPI.DOUBLE. Also, MPI
 initialization and finalization are methods of the MPI class and must
 be invoked by all MPI Java applications. The following example illustrates
 these concepts:
 import mpi.*;
 class ComputePi {
    public static void main(String args[]) throws MPIException {
        MPI.Init(args);
        int rank = MPI.COMM_WORLD.getRank(),
            size = MPI.COMM_WORLD.getSize(),
            nint = 100; // Intervals.
        double h = 1.0/(double)nint, sum = 0.0;
        for(int i=rank+1; i<=nint; i+=size) {
            double x = h * ((double)i - 0.5);
            sum += (4.0 / (1.0 + x * x));
        }
        double sBuf[] = { h * sum },
               rBuf[] = new double[1];
        MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
        if(rank == 0) System.out.println("PI: " + rBuf[0]);
        MPI.Finalize();
    }
 }
 ----------------------------------------------------------------------------
 Exception handling
 Java bindings in Open MPI support exception handling. By default, errors
 are fatal, but this behavior can be changed. The Java API will throw
 exceptions if the MPI.ERRORS_RETURN error handler is set:
    MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
 If you add this statement to your program, it will show the line
 where it breaks, instead of just crashing in case of an error.
 Error-handling code can be separated from main application code by
 means of try-catch blocks, for instance:
    try
    {
        File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
    }
    catch(MPIException ex)
    {
        System.err.println("Error Message: "+ ex.getMessage());
        System.err.println("  Error Class: "+ ex.getErrorClass());
        ex.printStackTrace();
        System.exit(-1);
    }
 ----------------------------------------------------------------------------
 How to specify buffers
 In MPI primitives that require a buffer (either send or receive) the
 Java API admits a Java array. Since Java arrays can be relocated by
 the Java runtime environment, the MPI Java bindings need to make a
 copy of the contents of the array to a temporary buffer, then pass the
 pointer to this buffer to the underlying C implementation. From the
 practical point of view, this implies an overhead associated to all
 buffers that are represented by Java arrays. The overhead is small
 for small buffers but increases for large arrays.
 There is a pool of temporary buffers with a default capacity of 64K.
 If a temporary buffer of 64K or less is needed, then the buffer will
 be obtained from the pool. But if the buffer is larger, then it will
 be necessary to allocate the buffer and free it later.
 The default capacity of pool buffers can be modified with an 'mca'
 parameter:
    mpirun --mca mpi_java_eager size ...
 Where 'size' is the number of bytes, or kilobytes if it ends with 'k',
 or megabytes if it ends with 'm'.
 An alternative is to use "direct buffers" provided by standard
 classes available in the Java SDK such as ByteBuffer. For convenience
 we provide a few static methods "new[Type]Buffer" in the MPI class
 to create direct buffers for a number of basic datatypes. Elements
 of the direct buffer can be accessed with methods put() and get(),
 and the number of elements in the buffer can be obtained with the
 method capacity(). This example illustrates its use:
    int myself = MPI.COMM_WORLD.getRank();
    int tasks  = MPI.COMM_WORLD.getSize();
    IntBuffer in  = MPI.newIntBuffer(MAXLEN * tasks),
              out = MPI.newIntBuffer(MAXLEN);
    for(int i = 0; i < MAXLEN; i++)
        out.put(i, myself);      // fill the buffer with the rank
    Request request = MPI.COMM_WORLD.iAllGather(
                      out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
    request.waitFor();
    request.free();
    for(int i = 0; i < tasks; i++)
    {
        for(int k = 0; k < MAXLEN; k++)
        {
            if(in.get(k + i * MAXLEN) != i)
                throw new AssertionError("Unexpected value");
        }
    }
 Direct buffers are available for: BYTE, CHAR, SHORT, INT, LONG,
 FLOAT, and DOUBLE. There is no direct buffer for booleans.
 Direct buffers are not a replacement for arrays, because they have
 higher allocation and deallocation costs than arrays. In some
 cases arrays will be a better choice. You can easily convert a
 buffer into an array and vice versa.
 All non-blocking methods must use direct buffers and only
 blocking methods can choose between arrays and direct buffers.
 The above example also illustrates that it is necessary to call
 the free() method on objects whose class implements the Freeable
 interface. Otherwise a memory leak is produced.
 ----------------------------------------------------------------------------
 Specifying offsets in buffers
 In a C program, it is common to specify an offset in a array with
 "&array[i]" or "array+i", for instance to send data starting from
 a given position in the array. The equivalent form in the Java bindings
 is to "slice()" the buffer to start at an offset. Making a "slice()"
 on a buffer is only necessary, when the offset is not zero. Slices
 work for both arrays and direct buffers.
    import static mpi.MPI.slice;
    ...
    int numbers[] = new int[SIZE];
    ...
    MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
 ----------------------------------------------------------------------------
 If you have any problems, or find any bugs, please feel free to report
 them to Open MPI user's mailing list (see
 https://www.open-mpi.org/community/lists/ompi.php).
--- a/README.md
+++ b/README.md
--- a/contrib/Makefile.am
+++ b/contrib/Makefile.am
@ -64,7 +64,7 @@ EXTRA_DIST = \
        platform/lanl/cray_xc_cle5.2/optimized-common \
        platform/lanl/cray_xc_cle5.2/optimized-lustre \
        platform/lanl/cray_xc_cle5.2/optimized-lustre.conf \
-        platform/lanl/toss/README \
+        platform/lanl/toss/README.md \
        platform/lanl/toss/common \
        platform/lanl/toss/common-optimized \
        platform/lanl/toss/cray-lustre-optimized \
--- a/contrib/build-mca-comps-outside-of-tree/README.txt
+++ b/contrib/build-mca-comps-outside-of-tree/README.txt
@ -1,121 +1,108 @@
 # Description
 2 Feb 2011
-Description
+This sample `tcp2` BTL component is a simple example of how to build
 ===========
 This sample "tcp2" BTL component is a simple example of how to build
 an Open MPI MCA component from outside of the Open MPI source tree.
 This is a valuable technique for 3rd parties who want to provide their
 own components for Open MPI, but do not want to be in the mainstream
 distribution (i.e., their code is not part of the main Open MPI code
 base).
 NOTE: We do recommend that 3rd party developers investigate using a
      DVCS such as Mercurial or Git to keep up with Open MPI
      development.  Using a DVCS allows you to host your component in
      your own copy of the Open MPI source tree, and yet still keep up
      with development changes, stable releases, etc.
 Previous colloquial knowledge held that building a component from
 outside of the Open MPI source tree required configuring Open MPI
--with-devel-headers, and then building and installing it.  This
+`--with-devel-headers`, and then building and installing it.  This
-configure switch installs all of OMPI's internal .h files under
+configure switch installs all of OMPI's internal `.h` files under
-$prefix/include/openmpi, and therefore allows 3rd party code to be
+`$prefix/include/openmpi`, and therefore allows 3rd party code to be
 compiled outside of the Open MPI tree.
 This method definitely works, but is annoying:
- * You have to ask users to use this special configure switch.
+* You have to ask users to use this special configure switch.
- * Not all users install from source; many get binary packages (e.g.,
+* Not all users install from source; many get binary packages (e.g.,
-   RPMs).
+  RPMs).
 This example package shows two ways to build an Open MPI MCA component
 from outside the Open MPI source tree:
- 1. Using the above --with-devel-headers technique
+1. Using the above `--with-devel-headers` technique
- 2. Compiling against the Open MPI source tree itself (vs. the
+2. Compiling against the Open MPI source tree itself (vs. the
-    installation tree)
+   installation tree)
 The user still has to have a source tree, but at least they don't have
-to be required to use --with-devel-headers (which most users don't) --
+to be required to use `--with-devel-headers` (which most users don't) --
 they can likely build off the source tree that they already used.
-Example project contents
+# Example project contents
 ========================
-The "tcp2" component is a direct copy of the TCP BTL as of January
+The `tcp2` component is a direct copy of the TCP BTL as of January
 2011 -- it has just been renamed so that it can be built separately
 and installed alongside the real TCP BTL component.
 Most of the mojo for both methods is handled in the example
-components' configure.ac, but the same techniques are applicable
+components' `configure.ac`, but the same techniques are applicable
 outside of the GNU Auto toolchain.
-This sample "tcp2" component has an autogen.sh script that requires
+This sample `tcp2` component has an `autogen.sh` script that requires
 the normal Autoconf, Automake, and Libtool.  It also adds the
 following two configure switches:
- --with-openmpi-install=DIR
+1. `--with-openmpi-install=DIR`:
    If provided, `DIR` is an Open MPI installation tree that was
    installed `--with-devel-headers`.
-    If provided, DIR is an Open MPI installation tree that was
+    This switch uses the installed `mpicc --showme:<foo>` functionality
-    installed --with-devel-headers.
+    to extract the relevant `CPPFLAGS`, `LDFLAGS`, and `LIBS`.
-
+1. `--with-openmpi-source=DIR`:
-    This switch uses the installed mpicc --showme:<foo> functionality
+    If provided, `DIR` is the source of a configured and built Open MPI
    to extract the relevant CPPFLAGS, LDFLAGS, and LIBS.
 --with-openmpi-source=DIR
    If provided, DIR is the source of a configured and built Open MPI
    source tree (corresponding to the version expected by the example
    component).  The source tree is not required to have been
-    configured --with-devel-headers.
+    configured `--with-devel-headers`.
-    This switch uses the source tree's config.status script to extract
+    This switch uses the source tree's `config.status` script to
-    the relevant CPPFLAGS and CFLAGS.
+    extract the relevant `CPPFLAGS` and `CFLAGS`.
 Either one of these two switches must be provided, or appropriate
-CPPFLAGS, CFLAGS, LDFLAGS, and/or LIBS must be provided such that
+`CPPFLAGS`, `CFLAGS`, `LDFLAGS`, and/or `LIBS` must be provided such
-valid Open MPI header and library files can be found and compiled /
+that valid Open MPI header and library files can be found and compiled
-linked against, respectively.
+/ linked against, respectively.
-Example use
+# Example use
 ===========
 First, download, build, and install Open MPI:
-----
+```
 $ cd $HOME
-$ wget \
+$ wget https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
-  https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
+[...lots of output...]
  [lots of output]
 $ tar jxf openmpi-X.Y.Z.tar.bz2
 $ cd openmpi-X.Y.Z
 $ ./configure --prefix=/opt/openmpi ...
-  [lots of output]
+[...lots of output...]
 $ make -j 4 install
-  [lots of output]
+[...lots of output...]
 $ /opt/openmpi/bin/ompi_info | grep btl
                 MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
                 MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
                 MCA btl: tcp (MCA vA.B, API vM.N, Component vX.Y.Z)
  [where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
 $
-----
+```
-Notice the installed BTLs from ompi_info.
+Notice the installed BTLs from `ompi_info`.
-Now cd into this example project and build it, pointing it to the
+Now `cd` into this example project and build it, pointing it to the
 source directory of the Open MPI that you just built.  Note that we
-use the same --prefix as when installing Open MPI (so that the built
+use the same `--prefix` as when installing Open MPI (so that the built
 component will be installed into the Right place):
-----
+```
 $ cd /path/to/this/sample
 $ ./autogen.sh
 $ ./configure --prefix=/opt/openmpi --with-openmpi-source=$HOME/openmpi-X.Y.Z
-  [lots of output]
+[...lots of output...]
 $ make -j 4 install
-  [lots of output]
+[...lots of output...]
 $ /opt/openmpi/bin/ompi_info | grep btl
                 MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
                 MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
@ -123,12 +110,11 @@ $ /opt/openmpi/bin/ompi_info | grep btl
                 MCA btl: tcp2 (MCA vA.B, API vM.N, Component vX.Y.Z)
  [where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
 $
-----
+```
-Notice that the "tcp2" BTL is now installed.
+Notice that the `tcp2` BTL is now installed.
-Random notes
+# Random notes
 ============
 The component in this project is just an example; I whipped it up in
 the span of several hours.  Your component may be a bit more complex
@ -139,17 +125,15 @@ what you need.
 Changes required to the component to make it build in a standalone
 mode:
-1. Write your own configure script.  This component is just a sample.
+1. Write your own `configure` script.  This component is just a
-   You basically need to build against an OMPI install that was
+   sample.  You basically need to build against an OMPI install that
-   installed --with-devel-headers or a built OMPI source tree.  See
+   was installed `--with-devel-headers` or a built OMPI source tree.
-   ./configure --help for details.
+   See `./configure --help` for details.
-
+1. I also provided a bogus `btl_tcp2_config.h` (generated by
-2. I also provided a bogus btl_tcp2_config.h (generated by configure).
+   `configure`).  This file is not included anywhere, but it does
-   This file is not included anywhere, but it does provide protection
+   provide protection against re-defined `PACKAGE_*` macros when
-   against re-defined PACKAGE_* macros when running configure, which
+   running `configure`, which is quite annoying.
-   is quite annoying.
+1. Modify `Makefile.am` to only build DSOs.  I.e., you can optionally
 3. Modify Makefile.am to only build DSOs.  I.e., you can optionally
   take the static option out since the component can *only* build in
   DSO mode when building standalone.  That being said, it doesn't
   hurt to leave the static builds in -- this would (hypothetically)
--- a/contrib/dist/linux/README
+++ b/contrib/dist/linux/README
@ -1,105 +0,0 @@
 Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
                        University Research and Technology
                        Corporation.  All rights reserved.
 Copyright (c) 2004-2006 The University of Tennessee and The University
                        of Tennessee Research Foundation.  All rights
                        reserved.
 Copyright (c) 2004-2006 High Performance Computing Center Stuttgart,
                        University of Stuttgart.  All rights reserved.
 Copyright (c) 2004-2006 The Regents of the University of California.
                        All rights reserved.
 Copyright (c) 2006-2016 Cisco Systems, Inc.  All rights reserved.
 $COPYRIGHT$
 Additional copyrights may follow
 $HEADER$
 ===========================================================================
 Note that you probably want to download the latest release of the SRPM
 for any given Open MPI version.  The SRPM release number is the
 version after the dash in the SRPM filename.  For example,
 "openmpi-1.6.3-2.src.rpm" is the 2nd release of the SRPM for Open MPI
 v1.6.3.  Subsequent releases of SRPMs typically contain bug fixes for
 the RPM packaging, but not Open MPI itself.
 The buildrpm.sh script takes a single mandatory argument -- a filename
 pointing to an Open MPI tarball (may be either .gz or .bz2).  It will
 create one or more RPMs from this tarball:
 1. Source RPM
 2. "All in one" RPM, where all of Open MPI is put into a single RPM.
 3. "Multiple" RPM, where Open MPI is split into several sub-package
   RPMs:
   - openmpi-runtime
   - openmpi-devel
   - openmpi-docs
 The folowing arguments could be used to affect script behaviour.
 Please, do NOT set the same settings with parameters and config vars.
 -b
   If you specify this option, only the all-in-one binary RPM will
   be built. By default, only the source RPM (SRPM) is built. Other
   parameters that affect the all-in-one binary RPM will be ignored
   unless this option is specified.
 -n name
   This option will change the name of the produced RPM to the "name".
   It is useful to use with "-o" and "-m" options if you want to have
   multiple Open MPI versions installed simultaneously in the same
   enviroment. Requires use of option "-b".
 -o
   With this option the install path of the binary RPM will be changed
   to /opt/_NAME_/_VERSION_. Requires use of option "-b".
 -m
   This option causes the RPM to also install modulefiles
   to the location specified in the specfile. Requires use of option "-b".
 -i
   Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
   Requires use of option "-b".
 -f lf_location
   Include support for Libfabric. "lf_location" is Libfabric install
   path. Requires use of option "-b".
 -t tm_location
   Include support for Torque/PBS Pro. "tm_location" is path of the
   Torque/PBS Pro header files. Requires use of option "-b".
 -d
   Build with debugging support. By default,
   the RPM is built without debugging support.
 -c parameter
   Add custom configure parameter.
 -r parameter
   Add custom RPM build parameter.
 -s
   If specified, the script will try to unpack the openmpi.spec
   file from the tarball specified on the command line. By default,
   the script will look for the specfile in the current directory.
 -R directory
   Specifies the top level RPM build direcotry.
 -h
   Prints script usage information.
 Target architecture is currently hard-coded in the beginning
 of the buildrpm.sh script.
 Alternatively, you can build directly from the openmpi.spec spec file
 or SRPM directly.  Many options can be passed to the building process
 via rpmbuild's --define option (there are older versions of rpmbuild
 that do not seem to handle --define'd values properly in all cases,
 but we generally don't care about those old versions of rpmbuild...).
 The available options are described in the comments in the beginning
 of the spec file in this directory.
--- a/contrib/dist/linux/README.md
+++ b/contrib/dist/linux/README.md
@ -0,0 +1,88 @@
 # Open MPI Linux distribution helpers
 Note that you probably want to download the latest release of the SRPM
 for any given Open MPI version.  The SRPM release number is the
 version after the dash in the SRPM filename.  For example,
 `openmpi-1.6.3-2.src.rpm` is the 2nd release of the SRPM for Open MPI
 v1.6.3.  Subsequent releases of SRPMs typically contain bug fixes for
 the RPM packaging, but not Open MPI itself.
 The `buildrpm.sh` script takes a single mandatory argument -- a
 filename pointing to an Open MPI tarball (may be either `.gz` or
 `.bz2`).  It will create one or more RPMs from this tarball:
 1. Source RPM
 1. "All in one" RPM, where all of Open MPI is put into a single RPM.
 1. "Multiple" RPM, where Open MPI is split into several sub-package
   RPMs:
   * `openmpi-runtime`
   * `openmpi-devel`
   * `openmpi-docs`
 The folowing arguments could be used to affect script behaviour.
 Please, do NOT set the same settings with parameters and config vars.
 * `-b`:
   If you specify this option, only the all-in-one binary RPM will
   be built. By default, only the source RPM (SRPM) is built. Other
   parameters that affect the all-in-one binary RPM will be ignored
   unless this option is specified.
 * `-n name`:
   This option will change the name of the produced RPM to the "name".
   It is useful to use with "-o" and "-m" options if you want to have
   multiple Open MPI versions installed simultaneously in the same
   enviroment. Requires use of option `-b`.
 * `-o`:
   With this option the install path of the binary RPM will be changed
   to `/opt/_NAME_/_VERSION_`. Requires use of option `-b`.
 * `-m`:
   This option causes the RPM to also install modulefiles
   to the location specified in the specfile. Requires use of option `-b`.
 * `-i`:
   Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
   Requires use of option `-b`.
 * `-f lf_location`:
   Include support for Libfabric. "lf_location" is Libfabric install
   path. Requires use of option `-b`.
 * `-t tm_location`:
   Include support for Torque/PBS Pro. "tm_location" is path of the
   Torque/PBS Pro header files. Requires use of option `-b`.
 * `-d`:
   Build with debugging support. By default,
   the RPM is built without debugging support.
 * `-c parameter`:
   Add custom configure parameter.
 * `-r parameter`:
   Add custom RPM build parameter.
 * `-s`:
   If specified, the script will try to unpack the openmpi.spec
   file from the tarball specified on the command line. By default,
   the script will look for the specfile in the current directory.
 * `-R directory`:
   Specifies the top level RPM build direcotry.
 * `-h`:
   Prints script usage information.
 Target architecture is currently hard-coded in the beginning
 of the `buildrpm.sh` script.
 Alternatively, you can build directly from the `openmpi.spec` spec
 file or SRPM directly.  Many options can be passed to the building
 process via `rpmbuild`'s `--define` option (there are older versions
 of `rpmbuild` that do not seem to handle `--define`'d values properly
 in all cases, but we generally don't care about those old versions of
 `rpmbuild`...).  The available options are described in the comments
 in the beginning of the spec file in this directory.
--- a/contrib/platform/lanl/toss/README.md
+++ b/contrib/platform/lanl/toss/README.md
@ -61,7 +61,7 @@ created.
  - copy of toss3-hfi-optimized.conf with the following changes:
    - change: comment "Add the interface for out-of-band communication and set
      it up" to "Set up the interface for out-of-band communication"
-    - remove: oob_tcp_if_exclude = ib0 
+    - remove: oob_tcp_if_exclude = ib0
    - remove: btl (let Open MPI figure out what best to use for ethernet-
      connected hardware)
    - remove: btl_openib_want_fork_support (no infiniband)
--- a/examples/Makefile.include
+++ b/examples/Makefile.include
@ -33,7 +33,7 @@
 # Automake).
 EXTRA_DIST += \
-        examples/README \
+        examples/README.md \
        examples/Makefile \
        examples/hello_c.c \
        examples/hello_mpifh.f \
--- a/examples/README
+++ b/examples/README
@ -1,67 +0,0 @@
 Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
                        University Research and Technology
                        Corporation.  All rights reserved.
 Copyright (c) 2006-2012 Cisco Systems, Inc.  All rights reserved.
 Copyright (c) 2007-2009 Sun Microsystems, Inc.  All rights reserved.
 Copyright (c) 2010      Oracle and/or its affiliates.  All rights reserved.
 Copyright (c) 2013      Mellanox Technologies, Inc.  All rights reserved.
 $COPYRIGHT$
 The files in this directory are sample MPI applications provided both
 as a trivial primer to MPI as well as simple tests to ensure that your
 Open MPI installation is working properly.
 If you are looking for a comprehensive MPI tutorial, these samples are
 not enough.  Excellent MPI tutorials are available here:
        http://www.citutor.org/login.php
 Get a free account and login; you can then browse to the list of
 available courses.  Look for the ones with "MPI" in the title.
 There are two MPI examples in this directory, each using one of six
 different MPI interfaces:
 - Hello world
  C:                   hello_c.c
  C++:                 hello_cxx.cc
  Fortran mpif.h:      hello_mpifh.f
  Fortran use mpi:     hello_usempi.f90
  Fortran use mpi_f08: hello_usempif08.f90
  Java:                Hello.java
  C shmem.h:           hello_oshmem_c.c
  Fortran shmem.fh:    hello_oshmemfh.f90
 - Send a trivial message around in a ring
  C:                   ring_c.c
  C++:                 ring_cxx.cc
  Fortran mpif.h:      ring_mpifh.f
  Fortran use mpi:     ring_usempi.f90
  Fortran use mpi_f08: ring_usempif08.f90
  Java:                Ring.java
  C shmem.h:           ring_oshmem_c.c
  Fortran shmem.fh:    ring_oshmemfh.f90
 Additionally, there's one further example application, but this one
 only uses the MPI C bindings:
 - Test the connectivity between all processes
  C:   connectivity_c.c
 The Makefile in this directory will build as many of the examples as
 you have language support (e.g., if you do not have the Fortran "use
 mpi" bindings compiled as part of Open MPI, the those examples will be
 skipped).
 The Makefile assumes that the wrapper compilers mpicc, mpic++, and
 mpifort are in your path.
 Although the Makefile is tailored for Open MPI (e.g., it checks the
 "ompi_info" command to see if you have support for C++, mpif.h, use
 mpi, and use mpi_f08 F90), all of the example programs are pure MPI,
 and therefore not specific to Open MPI.  Hence, you can use a
 different MPI implementation to compile and run these programs if you
 wish.
 Make today an Open MPI day!
--- a/examples/README.md
+++ b/examples/README.md
@ -0,0 +1,66 @@
 The files in this directory are sample MPI applications provided both
 as a trivial primer to MPI as well as simple tests to ensure that your
 Open MPI installation is working properly.
 If you are looking for a comprehensive MPI tutorial, these samples are
 not enough.  [Excellent MPI tutorials are available
 here](http://www.citutor.org/login.php).
 Get a free account and login; you can then browse to the list of
 available courses.  Look for the ones with "MPI" in the title.
 There are two MPI examples in this directory, each using one of six
 different MPI interfaces:
 ## Hello world
 The MPI version of the canonical "hello world" program:
 * C: `hello_c.c`
 * C++: `hello_cxx.cc`
 * Fortran mpif.h: `hello_mpifh.f`
 * Fortran use mpi: `hello_usempi.f90`
 * Fortran use mpi_f08: `hello_usempif08.f90`
 * Java: `Hello.java`
 * C shmem.h: `hello_oshmem_c.c`
 * Fortran shmem.fh: `hello_oshmemfh.f90`
 ## Ring
 Send a trivial message around in a ring:
 * C: `ring_c.c`
 * C++: `ring_cxx.cc`
 * Fortran mpif.h: `ring_mpifh.f`
 * Fortran use mpi: `ring_usempi.f90`
 * Fortran use mpi_f08: `ring_usempif08.f90`
 * Java: `Ring.java`
 * C shmem.h: `ring_oshmem_c.c`
 * Fortran shmem.fh: `ring_oshmemfh.f90`
 ## Connectivity Test
 Additionally, there's one further example application, but this one
 only uses the MPI C bindings to test the connectivity between all
 processes:
 * C: `connectivity_c.c`
 ## Makefile
 The `Makefile` in this directory will build as many of the examples as
 you have language support (e.g., if you do not have the Fortran `use
 mpi` bindings compiled as part of Open MPI, the those examples will be
 skipped).
 The `Makefile` assumes that the wrapper compilers `mpicc`, `mpic++`, and
 `mpifort` are in your path.
 Although the `Makefile` is tailored for Open MPI (e.g., it checks the
 `ompi_info` command to see if you have support for `mpif.h`, the `mpi`
 module, and the `use mpi_f08` module), all of the example programs are
 pure MPI, and therefore not specific to Open MPI.  Hence, you can use
 a different MPI implementation to compile and run these programs if
 you wish.
 Make today an Open MPI day!
--- a/ompi/contrib/README.md
+++ b/ompi/contrib/README.md
@ -0,0 +1,19 @@
 This is the OMPI contrib system.  It is (far) less functional and
 flexible than the OMPI MCA framework/component system.
 Each contrib package must have a `configure.m4`.  It may optionally also
 have an `autogen.subdirs` file.
 If it has a `configure.m4` file, it must specify its own relevant
 files to `AC_CONFIG_FILES` to create during `AC_OUTPUT` -- just like
 MCA components (at a minimum, usually its own `Makefile`).  The
 `configure.m4` file will be slurped up into the main `configure`
 script, just like other MCA components.  Note that there is currently
 no "no configure" option for contrib packages -- you *must* have a
 `configure.m4` (even if all it does it call `$1`).  Feel free to fix
 this situation if you want -- it probably won't not be too difficult
 to extend `autogen.pl` to support this scenario, similar to how it is
 done for MCA components.  :smile:
 If it has an `autogen.subdirs` file, then it needs to be a
 subdirectory that is autogen-able.
--- a/ompi/contrib/README.txt
+++ b/ompi/contrib/README.txt
@ -1,19 +0,0 @@
 This is the OMPI contrib system.  It is (far) less functional and
 flexible than the OMPI MCA framework/component system.
 Each contrib package must have a configure.m4.  It may optionally also
 have an autogen.subdirs file.
 If it has a configure.m4 file, it must specify its own relevant files
 to AC_CONFIG_FILES to create during AC_OUTPUT -- just like MCA
 components (at a minimum, usually its own Makefile).  The configure.m4
 file will be slurped up into the main configure script, just like
 other MCA components.  Note that there is currently no "no configure"
 option for contrib packages -- you *must* have a configure.m4 (even if
 all it does it call $1).  Feel free to fix this situation if you want
 -- it probably won't not be too difficult to extend autogen.pl to
 support this scenario, similar to how it is done for MCA components.
 :-)
 If it has an autogen.subdirs file, then it needs to be a subdirectory
 that is autogen-able.
--- a/ompi/mca/common/monitoring/Makefile.am
+++ b/ompi/mca/common/monitoring/Makefile.am
@ -13,7 +13,7 @@
 # $HEADER$
 #
-EXTRA_DIST = profile2mat.pl aggregate_profile.pl
+EXTRA_DIST = profile2mat.pl aggregate_profile.pl README.md
 sources = common_monitoring.c common_monitoring_coll.c
 headers = common_monitoring.h common_monitoring_coll.h
--- a/ompi/mca/common/monitoring/README
+++ b/ompi/mca/common/monitoring/README
@ -1,181 +0,0 @@
 Copyright (c) 2013-2015 The University of Tennessee and The University
                         of Tennessee Research Foundation.  All rights
                         reserved.
 Copyright (c) 2013-2015 Inria.  All rights reserved.
 $COPYRIGHT$
 Additional copyrights may follow
 $HEADER$
 ===========================================================================
 Low level communication monitoring interface in Open MPI
 Introduction
 ------------
 This interface traces and monitors all messages sent by MPI before they go to the
 communication channels. At that levels all communication are point-to-point communications:
 collectives are already decomposed in send and receive calls.
 The monitoring is stored internally by each process and output on stderr at the end of the
 application (during MPI_Finalize()).
 Enabling the monitoring
 -----------------------
 To enable the monitoring add  --mca pml_monitoring_enable x to the mpirun command line.
 If x = 1 it monitors internal and external tags indifferently and aggregate everything.
 If x = 2 it monitors internal tags and external tags separately.
 If x = 0 the monitoring is disabled.
 Other value of x are not supported.
 Internal tags are tags < 0. They are used to tag send and receive coming from
 collective operations or from protocol communications
 External tags are tags >=0. They are used by the application in point-to-point communication.
 Therefore, distinguishing external and internal tags help to distinguish between point-to-point
 and other communication (mainly collectives).
 Output format
 -------------
 The output of the monitoring looks like (with --mca pml_monitoring_enable 2):
 I	0	1	108 bytes	27 msgs sent
 E	0	1	1012 bytes	30 msgs sent
 E	0	2	23052 bytes	61 msgs sent
 I	1	2	104 bytes	26 msgs sent
 I	1	3	208 bytes	52 msgs sent
 E	1	0	860 bytes	24 msgs sent
 E	1	3	2552 bytes	56 msgs sent
 I	2	3	104 bytes	26 msgs sent
 E	2	0	22804 bytes	49 msgs sent
 E	2	3	860 bytes	24 msgs sent
 I	3	0	104 bytes	26 msgs sent
 I	3	1	204 bytes	51 msgs sent
 E	3	1	2304 bytes	44 msgs sent
 E	3	2	860 bytes	24 msgs sent
 Where:
  - the first column distinguishes internal (I)  and external (E) tags.
  - the second column is the sender rank
  - the third column is the receiver rank
  - the fourth column is the number of bytes sent
  - the last column is the number of messages.
 In this example process 0 as sent 27 messages to process 1 using point-to-point call
 for 108 bytes and 30 messages with collectives and protocol related communication
 for 1012 bytes to process 1.
 If the monitoring was called with --mca pml_monitoring_enable 1 everything is aggregated
 under the internal tags. With te above example, you have:
 I	0	1	1120 bytes	57 msgs sent
 I	0	2	23052 bytes	61 msgs sent
 I	1	0	860 bytes	24 msgs sent
 I	1	2	104 bytes	26 msgs sent
 I	1	3	2760 bytes	108 msgs sent
 I	2	0	22804 bytes	49 msgs sent
 I	2	3	964 bytes	50 msgs sent
 I	3	0	104 bytes	26 msgs sent
 I	3	1	2508 bytes	95 msgs sent
 I	3	2	860 bytes	24 msgs sent
 Monitoring phases
 -----------------
 If one wants to monitor phases of the application, it is possible to flush the monitoring
 at the application level. In this case all the monitoring since the last flush is stored
 by every process in a file.
 An example of how to flush such monitoring is given in test/monitoring/monitoring_test.c
 Moreover, all the different flushed phased are aggregated at runtime and output at the end
 of the application as described above.
 Example
 -------
 A working example is given in test/monitoring/monitoring_test.c
 It features, MPI_COMM_WORLD monitoring , sub-communicator monitoring, collective and
 point-to-point communication monitoring and  phases monitoring
 To compile:
 > make monitoring_test
 Helper scripts
 --------------
 Two perl scripts are provided in test/monitoring
 - aggregate_profile.pl is for aggregating monitoring phases of different processes
  This script aggregates the profiles generated by the flush_monitoring function.
  The files need to be in in given format: name_<phase_id>_<process_id>
  They are then aggregated by phases.
  If one needs the profile of all the phases he can concatenate the different files,
  or use the output of the monitoring system done at MPI_Finalize
  in the example it should be call as:
   ./aggregate_profile.pl prof/phase to generate
   prof/phase_1.prof
   prof/phase_2.prof
 - profile2mat.pl is for transforming a the monitoring output into a communication matrix.
   Take a profile file and aggregates all the recorded communicator into matrices.
   It generated a matrices for the number of messages, (msg),
   for the total bytes transmitted (size) and
   the average number of bytes per messages (avg)
   The output matrix is symmetric
 Do not forget to enable the execution right to these scripts.
 For instance, the provided examples store phases output in ./prof
 If you type:
 > mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
 you should have the following output
 Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
 Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
 Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
 Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
 Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
 Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
 Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
 Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
 I	2	3	104 bytes	26 msgs sent
 E	2	0	22804 bytes	49 msgs sent
 E	2	3	860 bytes	24 msgs sent
 I	3	0	104 bytes	26 msgs sent
 I	3	1	204 bytes	51 msgs sent
 E	3	1	2304 bytes	44 msgs sent
 E	3	2	860 bytes	24 msgs sent
 I	0	1	108 bytes	27 msgs sent
 E	0	1	1012 bytes	30 msgs sent
 E	0	2	23052 bytes	61 msgs sent
 I	1	2	104 bytes	26 msgs sent
 I	1	3	208 bytes	52 msgs sent
 E	1	0	860 bytes	24 msgs sent
 E	1	3	2552 bytes	56 msgs sent
 you can parse the phases with:
 > /aggregate_profile.pl prof/phase
 Building prof/phase_1.prof
 Building prof/phase_2.prof
 And you can build the different communication matrices of phase 1 with:
 > ./profile2mat.pl prof/phase_1.prof
 prof/phase_1.prof -> all
 prof/phase_1_size_all.mat
 prof/phase_1_msg_all.mat
 prof/phase_1_avg_all.mat
 prof/phase_1.prof -> external
 prof/phase_1_size_external.mat
 prof/phase_1_msg_external.mat
 prof/phase_1_avg_external.mat
 prof/phase_1.prof -> internal
 prof/phase_1_size_internal.mat
 prof/phase_1_msg_internal.mat
 prof/phase_1_avg_internal.mat
 Credit
 ------
 Designed by George Bosilca <bosilca@icl.utk.edu> and
 Emmanuel Jeannot <emmanuel.jeannot@inria.fr>
--- a/ompi/mca/common/monitoring/README.md
+++ b/ompi/mca/common/monitoring/README.md
@ -0,0 +1,209 @@
 # Open MPI common monitoring module
 Copyright (c) 2013-2015 The University of Tennessee and The University
                         of Tennessee Research Foundation.  All rights
                         reserved.
 Copyright (c) 2013-2015 Inria.  All rights reserved.
 Low level communication monitoring interface in Open MPI
 ## Introduction
 This interface traces and monitors all messages sent by MPI before
 they go to the communication channels. At that levels all
 communication are point-to-point communications: collectives are
 already decomposed in send and receive calls.
 The monitoring is stored internally by each process and output on
 stderr at the end of the application (during `MPI_Finalize()`).
 ## Enabling the monitoring
 To enable the monitoring add `--mca pml_monitoring_enable x` to the
 `mpirun` command line:
 * If x = 1 it monitors internal and external tags indifferently and aggregate everything.
 * If x = 2 it monitors internal tags and external tags separately.
 * If x = 0 the monitoring is disabled.
 * Other value of x are not supported.
 Internal tags are tags < 0. They are used to tag send and receive
 coming from collective operations or from protocol communications
 External tags are tags >=0. They are used by the application in
 point-to-point communication.
 Therefore, distinguishing external and internal tags help to
 distinguish between point-to-point and other communication (mainly
 collectives).
 ## Output format
 The output of the monitoring looks like (with `--mca
 pml_monitoring_enable 2`):
 ```
 I	0	1	108 bytes	27 msgs sent
 E	0	1	1012 bytes	30 msgs sent
 E	0	2	23052 bytes	61 msgs sent
 I	1	2	104 bytes	26 msgs sent
 I	1	3	208 bytes	52 msgs sent
 E	1	0	860 bytes	24 msgs sent
 E	1	3	2552 bytes	56 msgs sent
 I	2	3	104 bytes	26 msgs sent
 E	2	0	22804 bytes	49 msgs sent
 E	2	3	860 bytes	24 msgs sent
 I	3	0	104 bytes	26 msgs sent
 I	3	1	204 bytes	51 msgs sent
 E	3	1	2304 bytes	44 msgs sent
 E	3	2	860 bytes	24 msgs sent
 ```
 Where:
 1. the first column distinguishes internal (I)  and external (E) tags.
 1. the second column is the sender rank
 1. the third column is the receiver rank
 1. the fourth column is the number of bytes sent
 1. the last column is the number of messages.
 In this example process 0 as sent 27 messages to process 1 using
 point-to-point call for 108 bytes and 30 messages with collectives and
 protocol related communication for 1012 bytes to process 1.
 If the monitoring was called with `--mca pml_monitoring_enable 1`,
 everything is aggregated under the internal tags. With the e above
 example, you have:
 ```
 I	0	1	1120 bytes	57 msgs sent
 I	0	2	23052 bytes	61 msgs sent
 I	1	0	860 bytes	24 msgs sent
 I	1	2	104 bytes	26 msgs sent
 I	1	3	2760 bytes	108 msgs sent
 I	2	0	22804 bytes	49 msgs sent
 I	2	3	964 bytes	50 msgs sent
 I	3	0	104 bytes	26 msgs sent
 I	3	1	2508 bytes	95 msgs sent
 I	3	2	860 bytes	24 msgs sent
 ```
 ## Monitoring phases
 If one wants to monitor phases of the application, it is possible to
 flush the monitoring at the application level. In this case all the
 monitoring since the last flush is stored by every process in a file.
 An example of how to flush such monitoring is given in
 `test/monitoring/monitoring_test.c`.
 Moreover, all the different flushed phased are aggregated at runtime
 and output at the end of the application as described above.
 ## Example
 A working example is given in `test/monitoring/monitoring_test.c` It
 features, `MPI_COMM_WORLD` monitoring , sub-communicator monitoring,
 collective and point-to-point communication monitoring and phases
 monitoring
 To compile:
 ```
 shell$ make monitoring_test
 ```
 ## Helper scripts
 Two perl scripts are provided in test/monitoring:
 1. `aggregate_profile.pl` is for aggregating monitoring phases of
   different processes This script aggregates the profiles generated by
   the `flush_monitoring` function.
   The files need to be in in given format: `name_<phase_id>_<process_id>`
   They are then aggregated by phases.
   If one needs the profile of all the phases he can concatenate the different files,
   or use the output of the monitoring system done at `MPI_Finalize`
   in the example it should be call as:
   ```
   ./aggregate_profile.pl prof/phase to generate
   prof/phase_1.prof
   prof/phase_2.prof
   ```
 1. `profile2mat.pl` is for transforming a the monitoring output into a
   communication matrix.  Take a profile file and aggregates all the
   recorded communicator into matrices.  It generated a matrices for
   the number of messages, (msg), for the total bytes transmitted
   (size) and the average number of bytes per messages (avg)
   The output matrix is symmetric.
 For instance, the provided examples store phases output in `./prof`:
 ```
 shell$ mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
 ```
 Should provide the following output:
 ```
 Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
 Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
 Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
 Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
 Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
 Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
 Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
 Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
 I	2	3	104 bytes	26 msgs sent
 E	2	0	22804 bytes	49 msgs sent
 E	2	3	860 bytes	24 msgs sent
 I	3	0	104 bytes	26 msgs sent
 I	3	1	204 bytes	51 msgs sent
 E	3	1	2304 bytes	44 msgs sent
 E	3	2	860 bytes	24 msgs sent
 I	0	1	108 bytes	27 msgs sent
 E	0	1	1012 bytes	30 msgs sent
 E	0	2	23052 bytes	61 msgs sent
 I	1	2	104 bytes	26 msgs sent
 I	1	3	208 bytes	52 msgs sent
 E	1	0	860 bytes	24 msgs sent
 E	1	3	2552 bytes	56 msgs sent
 ```
 You can then parse the phases with:
 ```
 shell$ /aggregate_profile.pl prof/phase
 Building prof/phase_1.prof
 Building prof/phase_2.prof
 ```
 And you can build the different communication matrices of phase 1
 with:
 ```
 shell$ ./profile2mat.pl prof/phase_1.prof
 prof/phase_1.prof -> all
 prof/phase_1_size_all.mat
 prof/phase_1_msg_all.mat
 prof/phase_1_avg_all.mat
 prof/phase_1.prof -> external
 prof/phase_1_size_external.mat
 prof/phase_1_msg_external.mat
 prof/phase_1_avg_external.mat
 prof/phase_1.prof -> internal
 prof/phase_1_size_internal.mat
 prof/phase_1_msg_internal.mat
 prof/phase_1_avg_internal.mat
 ```
 ## Authors
 Designed by George Bosilca <bosilca@icl.utk.edu> and
 Emmanuel Jeannot <emmanuel.jeannot@inria.fr>
--- a/ompi/mca/mtl/ofi/README
+++ b/ompi/mca/mtl/ofi/README
@ -1,340 +0,0 @@
 OFI MTL:
 --------
 The OFI MTL supports Libfabric (a.k.a. Open Fabrics Interfaces OFI,
 https://ofiwg.github.io/libfabric/) tagged APIs (fi_tagged(3)). At
 initialization time, the MTL queries libfabric for providers supporting tag matching
 (fi_getinfo(3)). Libfabric will return a list of providers that satisfy the requested
 capabilities, having the most performant one at the top of the list.
 The user may modify the OFI provider selection with mca parameters
 mtl_ofi_provider_include or mtl_ofi_provider_exclude.
 PROGRESS:
 ---------
 The MTL registers a progress function to opal_progress. There is currently
 no support for asynchronous progress. The progress function reads multiple events
 from the OFI provider Completion Queue (CQ) per iteration (defaults to 100, can be
 modified with the mca mtl_ofi_progress_event_cnt) and iterates until the
 completion queue is drained.
 COMPLETIONS:
 ------------
 Each operation uses a request type ompi_mtl_ofi_request_t which includes a reference
 to an operation specific completion callback, an MPI request, and a context. The
 context (fi_context) is used to map completion events with MPI_requests when reading the
 CQ.
 OFI TAG:
 --------
 MPI needs to send 96 bits of information per message (32 bits communicator id,
 32 bits source rank, 32 bits MPI tag) but OFI only offers 64 bits tags. In
 addition, the OFI MTL uses 2 bits of the OFI tag for the synchronous send protocol.
 Therefore, there are only 62 bits available in the OFI tag for message usage. The
 OFI MTL offers the mtl_ofi_tag_mode mca parameter with 4 modes to address this:
 "auto" (Default):
 After the OFI provider is selected, a runtime check is performed to assess
 FI_REMOTE_CQ_DATA and FI_DIRECTED_RECV support (see fi_tagged(3), fi_msg(2)
 and fi_getinfo(3)). If supported, "ofi_tag_full" is used. If not supported,
 fall back to "ofi_tag_1".
 "ofi_tag_1":
 For providers that do not support FI_REMOTE_CQ_DATA, the OFI MTL will
 trim the fields (Communicator ID, Source Rank, MPI tag) to make them fit the 62
 bits available bit in the OFI tag. There are two options available with different
 number of bits for the Communicator ID and MPI tag fields. This tag distribution
 offers: 12 bits for Communicator ID (max Communicator ID 4,095) subject to
 provider reserved bits (see mem_tag_format below), 18 bits for Source Rank (max
 Source Rank 262,143), 32 bits for MPI tag (max MPI tag is INT_MAX).
 "ofi_tag_2":
 Same as 2 "ofi_tag_1" but offering a different OFI tag distribution for
 applications that may require a greater number of supported Communicators at the
 expense of fewer MPI tag bits. This tag distribution offers: 24 bits for
 Communicator ID (max Communicator ED 16,777,215. See mem_tag_format below), 18
 bits for Source Rank (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
 524,287).
 "ofi_tag_full":
 For executions that cannot accept trimming source rank or MPI tag, this mode sends
 source rank for each message in the CQ DATA. The Source Rank is made available at
 the remote process CQ (FI_CQ_FORMAT_TAGGED is used, see fi_cq(3)) at the completion
 of the matching receive operation. Since the minimum size for FI_REMOTE_CQ_DATA
 is 32 bits, the Source Rank fits with no limitations. The OFI tag is used for the
 Communicator id (28 bits, max Communicator ID 268,435,455. See mem_tag_format below),
 and the MPI tag (max MPI tag is INT_MAX). If this mode is selected by the user
 and FI_REMOTE_CQ_DATA or FI_DIRECTED_RECV are not supported, the execution will abort.
 mem_tag_format (fi_endpoint(3))
 Some providers can reserve the higher order bits from the OFI tag for internal purposes.
 This is signaled in mem_tag_format (see fi_endpoint(3)) by setting higher order bits
 to zero. In such cases, the OFI MTL will reduce the number of communicator ids supported
 by reducing the bits available for the communicator ID field in the OFI tag.
 SCALABLE ENDPOINTS:
 -------------------
 OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to improve
 multi-threaded application throughput and message rate. Currently the feature
 is designed to utilize multiple TX/RX contexts exposed by the OFI provider in
 conjunction with a multi-communicator MPI application model. Therefore, new OFI
 contexts are created as and when communicators are duplicated in a lazy fashion
 instead of creating them all at once during init time and this approach also
 favours only creating as many contexts as needed.
 1. Multi-communicator model:
   With this approach, the MPI application is requried to first duplicate
   the communicators it wants to use with MPI operations (ideally creating
   as many communicators as the number of threads it wants to use to call
   into MPI). The duplicated communicators are then used by the
   corresponding threads to perform MPI operations. A possible usage
   scenario could be in an MPI + OMP application as follows
   (example limited to 2 ranks):
    MPI_Comm dup_comm[n];
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
     for (i = 0; i < n; i++) {
        MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
     }
     if (rank == 0) {
 #pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
         for (i = 0; i < n ; i++) {
                    MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
                                        1, MSG_TAG, dup_comm[i]);
                    MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
                                       1, MSG_TAG, dup_comm[i], &status);
         }
    } else if (rank == 1) {
 #pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
            for (i = 0; i < n ; i++) {
                MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
                                   0, MSG_TAG, dup_comm[i], &status);
                MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
                                   0, MSG_TAG, dup_comm[i]);
           }
    }
 2. MCA variables:
  To utilize the feature, the following MCA variables need to be set:
  mtl_ofi_enable_sep:
  This MCA variable needs to be set to enable the use of Scalable Endpoints (SEP)
  feature in the OFI MTL. The underlying provider is also checked to ensure the
  feature is supported. If the provider chosen does not support it, user needs
  to either set this variable to 0 or select a different provider which supports
  the feature.
  For single-threaded applications one OFI context is sufficient, so OFI SEPs
  may not add benefit.
  Note that mtl_ofi_thread_grouping (see below) needs to be enabled to use the
  different OFI SEP contexts. Otherwise, only one context (ctxt 0) will be used.
  Default: 0
  Command-line syntax:
  "-mca mtl_ofi_enable_sep 1"
  mtl_ofi_thread_grouping:
  Turn Thread Grouping feature on. This is needed to use the Multi-communicator
  model explained above. This means that the OFI MTL will use the communicator
  ID to decide the SEP contexts to be used by the thread. In this way, each
  thread will have direct access to different OFI resources. If disabled,
  only context 0 will be used.
  Requires mtl_ofi_enable_sep to be set to 1.
  Default: 0
  It is not recommended to set the MCA variable for:
   - Multi-threaded MPI applications not following multi-communicator approach.
   - Applications that have multiple threads using a single communicator as
     it may degrade performance.
  Command-line syntax:
    "-mca mtl_ofi_thread_grouping 1"
  mtl_ofi_num_ctxts:
  This MCA variable allows user to set the number of OFI SEP contexts the
  application expects to use. For multi-threaded applications using Thread
  Grouping feature, this number should be set to the number of user threads
  that will call into MPI. This variable will only have effect if
  mtl_ofi_enable_sep is set to 1.
  Default: 1
  Command-line syntax:
  "-mca mtl_ofi_num_ctxts N" [ N: number of OFI contexts required by
                                         application ]
 3. Notes on performance:
  - OFI MTL will create as many TX/RX contexts as set by MCA mtl_ofi_num_ctxts.
    The number of contexts that can be created is also limited by the underlying
    provider as each provider may have different thresholds. Once the threshold
    is exceeded, contexts are used in a round-robin fashion which leads to
    resource sharing among threads. Therefore locks are required to guard
    against race conditions. For performance, it is recommended to have
      Number of threads = Number of communicators = Number of contexts
    For example, when using PSM2 provider, the number of contexts is dictated
    by the Intel Omni-Path HFI1 driver module.
  - OPAL layer allows for multiple threads to enter progress simultaneously. To
    enable this feature, user needs to set MCA variable
    "max_thread_in_progress". When using Thread Grouping feature, it is
    recommended to set this MCA parameter to the number of threads expected to
    call into MPI as it provides performance benefits.
    Command-line syntax:
    "-mca opal_max_thread_in_progress N" [ N: number of threads expected to
                                              make MPI calls ]
      Default: 1
  - For applications using a single thread with multiple communicators and MCA
    variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple
    contexts, but the benefits may be negligible as only one thread is driving
    progress.
 SPECIALIZED FUNCTIONS:
 -------------------
 To improve performance when calling message passing APIs in the OFI mtl
 specialized functions are generated at compile time that eliminate all the
 if conditionals that can be determined at init and don't need to be
 queried again during the critical path. These functions are generated by
 perl scripts during make which generate functions and symbols for every
 combination of flags for each function.
 1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
    To add a new flag to an existing specialized function for handling cases
    where different OFI providers may or may not support a particular feature,
    then you must follow these steps:
    1) Update the "_generic" function in mtl_ofi.h with the new flag and
       the if conditionals to read the new value.
    2) Update the *.pm file corresponding to the function with the new flag in:
       gen_funcs(), gen_*_function(), & gen_*_sym_init()
    3) Update mtl_ofi_opt.h with:
        The new flag as #define NEW_FLAG_TYPES #NUMBER_OF_STATES
            example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
        Update the function's types with:
            #define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]
 2. ADDING A NEW FUNCTION FOR SPECIALIZATION:
    To add a new function to be specialized you must
    follow these steps:
    1) Create a new mtl_ofi_"function_name"_opt.pm based off opt_common/mtl_ofi_opt.pm.template
    2) Add new .pm file to generated_source_modules in Makefile.am
    3) Add .c file to generated_sources in Makefile.am named the same as the corresponding .pm file
    4) Update existing or create function in mtl_ofi.h to _generic with new flags.
    5) Update mtl_ofi_opt.h with:
        a) New function types: #define OMPI_MTL_OFI_FUNCTION_TYPES [FLAG_TYPES]
        b) Add new function to the struct ompi_mtl_ofi_symtable:
            struct ompi_mtl_ofi_symtable {
                ...
                int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
            }
        c) Add new symbol table init function definition:
            void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
    6) Add calls to init the new function in the symbol table and assign the function
       pointer to be used based off the flags in mtl_ofi_component.c:
        ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);
        ompi_mtl_ofi.base.mtl_FUNCTION =
            ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];
 3. EXAMPLE SPECIALIZED FILE:
 The code below is an example of what is generated by the specialization
 scripts for use in the OFI mtl. This code specializes the blocking
 send functionality based on FI_REMOTE_CQ_DATA & OFI Scalable Endpoint support
 provided by an OFI Provider. Only one function and symbol is used during
 runtime based on if FI_REMOTE_CQ_DATA is supported and/or if OFI Scalable
 Endpoint support is enabled.
 /*
 * Copyright (c) 2013-2018 Intel, Inc. All rights reserved
 *
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */
 #include "mtl_ofi.h"
 __opal_attribute_always_inline__ static inline int
 ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
                  struct ompi_communicator_t *comm,
                  int dest,
                  int tag,
                  struct opal_convertor_t *convertor,
                  mca_pml_base_send_mode_t mode)
 {
    const bool OFI_CQ_DATA = false;
    const bool OFI_SCEP_EPS = false;
    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
                                    convertor, mode,
                                    OFI_CQ_DATA, OFI_SCEP_EPS);
 }
 __opal_attribute_always_inline__ static inline int
 ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
                  struct ompi_communicator_t *comm,
                  int dest,
                  int tag,
                  struct opal_convertor_t *convertor,
                  mca_pml_base_send_mode_t mode)
 {
    const bool OFI_CQ_DATA = false;
    const bool OFI_SCEP_EPS = true;
    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
                                    convertor, mode,
                                    OFI_CQ_DATA, OFI_SCEP_EPS);
 }
 __opal_attribute_always_inline__ static inline int
 ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
                  struct ompi_communicator_t *comm,
                  int dest,
                  int tag,
                  struct opal_convertor_t *convertor,
                  mca_pml_base_send_mode_t mode)
 {
    const bool OFI_CQ_DATA = true;
    const bool OFI_SCEP_EPS = false;
    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
                                    convertor, mode,
                                    OFI_CQ_DATA, OFI_SCEP_EPS);
 }
 __opal_attribute_always_inline__ static inline int
 ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
                  struct ompi_communicator_t *comm,
                  int dest,
                  int tag,
                  struct opal_convertor_t *convertor,
                  mca_pml_base_send_mode_t mode)
 {
    const bool OFI_CQ_DATA = true;
    const bool OFI_SCEP_EPS = true;
    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
                                    convertor, mode,
                                    OFI_CQ_DATA, OFI_SCEP_EPS);
 }
 void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
 {
    sym_table->ompi_mtl_ofi_send[false][false]
        = ompi_mtl_ofi_send_false_false;
    sym_table->ompi_mtl_ofi_send[false][true]
        = ompi_mtl_ofi_send_false_true;
    sym_table->ompi_mtl_ofi_send[true][false]
        = ompi_mtl_ofi_send_true_false;
    sym_table->ompi_mtl_ofi_send[true][true]
        = ompi_mtl_ofi_send_true_true;
 }
 ###
--- a/ompi/mca/mtl/ofi/README.md
+++ b/ompi/mca/mtl/ofi/README.md
@ -0,0 +1,368 @@
 # Open MPI OFI MTL
 The OFI MTL supports Libfabric (a.k.a., [Open Fabrics Interfaces
 OFI](https://ofiwg.github.io/libfabric/)) tagged APIs
 (`fi_tagged(3)`). At initialization time, the MTL queries libfabric
 for providers supporting tag matching (`fi_getinfo(3)`). Libfabric
 will return a list of providers that satisfy the requested
 capabilities, having the most performant one at the top of the list.
 The user may modify the OFI provider selection with mca parameters
 `mtl_ofi_provider_include` or `mtl_ofi_provider_exclude`.
 ## PROGRESS
 The MTL registers a progress function to `opal_progress`. There is
 currently no support for asynchronous progress. The progress function
 reads multiple events from the OFI provider Completion Queue (CQ) per
 iteration (defaults to 100, can be modified with the mca
 `mtl_ofi_progress_event_cnt`) and iterates until the completion queue is
 drained.
 ## COMPLETIONS
 Each operation uses a request type `ompi_mtl_ofi_request_t` which
 includes a reference to an operation specific completion callback, an
 MPI request, and a context. The context (`fi_context`) is used to map
 completion events with `MPI_requests` when reading the CQ.
 ## OFI TAG
 MPI needs to send 96 bits of information per message (32 bits
 communicator id, 32 bits source rank, 32 bits MPI tag) but OFI only
 offers 64 bits tags. In addition, the OFI MTL uses 2 bits of the OFI
 tag for the synchronous send protocol.  Therefore, there are only 62
 bits available in the OFI tag for message usage. The OFI MTL offers
 the `mtl_ofi_tag_mode` mca parameter with 4 modes to address this:
 * `auto` (Default):
  After the OFI provider is selected, a runtime check is performed to
  assess `FI_REMOTE_CQ_DATA` and `FI_DIRECTED_RECV` support (see
  `fi_tagged(3)`, `fi_msg(2)` and `fi_getinfo(3)`). If supported,
  `ofi_tag_full` is used. If not supported, fall back to `ofi_tag_1`.
 * `ofi_tag_1`:
  For providers that do not support `FI_REMOTE_CQ_DATA`, the OFI MTL
  will trim the fields (Communicator ID, Source Rank, MPI tag) to make
  them fit the 62 bits available bit in the OFI tag. There are two
  options available with different number of bits for the Communicator
  ID and MPI tag fields. This tag distribution offers: 12 bits for
  Communicator ID (max Communicator ID 4,095) subject to provider
  reserved bits (see `mem_tag_format` below), 18 bits for Source Rank
  (max Source Rank 262,143), 32 bits for MPI tag (max MPI tag is
  `INT_MAX`).
 * `ofi_tag_2`:
  Same as 2 `ofi_tag_1` but offering a different OFI tag distribution
  for applications that may require a greater number of supported
  Communicators at the expense of fewer MPI tag bits. This tag
  distribution offers: 24 bits for Communicator ID (max Communicator
  ED 16,777,215. See mem_tag_format below), 18 bits for Source Rank
  (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
  524,287).
 * `ofi_tag_full`:
  For executions that cannot accept trimming source rank or MPI tag,
  this mode sends source rank for each message in the CQ DATA. The
  Source Rank is made available at the remote process CQ
  (`FI_CQ_FORMAT_TAGGED` is used, see `fi_cq(3)`) at the completion of
  the matching receive operation. Since the minimum size for
  `FI_REMOTE_CQ_DATA` is 32 bits, the Source Rank fits with no
  limitations. The OFI tag is used for the Communicator id (28 bits,
  max Communicator ID 268,435,455. See `mem_tag_format` below), and
  the MPI tag (max MPI tag is `INT_MAX`). If this mode is selected by
  the user and `FI_REMOTE_CQ_DATA` or `FI_DIRECTED_RECV` are not
  supported, the execution will abort.
 * `mem_tag_format` (`fi_endpoint(3)`)
  Some providers can reserve the higher order bits from the OFI tag
  for internal purposes.  This is signaled in `mem_tag_format` (see
  `fi_endpoint(3)`) by setting higher order bits to zero. In such
  cases, the OFI MTL will reduce the number of communicator ids
  supported by reducing the bits available for the communicator ID
  field in the OFI tag.
 ## SCALABLE ENDPOINTS
 OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to
 improve multi-threaded application throughput and message
 rate. Currently the feature is designed to utilize multiple TX/RX
 contexts exposed by the OFI provider in conjunction with a
 multi-communicator MPI application model. Therefore, new OFI contexts
 are created as and when communicators are duplicated in a lazy fashion
 instead of creating them all at once during init time and this
 approach also favours only creating as many contexts as needed.
 1. Multi-communicator model:
   With this approach, the MPI application is requried to first duplicate
   the communicators it wants to use with MPI operations (ideally creating
   as many communicators as the number of threads it wants to use to call
   into MPI). The duplicated communicators are then used by the
   corresponding threads to perform MPI operations. A possible usage
   scenario could be in an MPI + OMP application as follows
   (example limited to 2 ranks):
    ```c
    MPI_Comm dup_comm[n];
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
    for (i = 0; i < n; i++) {
        MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
    }
    if (rank == 0) {
    #pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
        for (i = 0; i < n ; i++) {
            MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
                     1, MSG_TAG, dup_comm[i]);
            MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
                     1, MSG_TAG, dup_comm[i], &status);
        }
    } else if (rank == 1) {
    #pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
        for (i = 0; i < n ; i++) {
            MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
                     0, MSG_TAG, dup_comm[i], &status);
            MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
                     0, MSG_TAG, dup_comm[i]);
        }
    }
    ```
 2. MCA variables:
   To utilize the feature, the following MCA variables need to be set:
   * `mtl_ofi_enable_sep`:
     This MCA variable needs to be set to enable the use of Scalable
     Endpoints (SEP) feature in the OFI MTL. The underlying provider
     is also checked to ensure the feature is supported. If the
     provider chosen does not support it, user needs to either set
     this variable to 0 or select a different provider which supports
     the feature.  For single-threaded applications one OFI context is
     sufficient, so OFI SEPs may not add benefit.  Note that
     `mtl_ofi_thread_grouping` (see below) needs to be enabled to use
     the different OFI SEP contexts. Otherwise, only one context (ctxt
     0) will be used.
     Default: 0
     Command-line syntax: `--mca mtl_ofi_enable_sep 1`
   * `mtl_ofi_thread_grouping`:
     Turn Thread Grouping feature on. This is needed to use the
     Multi-communicator model explained above. This means that the OFI
     MTL will use the communicator ID to decide the SEP contexts to be
     used by the thread. In this way, each thread will have direct
     access to different OFI resources. If disabled, only context 0
     will be used.  Requires `mtl_ofi_enable_sep` to be set to 1.
     Default: 0
     It is not recommended to set the MCA variable for:
     * Multi-threaded MPI applications not following multi-communicator
       approach.
     * Applications that have multiple threads using a single
       communicator as it may degrade performance.
     Command-line syntax: `--mca mtl_ofi_thread_grouping 1`
   * `mtl_ofi_num_ctxts`:
     This MCA variable allows user to set the number of OFI SEP
     contexts the application expects to use. For multi-threaded
     applications using Thread Grouping feature, this number should be
     set to the number of user threads that will call into MPI. This
     variable will only have effect if `mtl_ofi_enable_sep` is set to 1.
     Default: 1
     Command-line syntax: `--mca mtl_ofi_num_ctxts N` (`N`: number of OFI contexts required by application)
 3. Notes on performance:
   * OFI MTL will create as many TX/RX contexts as set by MCA
     mtl_ofi_num_ctxts.  The number of contexts that can be created is
     also limited by the underlying provider as each provider may have
     different thresholds. Once the threshold is exceeded, contexts are
     used in a round-robin fashion which leads to resource sharing
     among threads. Therefore locks are required to guard against race
     conditions. For performance, it is recommended to have
       Number of threads = Number of communicators = Number of contexts
     For example, when using PSM2 provider, the number of contexts is
     dictated by the Intel Omni-Path HFI1 driver module.
   * OPAL layer allows for multiple threads to enter progress
     simultaneously. To enable this feature, user needs to set MCA
     variable `max_thread_in_progress`. When using Thread Grouping
     feature, it is recommended to set this MCA parameter to the number
     of threads expected to call into MPI as it provides performance
     benefits.
     Default: 1
     Command-line syntax: `--mca opal_max_thread_in_progress N` (`N`: number of threads expected to make MPI calls )
   * For applications using a single thread with multiple communicators
     and MCA variable `mtl_ofi_thread_grouping` set to 1, the MTL will
     use multiple contexts, but the benefits may be negligible as only
     one thread is driving progress.
 ## SPECIALIZED FUNCTIONS
 To improve performance when calling message passing APIs in the OFI
 mtl specialized functions are generated at compile time that eliminate
 all the if conditionals that can be determined at init and don't need
 to be queried again during the critical path. These functions are
 generated by perl scripts during make which generate functions and
 symbols for every combination of flags for each function.
 1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
   To add a new flag to an existing specialized function for handling
   cases where different OFI providers may or may not support a
   particular feature, then you must follow these steps:
   1. Update the `_generic` function in `mtl_ofi.h` with the new flag
      and the if conditionals to read the new value.
   1. Update the `*.pm` file corresponding to the function with the
      new flag in: `gen_funcs()`, `gen_*_function()`, &
      `gen_*_sym_init()`
   1. Update `mtl_ofi_opt.h` with:
      * The new flag as `#define NEW_FLAG_TYPES #NUMBER_OF_STATES`.
        Example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
      * Update the function's types with:
        `#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]`
 1. ADDING A NEW FUNCTION FOR SPECIALIZATION:
   To add a new function to be specialized you must
   follow these steps:
   1. Create a new `mtl_ofi_<function_name>_opt.pm` based off
      `opt_common/mtl_ofi_opt.pm.template`
   1. Add new `.pm` file to `generated_source_modules` in `Makefile.am`
   1. Add `.c` file to `generated_sources` in `Makefile.am` named the
      same as the corresponding `.pm` file
   1. Update existing or create function in `mtl_ofi.h` to `_generic`
      with new flags.
   1. Update `mtl_ofi_opt.h` with:
      1. New function types: `#define OMPI_MTL_OFI_FUNCTION_TYPES` `[FLAG_TYPES]`
      1. Add new function to the `struct ompi_mtl_ofi_symtable`:
         ```c
         struct ompi_mtl_ofi_symtable {
               ...
               int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
         }
         ```
      1. Add new symbol table init function definition:
         ```c
         void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
         ```
   1. Add calls to init the new function in the symbol table and
      assign the function pointer to be used based off the flags in
      `mtl_ofi_component.c`:
      * `ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);`
      * `ompi_mtl_ofi.base.mtl_FUNCTION = ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];`
 ## EXAMPLE SPECIALIZED FILE
 The code below is an example of what is generated by the
 specialization scripts for use in the OFI mtl. This code specializes
 the blocking send functionality based on `FI_REMOTE_CQ_DATA` & OFI
 Scalable Endpoint support provided by an OFI Provider. Only one
 function and symbol is used during runtime based on if
 `FI_REMOTE_CQ_DATA` is supported and/or if OFI Scalable Endpoint support
 is enabled.
 ```c
 /*
 * Copyright (c) 2013-2018 Intel, Inc. All rights reserved
 *
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */
 #include "mtl_ofi.h"
 __opal_attribute_always_inline__ static inline int
 ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
                  struct ompi_communicator_t *comm,
                  int dest,
                  int tag,
                  struct opal_convertor_t *convertor,
                  mca_pml_base_send_mode_t mode)
 {
    const bool OFI_CQ_DATA = false;
    const bool OFI_SCEP_EPS = false;
    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
                                    convertor, mode,
                                    OFI_CQ_DATA, OFI_SCEP_EPS);
 }
 __opal_attribute_always_inline__ static inline int
 ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
                  struct ompi_communicator_t *comm,
                  int dest,
                  int tag,
                  struct opal_convertor_t *convertor,
                  mca_pml_base_send_mode_t mode)
 {
    const bool OFI_CQ_DATA = false;
    const bool OFI_SCEP_EPS = true;
    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
                                    convertor, mode,
                                    OFI_CQ_DATA, OFI_SCEP_EPS);
 }
 __opal_attribute_always_inline__ static inline int
 ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
                  struct ompi_communicator_t *comm,
                  int dest,
                  int tag,
                  struct opal_convertor_t *convertor,
                  mca_pml_base_send_mode_t mode)
 {
    const bool OFI_CQ_DATA = true;
    const bool OFI_SCEP_EPS = false;
    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
                                    convertor, mode,
                                    OFI_CQ_DATA, OFI_SCEP_EPS);
 }
 __opal_attribute_always_inline__ static inline int
 ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
                  struct ompi_communicator_t *comm,
                  int dest,
                  int tag,
                  struct opal_convertor_t *convertor,
                  mca_pml_base_send_mode_t mode)
 {
    const bool OFI_CQ_DATA = true;
    const bool OFI_SCEP_EPS = true;
    return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
                                    convertor, mode,
                                    OFI_CQ_DATA, OFI_SCEP_EPS);
 }
 void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
 {
    sym_table->ompi_mtl_ofi_send[false][false]
        = ompi_mtl_ofi_send_false_false;
    sym_table->ompi_mtl_ofi_send[false][true]
        = ompi_mtl_ofi_send_false_true;
    sym_table->ompi_mtl_ofi_send[true][false]
        = ompi_mtl_ofi_send_true_false;
    sym_table->ompi_mtl_ofi_send[true][true]
        = ompi_mtl_ofi_send_true_true;
 }
 ```
--- a/ompi/mca/op/example/README.txt
+++ b/ompi/mca/op/example/README.txt
@ -1,5 +1,3 @@
 Copyright 2009 Cisco Systems, Inc.  All rights reserved.
 This is a simple example op component meant to be a template /
 springboard for people to write their own op components.  There are
 many different ways to write components and modules; this is but one
@ -13,28 +11,26 @@ same end effect.  Feel free to customize / simplify / strip out what
 you don't need from this example.
 This example component supports a ficticious set of hardware that
-provides acceleation for the MPI_MAX and MPI_BXOR MPI_Ops.  The
+provides acceleation for the `MPI_MAX` and `MPI_BXOR` `MPI_Ops`.  The
 ficticious hardware has multiple versions, too: some versions only
-support single precision floating point types for MAX and single
+support single precision floating point types for `MAX` and single
-precision integer types for BXOR, whereas later versions support both
+precision integer types for `BXOR`, whereas later versions support
-single and double precision floating point types for MAX and both
+both single and double precision floating point types for `MAX` and
-single and double precision integer types for BXOR.  Hence, this
+both single and double precision integer types for `BXOR`.  Hence,
-example walks through setting up particular MPI_Op function pointers
+this example walks through setting up particular `MPI_Op` function
-based on:
+pointers based on:
-a) hardware availability (e.g., does the node where this MPI process
+1. hardware availability (e.g., does the node where this MPI process
   is running have the relevant hardware/resources?)
-
+1. `MPI_Op` (e.g., in this example, only `MPI_MAX` and `MPI_BXOR` are
 b) MPI_Op (e.g., in this example, only MPI_MAX and MPI_BXOR are
   supported)
-
+1. datatype (e.g., single/double precision floating point for `MAX`
-c) datatype (e.g., single/double precision floating point for MAX and
+   and single/double precision integer for `BXOR`)
   single/double precision integer for BXOR)
 Additionally, there are other considerations that should be factored
 in at run time.  Hardware accelerators are great, but they do induce
 overhead -- for example, some accelerator hardware require registered
-memory.  So even if a particular MPI_Op and datatype are supported, it
+memory.  So even if a particular `MPI_Op` and datatype are supported, it
 may not be worthwhile to use the hardware unless the amount of data to
 be processed is "big enough" (meaning that the cost of the
 registration and/or copy-in/copy-out is ameliorated) or the memory to
@ -47,57 +43,65 @@ failover strategy is well-supported by the op framework; during the
 query process, a component can "stack" itself similar to how POSIX
 signal handlers can be stacked.  Specifically, op components can cache
 other implementations of operation functions for use in the case of
-failover.  The MAX and BXOR module implementations show one way of
+failover.  The `MAX` and `BXOR` module implementations show one way of
 using this method.
 Here's a listing of the files in the example component and what they
 do:
- configure.m4: Tests that get slurped into OMPI's top-level configure
+- `configure.m4`: Tests that get slurped into OMPI's top-level
-  script to determine whether this component will be built or not.
+  `configure` script to determine whether this component will be built
- Makefile.am: Automake makefile that builds this component.
+  or not.
- op_example_component.c: The main "component" source file.
+- `Makefile.am`: Automake makefile that builds this component.
- op_example_module.c: The main "module" source file.
+- `op_example_component.c`: The main "component" source file.
- op_example.h: information that is shared between the .c files.
+- `op_example_module.c`: The main "module" source file.
- .ompi_ignore: the presence of this file causes OMPI's autogen.pl to
+- `op_example.h`: information that is shared between the `.c` files.
-  skip this component in the configure/build/install process (see
+- `.ompi_ignore`: the presence of this file causes OMPI's `autogen.pl`
  to skip this component in the configure/build/install process (see
  below).
 To use this example as a template for your component (assume your new
-component is named "foo"):
+component is named `foo`):
 ```
 shell$ cd (top_ompi_dir)/ompi/mca/op
 shell$ cp -r example foo
 shell$ cd foo
 ```
-Remove the .ompi_ignore file (which makes the component "visible" to
+Remove the `.ompi_ignore` file (which makes the component "visible" to
-all developers) *OR* add an .ompi_unignore file with one username per
+all developers) *OR* add an `.ompi_unignore` file with one username per
-line (as reported by `whoami`).  OMPI's autogen.pl will skip any
+line (as reported by `whoami`).  OMPI's `autogen.pl` will skip any
-component with a .ompi_ignore file *unless* there is also an
+component with a `.ompi_ignore` file *unless* there is also an
 .ompi_unignore file containing your user ID in it.  This is a handy
 mechanism to have a component in the tree but have it not built / used
 by most other developers:
 ```
 shell$ rm .ompi_ignore
 *OR*
 shell$ whoami > .ompi_unignore
 ```
-Now rename any file that contains "example" in the filename to have
+Now rename any file that contains `example` in the filename to have
-"foo", instead.  For example:
+`foo`, instead.  For example:
 ```
 shell$ mv op_example_component.c op_foo_component.c
 #...etc.
 ```
-Now edit all the files and s/example/foo/gi.  Specifically, replace
+Now edit all the files and `s/example/foo/gi`.  Specifically, replace
-all instances of "example" with "foo" in all function names, type
+all instances of `example` with `foo` in all function names, type
-names, header #defines, strings, and global variables.
+names, header `#defines`, strings, and global variables.
 Now your component should be fully functional (although entirely
-renamed as "foo" instead of "example").  You can go to the top-level
+renamed as `foo` instead of `example`).  You can go to the top-level
-OMPI directory and run "autogen.pl" (which will find your component
+OMPI directory and run `autogen.pl` (which will find your component
-and att it to the configure/build process) and then "configure ..."
+and att it to the configure/build process) and then `configure ...`
-and "make ..." as normal.
+and `make ...` as normal.
 ```
 shell$ cd (top_ompi_dir)
 shell$ ./autogen.pl
 # ...lots of output...
@ -107,19 +111,21 @@ shell$ make -j 4 all
 # ...lots of output...
 shell$ make install
 # ...lots of output...
 ```
-After you have installed Open MPI, running "ompi_info" should show
+After you have installed Open MPI, running `ompi_info` should show
-your "foo" component in the output.
+your `foo` component in the output.
 ```
 shell$ ompi_info | grep op:
                  MCA op: example (MCA v2.0, API v1.0, Component v1.4)
                  MCA op: foo (MCA v2.0, API v1.0, Component v1.4)
 shell$
 ```
-If you do not see your foo component, check the above steps, and check
+If you do not see your `foo` component, check the above steps, and
-the output of autogen.pl, configure, and make to ensure that "foo" was
+check the output of `autogen.pl`, `configure`, and `make` to ensure
-found, configured, and built successfully.
+that `foo` was found, configured, and built successfully.
 Once ompi_info sees your component, start editing the "foo" component
 files in a meaningful way.
 Once `ompi_info` sees your component, start editing the `foo`
 component files in a meaningful way.
--- a/ompi/mpi/java/Makefile.am
+++ b/ompi/mpi/java/Makefile.am
@ -10,3 +10,5 @@
 #
 SUBDIRS = java c
 EXTRA_DIST = README.md
--- a/ompi/mpi/java/README.md
+++ b/ompi/mpi/java/README.md
@ -1,26 +1,27 @@
-***************************************************************************
+# Open MPI Java bindings
 Note about the Open MPI Java bindings
-The Java bindings in this directory are not part of the MPI specification,
+The Java bindings in this directory are not part of the MPI
-as noted in the README.JAVA.txt file in the root directory. That file also
+specification, as noted in the README.JAVA.md file in the root
-contains some information regarding the installation and use of the Java
+directory. That file also contains some information regarding the
-bindings. Further details can be found in the paper [1].
+installation and use of the Java bindings. Further details can be
 found in the paper [1].
 We originally took the code from the mpiJava project [2] as starting point
 for our developments, but we have pretty much rewritten 100% of it. The
 original copyrights and license terms of mpiJava are listed below.
- [1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
+1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
-     implementation of Java bindings in Open MPI". Parallel Comput.
+   implementation of Java bindings in Open MPI". Parallel Comput.
-     59: 1-20 (2016).
+   59: 1-20 (2016).
 1. M. Baker et al. "mpiJava: An object-oriented Java interface to
   MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
   pp. 748-762, Springer (1999).
- [2] M. Baker et al. "mpiJava: An object-oriented Java interface to
+## Original citation
     MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
     pp. 748-762, Springer (1999).
 ***************************************************************************
 ```
            mpiJava - A Java Interface to MPI
            ---------------------------------
                    Copyright 2003
@ -39,6 +40,7 @@ original copyrights and license terms of mpiJava are listed below.
      (Bugfixes/Additions, CMake based configure/build)
                      Blasius Czink
               HLRS, University of Stuttgart
 ```
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
--- a/ompi/mpiext/README.txt
+++ b/ompi/mpiext/README.txt
@ -1,4 +1,5 @@
-Symbol conventions for Open MPI extensions
+# Symbol conventions for Open MPI extensions
 Last updated: January 2015
 This README provides some rule-of-thumb guidance for how to name
@ -15,26 +16,22 @@ Generally speaking, there are usually three kinds of extensions:
 3. Functionality that is strongly expected to be in an upcoming
   version of the MPI specification.
----------------------------------------------------------------------
+## Case 1
-Case 1
+The `OMPI_Paffinity_str()` extension is a good example of this type:
-
+it is solely intended to be for Open MPI.  It will likely never be
-The OMPI_Paffinity_str() extension is a good example of this type: it
+pushed to other MPI implementations, and it will likely never be
-is solely intended to be for Open MPI.  It will likely never be pushed
+pushed to the MPI Forum.
 to other MPI implementations, and it will likely never be pushed to
 the MPI Forum.
 It's Open MPI-specific functionality, through and through.
 Public symbols of this type of functionality should be named with an
-"OMPI_" prefix to emphasize its Open MPI-specific nature.  To be
+`OMPI_` prefix to emphasize its Open MPI-specific nature.  To be
-clear: the "OMPI_" prefix clearly identifies parts of user code that
+clear: the `OMPI_` prefix clearly identifies parts of user code that
 are relying on Open MPI (and likely need to be surrounded with #if
-OPEN_MPI blocks, etc.).
+`OPEN_MPI` blocks, etc.).
----------------------------------------------------------------------
+## Case 2
 Case 2
 The MPI extensions mechanism in Open MPI was designed to help MPI
 Forum members prototype new functionality that is intended for the
@ -43,23 +40,21 @@ functionality is not only to be included in the MPI spec, but possibly
 also be included in another MPI implementation.
 As such, it seems reasonable to prefix public symbols in this type of
-functionality with "MPIX_".  This commonly-used prefix allows the same
+functionality with `MPIX_`.  This commonly-used prefix allows the same
 symbols to be available in multiple MPI implementations, and therefore
 allows user code to easily check for it.  E.g., user apps can check
-for the presence of MPIX_Foo to know if both Open MPI and Other MPI
+for the presence of `MPIX_Foo` to know if both Open MPI and Other MPI
-support the proposed MPIX_Foo functionality.
+support the proposed `MPIX_Foo` functionality.
-Of course, when using the MPIX_ namespace, there is the possibility of
+Of course, when using the `MPIX_` namespace, there is the possibility of
-symbol name collisions.  E.g., what if Open MPI has an MPIX_Foo and
+symbol name collisions.  E.g., what if Open MPI has an `MPIX_Foo` and
-Other MPI has a *different* MPIX_Foo?
+Other MPI has a *different* `MPIX_Foo`?
 While we technically can't prevent such collisions from happening, we
 encourage extension authors to avoid such symbol clashes whenever
 possible.
----------------------------------------------------------------------
+## Case 3
 Case 3
 It is well-known that the MPI specification (intentionally) takes a
 long time to publish.  MPI implementers can typically know, with a
@ -72,13 +67,13 @@ functionality early (i.e., before the actual publication of the
 corresponding MPI specification document).
 Case in point: the non-blocking collective operations that were
-included in MPI-3.0 (e.g., MPI_Ibarrier).  It was known for a year or
+included in MPI-3.0 (e.g., `MPI_Ibarrier()`).  It was known for a year
-two before MPI-3.0 was published that these functions would be
+or two before MPI-3.0 was published that these functions would be
 included in MPI-3.0.
 There is a continual debate among the developer community: when
 implementing such functionality, should the symbols be in the MPIX_
-namespace or in the MPI_ namespace?  On one hand, the symbols are not
+namespace or in the `MPI_` namespace?  On one hand, the symbols are not
 yet officially standardized -- *they could change* before publication.
 On the other hand, developers who participate in the Forum typically
 have a good sense for whether symbols are going to change before
@ -89,35 +84,31 @@ before the MPI specification is published.  ...and so on.
 After much debate: for functionality that has a high degree of
 confidence that it will be included in an upcoming spec (e.g., it has
 passed at least one vote in the MPI Forum), our conclusion is that it
-is OK to use the MPI_ namespace.
+is OK to use the `MPI_` namespace.
 Case in point: Open MPI released non-blocking collectives with the
-MPI_ prefix (not the MPIX_ prefix) before the MPI-3.0 specification
+`MPI_` prefix (not the `MPIX_` prefix) before the MPI-3.0
-officially standardized these functions.
+specification officially standardized these functions.
 The rationale was threefold:
 1. Let users use the functionality as soon as possible.
-
+1. If OMPI initially creates `MPIX_Foo`, but eventually renames it to
-2. If OMPI initially creates MPIX_Foo, but eventually renames it to
+   `MPI_Foo` when the MPI specification is published, then users will
   MPI_Foo when the MPI specification is published, then users will
   have to modify their codes to match.  This is an artificial change
   inserted just to be "pure" to the MPI spec (i.e., it's a "lawyer's
-   answer").  But since the MPIX_Foo -> MPI_Foo change is inevitable,
+   answer").  But since the `MPIX_Foo` -> `MPI_Foo` change is
-   it just ends up annoying users.
+   inevitable, it just ends up annoying users.
-
+1. Once OMPI introduces `MPIX_` symbols, if we want to *not* annoy
 3. Once OMPI introduces MPIX_ symbols, if we want to *not* annoy
   users, we'll likely have weak symbols / aliased versions of both
-   MPIX_Foo and MPI_Foo once the Foo functionality is included in a
+   `MPIX_Foo` and `MPI_Foo` once the Foo functionality is included in
-   published MPI specification.  However, when can we delete the
+   a published MPI specification.  However, when can we delete the
-   MPIX_Foo symbol?  It becomes a continuing annoyance of backwards
+   `MPIX_Foo` symbol?  It becomes a continuing annoyance of backwards
   compatibility that we have to keep carrying forward.
 For all these reasons, we believe that it's better to put
-expected-upcoming official MPI functionality in the MPI_ namespace,
+expected-upcoming official MPI functionality in the `MPI_` namespace,
-not the MPIX_ namespace.
+not the `MPIX_` namespace.
 ----------------------------------------------------------------------
 All that being said, these are rules of thumb.  They are not an
 official mandate.  There may well be cases where there are reasons to
--- a/ompi/mpiext/affinity/Makefile.am
+++ b/ompi/mpiext/affinity/Makefile.am
@ -2,7 +2,7 @@
 # Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
 #                         University Research and Technology
 #                         Corporation.  All rights reserved.
-# Copyright (c) 2010-2012 Cisco Systems, Inc.  All rights reserved.
+# Copyright (c) 2010-2020 Cisco Systems, Inc.  All rights reserved.
 # $COPYRIGHT$
 #
 # Additional copyrights may follow
@ -20,4 +20,4 @@
 SUBDIRS = c
-EXTRA_DIST = README.txt
+EXTRA_DIST = README.md
--- a/ompi/mpiext/affinity/README.md
+++ b/ompi/mpiext/affinity/README.md
@ -0,0 +1,30 @@
 # Open MPI extension: Affinity
 ## Copyrights
 ```
 Copyright (c) 2010-2012 Cisco Systems, Inc.  All rights reserved.
 Copyright (c) 2010 Oracle and/or its affiliates.  All rights reserved.
 ```
 ## Authors
 * Jeff Squyres, 19 April 2010, and 16 April 2012
 * Terry Dontje, 18 November 2010
 ## Description
 This extension provides a single new function, `OMPI_Affinity_str()`,
 that takes a format value and then provides 3 prettyprint strings as
 output:
 * `fmt_type`: is an enum that tells `OMPI_Affinity_str()` whether to
  use a resource description string or layout string format for
  `ompi_bound` and `currently_bound` output strings.
 * `ompi_bound`: describes what sockets/cores Open MPI bound this process
  to (or indicates that Open MPI did not bind this process).
 * `currently_bound`: describes what sockets/cores this process is
  currently bound to (or indicates that it is unbound).
 * `exists`: describes what processors are available in the current host.
 See `OMPI_Affinity_str(3)` for more details.
--- a/ompi/mpiext/affinity/README.txt
+++ b/ompi/mpiext/affinity/README.txt
@ -1,29 +0,0 @@
 # Copyright (c) 2010-2012 Cisco Systems, Inc.  All rights reserved.
 Copyright (c) 2010 Oracle and/or its affiliates.  All rights reserved.
 $COPYRIGHT$
 Jeff Squyres
 19 April 2010, and
 16 April 2012
 Terry Dontje
 18 November 2010
 This extension provides a single new function, OMPI_Affinity_str(),
 that takes a format value and then provides 3 prettyprint strings as
 output:
 fmt_type: is an enum that tells OMPI_Affinity_str() whether to use a
 resource description string or layout string format for ompi_bound and
 currently_bound output strings.
 ompi_bound: describes what sockets/cores Open MPI bound this process
 to (or indicates that Open MPI did not bind this process).
 currently_bound: describes what sockets/cores this process is
 currently bound to (or indicates that it is unbound).
 exists: describes what processors are available in the current host.
 See OMPI_Affinity_str(3) for more details.
--- a/ompi/mpiext/cuda/Makefile.am
+++ b/ompi/mpiext/cuda/Makefile.am
@ -21,4 +21,4 @@
 SUBDIRS = c
-EXTRA_DIST = README.txt
+EXTRA_DIST = README.md
--- a/ompi/mpiext/cuda/README.md
+++ b/ompi/mpiext/cuda/README.md
@ -0,0 +1,11 @@
 # Open MPI extension: Cuda
 Copyright (c) 2015 NVIDIA, Inc.  All rights reserved.
 Author: Rolf vandeVaart
 This extension provides a macro for compile time check of CUDA aware
 support.  It also provides a function for runtime check of CUDA aware
 support.
 See `MPIX_Query_cuda_support(3)` for more details.
--- a/ompi/mpiext/cuda/README.txt
+++ b/ompi/mpiext/cuda/README.txt
@ -1,11 +0,0 @@
 # Copyright (c) 2015      NVIDIA, Inc.  All rights reserved.
 $COPYRIGHT$
 Rolf vandeVaart
 This extension provides a macro for compile time check of CUDA aware support.
 It also provides a function for runtime check of CUDA aware support.
 See MPIX_Query_cuda_support(3) for more details.
--- a/ompi/mpiext/example/Makefile.am
+++ b/ompi/mpiext/example/Makefile.am
@ -1,5 +1,5 @@
 #
-# Copyright (c) 2012 Cisco Systems, Inc.  All rights reserved.
+# Copyright (c) 2020 Cisco Systems, Inc.  All rights reserved.
 # $COPYRIGHT$
 #
 # Additional copyrights may follow
@ -17,4 +17,4 @@
 SUBDIRS = c mpif-h use-mpi use-mpi-f08
-EXTRA_DIST = README.txt
+EXTRA_DIST = README.md
--- a/ompi/mpiext/example/README.md
+++ b/ompi/mpiext/example/README.md
@ -0,0 +1,148 @@
 # Open MPI extension: Example
 ## Overview
 This example MPI extension shows how to make an MPI extension for Open
 MPI.
 An MPI extension provides new top-level APIs in Open MPI that are
 available to user-level applications (vs. adding new code/APIs that is
 wholly internal to Open MPI).  MPI extensions are generally used to
 prototype new MPI APIs, or provide Open MPI-specific APIs to
 applications.  This example MPI extension provides a new top-level MPI
 API named `OMPI_Progress` that is callable in both C and Fortran.
 MPI extensions are similar to Open MPI components, but due to
 complex ordering requirements for the Fortran-based MPI bindings,
 their build order is a little different.
 Note that MPI has 4 different sets of bindings (C, Fortran `mpif.h`,
 the Fortran `mpi` module, and the Fortran `mpi_f08` module), and Open
 MPI extensions allow adding API calls to all 4 of them.  Prototypes
 for the user-accessible functions/subroutines/constants are included
 in the following publicly-available mechanisms:
 * C: `mpi-ext.h`
 * Fortran mpif.h: `mpif-ext.h`
 * Fortran "use mpi": `use mpi_ext`
 * Fortran "use mpi_f08": `use mpi_f08_ext`
 This example extension defines a new top-level API named
 `OMPI_Progress()` in all four binding types, and provides test programs
 to call this API in each of the four binding types.  Code (and
 comments) is worth 1,000 words -- see the code in this example
 extension to understand how it works and how the build system builds
 and inserts each piece into the publicly-available mechansisms (e.g.,
 `mpi-ext.h` and the `mpi_f08_ext` module).
 ## Comparison to General Open MPI MCA Components
 Here's the ways that MPI extensions are similar to Open MPI
 components:
 1. Extensions have a top-level `configure.m4` with a well-known m4 macro
   that is run during Open MPI's configure that determines whether the
   component wants to build or not.
   Note, however, that unlike components, extensions *must* have a
   `configure.m4`.  No other method of configuration is supported.
 1. Extensions must adhere to normal Automake-based targets.  We
   strongly suggest that you use `Makefile.am`'s and have the
   extension's `configure.m4` `AC_CONFIG_FILE` each `Makefile.am` in
   the extension.  Using other build systems may work, but are
   untested and unsupported.
 1. Extensions create specifically-named libtool convenience archives
   (i.e., `*.la` files) that the build system slurps into higher-level
   libraries.
 Unlike components, however, extensions:
 1. Have a bit more rigid directory and file naming scheme.
 1. Have up to four different, specifically-named subdirectories (one
   for each MPI binding type).
 1. Also install some specifically-named header files (for C and the
   Fortran `mpif.h` bindings).
 Similar to components, an MPI extension's name is determined by its
 directory name: `ompi/mpiext/EXTENSION_NAME`
 ## Extension requirements
 ### Required: C API
 Under this top-level directory, the extension *must* have a directory
 named `c` (for the C bindings) that:
 1. contains a file named `mpiext_EXTENSION_NAME_c.h`
 1. installs `mpiext_EXTENSION_NAME_c.h` to
   `$includedir/openmpi/mpiext/EXTENSION_NAME/c`
 1. builds a Libtool convenience library named
   `libmpiext_EXTENSION_NAME_c.la`
 ### Optional: `mpif.h` bindings
 Optionally, the extension may have a director named `mpif-h` (for the
 Fortran `mpif.h` bindings) that:
 1. contains a file named `mpiext_EXTENSION_NAME_mpifh.h`
 1. installs `mpiext_EXTENSION_NAME_mpih.h` to
   `$includedir/openmpi/mpiext/EXTENSION_NAME/mpif-h`
 1. builds a Libtool convenience library named
   `libmpiext_EXTENSION_NAME_mpifh.la`
 ### Optional: `mpi` module bindings
 Optionally, the extension may have a directory named `use-mpi` (for the
 Fortran `mpi` module) that:
 1. contains a file named `mpiext_EXTENSION_NAME_usempi.h`
 ***NOTE:*** The MPI extension system does NOT support building an
 additional library in the `use-mpi` extension directory.  It is
 assumed that the `use-mpi` bindings will use the same back-end symbols
 as the `mpif.h` bindings, and that the only output product of the
 `use-mpi` directory is a file to be included in the `mpi-ext` module
 (i.e., strong Fortran prototypes for the functions/global variables in
 this extension).
 ### Optional: `mpi_f08` module bindings
 Optionally, the extension may have a directory named `use-mpi-f08` (for
 the Fortran `mpi_f08` module) that:
 1. contains a file named `mpiext_EXTENSION_NAME_usempif08.h`
 1. builds a Libtool convenience library named
   `libmpiext_EXTENSION_NAME_usempif08.la`
 See the comments in all the header and source files in this tree to
 see what each file is for and what should be in each.
 ## Notes
 Note that the build order of MPI extensions is a bit strange.  The
 directories in a MPI extensions are NOT traversed top-down in
 sequential order.  Instead, due to ordering requirements when building
 the Fortran module-based interfaces, each subdirectory in extensions
 are traversed individually at different times in the overall Open MPI
 build.
 As such, `ompi/mpiext/EXTENSION_NAME/Makefile.am` is not traversed
 during a normal top-level `make all` target.  This `Makefile.am`
 exists for two reasons, however:
 1. For the conveneince of the developer, so that you can issue normal
   `make` commands at the top of your extension tree (e.g., `make all`
   will still build all bindings in an extension).
 1. During a top-level `make dist`, extension directories *are*
   traversed top-down in sequence order.  Having a top-level
   `Makefile.am` in an extension allows `EXTRA_DIST`ing of files, such
   as this `README.md` file.
 This are reasons for this strange ordering, but suffice it to say that
 `make dist` doesn't have the same ordering requiements as `make all`,
 and is therefore easier to have a "normal" Automake-usual top-down
 sequential directory traversal.
 Enjoy!
--- a/ompi/mpiext/example/README.txt
+++ b/ompi/mpiext/example/README.txt
@ -1,138 +0,0 @@
 Copyright (C) 2012 Cisco Systems, Inc.  All rights reserved.
 $COPYRIGHT$
 This example MPI extension shows how to make an MPI extension for Open
 MPI.
 An MPI extension provides new top-level APIs in Open MPI that are
 available to user-level applications (vs. adding new code/APIs that is
 wholly internal to Open MPI).  MPI extensions are generally used to
 prototype new MPI APIs, or provide Open MPI-specific APIs to
 applications.  This example MPI extension provides a new top-level MPI
 API named "OMPI_Progress" that is callable in both C and Fortran.
 MPI extensions are similar to Open MPI components, but due to
 complex ordering requirements for the Fortran-based MPI bindings,
 their build order is a little different.
 Note that MPI has 4 different sets of bindings (C, Fortran mpif.h,
 Fortran "use mpi", and Fortran "use mpi_f08"), and Open MPI extensions
 allow adding API calls to all 4 of them.  Prototypes for the
 user-accessible functions/subroutines/constants are included in the
 following publicly-available mechanisms:
 - C: mpi-ext.h
 - Fortran mpif.h: mpif-ext.h
 - Fortran "use mpi": use mpi_ext
 - Fortran "use mpi_f08": use mpi_f08_ext
 This example extension defines a new top-level API named
 "OMPI_Progress" in all four binding types, and provides test programs
 to call this API in each of the four binding types.  Code (and
 comments) is worth 1,000 words -- see the code in this example
 extension to understand how it works and how the build system builds
 and inserts each piece into the publicly-available mechansisms (e.g.,
 mpi-ext.h and the mpi_f08_ext module).
 --------------------------------------------------------------------------------
 Here's the ways that MPI extensions are similar to Open MPI
 components:
 - Extensions have a top-level configure.m4 with a well-known m4 macro
  that is run during Open MPI's configure that determines whether the
  component wants to build or not.
  Note, however, that unlike components, extensions *must* have a
  configure.m4.  No other method of configuration is supported.
 - Extensions must adhere to normal Automake-based targets.  We
  strongly suggest that you use Makefile.am's and have the extension's
  configure.m4 AC_CONFIG_FILE each Makefile.am in the extension.
  Using other build systems may work, but are untested and
  unsupported.
 - Extensions create specifically-named libtool convenience archives
  (i.e., *.la files) that the build system slurps into higher-level
  libraries.
 Unlike components, however, extensions:
 - Have a bit more rigid directory and file naming scheme.
 - Have up to four different, specifically-named subdirectories (one
  for each MPI binding type).
 - Also install some specifically-named header files (for C and the
  Fortran mpif.h bindings).
 Similar to components, an MPI extension's name is determined by its
 directory name: ompi/mpiext/<extension name>
 Under this top-level directory, the extension *must* have a directory
 named "c" (for the C bindings) that:
 - contains a file named mpiext_<ext_name>_c.h
 - installs mpiext_<ext_name>_c.h to
  $includedir/openmpi/mpiext/<ext_name>/c
 - builds a Libtool convenience library named libmpiext_<ext_name>_c.la
 Optionally, the extension may have a director named "mpif-h" (for the
 Fortran mpif.h bindings) that:
 - contains a file named mpiext_<ext_name>_mpifh.h
 - installs mpiext_<ext_name>_mpih.h to
  $includedir/openmpi/mpiext/<ext_name>/mpif-h
 - builds a Libtool convenience library named libmpiext_<ext_name>_mpifh.la
 Optionally, the extension may have a director named "use-mpi" (for the
 Fortran "use mpi" bindings) that:
 - contains a file named mpiext_<ext_name>_usempi.h
 NOTE: The MPI extension system does NOT support building an additional
 library in the use-mpi extension directory.  It is assumed that the
 use-mpi bindings will use the same back-end symbols as the mpif.h
 bindings, and that the only output product of the use-mpi directory is
 a file to be included in the mpi-ext module (i.e., strong Fortran
 prototypes for the functions/global variables in this extension).
 Optionally, the extension may have a director named "use-mpi-f08" (for
 the Fortran mpi_f08 bindings) that:
 - contains a file named mpiext_<ext_name>_usempif08.h
 - builds a Libtool convenience library named
  libmpiext_<ext_name>_usempif08.la
 See the comments in all the header and source files in this tree to
 see what each file is for and what should be in each.
 --------------------------------------------------------------------------------
 Note that the build order of MPI extensions is a bit strange.  The
 directories in a MPI extensions are NOT traversed top-down in
 sequential order.  Instead, due to ordering requirements when building
 the Fortran module-based interfaces, each subdirectory in extensions
 are traversed individually at different times in the overall Open MPI
 build.
 As such, ompi/mpiext/<ext_name>/Makefile.am is not traversed during a
 normal top-level "make all" target.  This Makefile.am exists for two
 reasons, however:
 1. For the conveneince of the developer, so that you can issue normal
 "make" commands at the top of your extension tree (e.g., "make all"
 will still build all bindings in an extension).
 2. During a top-level "make dist", extension directories *are*
 traversed top-down in sequence order.  Having a top-level Makefile.am
 in an extension allows EXTRA_DISTing of files, such as this README
 file.
 This are reasons for this strange ordering, but suffice it to say that
 "make dist" doesn't have the same ordering requiements as "make all",
 and is therefore easier to have a "normal" Automake-usual top-down
 sequential directory traversal.
 Enjoy!
--- a/ompi/mpiext/pcollreq/Makefile.am
+++ b/ompi/mpiext/pcollreq/Makefile.am
@ -8,3 +8,5 @@
 #
 SUBDIRS = c mpif-h use-mpi use-mpi-f08
 EXTRA_DIST = README.md
--- a/ompi/mpiext/pcollreq/README.md
+++ b/ompi/mpiext/pcollreq/README.md
@ -0,0 +1,14 @@
 # Open MPI extension: pcollreq
 Copyright (c) 2018      FUJITSU LIMITED.  All rights reserved.
 This extension provides the feature of persistent collective
 communication operations and persistent neighborhood collective
 communication operations, which is planned to be included in the next
 MPI Standard after MPI-3.1 as of Nov. 2018.
 See `MPIX_Barrier_init(3)` for more details.
 The code will be moved to the `ompi/mpi` directory and the `MPIX_`
 prefix will be switch to the `MPI_` prefix once the MPI Standard which
 includes this feature is published.
--- a/ompi/mpiext/pcollreq/README.txt
+++ b/ompi/mpiext/pcollreq/README.txt
@ -1,14 +0,0 @@
 Copyright (c) 2018      FUJITSU LIMITED.  All rights reserved.
 $COPYRIGHT$
 This extension provides the feature of persistent collective communication
 operations and persistent neighborhood collective communication operations,
 which is planned to be included in the next MPI Standard after MPI-3.1 as
 of Nov. 2018.
 See MPIX_Barrier_init(3) for more details.
 The code will be moved to the ompi/mpi directory and the MPIX_ prefix will
 be switch to the MPI_ prefix once the MPI Standard which includes this
 feature is published.
--- a/ompi/mpiext/shortfloat/Makefile.am
+++ b/ompi/mpiext/shortfloat/Makefile.am
@ -8,3 +8,5 @@
 #
 SUBDIRS = c mpif-h use-mpi use-mpi-f08
 EXTRA_DIST = README.md
--- a/ompi/mpiext/shortfloat/README.md
+++ b/ompi/mpiext/shortfloat/README.md
@ -0,0 +1,35 @@
 # Open MPI extension: shortfloat
 Copyright (c) 2018      FUJITSU LIMITED.  All rights reserved.
 This extension provides additional MPI datatypes `MPIX_SHORT_FLOAT`,
 `MPIX_C_SHORT_FLOAT_COMPLEX`, and `MPIX_CXX_SHORT_FLOAT_COMPLEX`,
 which are proposed with the `MPI_` prefix in June 2017 for proposal in
 the MPI 4.0 standard. As of February 2019, it is not accepted yet.
 See https://github.com/mpi-forum/mpi-issues/issues/65 for moe details
 Each MPI datatype corresponds to the C/C++ type `short float`, the C
 type `short float _Complex`, and the C++ type `std::complex<short
 float>`, respectively.
 In addition, this extension provides a datatype `MPIX_C_FLOAT16` for
 the C type `_Float16`, which is defined in ISO/IEC JTC 1/SC 22/WG 14
 N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
 that of MPICH.  See https://github.com/pmodels/mpich/pull/3455.
 This extension is enabled only if the C compiler supports `short float`
 or `_Float16`, or the `--enable-alt-short-float=TYPE` option is passed
 to the Open MPI `configure` script.
 NOTE: The Clang 6.0.x and 7.0.x compilers support the `_Float16` type
 (via software emulation), but require an additional linker flag to
 function properly.  If you wish to enable Clang 6.0.x or 7.0.x's
 software emulation of `_Float16`, use the following CLI options to Open
 MPI configure script:
 ```
 ./configure \
        LDFLAGS=--rtlib=compiler-rt \
        --with-wrapper-ldflags=--rtlib=compiler-rt ...
 ```
--- a/ompi/mpiext/shortfloat/README.txt
+++ b/ompi/mpiext/shortfloat/README.txt
@ -1,35 +0,0 @@
 Copyright (c) 2018      FUJITSU LIMITED.  All rights reserved.
 $COPYRIGHT$
 This extension provides additional MPI datatypes MPIX_SHORT_FLOAT,
 MPIX_C_SHORT_FLOAT_COMPLEX, and MPIX_CXX_SHORT_FLOAT_COMPLEX, which
 are proposed with the MPI_ prefix in June 2017 for proposal in the
 MPI 4.0 standard. As of February 2019, it is not accepted yet.
  https://github.com/mpi-forum/mpi-issues/issues/65
 Each MPI datatype corresponds to the C/C++ type 'short float', the C type
 'short float _Complex', and the C++ type 'std::complex<short float>',
 respectively.
 In addition, this extension provides a datatype MPIX_C_FLOAT16 for
 the C type _Float16, which is defined in ISO/IEC JTC 1/SC 22/WG 14
 N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
 that of MPICH.
  https://github.com/pmodels/mpich/pull/3455
 This extension is enabled only if the C compiler supports 'short float'
 or '_Float16', or the '--enable-alt-short-float=TYPE' option is passed
 to the configure script.
 NOTE: The Clang 6.0.x and 7.0.x compilers support the "_Float16" type
 (via software emulation), but require an additional linker flag to
 function properly.  If you wish to enable Clang 6.0.x or 7.0.x's
 software emulation of _Float16, use the following CLI options to Open
 MPI configure script:
    ./configure \
        LDFLAGS=--rtlib=compiler-rt \
        --with-wrapper-ldflags=--rtlib=compiler-rt ...
--- a/opal/mca/btl/ofi/README
+++ b/opal/mca/btl/ofi/README
@ -1,110 +0,0 @@
 ========================================
 Design notes on BTL/OFI
 ========================================
 This is the RDMA only btl based on OFI Libfabric. The goal is to enable RDMA
 with multiple vendor hardware through one interface. Most of the operations are
 managed by upper layer (osc/rdma). This BTL is mostly doing the low level work.
 Tested providers: sockets,psm2,ugni
 ========================================
 Component
 This BTL is requesting libfabric version 1.5 API and will not support older versions.
 The required capabilities of this BTL is FI_ATOMIC and FI_RMA with the endpoint type
 of FI_EP_RDM only. This BTL does NOT support libfabric provider that requires local
 memory registration (FI_MR_LOCAL).
 BTL/OFI will initialize a module with ONLY the first compatible info returned from OFI.
 This means it will rely on OFI provider to do load balancing. The support for multiple
 device might be added later.
 The BTL creates only one endpoint and one CQ.
 ========================================
 Memory Registration
 Open MPI has a system in place to exchange remote address and always use the remote
 virtual address to refer to a piece of memory. However, some libfabric providers might
 not support the use of virtual address and instead will use zero-based offset addressing.
 FI_MR_VIRT_ADDR is the flag that determine this behavior. mca_btl_ofi_reg_mem() handles
 this by storing the base address in registration handle in case of the provider does not
 support FI_MR_VIRT_ADDR. This base address will be used to calculate the offset later in
 RDMA/Atomic operations.
 The BTL will try to use the address of registration handle as the key. However, if the
 provider supports FI_MR_PROV_KEY, it will use provider provided key. Simply does not care.
 The BTL does not register local operand or compare. This is why this BTL does not support
 FI_MR_LOCAL and will allocate every buffer before registering. This means FI_MR_ALLOCATED
 is supported. So to be explicit.
 Supported MR mode bits (will work with or without):
    enum:
    - FI_MR_BASIC
    - FI_MR_SCALABLE
    mode bits:
    - FI_MR_VIRT_ADDR
    - FI_MR_ALLOCATED
    - FI_MR_PROV_KEY
 The BTL does NOT support (will not work with):
    - FI_MR_LOCAL
    - FI_MR_MMU_NOTIFY
    - FI_MR_RMA_EVENT
    - FI_MR_ENDPOINT
 Just a reminder, in libfabric API 1.5...
 FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)
 ========================================
 Completions
 Every operation in this BTL is asynchronous. The completion handling will occur in
 mca_btl_ofi_component_progress() where we read the CQ with the completion context and
 execute the callback functions. The completions are local. No remote completion event is
 generated as local completion already guarantee global completion.
 The BTL keep tracks of number of outstanding operations and provide flush interface.
 ========================================
 Sockets Provider
 Sockets provider is the proof of concept provider for libfabric. It is supposed to support
 all the OFI API with emulations. This provider is considered very slow and bound to raise
 problems that we might not see from other faster providers.
 Known Problems:
    - sockets provider uses progress thread and can cause segfault in finalize as we free
      the resources while progress thread is still using it. sleep(1) was put in
      mca_btl_ofi_componenet_close() for this reason.
    - sockets provider deadlock in two-sided mode. Might be something about buffered recv.
      (August 2018).
 ========================================
 Scalable Endpoint
 This BTL will try to use scalable endpoint to create communication context. This will increase
 multithreaded performance for some application. The default number of context created is 1 and
 can be tuned VIA MCA parameter "btl_ofi_num_contexts_per_module". It is advised that the number
 of context should be equal to number of physical core for optimal performance.
 User can disable scalable endpoint by MCA parameter "btl_ofi_disable_sep".
 With scalable endpoint disbled, the BTL will alias OFI endpoint to both tx and rx context.
 ========================================
 Two sided communication
 Two sided communication is added later on to BTL OFI to enable non tag-matching provider
 to be able to use in Open MPI with this BTL. However, the support is only for "functional"
 and has not been optimized for performance at this point. (August 2018)
--- a/opal/mca/btl/ofi/README.md
+++ b/opal/mca/btl/ofi/README.md
@ -0,0 +1,113 @@
 # Design notes on BTL/OFI
 This is the RDMA only btl based on OFI Libfabric. The goal is to
 enable RDMA with multiple vendor hardware through one interface. Most
 of the operations are managed by upper layer (osc/rdma). This BTL is
 mostly doing the low level work.
 Tested providers: sockets,psm2,ugni
 ## Component
 This BTL is requesting libfabric version 1.5 API and will not support
 older versions.
 The required capabilities of this BTL is `FI_ATOMIC` and `FI_RMA` with
 the endpoint type of `FI_EP_RDM` only. This BTL does NOT support
 libfabric provider that requires local memory registration
 (`FI_MR_LOCAL`).
 BTL/OFI will initialize a module with ONLY the first compatible info
 returned from OFI.  This means it will rely on OFI provider to do load
 balancing. The support for multiple device might be added later.
 The BTL creates only one endpoint and one CQ.
 ## Memory Registration
 Open MPI has a system in place to exchange remote address and always
 use the remote virtual address to refer to a piece of memory. However,
 some libfabric providers might not support the use of virtual address
 and instead will use zero-based offset addressing.
 `FI_MR_VIRT_ADDR` is the flag that determine this
 behavior. `mca_btl_ofi_reg_mem()` handles this by storing the base
 address in registration handle in case of the provider does not
 support `FI_MR_VIRT_ADDR`. This base address will be used to calculate
 the offset later in RDMA/Atomic operations.
 The BTL will try to use the address of registration handle as the
 key. However, if the provider supports `FI_MR_PROV_KEY`, it will use
 provider provided key. Simply does not care.
 The BTL does not register local operand or compare. This is why this
 BTL does not support `FI_MR_LOCAL` and will allocate every buffer
 before registering. This means `FI_MR_ALLOCATED` is supported. So to
 be explicit.
 Supported MR mode bits (will work with or without):
 * enum:
  * `FI_MR_BASIC`
  * `FI_MR_SCALABLE`
 * mode bits:
  * `FI_MR_VIRT_ADDR`
  * `FI_MR_ALLOCATED`
  * `FI_MR_PROV_KEY`
 The BTL does NOT support (will not work with):
 * `FI_MR_LOCAL`
 * `FI_MR_MMU_NOTIFY`
 * `FI_MR_RMA_EVENT`
 * `FI_MR_ENDPOINT`
 Just a reminder, in libfabric API 1.5...
 `FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)`
 ## Completions
 Every operation in this BTL is asynchronous. The completion handling
 will occur in `mca_btl_ofi_component_progress()` where we read the CQ
 with the completion context and execute the callback functions. The
 completions are local. No remote completion event is generated as
 local completion already guarantee global completion.
 The BTL keep tracks of number of outstanding operations and provide
 flush interface.
 ## Sockets Provider
 Sockets provider is the proof of concept provider for libfabric. It is
 supposed to support all the OFI API with emulations. This provider is
 considered very slow and bound to raise problems that we might not see
 from other faster providers.
 Known Problems:
 * sockets provider uses progress thread and can cause segfault in
  finalize as we free the resources while progress thread is still
  using it. `sleep(1)` was put in `mca_btl_ofi_component_close()` for
  this reason.
 * sockets provider deadlock in two-sided mode. Might be something
  about buffered recv.  (August 2018).
 ## Scalable Endpoint
 This BTL will try to use scalable endpoint to create communication
 context. This will increase multithreaded performance for some
 application. The default number of context created is 1 and can be
 tuned VIA MCA parameter `btl_ofi_num_contexts_per_module`. It is
 advised that the number of context should be equal to number of
 physical core for optimal performance.
 User can disable scalable endpoint by MCA parameter
 `btl_ofi_disable_sep`.  With scalable endpoint disbled, the BTL will
 alias OFI endpoint to both tx and rx context.
 ## Two sided communication
 Two sided communication is added later on to BTL OFI to enable non
 tag-matching provider to be able to use in Open MPI with this
 BTL. However, the support is only for "functional" and has not been
 optimized for performance at this point. (August 2018)
--- a/opal/mca/btl/smcuda/README
+++ b/opal/mca/btl/smcuda/README
@ -1,113 +0,0 @@
 Copyright (c) 2013      NVIDIA Corporation.  All rights reserved.
 August 21, 2013
 SMCUDA DESIGN DOCUMENT
 This document describes the design and use of the smcuda BTL.
 BACKGROUND
 The smcuda btl is a copy of the sm btl but with some additional features.
 The main extra feature is the ability to make use of the CUDA IPC APIs to
 quickly move GPU buffers from one GPU to another.  Without this support,
 the GPU buffers would all be moved into and then out of host memory.
 GENERAL DESIGN
 The general design makes use of the large message RDMA RGET support in the
 OB1 PML.  However, there are some interesting choices to make use of it.
 First, we disable any large message RDMA support in the BTL for host
 messages.  This is done because we need to use the mca_btl_smcuda_get() for
 the GPU buffers.  This is also done because the upper layers expect there
 to be a single mpool but we need one for the GPU memory and one for the
 host memory.  Since the advantages of using RDMA with host memory is
 unclear, we disabled it.  This means no KNEM or CMA support built in to the
 smcuda BTL.
 Also note that we give the smcuda BTL a higher rank than the sm BTL.  This
 means it will always be selected even if we are doing host only data
 transfers.  The smcuda BTL is not built if it is not requested via the
 --with-cuda flag to the configure line.
 Secondly, the smcuda does not make use of the traditional method of
 enabling RDMA operations.  The traditional method checks for the existence
 of an RDMA btl hanging off the endpoint.  The smcuda works in conjunction
 with the OB1 PML and uses flags that it sends in the BML layer.
 OTHER CONSIDERATIONS
 CUDA IPC is not necessarily supported by all GPUs on a node.  In NUMA
 nodes, CUDA IPC may only work between GPUs that are not connected
 over the IOH.  In addition, we want to check for CUDA IPC support lazily,
 when the first GPU access occurs, rather than during MPI_Init() time.
 This complicates the design.
 INITIALIZATION
 When the smcuda BTL initializes, it starts with no support for CUDA IPC.
 Upon the first access of a GPU buffer, the smcuda checks which GPU device
 it has and sends that to the remote side using a smcuda specific control
 message.  The other rank receives the message, and checks to see if there
 is CUDA IPC support between the two GPUs via a call to
 cuDeviceCanAccessPeer().  If it is true, then the smcuda BTL piggy backs on
 the PML error handler callback to make a call into the PML and let it know
 to enable CUDA IPC. We created a new flag so that the error handler does
 the right thing.  Large message RDMA is enabled by setting a flag in the
 bml->btl_flags field.  Control returns to the smcuda BTL where a reply
 message is sent so the sending side can set its flag.
 At that point, the PML layer starts using the large message RDMA support
 in the smcuda BTL.  This is done in some special CUDA code in the PML layer.
 ESTABLISHING CUDA IPC SUPPORT
 A check has been added into both the send and sendi path in the smcuda btl
 that checks to see if it should send a request for CUDA IPC setup message.
    /* Initiate setting up CUDA IPC support. */
    if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
        mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
    }
 The first check is to see if the CUDA environment has been initialized.  If
 not, then presumably we are not sending any GPU buffers yet and there is
 nothing to be done.  If we are initialized, then check the status of the
 CUDA IPC endpoint.  If it is in the IPC_INIT stage, then call the function
 to send of a control message to the endpoint.
 On the receiving side, we first check to see if we are initialized.  If
 not, then send a message back to the sender saying we are not initialized.
 This will cause the sender to reset its state to IPC_INIT so it can try
 again on the next send.
 I considered putting the receiving side into a new state like IPC_NOTREADY,
 and then when it switches to ready, to then sending the ACK to the sender.
 The problem with this is that we would need to do these checks during the
 progress loop which adds some extra overhead as we would have to check all
 endpoints to see if they were ready.
 Note that any rank can initiate the setup of CUDA IPC.  It is triggered by
 whichever side does a send or sendi call of a GPU buffer.
 I have the sender attempt 5 times to set up the connection.  After that, we
 give up.  Note that I do not expect many scenarios where the sender has to
 resend.  It could happen in a race condition where one rank has initialized
 its CUDA environment but the other side has not.
 There are several states the connections can go through.
 IPC_INIT   - nothing has happened
 IPC_SENT   - message has been sent to other side
 IPC_ACKING - Received request and figuring out what to send back
 IPC_ACKED  - IPC ACK sent
 IPC_OK     - IPC ACK received back
 IPC_BAD    - Something went wrong, so marking as no IPC support
 NOTE ABOUT CUDA IPC AND MEMORY POOLS
 The CUDA IPC support works in the following way.  A sender makes a call to
 cuIpcGetMemHandle() and gets a memory handle for its local memory.  The
 sender then sends that handle to receiving side.  The receiver calls
 cuIpcOpenMemHandle() using that handle and gets back an address to the
 remote memory.  The receiver then calls cuMemcpyAsync() to initiate a
 remote read of the GPU data.
 The receiver maintains a cache of remote memory that it has handles open on.
 This is because a call to cuIpcOpenMemHandle() can be very expensive (90usec) so
 we want to avoid it when we can.  The cache of remote memory is kept in a memory
 pool that is associated with each endpoint.  Note that we do not cache the local
 memory handles because getting them is very cheap and there is no need.
--- a/opal/mca/btl/smcuda/README.md
+++ b/opal/mca/btl/smcuda/README.md
@ -0,0 +1,126 @@
 # Open MPI SMCUDA design document
 Copyright (c) 2013      NVIDIA Corporation.  All rights reserved.
 August 21, 2013
 This document describes the design and use of the `smcuda` BTL.
 ## BACKGROUND
 The `smcuda` btl is a copy of the `sm` btl but with some additional
 features.  The main extra feature is the ability to make use of the
 CUDA IPC APIs to quickly move GPU buffers from one GPU to another.
 Without this support, the GPU buffers would all be moved into and then
 out of host memory.
 ## GENERAL DESIGN
 The general design makes use of the large message RDMA RGET support in
 the OB1 PML.  However, there are some interesting choices to make use
 of it.  First, we disable any large message RDMA support in the BTL
 for host messages.  This is done because we need to use the
 `mca_btl_smcuda_get()` for the GPU buffers.  This is also done because
 the upper layers expect there to be a single mpool but we need one for
 the GPU memory and one for the host memory.  Since the advantages of
 using RDMA with host memory is unclear, we disabled it.  This means no
 KNEM or CMA support built in to the `smcuda` BTL.
 Also note that we give the `smcuda` BTL a higher rank than the `sm`
 BTL.  This means it will always be selected even if we are doing host
 only data transfers.  The `smcuda` BTL is not built if it is not
 requested via the `--with-cuda` flag to the configure line.
 Secondly, the `smcuda` does not make use of the traditional method of
 enabling RDMA operations.  The traditional method checks for the existence
 of an RDMA btl hanging off the endpoint.  The `smcuda` works in conjunction
 with the OB1 PML and uses flags that it sends in the BML layer.
 ## OTHER CONSIDERATIONS
 CUDA IPC is not necessarily supported by all GPUs on a node.  In NUMA
 nodes, CUDA IPC may only work between GPUs that are not connected
 over the IOH.  In addition, we want to check for CUDA IPC support lazily,
 when the first GPU access occurs, rather than during `MPI_Init()` time.
 This complicates the design.
 ## INITIALIZATION
 When the `smcuda` BTL initializes, it starts with no support for CUDA IPC.
 Upon the first access of a GPU buffer, the `smcuda` checks which GPU device
 it has and sends that to the remote side using a `smcuda` specific control
 message.  The other rank receives the message, and checks to see if there
 is CUDA IPC support between the two GPUs via a call to
 `cuDeviceCanAccessPeer()`.  If it is true, then the `smcuda` BTL piggy backs on
 the PML error handler callback to make a call into the PML and let it know
 to enable CUDA IPC. We created a new flag so that the error handler does
 the right thing.  Large message RDMA is enabled by setting a flag in the
 `bml->btl_flags` field.  Control returns to the `smcuda` BTL where a reply
 message is sent so the sending side can set its flag.
 At that point, the PML layer starts using the large message RDMA
 support in the `smcuda` BTL.  This is done in some special CUDA code
 in the PML layer.
 ## ESTABLISHING CUDA IPC SUPPORT
 A check has been added into both the `send` and `sendi` path in the
 `smcuda` btl that checks to see if it should send a request for CUDA
 IPC setup message.
 ```c
 /* Initiate setting up CUDA IPC support. */
 if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
    mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
 }
 ```
 The first check is to see if the CUDA environment has been
 initialized.  If not, then presumably we are not sending any GPU
 buffers yet and there is nothing to be done.  If we are initialized,
 then check the status of the CUDA IPC endpoint.  If it is in the
 IPC_INIT stage, then call the function to send of a control message to
 the endpoint.
 On the receiving side, we first check to see if we are initialized.
 If not, then send a message back to the sender saying we are not
 initialized.  This will cause the sender to reset its state to
 IPC_INIT so it can try again on the next send.
 I considered putting the receiving side into a new state like
 IPC_NOTREADY, and then when it switches to ready, to then sending the
 ACK to the sender.  The problem with this is that we would need to do
 these checks during the progress loop which adds some extra overhead
 as we would have to check all endpoints to see if they were ready.
 Note that any rank can initiate the setup of CUDA IPC.  It is
 triggered by whichever side does a send or sendi call of a GPU buffer.
 I have the sender attempt 5 times to set up the connection.  After
 that, we give up.  Note that I do not expect many scenarios where the
 sender has to resend.  It could happen in a race condition where one
 rank has initialized its CUDA environment but the other side has not.
 There are several states the connections can go through.
 1. IPC_INIT   - nothing has happened
 1. IPC_SENT   - message has been sent to other side
 1. IPC_ACKING - Received request and figuring out what to send back
 1. IPC_ACKED  - IPC ACK sent
 1. IPC_OK     - IPC ACK received back
 1. IPC_BAD    - Something went wrong, so marking as no IPC support
 ## NOTE ABOUT CUDA IPC AND MEMORY POOLS
 The CUDA IPC support works in the following way.  A sender makes a
 call to `cuIpcGetMemHandle()` and gets a memory handle for its local
 memory.  The sender then sends that handle to receiving side.  The
 receiver calls `cuIpcOpenMemHandle()` using that handle and gets back
 an address to the remote memory.  The receiver then calls
 `cuMemcpyAsync()` to initiate a remote read of the GPU data.
 The receiver maintains a cache of remote memory that it has handles
 open on.  This is because a call to `cuIpcOpenMemHandle()` can be very
 expensive (90usec) so we want to avoid it when we can.  The cache of
 remote memory is kept in a memory pool that is associated with each
 endpoint.  Note that we do not cache the local memory handles because
 getting them is very cheap and there is no need.
--- a/opal/mca/btl/usnic/Makefile.am
+++ b/opal/mca/btl/usnic/Makefile.am
@ -27,7 +27,7 @@
 AM_CPPFLAGS = $(opal_ofi_CPPFLAGS) -DOMPI_LIBMPI_NAME=\"$(OMPI_LIBMPI_NAME)\"
-EXTRA_DIST = README.txt README.test
+EXTRA_DIST = README.md README.test
 dist_opaldata_DATA = \
    help-mpi-btl-usnic.txt
--- a/opal/mca/btl/usnic/README.md
+++ b/opal/mca/btl/usnic/README.md
@ -0,0 +1,330 @@
 # Design notes on usnic BTL
 ## nomenclature
 * fragment - something the PML asks us to send or put, any size
 * segment - something we can put on the wire in a single packet
 * chunk - a piece of a fragment that fits into one segment
 a segment can contain either an entire fragment or a chunk of a fragment
 each segment and fragment has associated descriptor.
 Each segment data structure has a block of registered memory associated with
 it which matches MTU for that segment
 * ACK - acks get special small segments with only enough memory for an ACK
 * non-ACK segments always have a parent fragment
 * fragments are either large (> MTU) or small (<= MTU)
 * a small fragment has a segment descriptor embedded within it since it
  always needs exactly one.
 * a large fragment has no permanently associated segments, but allocates them
  as needed.
 ## channels
 A channel is a queue pair with an associated completion queue
 each channel has its own MTU and r/w queue entry counts
 There are 2 channels, command and data:
 * command queue is generally for higher priority fragments
 * data queue is for standard data traffic
 * command queue should possibly be called "priority" queue
 command queue is shorter and has a smaller MTU that the data queue.
 this makes the command queue a lot faster than the data queue, so we
 hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
 command queue is used for ACKs and tiny fragments.
 data queue is used for everything else.
 PML fragments marked priority should perhaps use command queue
 ## sending
 Normally, all send requests are simply enqueued and then actually posted
 to the NIC by the routine `opal_btl_usnic_module_progress_sends()`.
 "fastpath" tiny sends are the exception.
 Each module maintains a queue of endpoints that are ready to send.
 An endpoint is ready to send if all of the following are met:
 1. the endpoint has fragments to send
 1. the endpoint has send credits
 1. the endpoint's send window is "open" (not full of un-ACKed segments)
 Each module also maintains a list of segments that need to be retransmitted.
 Note that the list of pending retrans is per-module, not per-endpoint.
 Send progression first posts any pending retransmissions, always using
 the data channel.  (reason is that if we start getting heavy
 congestion and there are lots of retransmits, it becomes more
 important than ever to prioritize ACKs, clogging command channel with
 retrans data makes things worse, not better)
 Next, progression loops sending segments to the endpoint at the top of
 the `endpoints_with_sends` queue.  When an endpoint exhausts its send
 credits or fills its send window or runs out of segments to send, it
 removes itself from the `endpoint_with_sends` list.  Any pending ACKs
 will be picked up and piggy-backed on these sends.
 Finally, any endpoints that still need ACKs whose timer has expired will
 be sent explicit ACK packets.
 ## fragment sending
 The middle part of the progression loop handles both small
 (single-segment) and large (multi-segment) sends.
 For small fragments, the verbs descriptor within the embedded segment
 is updated with length, BTL header is updated, then we call
 `opal_btl_usnic_endpoint_send_segment()` to send the segment.  After
 posting, we make a PML callback if needed.
 For large fragments, a little more is needed.  segments froma large
 fragment have a slightly larger BTL header which contains a fragment
 ID, and offset, and a size.  The fragment ID is allocated when the
 first chunk the fragment is sent.  A segment gets allocated, next blob
 of data is copied into this segment, segment is posted.  If last chunk
 of fragment sent, perform callback if needed, then remove fragment
 from endpoint send queue.
 ## `opal_btl_usnic_endpoint_send_segment()`
 This is common posting code for large or small segments.  It assigns a
 sequence number to a segment, checks for an ACK to piggy-back,
 posts the segment to the NIC, and then starts the retransmit timer
 by checking the segment into hotel.  Send credits are consumed here.
 ## send dataflow
 PML control messages with no user data are sent via:
 * `desc = usnic_alloc(size)`
 * `usnic_send(desc)`
 user messages less than eager limit and 1st part of larger
 messages are sent via:
 * `desc = usnic_prepare_src(convertor, size)`
 * `usnic_send(desc)`
 larger msgs:
 * `desc = usnic_prepare_src(convertor, size)`
 * `usnic_put(desc)`
 `usnic_alloc()` currently asserts the length is "small", allocates and
 fills in a small fragment.  src pointer will point to start of
 associated registered mem + sizeof BTL header, and PML will put its
 data there.
 `usnic_prepare_src()` allocated either a large or small fragment based
 on size The fragment descriptor is filled in to have 2 SG entries, 1st
 pointing to place where PML should construct its header.  If the data
 convertor says data is contiguous, 2nd SG entry points to user buffer,
 else it is null and sf_convertor is filled in with address of
 convertor.
 ### `usnic_send()`
 If the fragment being sent is small enough, has contiguous data, and
 "very few" command queue send WQEs have been consumed, `usnic_send()`
 does a fastpath send.  This means it posts the segment immediately to
 the NIC with INLINE flag set.
 If all of the conditions for fastpath send are not met, and this is a
 small fragment, the user data is copied into the associated registered
 memory at this time and the SG list in the descriptor is collapsed to
 one entry.
 After the checks above are done, the fragment is enqueued to be sent
 via `opal_btl_usnic_endpoint_enqueue_frag()`
 ### `usnic_put()`
 Do a fast version of what happens in `prepare_src()` (can take shortcuts
 because we know it will always be a contiguous buffer / no convertor
 needed).  PML gives us the destination address, which we save on the
 fragment (which is the sentinel value that the underlying engine uses
 to know that this is a PUT and not a SEND), and the fragment is
 enqueued for processing.
 ### `opal_btl_usnic_endpoint_enqueue_frag()`
 This appends the fragment to the "to be sent" list of the endpoint and
 conditionally adds the endpoint to the list of endpoints with data to
 send via `opal_btl_usnic_check_rts()`
 ## receive dataflow
 BTL packets has one of 3 types in header: frag, chunk, or ack.
 * A frag packet is a full PML fragment.
 * A chunk packet is a piece of a fragment that needs to be reassembled.
 * An ack packet is header only with a sequence number being ACKed.
 * Both frag and chunk packets go through some of the same processing.
 * Both may carry piggy-backed ACKs which may need to be processed.
 * Both have sequence numbers which must be processed and may result in
  dropping the packet and/or queueing an ACK to the sender.
 frag packets may be either regular PML fragments or PUT segments.  If
 the "put_addr" field of the BTL header is set, this is a PUT and the
 data is copied directly to the user buffer.  If this field is NULL,
 the segment is passed up to the PML.  The PML is expected to do
 everything it needs with this packet in the callback, including
 copying data out if needed.  Once the callback is complete, the
 receive buffer is recycled.
 chunk packets are parts of a larger fragment.  If an active fragment
 receive for the matching fragment ID cannot be found, and new fragment
 info descriptor is allocated.  If this is not a PUT (`put_addr == NULL`),
 we `malloc()` data to reassemble the fragment into.  Each
 subsequent chunk is copied either into this reassembly buffer or
 directly into user memory.  When the last chunk of a fragment arrives,
 a PML callback is made for non-PUTs, then the fragment info descriptor
 is released.
 ## fast receive optimization
 In order to optimize latency of small packets, the component progress
 routine implements a fast path for receives.  If the first completion
 is a receive on the priority queue, then it is handled by a routine
 called `opal_btl_usnic_recv_fast()` which does nothing but validates
 that the packet is OK to be received (sequence number OK and not a
 DUP) and then delivers it to the PML.  This packet is recorded in the
 channel structure, and all bookeeping for the packet is deferred until
 the next time `component_progress` is called again.
 This fast path cannot be taken every time we pass through
 `component_progress` because there will be other completions that need
 processing, and the receive bookeeping for one fast receive must be
 complete before allowing another fast receive to occur, as only one
 recv segment can be saved for deferred processing at a time.  This is
 handled by maintaining a variable in `opal_btl_usnic_recv_fast()`
 called fastpath_ok which is set to false every time the fastpath is
 taken.  A call into the regular progress routine will set this flag
 back to true.
 ## reliability:
 * every packet has sequence #
 * each endpoint has a "send window" , currently 4096 entries.
 * once a segment is sent, it is saved in window array until ACK is received
 * ACKs acknowledge all packets <= specified sequence #
 * rcvr only ACKs a sequence # when all packets up to that sequence have arrived
 * each pkt has dflt retrans timer of 100ms
 * packet will be scheduled for retrans if timer expires
 Once a segment is sent, it always has its retransmit timer started.
 This is accomplished by `opal_hotel_checkin()`.
 Any time a segment is posted to the NIC for retransmit, it is checked out
 of the hotel (timer stopped).
 So, a send segment is always in one of 4 states:
 * on free list, unallocated
 * on endpoint to-send list in the case of segment associated with small fragment
 * posted to NIC and in hotel awaiting ACK
 * on module re-send list awaiting retransmission
 rcvr:
 * if a pkt with seq >= expected seq is received, schedule ack of largest
  in-order sequence received if not already scheduled.  dflt time is 50us
 * if a packet with seq < expected seq arrives, we send an ACK immediately,
  as this indicates a lost ACK
 sender:
 * duplicate ACK triggers immediate retrans if one is not pending for
  that segment
 ## Reordering induced by two queues and piggy-backing:
 ACKs can be reordered-
 *  not an issue at all, old ACKs are simply ignored
 Sends can be reordered-
 * (small send can jump far ahead of large sends)
 * large send followed by lots of small sends could trigger many
  retrans of the large sends.  smalls would have to be paced pretty
  precisely to keep command queue empty enough and also beat out the
  large sends.  send credits limit how many larges can be queued on
  the sender, but there could be many on the receiver
 ## RDMA emulation
 We emulate the RDMA PUT because it's more efficient than regular send:
 it allows the receive to copy directly to the target buffer
 (vs. making an intermediate copy out of the bounce buffer).
 It would actually be better to morph this PUT into a GET -- GET would
 be slightly more efficient.  In short, when the target requests the
 actual RDMA data, with PUT, the request has to go up to the PML, which
 will then invoke PUT on the source's BTL module.  With GET, the target
 issues the GET, and the source BTL module can reply without needing to
 go up the stack to the PML.
 Once we start supporting RDMA in hardware:
 * we need to provide `module.btl_register_mem` and
  `module.btl_deregister_mem` functions (see openib for an example)
 * we need to put something meaningful in
  `btl_usnic_frag.h:mca_btl_base_registration_handle_t`.
 * we need to set `module.btl_registration_handle_size` to `sizeof(struct
  mca_btl_base_registration_handle_t`).
 * `module.btl_put` / `module.btl_get` will receive the
  `mca_btl_base_registration_handle_t` from the peer as a cookie.
 Also, `module.btl_put` / `module.btl_get` do not need to make
 descriptors (this was an optimization added in BTL 3.0).  They are now
 called with enough information to do whatever they need to do.
 module.btl_put still makes a descriptor and submits it to the usnic
 sending engine so as to utilize a common infrastructure for send and
 put.
 But it doesn't necessarily have to be that way -- we could optimize
 out the use of the descriptors.  Have not investigated how easy/hard
 that would be.
 ## libfabric abstractions:
 * `fi_fabric`: corresponds to a VIC PF
 * `fi_domain`: corresponds to a VIC VF
 * `fi_endpoint`: resources inside the VIC VF (basically a QP)
 ## `MPI_THREAD_MULTIPLE` support
 In order to make usnic btl thread-safe, the mutex locks are issued to
 protect the critical path. ie; libfabric routines, book keeping, etc.
 The said lock is `btl_usnic_lock`. It is a RECURSIVE lock, meaning
 that the same thread can take the lock again even if it already has
 the lock to allow the callback function to post another segment right
 away if we know that the current segment is completed inline. (So we
 can call send in send without deadlocking)
 These two functions taking care of hotel checkin/checkout and we have
 to protect that part. So we take the mutex lock before we enter the
 function.
 * `opal_btl_usnic_check_rts()`
 * `opal_btl_usnic_handle_ack()`
 We also have to protect the call to libfabric routines
 * `opal_btl_usnic_endpoint_send_segment()` (`fi_send`)
 * `opal_btl_usnic_recv_call()` (`fi_recvmsg`)
 have to be protected as well.
 Also cclient connection checking (`opal_btl_usnic_connectivity_ping`)
 has to be protected. This happens only in the beginning but cclient
 communicate with cagent through `opal_fd_read/write()` and if two or
 more clients do `opal_fd_write()` at the same time, the data might be
 corrupt.
 With this concept, many functions in btl/usnic that make calls to the
 listed functions are protected by `OPAL_THREAD_LOCK` macro which will
 only be active if the user specify `MPI_Init_thread()` with
 `MPI_THREAD_MULTIPLE` support.
--- a/opal/mca/btl/usnic/README.txt
+++ b/opal/mca/btl/usnic/README.txt
@ -1,383 +0,0 @@
 Design notes on usnic BTL
 ======================================
 nomenclature
 fragment - something the PML asks us to send or put, any size
 segment - something we can put on the wire in a single packet
 chunk - a piece of a fragment that fits into one segment
 a segment can contain either an entire fragment or a chunk of a fragment
 each segment and fragment has associated descriptor.
 Each segment data structure has a block of registered memory associated with
 it which matches MTU for that segment
 ACK - acks get special small segments with only enough memory for an ACK
 non-ACK segments always have a parent fragment
 fragments are either large (> MTU) or small (<= MTU)
 a small fragment has a segment descriptor embedded within it since it
 always needs exactly one.
 a large fragment has no permanently associated segments, but allocates them
 as needed.
 ======================================
 channels
 a channel is a queue pair with an associated completion queue
 each channel has its own MTU and r/w queue entry counts
 There are 2 channels, command and data
 command queue is generally for higher priority fragments
 data queue is for standard data traffic
 command queue should possibly be called "priority" queue
 command queue is shorter and has a smaller MTU that the data queue
 this makes the command queue a lot faster than the data queue, so we
 hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
 command queue is used for ACKs and tiny fragments
 data queue is used for everything else
 PML fragments marked priority should perhaps use command queue
 ======================================
 sending
 Normally, all send requests are simply enqueued and then actually posted
 to the NIC by the routine opal_btl_usnic_module_progress_sends().
 "fastpath" tiny sends are the exception.
 Each module maintains a queue of endpoints that are ready to send.
 An endpoint is ready to send if all of the following are met:
 - the endpoint has fragments to send
 - the endpoint has send credits
 - the endpoint's send window is "open" (not full of un-ACKed segments)
 Each module also maintains a list of segments that need to be retransmitted.
 Note that the list of pending retrans is per-module, not per-endpoint.
 send progression first posts any pending retransmissions, always using the
 data channel.  (reason is that if we start getting heavy congestion and
 there are lots of retransmits, it becomes more important than ever to
 prioritize ACKs, clogging command channel with retrans data makes things worse,
 not better)
 Next, progression loops sending segments to the endpoint at the top of
 the "endpoints_with_sends" queue.  When an endpoint exhausts its send
 credits or fills its send window or runs out of segments to send, it removes
 itself from the endpoint_with_sends list.  Any pending ACKs will be
 picked up and piggy-backed on these sends.
 Finally, any endpoints that still need ACKs whose timer has expired will
 be sent explicit ACK packets.
 [double-click fragment sending]
 The middle part of the progression loop handles both small (single-segment)
 and large (multi-segment) sends.
 For small fragments, the verbs descriptor within the embedded segment is
 updated with length, BTL header is updated, then we call
 opal_btl_usnic_endpoint_send_segment() to send the segment.
 After posting, we make a PML callback if needed.
 For large fragments, a little more is needed.  segments froma large
 fragment have a slightly larger BTL header which contains a fragment ID,
 and offset, and a size.  The fragment ID is allocated when the first chunk
 the fragment is sent.  A segment gets allocated, next blob of data is
 copied into this segment, segment is posted.  If last chunk of fragment
 sent, perform callback if needed, then remove fragment from endpoint
 send queue.
 [double-click opal_btl_usnic_endpoint_send_segment()]
 This is common posting code for large or small segments.  It assigns a
 sequence number to a segment, checks for an ACK to piggy-back,
 posts the segment to the NIC, and then starts the retransmit timer
 by checking the segment into hotel.  Send credits are consumed here.
 ======================================
 send dataflow
 PML control messages with no user data are sent via:
 desc = usnic_alloc(size)
 usnic_send(desc)
 user messages less than eager limit and 1st part of larger
 messages are sent via:
 desc = usnic_prepare_src(convertor, size)
 usnic_send(desc)
 larger msgs
 desc = usnic_prepare_src(convertor, size)
 usnic_put(desc)
 usnic_alloc() currently asserts the length is "small", allocates and
 fills in a small fragment.  src pointer will point to start of
 associated registered mem + sizeof BTL header, and PML will put its
 data there.
 usnic_prepare_src() allocated either a large or small fragment based on size
 The fragment descriptor is filled in to have 2 SG entries, 1st pointing to
 place where PML should construct its header.  If the data convertor says
 data is contiguous, 2nd SG entry points to user buffer, else it is null and
 sf_convertor is filled in with address of convertor.
 usnic_send()
 If the fragment being sent is small enough, has contiguous data, and
 "very few" command queue send WQEs have been consumed, usnic_send() does
 a fastpath send.  This means it posts the segment immediately to the NIC
 with INLINE flag set.
 If all of the conditions for fastpath send are not met, and this is a small
 fragment, the user data is copied into the associated registered memory at this
 time and the SG list in the descriptor is collapsed to one entry.
 After the checks above are done, the fragment is enqueued to be sent
 via opal_btl_usnic_endpoint_enqueue_frag()
 usnic_put()
 Do a fast version of what happens in prepare_src() (can take shortcuts
 because we know it will always be a contiguous buffer / no convertor
 needed).  PML gives us the destination address, which we save on the
 fragment (which is the sentinel value that the underlying engine uses
 to know that this is a PUT and not a SEND), and the fragment is
 enqueued for processing.
 opal_btl_usnic_endpoint_enqueue_frag()
 This appends the fragment to the "to be sent" list of the endpoint and
 conditionally adds the endpoint to the list of endpoints with data to send
 via opal_btl_usnic_check_rts()
 ======================================
 receive dataflow
 BTL packets has one of 3 types in header: frag, chunk, or ack.
 A frag packet is a full PML fragment.
 A chunk packet is a piece of a fragment that needs to be reassembled.
 An ack packet is header only with a sequence number being ACKed.
 Both frag and chunk packets go through some of the same processing.
 Both may carry piggy-backed ACKs which may need to be processed.
 Both have sequence numbers which must be processed and may result in
 dropping the packet and/or queueing an ACK to the sender.
 frag packets may be either regular PML fragments or PUT segments.
 If the "put_addr" field of the BTL header is set, this is a PUT and
 the data is copied directly to the user buffer.  If this field is NULL,
 the segment is passed up to the PML.  The PML is expected to do everything
 it needs with this packet in the callback, including copying data out if
 needed.  Once the callback is complete, the receive buffer is recycled.
 chunk packets are parts of a larger fragment.  If an active fragment receive
 for the matching fragment ID cannot be found, and new fragment info
 descriptor is allocated.  If this is not a PUT (put_addr == NULL), we
 malloc() data to reassemble the fragment into.  Each subsequent chunk
 is copied either into this reassembly buffer or directly into user memory.
 When the last chunk of a fragment arrives, a PML callback is made for non-PUTs,
 then the fragment info descriptor is released.
 ======================================
 fast receive optimization
 In order to optimize latency of small packets, the component progress routine
 implements a fast path for receives.  If the first completion is a receive on
 the priority queue, then it is handled by a routine called
 opal_btl_usnic_recv_fast() which does nothing but validates that the packet
 is OK to be received (sequence number OK and not a DUP) and then delivers it
 to the PML.  This packet is recorded in the channel structure, and all
 bookeeping for the packet is deferred until the next time component_progress
 is called again.
 This fast path cannot be taken every time we pass through component_progress
 because there will be other completions that need processing, and the receive
 bookeeping for one fast receive must be complete before allowing another fast
 receive to occur, as only one recv segment can be saved for deferred
 processing at a time.  This is handled by maintaining a variable in
 opal_btl_usnic_recv_fast() called fastpath_ok which is set to false every time
 the fastpath is taken.  A call into the regular progress routine will set this
 flag back to true.
 ======================================
 reliability:
 every packet has sequence #
 each endpoint has a "send window" , currently 4096 entries.
 once a segment is sent, it is saved in window array until ACK is received
 ACKs acknowledge all packets <= specified sequence #
 rcvr only ACKs a sequence # when all packets up to that sequence have arrived
 each pkt has dflt retrans timer of 100ms
 packet will be scheduled for retrans if timer expires
 Once a segment is sent, it always has its retransmit timer started.
 This is accomplished by opal_hotel_checkin()
 Any time a segment is posted to the NIC for retransmit, it is checked out
 of the hotel (timer stopped).
 So, a send segment is always in one of 4 states:
 - on free list, unallocated
 - on endpoint to-send list in the case of segment associated with small fragment
 - posted to NIC and in hotel awaiting ACK
 - on module re-send list awaiting retransmission
 rcvr:
 - if a pkt with seq >= expected seq is received, schedule ack of largest
  in-order sequence received if not already scheduled.  dflt time is 50us
 - if a packet with seq < expected seq arrives, we send an ACK immediately,
  as this indicates a lost ACK
 sender:
 duplicate ACK triggers immediate retrans if one is not pending for that segment
 ======================================
 Reordering induced by two queues and piggy-backing:
 ACKs can be reordered-
  not an issue at all, old ACKs are simply ignored
 Sends can be reordered-
 (small send can jump far ahead of large sends)
 large send followed by lots of small sends could trigger many retrans
 of the large sends.  smalls would have to be paced pretty precisely to
 keep command queue empty enough and also beat out the large sends.
 send credits limit how many larges can be queued on the sender, but there
 could be many on the receiver
 ======================================
 RDMA emulation
 We emulate the RDMA PUT because it's more efficient than regular send:
 it allows the receive to copy directly to the target buffer
 (vs. making an intermediate copy out of the bounce buffer).
 It would actually be better to morph this PUT into a GET -- GET would
 be slightly more efficient.  In short, when the target requests the
 actual RDMA data, with PUT, the request has to go up to the PML, which
 will then invoke PUT on the source's BTL module.  With GET, the target
 issues the GET, and the source BTL module can reply without needing to
 go up the stack to the PML.
 Once we start supporting RDMA in hardware:
 - we need to provide module.btl_register_mem and
  module.btl_deregister_mem functions (see openib for an example)
 - we need to put something meaningful in
  btl_usnic_frag.h:mca_btl_base_registration_handle_t.
 - we need to set module.btl_registration_handle_size to sizeof(struct
  mca_btl_base_registration_handle_t).
 - module.btl_put / module.btl_get will receive the
  mca_btl_base_registration_handle_t from the peer as a cookie.
 Also, module.btl_put / module.btl_get do not need to make descriptors
 (this was an optimization added in BTL 3.0).  They are now called with
 enough information to do whatever they need to do.  module.btl_put
 still makes a descriptor and submits it to the usnic sending engine so
 as to utilize a common infrastructure for send and put.
 But it doesn't necessarily have to be that way -- we could optimize
 out the use of the descriptors.  Have not investigated how easy/hard
 that would be.
 ======================================
 November 2014 / SC 2014
 Update February 2015
 The usnic BTL code has been unified across master and the v1.8
 branches.
   NOTE: As of May 2018, this is no longer true.  This was generally
   only necessary back when the BTLs were moved from the OMPI layer to
   the OPAL layer.  Now that the BTLs have been down in OPAL for
   several years, this tomfoolery is no longer necessary.  This note
   is kept for historical purposes, just in case someone needs to go
   back and look at the v1.8 series.
 That is, you can copy the code from v1.8:ompi/mca/btl/usnic/* to
 master:opal/mca/btl/usnic*, and then only have to make 3 changes in
 the resulting code in master:
 1. Edit Makefile.am: s/ompi/opal/gi
 2. Edit configure.m4: s/ompi/opal/gi
   --> EXCEPT for:
       - opal_common_libfabric_* (which will eventually be removed,
         when the embedded libfabric goes away)
       - OPAL_BTL_USNIC_FI_EXT_USNIC_H (which will eventually be
         removed, when the embedded libfabric goes away)
       - OPAL_VAR_SCOPE_*
 3. Edit Makefile.am: change -DBTL_IN_OPAL=0 to -DBTL_IN_OPAL=1
 *** Note: the BTL_IN_OPAL preprocessor macro is set in Makefile.am
    rather that in btl_usnic_compat.h to avoid all kinds of include
    file dependency issues (i.e., btl_usnic_compat.h would need to be
    included first, but it requires some data structures to be
    defined, which means it either can't be first or we have to
    declare various structs first... just put BTL_IN_OPAL in
    Makefile.am and be happy).
 *** Note 2: CARE MUST BE TAKEN WHEN COPYING THE OTHER DIRECTION!  It
    is *not* as simple as simple s/opal/ompi/gi in configure.m4 and
    Makefile.am.  It certainly can be done, but there's a few strings
    that need to stay "opal" or "OPAL" (e.g., OPAL_HAVE_FOO).
    Hence, the string replace will likely need to be done via manual
    inspection.
 Things still to do:
 - VF/PF sanity checks in component.c:check_usnic_config() uses
  usnic-specific fi_provider info.  The exact mechanism might change
  as provider-specific info is still being discussed upstream.
 - component.c:usnic_handle_cq_error is using a USD_* constant from
  usnic_direct.  Need to get that value through libfabric somehow.
 ======================================
 libfabric abstractions:
 fi_fabric: corresponds to a VIC PF
 fi_domain: corresponds to a VIC VF
 fi_endpoint: resources inside the VIC VF (basically a QP)
 ======================================
 MPI_THREAD_MULTIPLE support
 In order to make usnic btl thread-safe, the mutex locks are issued
 to protect the critical path. ie; libfabric routines, book keeping, etc.
 The said lock is btl_usnic_lock. It is a RECURSIVE lock, meaning that
 the same thread can take the lock again even if it already has the lock to
 allow the callback function to post another segment right away if we know
 that the current segment is completed inline. (So we can call send in send
 without deadlocking)
 These two functions taking care of hotel checkin/checkout and we
 have to protect that part. So we take the mutex lock before we enter the
 function.
 - opal_btl_usnic_check_rts()
 - opal_btl_usnic_handle_ack()
 We also have to protect the call to libfabric routines
 - opal_btl_usnic_endpoint_send_segment()        (fi_send)
 - opal_btl_usnic_recv_call()			(fi_recvmsg)
 have to be protected as well.
 Also cclient connection checking (opal_btl_usnic_connectivity_ping) has to be
 protected. This happens only in the beginning but cclient communicate with cagent
 through opal_fd_read/write() and if two or more clients do opal_fd_write() at the
 same time, the data might be corrupt.
 With this concept, many functions in btl/usnic that make calls to the
 listed functions are protected by OPAL_THREAD_LOCK macro which will only
 be active if the user specify MPI_Init_thread() with MPI_THREAD_MULTIPLE
 support.
--- a/oshmem/mca/memheap/README
+++ b/oshmem/mca/memheap/README
@ -1,50 +0,0 @@
 # Copyright (c) 2013      Mellanox Technologies, Inc.
 #                         All rights reserved
 # $COPYRIGHT$
 MEMHEAP Infrustructure documentation
 ------------------------------------
 MEMHEAP Infrustructure is responsible for managing the symmetric heap.
 The framework currently has following components: buddy and ptmalloc. buddy which uses a buddy allocator in order to manage the Memory allocations on the symmetric heap. Ptmalloc is an adaptation of ptmalloc3.
 Additional components may be added easily to the framework by defining the component's and the module's base and extended structures, and their funtionalities.
 The buddy allocator has the following data structures:
 1. Base component - of type struct mca_memheap_base_component_2_0_0_t
 2. Base module - of type struct mca_memheap_base_module_t
 3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
 4. Buddy module - of type struct mca_memheap_buddy_module_t extending the base module (struct mca_memheap_base_module_t)
 Each data structure includes the following fields:
 1. Base component - memheap_version, memheap_data and memheap_init
 2. Base module - Holds pointers to the base component and to the functions: alloc, free and finalize
 3. Buddy component - is a base component.
 4. Buddy module - Extends the base module and holds additional data on the components's priority, buddy allocator,
   maximal order of the symmetric heap, symmetric heap, pointer to the symmetric heap and hashtable maintaining the size of each allocated address.
 In the case that the user decides to implement additional components, the Memheap infrastructure chooses a component with the maximal priority.
 Handling the component opening is done under the base directory, in three stages:
 1. Open all available components. Implemented by memheap_base_open.c and called from shmem_init.
 2. Select the maximal priority component. This procedure involves the initialization of all components and then their
   finalization except to the chosen component. It is implemented by memheap_base_select.c and called from shmem_init.
 3. Close the max priority active cmponent. Implemented by memheap_base_close.c and called from shmem finalize.
 Buddy Component/Module
 ----------------------
 Responsible for handling the entire activities of the symmetric heap.
 The supported activities are:
                            - buddy_init (Initialization)
                            - buddy_alloc (Allocates a variable on the symmetric heap)
                            - buddy_free (frees a variable previously allocated on the symetric heap)
                            - buddy_finalize (Finalization).
 Data members of buddy module: - priority. The module's priority.
                              - buddy allocator: bits, num_free, lock and the maximal order (log2 of the maximal size)
                                of a variable on the symmetric heap. Buddy Allocator gives the offset in the symmetric heap
                                where a variable should be allocated.
                              - symmetric_heap: a range of reserved addresses (equal in all executing PE's) dedicated to "shared memory" allocation.
                              - symmetric_heap_hashtable (holding the size of an allocated variable on the symmetric heap.
                                 used to free an allocated variable on the symmetric heap)
--- a/oshmem/mca/memheap/README.md
+++ b/oshmem/mca/memheap/README.md
@ -0,0 +1,71 @@
 # MEMHEAP infrastructure documentation
 Copyright (c) 2013      Mellanox Technologies, Inc.
                        All rights reserved
 MEMHEAP Infrustructure is responsible for managing the symmetric heap.
 The framework currently has following components: buddy and
 ptmalloc. buddy which uses a buddy allocator in order to manage the
 Memory allocations on the symmetric heap. Ptmalloc is an adaptation of
 ptmalloc3.
 Additional components may be added easily to the framework by defining
 the component's and the module's base and extended structures, and
 their funtionalities.
 The buddy allocator has the following data structures:
 1. Base component - of type struct mca_memheap_base_component_2_0_0_t
 2. Base module - of type struct mca_memheap_base_module_t
 3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
 4. Buddy module - of type struct mca_memheap_buddy_module_t extending
   the base module (struct mca_memheap_base_module_t)
 Each data structure includes the following fields:
 1. Base component - memheap_version, memheap_data and memheap_init
 2. Base module - Holds pointers to the base component and to the
   functions: alloc, free and finalize
 3. Buddy component - is a base component.
 4. Buddy module - Extends the base module and holds additional data on
   the components's priority, buddy allocator,
   maximal order of the symmetric heap, symmetric heap, pointer to the
   symmetric heap and hashtable maintaining the size of each allocated
   address.
 In the case that the user decides to implement additional components,
 the Memheap infrastructure chooses a component with the maximal
 priority.  Handling the component opening is done under the base
 directory, in three stages:
 1. Open all available components. Implemented by memheap_base_open.c
   and called from shmem_init.
 2. Select the maximal priority component. This procedure involves the
   initialization of all components and then their finalization except
   to the chosen component. It is implemented by memheap_base_select.c
   and called from shmem_init.
 3. Close the max priority active cmponent. Implemented by
   memheap_base_close.c and called from shmem finalize.
 ## Buddy Component/Module
 Responsible for handling the entire activities of the symmetric heap.
 The supported activities are:
 1. buddy_init (Initialization)
 1. buddy_alloc (Allocates a variable on the symmetric heap)
 1. buddy_free (frees a variable previously allocated on the symetric heap)
 1. buddy_finalize (Finalization).
 Data members of buddy module:
 1. priority. The module's priority.
 1. buddy allocator: bits, num_free, lock and the maximal order (log2
   of the maximal size) of a variable on the symmetric heap. Buddy
   Allocator gives the offset in the symmetric heap where a variable
   should be allocated.
 1. symmetric_heap: a range of reserved addresses (equal in all
   executing PE's) dedicated to "shared memory" allocation.
 1. symmetric_heap_hashtable (holding the size of an allocated variable
   on the symmetric heap.  used to free an allocated variable on the
   symmetric heap)
--- a/test/runtime/README
+++ b/test/runtime/README
@ -1,7 +0,0 @@
 The functions in this directory are all intended to test registry operations against a persistent seed. Thus, they perform a system init/finalize. The functions in the directory above this one should be used to test basic registry operations within the replica - they will isolate the replica so as to avoid the communications issues and the init/finalize problems in other subsystems that may cause problems here.
 To run these tests, you need to first start a persistent daemon. This can be done using the command:
 orted --seed --scope public --persistent
 The daemon will "daemonize" itself and establish the registry (as well as other central services) replica, and then return a system prompt. You can then run any of these functions. If desired, you can utilize gdb and/or debug options on the persistent orted to watch/debug replica operations as well.
--- a/test/runtime/README.md
+++ b/test/runtime/README.md
@ -0,0 +1,20 @@
 The functions in this directory are all intended to test registry
 operations against a persistent seed. Thus, they perform a system
 init/finalize. The functions in the directory above this one should be
 used to test basic registry operations within the replica - they will
 isolate the replica so as to avoid the communications issues and the
 init/finalize problems in other subsystems that may cause problems
 here.
 To run these tests, you need to first start a persistent daemon. This
 can be done using the command:
 ```
 orted --seed --scope public --persistent
 ```
 The daemon will "daemonize" itself and establish the registry (as well
 as other central services) replica, and then return a system
 prompt. You can then run any of these functions. If desired, you can
 utilize gdb and/or debug options on the persistent orted to
 watch/debug replica operations as well.
`@ -21,4 +21,4 @@`

	`SUBDIRS = c`	`SUBDIRS = c`

	`EXTRA_DIST = README.txt`	`EXTRA_DIST = README.md`