openmpi/ompi/mca/pml/bfo/README

Copyright (c) 2010 Oracle and/or its affiliates.  All rights reserved.

BFO DESIGN DOCUMENT
This document describes the use and design of the bfo.  In addition,
there is a section at the end explaining why this functionality was
not merged into the ob1 PML.

1. GENERAL USAGE
First, one has to configure the failover code into the openib BTL so
that bfo will work correctly.  To do this:
configure --enable-btl-openib-failover.

Then, when running one needs to select the bfo PML explicitly.
mpirun --mca pml bfo

Note that one needs to both configure with --enable-btl-openib-failover
and run with --mca pml bfo to get the failover support.  If one of
these two steps is skipped, then the MPI job will just abort in the
case of an error like it normally does with the ob1 PML.

2. GENERAL FUNCTION
The bfo failover feature requires two or more openib BTLs in use.  In
normal operation, it will stripe the communication over the multiple
BTLs.  When an error is detected, it will stop using the BTL that
incurred the error and continue the communication over the remaining
BTL.  Once a BTL has been mapped out, it cannot be used by the job
again, even if the underlying fabric becomes functional again.  Only
new jobs started after the fabric comes back up will use both BTLs.

The bfo works in conjunction with changes that were made in the openib
BTL.  As noted above, those changes need to be configured into the
BTL for everything to work properly.

The bfo only fails over between openib BTLs.  It cannot failover from
an openib BTL to TCP, for example.

3. GENERAL DESIGN
The bfo (Btl FailOver) PML was designed to work in clusters that have
multiple openib BTLs.  It was designed to be lightweight so as to
avoid any adverse effects on latency.  To that end, there is no
tracking of fragments or messages in the bfo PML.  Rather, it depends
on the underlying BTL to notify it of each fragment that has an error.
The bfo then decides what needs to be done based on the type of
fragment that gets an error.

No additional sequence numbers were introduced in the bfo.  Instead,
it makes use of the sequence numbers that exist in the MATCH, RNDV and
RGET fragment header.  In that way, duplicate fragments that have
MATCH information in them can be detected.  Other fragments, like PUT
and ACK, are never retransmitted so it does not matter that they do
not have sequence numbers.  The FIN header was a special case in that
it was changed to include the MATCH header so that the tag, source,
and context fields could be used to check for duplicate FINs.

Note that the assumption is that the underlying BTL will always issue
a callback with an error flag when it thinks a fragment has an error.
This means that even after an error is detected on a BTL, the BTL
continues to be checked for any other messages that may also complete
with an error.  This is potentially a unique characteristic of the
openib BTL when running over RC connections that allows the BFO to
work properly.

One scenario that is particularly difficult to handle is the case
where a fragment has an error but the message actually makes it to the
other side.  It is because of this that all fragments need to be
checked to make sure they are not a duplicate.  This scenario also
complicates some of the rendezvous protocols as the two sides may not
agree where the problem occurred.  For example, one can imagine a
sender getting an error on a final FIN message, but the FIN message
actually arrives at the other side.  The receiver thinks the
communication is done and moves on.  The sender thinks there was a
problem, and that the communication needs to restart.

It is also important to note that a message cannot signal a successful
completion and *not* make it to the receiver.  This would probably cause
the bfo to hang.

4. ERRORS
Errors are detected in the openib BTL layer and propagated to the PML
layer.  Typically, the errors occur while polling the completion
queue, but can happen in other areas as well.  When an error occurs,
an additional callback is called so the PML can map out the connection
for future sending.  Then the callback associated with the fragment is
called, but with the error field set to OMPI_ERROR.  This way, the PML
knows that this fragment may not have made it to the remote side.

The first callback into the PML is via the mca_pml_bfo_error_handler()
callback and the PML uses this to remove a connection for future
sending.  If the error_proc_t field is NULL, then the entire BTL is
removed for any future communication.  If the error_proc_t is not
NULL, then the BTL is only removed for the connection associated with
the error_proc_t.

The second callback is the standard one for a completion event, and
this can trigger various activities in the PML.  The regular callback
function is called but the status is set to OMPI_ERROR.  The PML layer
detects this and calls some failover specific routines depending on
the type of fragment that got the error.


5. RECOVERY OF MATCH FRAGMENTS
Note: For a general description of how the various fragments interact,
see Appendix 1 at the end of this document.

In the case of a MATCH fragment, the fragment is simply resent.  Care
has to be taken with a MATCH fragment that is sent via the standard
interface and one that is sent via the sendi interface.  In the
standard send, the send request is still available and is therefore
reset reused to send the MATCH fragment.  In the case of the sendi
fragment, the send request is gone, so the fragment is regenerated
from the information contained within the fragment.

6. RECOVERY OF RNDV or LARGE MESSAGE RDMA
In the case of a large message RDMA transfer or a RNDV transfer where
the message consists of several fragments, the restart is a little
more complicated.  This includes fragments like RNDV, PUT, RGET, FRAG,
FIN, and RDMA write and RDMA read completions. In most cases, the
requests associated with these fragments are reset and restarted.

First, it should be pointed out that a new variable was added to the
send and receive requests.  This variable tracks outstanding send
events that have not yet received their completion events.  This new
variable is used so that a request is not restarted until all the
outstanding events have completed.  If one does not wait for the
outstanding events to complete, then one may restart a request and
then a completion event will happen on the wrong request.

There is a second variable added to each request and that is one that
shows whether the request is already in an error state.  When a request
reaches the state that it has an error flagged on it and the outstanding
completion events are down to zero, it can start the restart dance
as described below.

7. SPECIAL CASE FOR FIN FRAGMENT
Like the MATCH fragment, the FIN message is also simply resent.  Like
the sendi MATCH fragment, there may be no request associated with the
FIN message when it gets an error, so the fragment is recreated from
the information in the fragment.  The FIN fragment was modified to
have additional information like what is in a MATCH fragment including
the context, source, and tag.  In this way, we can figure out if the
FIN message is a duplicate on the receiving side.

8. RESTART DANCE
When the bfo determines that there are no outstanding completion events,
a restart dance is initiated.  There are four new PML message types that
have been created to participate in the dance.
  1. RNDVRESTARTNOTIFY
  2. RECVERRNOTIFY
  3. RNDVRESTARTACK
  4. RNDVRESTARTNACK

When the send request is in an error state and the outstanding
completion events is zero, RNDVRESTARTNOTIFY is sent from the sender
to the receiver to let it know that the communication needs to be
restarted.  Upon receipt of the RNDVRESTARTNOTIFY, the receiver first
checks to make sure that it is still pointing to a valid receiver
request.  If so, it marks the receive request in error.  It then
checks to see if there are any outstanding completion events on the
receiver.  If there are no outstanding completion events, the receiver
sends the RNDVRESTARTACK.  If there are outstanding completion events,
then the RNDVRESTARTACK gets sent later when a completion event occurs
that brings the outstanding event count to zero.

In the case that the receiver determines that it is no longer looking
at a valid receive request, which means the request is complete, the
receiver responds with a RNDVRESTARTNACK.  While rare, this case can
happen for example, when a final FRAG message triggers an error on the
sender, but actually makes it to the receiver.

The RECVERRNOTIFY fragment is used so the receiver can let the sender
sender know that it had an error.  The sender then waits for all of
its completion events, and then sends a RNDVRESTARTNOTIFY.

All the handling of these new messages is contained in the
pml_bfo_failover files.

9. BTL SUPPORT
The openib BTL also supplies a lot of support for the bfo PML.  First,
fragments can be stored in the BTL during normal operation if
resources become scarce.  This means that when an error is detected in
the BTL, it needs to scour its internal queues for fragments that are
destined for the BTL and error them out.  The function
error_out_all_pending_frags() takes care of this functionality.  And
some of the fragments stored can be coalesced, so care has to be taken
to tease out each message from a coalesced fragment.

There is also some special code in the BTL to handle some strange
occurrences that were observed in the BTL.  First, there are times
where only one half of the connection gets an error.  This can result
in a mismatch between what the PML thinks is available to it and can
cause hangs.  Therefore, when a BTL detects an error, it sends a
special message down the working BTL connection to tell the remote
side that it needs to be brought down as well.

Secondly, it has been observed that a message can get stuck in the
eager RDMA connection between two BTLs.  In this case, an error is
detected on one side, but the other side never sees the message.
Therefore, a special message is sent to the other side telling it to
move along in the eager RDMA connection.  This is all somewhat
confusing.  See the code in the btl_openib_failover.c file for the
details.

10. MERGING
Every effort was made to try and merge the bfo PML into the ob1 PML.
The idea was that any upgrades to the ob1 PML would automatically make
it into the bfo PML and this would enhance maintainability of all the
code.  However, it was deemed that this merging would cause more
problems than it would solve.  What was attempted and why the
conclusion was made are documented here.

One can look at the bfo and easily see the differences between it and
ob1.  All the bfo specific code is surrounded by #if PML_BFO.  In
addition, there are two additional files in the bfo,
pml_bfo_failover.c and pml_bfo_failover.h.

To merge them, the following was attempted.  First, add all the code
in #if regions into the ob1 PML.  As of this writing, there are 73
#ifs that would have to be added into ob1.

Secondly, remove almost all the pml_bfo files and replace them with
links to the ob1 files.

Third, create a new header file that did name shifting of all the
functions so that ob1 and bfo could live together.  This also included
having to create macros for the names of header files as well.  To
help illustrate the name shifting issue, here is what the file might
look like in the bfo directory.

/* Need macros for the header files as they are different in the
 * different PMLs */
#define PML "bfo"
#define PML_OB1_H "pml_bfo.h"
#define PML_OB1_COMM_H "pml_bfo_comm.h"
#define PML_OB1_COMPONENT_H "pml_bfo_component.h"
#define PML_OB1_HDR_H "pml_bfo_hdr.h"
#define PML_OB1_RDMA_H "pml_bfo_rdma.h"
#define PML_OB1_RDMAFRAG_H "pml_bfo_rdmafrag.h"
#define PML_OB1_RECVFRAG_H "pml_bfo_recvfrag.h"
#define PML_OB1_RECVREQ_H "pml_bfo_recvreq.h"
#define PML_OB1_SENDREQ_H "pml_bfo_sendreq.h"

/* Name shifting of functions from ob1 to bfo (incomplete list) */
#define mca_pml_ob1 mca_pml_bfo
#define mca_pml_ob1_t mca_pml_bfo_t
#define mca_pml_ob1_component mca_pml_bfo_component
#define mca_pml_ob1_add_procs mca_pml_bfo_add_procs
#define mca_pml_ob1_del_procs mca_pml_bfo_del_procs
#define mca_pml_ob1_enable mca_pml_bfo_enable
#define mca_pml_ob1_progress mca_pml_bfo_progress
#define mca_pml_ob1_add_comm mca_pml_bfo_add_comm
#define mca_pml_ob1_del_comm mca_pml_bfo_del_comm
#define mca_pml_ob1_irecv_init mca_pml_bfo_irecv_init
#define mca_pml_ob1_irecv mca_pml_bfo_irecv
#define mca_pml_ob1_recv mca_pml_bfo_recv
#define mca_pml_ob1_isend_init mca_pml_bfo_isend_init
#define mca_pml_ob1_isend mca_pml_bfo_isend
#define mca_pml_ob1_send mca_pml_bfo_send
#define mca_pml_ob1_iprobe mca_pml_bfo_iprobe
[...and much more ...]

The pml_bfo_hdr.h file was not a link because the changes in it were
so extensive.  Also the Makefile was kept separate so it could include
the additional failover files as well as add a compile directive that
would force the files to be compiled as bfo instead of ob1.

After these changes were made, several independent developers reviewed
the results and concluded that making these changes would have too
much of a negative impact on ob1 maintenance.  First, the code became
much harder to read with all the additional #ifdefs.  Secondly, the
possibility of adding other features, like csum, to ob1 would only
make this issue even worse.  Therefore, it was decided to keep the bfo
PML separate from ob1.

11. UTILITIES
In an ideal world, any bug fixes that are made in the ob1 PML would
also be made in the csum and the bfo PMLs.  However, that does not
always happen.  Therefore, there are two new utilities added to the
contrib directory.

check-ob1-revision.pl
check-ob1-pml-diffs.pl

The first one can be run to see if ob1 has changed from its last known
state.  Here is an example.

 machine =>check-ob1-revision.pl
Running svn diff -r24138 ../ompi/mca/pml/ob1
No new changes detected in ob1.  Everything is fine.

If there are differences, then one needs to review them and potentially
add them to the bfo (and csum also if one feels like it).
After that, bump up the value in the script to the latest value.

The second script allows one to see the differences between the ob1
and bfo PML.  Here is an example.

 machine =>check-ob1-pml-diffs.pl

Starting script to check differences between bfo and ob1...
Files Compared: pml_ob1.c and pml_bfo.c
No differences encountered
Files Compared: pml_ob1.h and pml_bfo.h
[...snip...]
Files Compared: pml_ob1_start.c and pml_bfo_start.c
No differences encountered

There is a lot more in the script that tells how it is used.


Appendix 1: SIMPLE OVERVIEW OF COMMUNICATION PROTOCOLS
The drawings below attempt to describe some of the general flow of
fragments in the various protocols that are supported in the PMLs.
The "read" and "write" are actual RDMA actions and do not pertain to
fragments that are sent.  As can be inferred, they use FIN messages to
indicate their completion.


MATCH PROTOCOL
sender >->->-> MATCH >->->-> receiver

SEND WITH MULTIPLE FRAGMENTS
sender >->->-> RNDV  >->->-> receiver
       <-<-<-<  ACK  <-<-<-<
       >->->-> FRAG  >->->->
       >->->-> FRAG  >->->->
       >->->-> FRAG  >->->->

RDMA PUT
sender >->->-> RNDV  >->->-> receiver
       <-<-<-<  PUT  <-<-<-<
       <-<-<-<  PUT  <-<-<-<
       >->->-> write >->->->
       >->->->  FIN  >->->->
       >->->-> write >->->->
       >->->->  FIN  >->->->

RMA GET
sender >->->-> RGET  >->->-> receiver
       <-<-<-< read  <-<-<-<
       <-<-<-<  FIN  <-<-<-<