Brief description of bfo functionality.

This commit was SVN r24186.
2010-12-17 19:12:00 +00:00 · 2010-12-17 19:12:00 +00:00 · a57e5587f6
--- a/ompi/mca/pml/bfo/README
+++ b/ompi/mca/pml/bfo/README
@ -0,0 +1,340 @@
+Copyright (c) 2010 Oracle and/or its affiliates.  All rights reserved.
+
+BFO DESIGN DOCUMENT
+This document describes the use and design of the bfo.  In addition,
+there is a section at the end explaining why this functionality was
+not merged into the ob1 PML.
+
+1. GENERAL USAGE
+First, one has to configure the failover code into the openib BTL so
+that bfo will work correctly.  To do this:
+configure --enable-openib-failover.
+
+Then, when running one needs to select the bfo PML explicitly.
+mpirun --mca pml bfo
+
+Note that one needs to both configure with --enable-openib-failover
+and run with --mca pml bfo to get the failover support.  If one of
+these two steps is skipped, then the MPI job will just abort in the
+case of an error like it normally does with the ob1 PML.
+
+2. GENERAL FUNCTION
+The bfo failover feature requires two or more openib BTLs in use.  In
+normal operation, it will stripe the communication over the multiple
+BTLs.  When an error is detected, it will stop using the BTL that
+incurred the error and continue the communication over the remaining
+BTL.  Once a BTL has been mapped out, it cannot be used by the job
+again, even if the underlying fabric becomes functional again.  Only
+new jobs started after the fabric comes back up will use both BTLs.
+
+The bfo works in conjunction with changes that were made in the openib
+BTL.  As noted above, those changes need to be configured into the
+BTL for everything to work properly.
+
+The bfo only fails over between openib BTLs.  It cannot failover from
+an openib BTL to TCP, for example.
+ 
+3. GENERAL DESIGN
+The bfo (Btl FailOver) PML was designed to work in clusters that have
+multiple openib BTLs.  It was designed to be lightweight so as to
+avoid any adverse effects on latency.  To that end, there is no
+tracking of fragments or messages in the bfo PML.  Rather, it depends
+on the underlying BTL to notify it of each fragment that has an error.
+The bfo then decides what needs to be done based on the type of
+fragment that gets an error.
+
+No additional sequence numbers were introduced in the bfo.  Instead,
+it makes use of the sequence numbers that exist in the MATCH, RNDV and
+RGET fragment header.  In that way, duplicate fragments that have
+MATCH information in them can be detected.  Other fragments, like PUT
+and ACK, are never retransmitted so it does not matter that they do
+not have sequence numbers.  The FIN header was a special case in that
+it was changed to include the MATCH header so that the tag, source,
+and context fields could be used to check for duplicate FINs.
+
+Note that the assumption is that the underlying BTL will always issue
+a callback with an error flag when it thinks a fragment has an error.
+This means that even after an error is detected on a BTL, the BTL
+continues to be checked for any other messages that may also complete
+with an error.  This is potentially a unique characteristic of the
+openib BTL when running over RC connections that allows the BFO to
+work properly.
+
+One scenario that is particularly difficult to handle is the case
+where a fragment has an error but the message actually makes it to the
+other side.  It is because of this that all fragments need to be
+checked to make sure they are not a duplicate.  This scenario also
+complicates some of the rendezvous protocols as the two sides may not
+agree where the problem occurred.  For example, one can imagine a
+sender getting an error on a final FIN message, but the FIN message
+actually arrives at the other side.  The receiver thinks the
+communication is done and moves on.  The sender thinks there was a
+problem, and that the communication needs to restart.
+
+It is also important to note that a message cannot signal a successful
+completion and *not* make it to the receiver.  This would probably cause
+the bfo to hang.
+
+4. ERRORS
+Errors are detected in the openib BTL layer and propagated to the PML
+layer.  Typically, the errors occur while polling the completion
+queue, but can happen in other areas as well.  When an error occurs,
+an additional callback is called so the PML can map out the connection
+for future sending.  Then the callback associated with the fragment is
+called, but with the error field set to OMPI_ERROR.  This way, the PML
+knows that this fragment may not have made it to the remote side.
+
+The first callback into the PML is via the mca_pml_bfo_error_handler()
+callback and the PML uses this to remove a connection for future
+sending.  If the error_proc_t field is NULL, then the entire BTL is
+removed for any future communication.  If the error_proc_t is not
+NULL, then the BTL is only removed for the connection associated with
+the error_proc_t.
+
+The second callback is the standard one for a completion event, and
+this can trigger various activities in the PML.  The regular callback
+function is called but the status is set to OMPI_ERROR.  The PML layer
+detects this and calls some failover specific routines depending on
+the type of fragment that got the error.
+
+
+5. RECOVERY OF MATCH FRAGMENTS
+Note: For a general description of how the various fragments interact,
+see Appendix 1 at the end of this document.
+
+In the case of a MATCH fragment, the fragment is simply resent.  Care
+has to be taken with a MATCH fragment that is sent via the standard
+interface and one that is sent via the sendi interface.  In the
+standard send, the send request is still available and is therefore
+reset reused to send the MATCH fragment.  In the case of the sendi
+fragment, the send request is gone, so the fragment is regenerated
+from the information contained within the fragment.
+
+6. RECOVERY OF RNDV or LARGE MESSAGE RDMA
+In the case of a large message RDMA transfer or a RNDV transfer where
+the message consists of several fragments, the restart is a little
+more complicated.  This includes fragments like RNDV, PUT, RGET, FRAG,
+FIN, and RDMA write and RDMA read completions. In most cases, the
+requests associated with these fragments are reset and restarted.
+
+First, it should be pointed out that a new variable was added to the
+send and receive requests.  This variable tracks outstanding send
+events that have not yet received their completion events.  This new
+variable is used so that a request is not restarted until all the
+outstanding events have completed.  If one does not wait for the
+outstanding events to complete, then one may restart a request and
+then a completion event will happen on the wrong request.
+
+There is a second variable added to each request and that is one that
+shows whether the request is already in an error state.  When a request
+reaches the state that it has an error flagged on it and the outstanding
+completion events are down to zero, it can start the restart dance
+as described below.
+
+7. SPECIAL CASE FOR FIN FRAGMENT
+Like the MATCH fragment, the FIN message is also simply resent.  Like
+the sendi MATCH fragment, there may be no request associated with the
+FIN message when it gets an error, so the fragment is recreated from
+the information in the fragment.  The FIN fragment was modified to
+have additional information like what is in a MATCH fragment including
+the context, source, and tag.  In this way, we can figure out if the
+FIN message is a duplicate on the receiving side.
+
+8. RESTART DANCE
+When the bfo determines that there are no outstanding completion events,
+a restart dance is initiated.  There are four new PML message types that
+have been created to participate in the dance.
+  1. RNDVRESTARTNOTIFY
+  2. RECVERRNOTIFY
+  3. RNDVRESTARTACK
+  4. RNDVRESTARTNACK
+
+When the send request is in an error state and the outstanding
+completion events is zero, RNDVRESTARTNOTIFY is sent from the sender
+to the receiver to let it know that the communication needs to be
+restarted.  Upon receipt of the RNDVRESTARTNOTIFY, the receiver first
+checks to make sure that it is still pointing to a valid receiver
+request.  If so, it marks the receive request in error.  It then
+checks to see if there are any outstanding completion events on the
+receiver.  If there are no outstanding completion events, the receiver
+sends the RNDVRESTARTACK.  If there are outstanding completion events,
+then the RNDVRESTARTACK gets sent later when a completion event occurs
+that brings the outstanding event count to zero.
+
+In the case that the receiver determines that it is no longer looking
+at a valid receive request, which means the request is complete, the
+receiver responds with a RNDVRESTARTNACK.  While rare, this case can
+happen for example, when a final FRAG message triggers an error on the
+sender, but actually makes it to the receiver.
+
+The RECVERRNOTIFY fragment is used so the receiver can let the sender
+sender know that it had an error.  The sender then waits for all of
+its completion events, and then sends a RNDVRESTARTNOTIFY.
+
+All the handling of these new messages is contained in the
+pml_bfo_failover files.
+
+9. BTL SUPPORT
+The openib BTL also supplies a lot of support for the bfo PML.  First,
+fragments can be stored in the BTL during normal operation if
+resources become scarce.  This means that when an error is detected in
+the BTL, it needs to scour its internal queues for fragments that are
+destined for the BTL and error them out.  The function
+error_out_all_pending_frags() takes care of this functionality.  And
+some of the fragments stored can be coalesced, so care has to be taken
+to tease out each message from a coalesced fragment.
+
+There is also some special code in the BTL to handle some strange
+occurrences that were observed in the BTL.  First, there are times
+where only one half of the connection gets an error.  This can result
+in a mismatch between what the PML thinks is available to it and can
+cause hangs.  Therefore, when a BTL detects an error, it sends a
+special message down the working BTL connection to tell the remote
+side that it needs to be brought down as well.
+
+Secondly, it has been observed that a message can get stuck in the
+eager RDMA connection between two BTLs.  In this case, an error is
+detected on one side, but the other side never sees the message.
+Therefore, a special message is sent to the other side telling it to
+move along in the eager RDMA connection.  This is all somewhat
+confusing.  See the code in the btl_openib_failover.c file for the
+details.
+
+10. MERGING
+Every effort was made to try and merge the bfo PML into the ob1 PML.
+The idea was that any upgrades to the ob1 PML would automatically make
+it into the bfo PML and this would enhance maintainability of all the
+code.  However, it was deemed that this merging would cause more
+problems than it would solve.  What was attempted and why the
+conclusion was made are documented here.
+
+One can look at the bfo and easily see the differences between it and
+ob1.  All the bfo specific code is surrounded by #if PML_BFO.  In
+addition, there are two additional files in the bfo,
+pml_bfo_failover.c and pml_bfo_failover.h.
+
+To merge them, the following was attempted.  First, add all the code
+in #if regions into the ob1 PML.  As of this writing, there are 73
+#ifs that would have to be added into ob1.
+
+Secondly, remove almost all the pml_bfo files and replace them with
+links to the ob1 files.
+
+Third, create a new header file that did name shifting of all the
+functions so that ob1 and bfo could live together.  This also included
+having to create macros for the names of header files as well.  To
+help illustrate the name shifting issue, here is what the file might
+look like in the bfo directory.
+
+/* Need macros for the header files as they are different in the 
+ * different PMLs */
+#define PML "bfo"
+#define PML_OB1_H "pml_bfo.h"
+#define PML_OB1_COMM_H "pml_bfo_comm.h"
+#define PML_OB1_COMPONENT_H "pml_bfo_component.h"
+#define PML_OB1_HDR_H "pml_bfo_hdr.h"
+#define PML_OB1_RDMA_H "pml_bfo_rdma.h"
+#define PML_OB1_RDMAFRAG_H "pml_bfo_rdmafrag.h"
+#define PML_OB1_RECVFRAG_H "pml_bfo_recvfrag.h"
+#define PML_OB1_RECVREQ_H "pml_bfo_recvreq.h"
+#define PML_OB1_SENDREQ_H "pml_bfo_sendreq.h"
+
+/* Name shifting of functions from ob1 to bfo (incomplete list) */
+#define mca_pml_ob1 mca_pml_bfo
+#define mca_pml_ob1_t mca_pml_bfo_t 
+#define mca_pml_ob1_component mca_pml_bfo_component
+#define mca_pml_ob1_add_procs mca_pml_bfo_add_procs
+#define mca_pml_ob1_del_procs mca_pml_bfo_del_procs
+#define mca_pml_ob1_enable mca_pml_bfo_enable
+#define mca_pml_ob1_progress mca_pml_bfo_progress
+#define mca_pml_ob1_add_comm mca_pml_bfo_add_comm
+#define mca_pml_ob1_del_comm mca_pml_bfo_del_comm
+#define mca_pml_ob1_irecv_init mca_pml_bfo_irecv_init
+#define mca_pml_ob1_irecv mca_pml_bfo_irecv
+#define mca_pml_ob1_recv mca_pml_bfo_recv
+#define mca_pml_ob1_isend_init mca_pml_bfo_isend_init
+#define mca_pml_ob1_isend mca_pml_bfo_isend
+#define mca_pml_ob1_send mca_pml_bfo_send
+#define mca_pml_ob1_iprobe mca_pml_bfo_iprobe
+[...and much more ...]
+
+The pml_bfo_hdr.h file was not a link because the changes in it were
+so extensive.  Also the Makefile was kept separate so it could include
+the additional failover files as well as add a compile directive that
+would force the files to be compiled as bfo instead of ob1.
+
+After these changes were made, several independent developers reviewed
+the results and concluded that making these changes would have too
+much of a negative impact on ob1 maintenance.  First, the code became
+much harder to read with all the additional #ifdefs.  Secondly, the
+possibility of adding other features, like csum, to ob1 would only
+make this issue even worse.  Therefore, it was decided to keep the bfo
+PML separate from ob1.
+
+11. UTILITIES
+In an ideal world, any bug fixes that are made in the ob1 PML would
+also be made in the csum and the bfo PMLs.  However, that does not
+always happen.  Therefore, there are two new utilities added to the
+contrib directory.
+
+check-ob1-revision.pl
+check-ob1-pml-diffs.pl
+
+The first one can be run to see if ob1 has changed from its last known
+state.  Here is an example.
+
+ machine =>check-ob1-revision.pl
+Running svn diff -r24138 ../ompi/mca/pml/ob1
+No new changes detected in ob1.  Everything is fine.
+
+If there are differences, then one needs to review them and potentially
+add them to the bfo (and csum also if one feels like it).
+After that, bump up the value in the script to the latest value.
+
+The second script allows one to see the differences between the ob1
+and bfo PML.  Here is an example.
+
+ machine =>check-ob1-pml-diffs.pl
+
+Starting script to check differences between bfo and ob1...
+Files Compared: pml_ob1.c and pml_bfo.c
+No differences encountered
+Files Compared: pml_ob1.h and pml_bfo.h
+[...snip...]
+Files Compared: pml_ob1_start.c and pml_bfo_start.c
+No differences encountered
+
+There is a lot more in the script that tells how it is used.
+
+
+Appendix 1: SIMPLE OVERVIEW OF COMMUNICATION PROTOCOLS
+The drawings below attempt to describe some of the general flow of
+fragments in the various protocols that are supported in the PMLs.
+The "read" and "write" are actual RDMA actions and do not pertain to
+fragments that are sent.  As can be inferred, they use FIN messages to
+indicate their completion.
+
+
+MATCH PROTOCOL
+sender >->->-> MATCH >->->-> receiver
+
+SEND WITH MULTIPLE FRAGMENTS
+sender >->->-> RNDV  >->->-> receiver
+       <-<-<-<  ACK  <-<-<-<
+       >->->-> FRAG  >->->->
+       >->->-> FRAG  >->->->
+       >->->-> FRAG  >->->->
+
+RDMA PUT
+sender >->->-> RNDV  >->->-> receiver
+       <-<-<-<  PUT  <-<-<-<
+       <-<-<-<  PUT  <-<-<-<
+       >->->-> write >->->->
+       >->->->  FIN  >->->->
+       >->->-> write >->->->
+       >->->->  FIN  >->->->
+
+RMA GET
+sender >->->-> RGET  >->->-> receiver
+       <-<-<-< read  <-<-<-< 
+       <-<-<-<  FIN  <-<-<-<