1
1
Ralph Castain a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
..
2010-11-04 19:35:25 +00:00

Copyright (c) 2010 Oracle and/or its affiliates.  All rights reserved.

BFO DESIGN DOCUMENT
This document describes the use and design of the bfo.  In addition,
there is a section at the end explaining why this functionality was
not merged into the ob1 PML.

1. GENERAL USAGE
First, one has to configure the failover code into the openib BTL so
that bfo will work correctly.  To do this:
configure --enable-btl-openib-failover.

Then, when running one needs to select the bfo PML explicitly.
mpirun --mca pml bfo

Note that one needs to both configure with --enable-btl-openib-failover
and run with --mca pml bfo to get the failover support.  If one of
these two steps is skipped, then the MPI job will just abort in the
case of an error like it normally does with the ob1 PML.

2. GENERAL FUNCTION
The bfo failover feature requires two or more openib BTLs in use.  In
normal operation, it will stripe the communication over the multiple
BTLs.  When an error is detected, it will stop using the BTL that
incurred the error and continue the communication over the remaining
BTL.  Once a BTL has been mapped out, it cannot be used by the job
again, even if the underlying fabric becomes functional again.  Only
new jobs started after the fabric comes back up will use both BTLs.

The bfo works in conjunction with changes that were made in the openib
BTL.  As noted above, those changes need to be configured into the
BTL for everything to work properly.

The bfo only fails over between openib BTLs.  It cannot failover from
an openib BTL to TCP, for example.
 
3. GENERAL DESIGN
The bfo (Btl FailOver) PML was designed to work in clusters that have
multiple openib BTLs.  It was designed to be lightweight so as to
avoid any adverse effects on latency.  To that end, there is no
tracking of fragments or messages in the bfo PML.  Rather, it depends
on the underlying BTL to notify it of each fragment that has an error.
The bfo then decides what needs to be done based on the type of
fragment that gets an error.

No additional sequence numbers were introduced in the bfo.  Instead,
it makes use of the sequence numbers that exist in the MATCH, RNDV and
RGET fragment header.  In that way, duplicate fragments that have
MATCH information in them can be detected.  Other fragments, like PUT
and ACK, are never retransmitted so it does not matter that they do
not have sequence numbers.  The FIN header was a special case in that
it was changed to include the MATCH header so that the tag, source,
and context fields could be used to check for duplicate FINs.

Note that the assumption is that the underlying BTL will always issue
a callback with an error flag when it thinks a fragment has an error.
This means that even after an error is detected on a BTL, the BTL
continues to be checked for any other messages that may also complete
with an error.  This is potentially a unique characteristic of the
openib BTL when running over RC connections that allows the BFO to
work properly.

One scenario that is particularly difficult to handle is the case
where a fragment has an error but the message actually makes it to the
other side.  It is because of this that all fragments need to be
checked to make sure they are not a duplicate.  This scenario also
complicates some of the rendezvous protocols as the two sides may not
agree where the problem occurred.  For example, one can imagine a
sender getting an error on a final FIN message, but the FIN message
actually arrives at the other side.  The receiver thinks the
communication is done and moves on.  The sender thinks there was a
problem, and that the communication needs to restart.

It is also important to note that a message cannot signal a successful
completion and *not* make it to the receiver.  This would probably cause
the bfo to hang.

4. ERRORS
Errors are detected in the openib BTL layer and propagated to the PML
layer.  Typically, the errors occur while polling the completion
queue, but can happen in other areas as well.  When an error occurs,
an additional callback is called so the PML can map out the connection
for future sending.  Then the callback associated with the fragment is
called, but with the error field set to OMPI_ERROR.  This way, the PML
knows that this fragment may not have made it to the remote side.

The first callback into the PML is via the mca_pml_bfo_error_handler()
callback and the PML uses this to remove a connection for future
sending.  If the error_proc_t field is NULL, then the entire BTL is
removed for any future communication.  If the error_proc_t is not
NULL, then the BTL is only removed for the connection associated with
the error_proc_t.

The second callback is the standard one for a completion event, and
this can trigger various activities in the PML.  The regular callback
function is called but the status is set to OMPI_ERROR.  The PML layer
detects this and calls some failover specific routines depending on
the type of fragment that got the error.


5. RECOVERY OF MATCH FRAGMENTS
Note: For a general description of how the various fragments interact,
see Appendix 1 at the end of this document.

In the case of a MATCH fragment, the fragment is simply resent.  Care
has to be taken with a MATCH fragment that is sent via the standard
interface and one that is sent via the sendi interface.  In the
standard send, the send request is still available and is therefore
reset reused to send the MATCH fragment.  In the case of the sendi
fragment, the send request is gone, so the fragment is regenerated
from the information contained within the fragment.

6. RECOVERY OF RNDV or LARGE MESSAGE RDMA
In the case of a large message RDMA transfer or a RNDV transfer where
the message consists of several fragments, the restart is a little
more complicated.  This includes fragments like RNDV, PUT, RGET, FRAG,
FIN, and RDMA write and RDMA read completions. In most cases, the
requests associated with these fragments are reset and restarted.

First, it should be pointed out that a new variable was added to the
send and receive requests.  This variable tracks outstanding send
events that have not yet received their completion events.  This new
variable is used so that a request is not restarted until all the
outstanding events have completed.  If one does not wait for the
outstanding events to complete, then one may restart a request and
then a completion event will happen on the wrong request.

There is a second variable added to each request and that is one that
shows whether the request is already in an error state.  When a request
reaches the state that it has an error flagged on it and the outstanding
completion events are down to zero, it can start the restart dance
as described below.

7. SPECIAL CASE FOR FIN FRAGMENT
Like the MATCH fragment, the FIN message is also simply resent.  Like
the sendi MATCH fragment, there may be no request associated with the
FIN message when it gets an error, so the fragment is recreated from
the information in the fragment.  The FIN fragment was modified to
have additional information like what is in a MATCH fragment including
the context, source, and tag.  In this way, we can figure out if the
FIN message is a duplicate on the receiving side.

8. RESTART DANCE
When the bfo determines that there are no outstanding completion events,
a restart dance is initiated.  There are four new PML message types that
have been created to participate in the dance.
  1. RNDVRESTARTNOTIFY
  2. RECVERRNOTIFY
  3. RNDVRESTARTACK
  4. RNDVRESTARTNACK

When the send request is in an error state and the outstanding
completion events is zero, RNDVRESTARTNOTIFY is sent from the sender
to the receiver to let it know that the communication needs to be
restarted.  Upon receipt of the RNDVRESTARTNOTIFY, the receiver first
checks to make sure that it is still pointing to a valid receiver
request.  If so, it marks the receive request in error.  It then
checks to see if there are any outstanding completion events on the
receiver.  If there are no outstanding completion events, the receiver
sends the RNDVRESTARTACK.  If there are outstanding completion events,
then the RNDVRESTARTACK gets sent later when a completion event occurs
that brings the outstanding event count to zero.

In the case that the receiver determines that it is no longer looking
at a valid receive request, which means the request is complete, the
receiver responds with a RNDVRESTARTNACK.  While rare, this case can
happen for example, when a final FRAG message triggers an error on the
sender, but actually makes it to the receiver.

The RECVERRNOTIFY fragment is used so the receiver can let the sender
sender know that it had an error.  The sender then waits for all of
its completion events, and then sends a RNDVRESTARTNOTIFY.

All the handling of these new messages is contained in the
pml_bfo_failover files.

9. BTL SUPPORT
The openib BTL also supplies a lot of support for the bfo PML.  First,
fragments can be stored in the BTL during normal operation if
resources become scarce.  This means that when an error is detected in
the BTL, it needs to scour its internal queues for fragments that are
destined for the BTL and error them out.  The function
error_out_all_pending_frags() takes care of this functionality.  And
some of the fragments stored can be coalesced, so care has to be taken
to tease out each message from a coalesced fragment.

There is also some special code in the BTL to handle some strange
occurrences that were observed in the BTL.  First, there are times
where only one half of the connection gets an error.  This can result
in a mismatch between what the PML thinks is available to it and can
cause hangs.  Therefore, when a BTL detects an error, it sends a
special message down the working BTL connection to tell the remote
side that it needs to be brought down as well.

Secondly, it has been observed that a message can get stuck in the
eager RDMA connection between two BTLs.  In this case, an error is
detected on one side, but the other side never sees the message.
Therefore, a special message is sent to the other side telling it to
move along in the eager RDMA connection.  This is all somewhat
confusing.  See the code in the btl_openib_failover.c file for the
details.

10. MERGING
Every effort was made to try and merge the bfo PML into the ob1 PML.
The idea was that any upgrades to the ob1 PML would automatically make
it into the bfo PML and this would enhance maintainability of all the
code.  However, it was deemed that this merging would cause more
problems than it would solve.  What was attempted and why the
conclusion was made are documented here.

One can look at the bfo and easily see the differences between it and
ob1.  All the bfo specific code is surrounded by #if PML_BFO.  In
addition, there are two additional files in the bfo,
pml_bfo_failover.c and pml_bfo_failover.h.

To merge them, the following was attempted.  First, add all the code
in #if regions into the ob1 PML.  As of this writing, there are 73
#ifs that would have to be added into ob1.

Secondly, remove almost all the pml_bfo files and replace them with
links to the ob1 files.

Third, create a new header file that did name shifting of all the
functions so that ob1 and bfo could live together.  This also included
having to create macros for the names of header files as well.  To
help illustrate the name shifting issue, here is what the file might
look like in the bfo directory.

/* Need macros for the header files as they are different in the 
 * different PMLs */
#define PML "bfo"
#define PML_OB1_H "pml_bfo.h"
#define PML_OB1_COMM_H "pml_bfo_comm.h"
#define PML_OB1_COMPONENT_H "pml_bfo_component.h"
#define PML_OB1_HDR_H "pml_bfo_hdr.h"
#define PML_OB1_RDMA_H "pml_bfo_rdma.h"
#define PML_OB1_RDMAFRAG_H "pml_bfo_rdmafrag.h"
#define PML_OB1_RECVFRAG_H "pml_bfo_recvfrag.h"
#define PML_OB1_RECVREQ_H "pml_bfo_recvreq.h"
#define PML_OB1_SENDREQ_H "pml_bfo_sendreq.h"

/* Name shifting of functions from ob1 to bfo (incomplete list) */
#define mca_pml_ob1 mca_pml_bfo
#define mca_pml_ob1_t mca_pml_bfo_t 
#define mca_pml_ob1_component mca_pml_bfo_component
#define mca_pml_ob1_add_procs mca_pml_bfo_add_procs
#define mca_pml_ob1_del_procs mca_pml_bfo_del_procs
#define mca_pml_ob1_enable mca_pml_bfo_enable
#define mca_pml_ob1_progress mca_pml_bfo_progress
#define mca_pml_ob1_add_comm mca_pml_bfo_add_comm
#define mca_pml_ob1_del_comm mca_pml_bfo_del_comm
#define mca_pml_ob1_irecv_init mca_pml_bfo_irecv_init
#define mca_pml_ob1_irecv mca_pml_bfo_irecv
#define mca_pml_ob1_recv mca_pml_bfo_recv
#define mca_pml_ob1_isend_init mca_pml_bfo_isend_init
#define mca_pml_ob1_isend mca_pml_bfo_isend
#define mca_pml_ob1_send mca_pml_bfo_send
#define mca_pml_ob1_iprobe mca_pml_bfo_iprobe
[...and much more ...]

The pml_bfo_hdr.h file was not a link because the changes in it were
so extensive.  Also the Makefile was kept separate so it could include
the additional failover files as well as add a compile directive that
would force the files to be compiled as bfo instead of ob1.

After these changes were made, several independent developers reviewed
the results and concluded that making these changes would have too
much of a negative impact on ob1 maintenance.  First, the code became
much harder to read with all the additional #ifdefs.  Secondly, the
possibility of adding other features, like csum, to ob1 would only
make this issue even worse.  Therefore, it was decided to keep the bfo
PML separate from ob1.

11. UTILITIES
In an ideal world, any bug fixes that are made in the ob1 PML would
also be made in the csum and the bfo PMLs.  However, that does not
always happen.  Therefore, there are two new utilities added to the
contrib directory.

check-ob1-revision.pl
check-ob1-pml-diffs.pl

The first one can be run to see if ob1 has changed from its last known
state.  Here is an example.

 machine =>check-ob1-revision.pl
Running svn diff -r24138 ../ompi/mca/pml/ob1
No new changes detected in ob1.  Everything is fine.

If there are differences, then one needs to review them and potentially
add them to the bfo (and csum also if one feels like it).
After that, bump up the value in the script to the latest value.

The second script allows one to see the differences between the ob1
and bfo PML.  Here is an example.

 machine =>check-ob1-pml-diffs.pl

Starting script to check differences between bfo and ob1...
Files Compared: pml_ob1.c and pml_bfo.c
No differences encountered
Files Compared: pml_ob1.h and pml_bfo.h
[...snip...]
Files Compared: pml_ob1_start.c and pml_bfo_start.c
No differences encountered

There is a lot more in the script that tells how it is used.


Appendix 1: SIMPLE OVERVIEW OF COMMUNICATION PROTOCOLS
The drawings below attempt to describe some of the general flow of
fragments in the various protocols that are supported in the PMLs.
The "read" and "write" are actual RDMA actions and do not pertain to
fragments that are sent.  As can be inferred, they use FIN messages to
indicate their completion.


MATCH PROTOCOL
sender >->->-> MATCH >->->-> receiver

SEND WITH MULTIPLE FRAGMENTS
sender >->->-> RNDV  >->->-> receiver
       <-<-<-<  ACK  <-<-<-<
       >->->-> FRAG  >->->->
       >->->-> FRAG  >->->->
       >->->-> FRAG  >->->->

RDMA PUT
sender >->->-> RNDV  >->->-> receiver
       <-<-<-<  PUT  <-<-<-<
       <-<-<-<  PUT  <-<-<-<
       >->->-> write >->->->
       >->->->  FIN  >->->->
       >->->-> write >->->->
       >->->->  FIN  >->->->

RMA GET
sender >->->-> RGET  >->->-> receiver
       <-<-<-< read  <-<-<-< 
       <-<-<-<  FIN  <-<-<-<