Brief description of bfo functionality.
This commit was SVN r24186.
Этот коммит содержится в:
родитель
a525e70f46
Коммит
a57e5587f6
340
ompi/mca/pml/bfo/README
Обычный файл
340
ompi/mca/pml/bfo/README
Обычный файл
@ -0,0 +1,340 @@
|
||||
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
|
||||
|
||||
BFO DESIGN DOCUMENT
|
||||
This document describes the use and design of the bfo. In addition,
|
||||
there is a section at the end explaining why this functionality was
|
||||
not merged into the ob1 PML.
|
||||
|
||||
1. GENERAL USAGE
|
||||
First, one has to configure the failover code into the openib BTL so
|
||||
that bfo will work correctly. To do this:
|
||||
configure --enable-openib-failover.
|
||||
|
||||
Then, when running one needs to select the bfo PML explicitly.
|
||||
mpirun --mca pml bfo
|
||||
|
||||
Note that one needs to both configure with --enable-openib-failover
|
||||
and run with --mca pml bfo to get the failover support. If one of
|
||||
these two steps is skipped, then the MPI job will just abort in the
|
||||
case of an error like it normally does with the ob1 PML.
|
||||
|
||||
2. GENERAL FUNCTION
|
||||
The bfo failover feature requires two or more openib BTLs in use. In
|
||||
normal operation, it will stripe the communication over the multiple
|
||||
BTLs. When an error is detected, it will stop using the BTL that
|
||||
incurred the error and continue the communication over the remaining
|
||||
BTL. Once a BTL has been mapped out, it cannot be used by the job
|
||||
again, even if the underlying fabric becomes functional again. Only
|
||||
new jobs started after the fabric comes back up will use both BTLs.
|
||||
|
||||
The bfo works in conjunction with changes that were made in the openib
|
||||
BTL. As noted above, those changes need to be configured into the
|
||||
BTL for everything to work properly.
|
||||
|
||||
The bfo only fails over between openib BTLs. It cannot failover from
|
||||
an openib BTL to TCP, for example.
|
||||
|
||||
3. GENERAL DESIGN
|
||||
The bfo (Btl FailOver) PML was designed to work in clusters that have
|
||||
multiple openib BTLs. It was designed to be lightweight so as to
|
||||
avoid any adverse effects on latency. To that end, there is no
|
||||
tracking of fragments or messages in the bfo PML. Rather, it depends
|
||||
on the underlying BTL to notify it of each fragment that has an error.
|
||||
The bfo then decides what needs to be done based on the type of
|
||||
fragment that gets an error.
|
||||
|
||||
No additional sequence numbers were introduced in the bfo. Instead,
|
||||
it makes use of the sequence numbers that exist in the MATCH, RNDV and
|
||||
RGET fragment header. In that way, duplicate fragments that have
|
||||
MATCH information in them can be detected. Other fragments, like PUT
|
||||
and ACK, are never retransmitted so it does not matter that they do
|
||||
not have sequence numbers. The FIN header was a special case in that
|
||||
it was changed to include the MATCH header so that the tag, source,
|
||||
and context fields could be used to check for duplicate FINs.
|
||||
|
||||
Note that the assumption is that the underlying BTL will always issue
|
||||
a callback with an error flag when it thinks a fragment has an error.
|
||||
This means that even after an error is detected on a BTL, the BTL
|
||||
continues to be checked for any other messages that may also complete
|
||||
with an error. This is potentially a unique characteristic of the
|
||||
openib BTL when running over RC connections that allows the BFO to
|
||||
work properly.
|
||||
|
||||
One scenario that is particularly difficult to handle is the case
|
||||
where a fragment has an error but the message actually makes it to the
|
||||
other side. It is because of this that all fragments need to be
|
||||
checked to make sure they are not a duplicate. This scenario also
|
||||
complicates some of the rendezvous protocols as the two sides may not
|
||||
agree where the problem occurred. For example, one can imagine a
|
||||
sender getting an error on a final FIN message, but the FIN message
|
||||
actually arrives at the other side. The receiver thinks the
|
||||
communication is done and moves on. The sender thinks there was a
|
||||
problem, and that the communication needs to restart.
|
||||
|
||||
It is also important to note that a message cannot signal a successful
|
||||
completion and *not* make it to the receiver. This would probably cause
|
||||
the bfo to hang.
|
||||
|
||||
4. ERRORS
|
||||
Errors are detected in the openib BTL layer and propagated to the PML
|
||||
layer. Typically, the errors occur while polling the completion
|
||||
queue, but can happen in other areas as well. When an error occurs,
|
||||
an additional callback is called so the PML can map out the connection
|
||||
for future sending. Then the callback associated with the fragment is
|
||||
called, but with the error field set to OMPI_ERROR. This way, the PML
|
||||
knows that this fragment may not have made it to the remote side.
|
||||
|
||||
The first callback into the PML is via the mca_pml_bfo_error_handler()
|
||||
callback and the PML uses this to remove a connection for future
|
||||
sending. If the error_proc_t field is NULL, then the entire BTL is
|
||||
removed for any future communication. If the error_proc_t is not
|
||||
NULL, then the BTL is only removed for the connection associated with
|
||||
the error_proc_t.
|
||||
|
||||
The second callback is the standard one for a completion event, and
|
||||
this can trigger various activities in the PML. The regular callback
|
||||
function is called but the status is set to OMPI_ERROR. The PML layer
|
||||
detects this and calls some failover specific routines depending on
|
||||
the type of fragment that got the error.
|
||||
|
||||
|
||||
5. RECOVERY OF MATCH FRAGMENTS
|
||||
Note: For a general description of how the various fragments interact,
|
||||
see Appendix 1 at the end of this document.
|
||||
|
||||
In the case of a MATCH fragment, the fragment is simply resent. Care
|
||||
has to be taken with a MATCH fragment that is sent via the standard
|
||||
interface and one that is sent via the sendi interface. In the
|
||||
standard send, the send request is still available and is therefore
|
||||
reset reused to send the MATCH fragment. In the case of the sendi
|
||||
fragment, the send request is gone, so the fragment is regenerated
|
||||
from the information contained within the fragment.
|
||||
|
||||
6. RECOVERY OF RNDV or LARGE MESSAGE RDMA
|
||||
In the case of a large message RDMA transfer or a RNDV transfer where
|
||||
the message consists of several fragments, the restart is a little
|
||||
more complicated. This includes fragments like RNDV, PUT, RGET, FRAG,
|
||||
FIN, and RDMA write and RDMA read completions. In most cases, the
|
||||
requests associated with these fragments are reset and restarted.
|
||||
|
||||
First, it should be pointed out that a new variable was added to the
|
||||
send and receive requests. This variable tracks outstanding send
|
||||
events that have not yet received their completion events. This new
|
||||
variable is used so that a request is not restarted until all the
|
||||
outstanding events have completed. If one does not wait for the
|
||||
outstanding events to complete, then one may restart a request and
|
||||
then a completion event will happen on the wrong request.
|
||||
|
||||
There is a second variable added to each request and that is one that
|
||||
shows whether the request is already in an error state. When a request
|
||||
reaches the state that it has an error flagged on it and the outstanding
|
||||
completion events are down to zero, it can start the restart dance
|
||||
as described below.
|
||||
|
||||
7. SPECIAL CASE FOR FIN FRAGMENT
|
||||
Like the MATCH fragment, the FIN message is also simply resent. Like
|
||||
the sendi MATCH fragment, there may be no request associated with the
|
||||
FIN message when it gets an error, so the fragment is recreated from
|
||||
the information in the fragment. The FIN fragment was modified to
|
||||
have additional information like what is in a MATCH fragment including
|
||||
the context, source, and tag. In this way, we can figure out if the
|
||||
FIN message is a duplicate on the receiving side.
|
||||
|
||||
8. RESTART DANCE
|
||||
When the bfo determines that there are no outstanding completion events,
|
||||
a restart dance is initiated. There are four new PML message types that
|
||||
have been created to participate in the dance.
|
||||
1. RNDVRESTARTNOTIFY
|
||||
2. RECVERRNOTIFY
|
||||
3. RNDVRESTARTACK
|
||||
4. RNDVRESTARTNACK
|
||||
|
||||
When the send request is in an error state and the outstanding
|
||||
completion events is zero, RNDVRESTARTNOTIFY is sent from the sender
|
||||
to the receiver to let it know that the communication needs to be
|
||||
restarted. Upon receipt of the RNDVRESTARTNOTIFY, the receiver first
|
||||
checks to make sure that it is still pointing to a valid receiver
|
||||
request. If so, it marks the receive request in error. It then
|
||||
checks to see if there are any outstanding completion events on the
|
||||
receiver. If there are no outstanding completion events, the receiver
|
||||
sends the RNDVRESTARTACK. If there are outstanding completion events,
|
||||
then the RNDVRESTARTACK gets sent later when a completion event occurs
|
||||
that brings the outstanding event count to zero.
|
||||
|
||||
In the case that the receiver determines that it is no longer looking
|
||||
at a valid receive request, which means the request is complete, the
|
||||
receiver responds with a RNDVRESTARTNACK. While rare, this case can
|
||||
happen for example, when a final FRAG message triggers an error on the
|
||||
sender, but actually makes it to the receiver.
|
||||
|
||||
The RECVERRNOTIFY fragment is used so the receiver can let the sender
|
||||
sender know that it had an error. The sender then waits for all of
|
||||
its completion events, and then sends a RNDVRESTARTNOTIFY.
|
||||
|
||||
All the handling of these new messages is contained in the
|
||||
pml_bfo_failover files.
|
||||
|
||||
9. BTL SUPPORT
|
||||
The openib BTL also supplies a lot of support for the bfo PML. First,
|
||||
fragments can be stored in the BTL during normal operation if
|
||||
resources become scarce. This means that when an error is detected in
|
||||
the BTL, it needs to scour its internal queues for fragments that are
|
||||
destined for the BTL and error them out. The function
|
||||
error_out_all_pending_frags() takes care of this functionality. And
|
||||
some of the fragments stored can be coalesced, so care has to be taken
|
||||
to tease out each message from a coalesced fragment.
|
||||
|
||||
There is also some special code in the BTL to handle some strange
|
||||
occurrences that were observed in the BTL. First, there are times
|
||||
where only one half of the connection gets an error. This can result
|
||||
in a mismatch between what the PML thinks is available to it and can
|
||||
cause hangs. Therefore, when a BTL detects an error, it sends a
|
||||
special message down the working BTL connection to tell the remote
|
||||
side that it needs to be brought down as well.
|
||||
|
||||
Secondly, it has been observed that a message can get stuck in the
|
||||
eager RDMA connection between two BTLs. In this case, an error is
|
||||
detected on one side, but the other side never sees the message.
|
||||
Therefore, a special message is sent to the other side telling it to
|
||||
move along in the eager RDMA connection. This is all somewhat
|
||||
confusing. See the code in the btl_openib_failover.c file for the
|
||||
details.
|
||||
|
||||
10. MERGING
|
||||
Every effort was made to try and merge the bfo PML into the ob1 PML.
|
||||
The idea was that any upgrades to the ob1 PML would automatically make
|
||||
it into the bfo PML and this would enhance maintainability of all the
|
||||
code. However, it was deemed that this merging would cause more
|
||||
problems than it would solve. What was attempted and why the
|
||||
conclusion was made are documented here.
|
||||
|
||||
One can look at the bfo and easily see the differences between it and
|
||||
ob1. All the bfo specific code is surrounded by #if PML_BFO. In
|
||||
addition, there are two additional files in the bfo,
|
||||
pml_bfo_failover.c and pml_bfo_failover.h.
|
||||
|
||||
To merge them, the following was attempted. First, add all the code
|
||||
in #if regions into the ob1 PML. As of this writing, there are 73
|
||||
#ifs that would have to be added into ob1.
|
||||
|
||||
Secondly, remove almost all the pml_bfo files and replace them with
|
||||
links to the ob1 files.
|
||||
|
||||
Third, create a new header file that did name shifting of all the
|
||||
functions so that ob1 and bfo could live together. This also included
|
||||
having to create macros for the names of header files as well. To
|
||||
help illustrate the name shifting issue, here is what the file might
|
||||
look like in the bfo directory.
|
||||
|
||||
/* Need macros for the header files as they are different in the
|
||||
* different PMLs */
|
||||
#define PML "bfo"
|
||||
#define PML_OB1_H "pml_bfo.h"
|
||||
#define PML_OB1_COMM_H "pml_bfo_comm.h"
|
||||
#define PML_OB1_COMPONENT_H "pml_bfo_component.h"
|
||||
#define PML_OB1_HDR_H "pml_bfo_hdr.h"
|
||||
#define PML_OB1_RDMA_H "pml_bfo_rdma.h"
|
||||
#define PML_OB1_RDMAFRAG_H "pml_bfo_rdmafrag.h"
|
||||
#define PML_OB1_RECVFRAG_H "pml_bfo_recvfrag.h"
|
||||
#define PML_OB1_RECVREQ_H "pml_bfo_recvreq.h"
|
||||
#define PML_OB1_SENDREQ_H "pml_bfo_sendreq.h"
|
||||
|
||||
/* Name shifting of functions from ob1 to bfo (incomplete list) */
|
||||
#define mca_pml_ob1 mca_pml_bfo
|
||||
#define mca_pml_ob1_t mca_pml_bfo_t
|
||||
#define mca_pml_ob1_component mca_pml_bfo_component
|
||||
#define mca_pml_ob1_add_procs mca_pml_bfo_add_procs
|
||||
#define mca_pml_ob1_del_procs mca_pml_bfo_del_procs
|
||||
#define mca_pml_ob1_enable mca_pml_bfo_enable
|
||||
#define mca_pml_ob1_progress mca_pml_bfo_progress
|
||||
#define mca_pml_ob1_add_comm mca_pml_bfo_add_comm
|
||||
#define mca_pml_ob1_del_comm mca_pml_bfo_del_comm
|
||||
#define mca_pml_ob1_irecv_init mca_pml_bfo_irecv_init
|
||||
#define mca_pml_ob1_irecv mca_pml_bfo_irecv
|
||||
#define mca_pml_ob1_recv mca_pml_bfo_recv
|
||||
#define mca_pml_ob1_isend_init mca_pml_bfo_isend_init
|
||||
#define mca_pml_ob1_isend mca_pml_bfo_isend
|
||||
#define mca_pml_ob1_send mca_pml_bfo_send
|
||||
#define mca_pml_ob1_iprobe mca_pml_bfo_iprobe
|
||||
[...and much more ...]
|
||||
|
||||
The pml_bfo_hdr.h file was not a link because the changes in it were
|
||||
so extensive. Also the Makefile was kept separate so it could include
|
||||
the additional failover files as well as add a compile directive that
|
||||
would force the files to be compiled as bfo instead of ob1.
|
||||
|
||||
After these changes were made, several independent developers reviewed
|
||||
the results and concluded that making these changes would have too
|
||||
much of a negative impact on ob1 maintenance. First, the code became
|
||||
much harder to read with all the additional #ifdefs. Secondly, the
|
||||
possibility of adding other features, like csum, to ob1 would only
|
||||
make this issue even worse. Therefore, it was decided to keep the bfo
|
||||
PML separate from ob1.
|
||||
|
||||
11. UTILITIES
|
||||
In an ideal world, any bug fixes that are made in the ob1 PML would
|
||||
also be made in the csum and the bfo PMLs. However, that does not
|
||||
always happen. Therefore, there are two new utilities added to the
|
||||
contrib directory.
|
||||
|
||||
check-ob1-revision.pl
|
||||
check-ob1-pml-diffs.pl
|
||||
|
||||
The first one can be run to see if ob1 has changed from its last known
|
||||
state. Here is an example.
|
||||
|
||||
machine =>check-ob1-revision.pl
|
||||
Running svn diff -r24138 ../ompi/mca/pml/ob1
|
||||
No new changes detected in ob1. Everything is fine.
|
||||
|
||||
If there are differences, then one needs to review them and potentially
|
||||
add them to the bfo (and csum also if one feels like it).
|
||||
After that, bump up the value in the script to the latest value.
|
||||
|
||||
The second script allows one to see the differences between the ob1
|
||||
and bfo PML. Here is an example.
|
||||
|
||||
machine =>check-ob1-pml-diffs.pl
|
||||
|
||||
Starting script to check differences between bfo and ob1...
|
||||
Files Compared: pml_ob1.c and pml_bfo.c
|
||||
No differences encountered
|
||||
Files Compared: pml_ob1.h and pml_bfo.h
|
||||
[...snip...]
|
||||
Files Compared: pml_ob1_start.c and pml_bfo_start.c
|
||||
No differences encountered
|
||||
|
||||
There is a lot more in the script that tells how it is used.
|
||||
|
||||
|
||||
Appendix 1: SIMPLE OVERVIEW OF COMMUNICATION PROTOCOLS
|
||||
The drawings below attempt to describe some of the general flow of
|
||||
fragments in the various protocols that are supported in the PMLs.
|
||||
The "read" and "write" are actual RDMA actions and do not pertain to
|
||||
fragments that are sent. As can be inferred, they use FIN messages to
|
||||
indicate their completion.
|
||||
|
||||
|
||||
MATCH PROTOCOL
|
||||
sender >->->-> MATCH >->->-> receiver
|
||||
|
||||
SEND WITH MULTIPLE FRAGMENTS
|
||||
sender >->->-> RNDV >->->-> receiver
|
||||
<-<-<-< ACK <-<-<-<
|
||||
>->->-> FRAG >->->->
|
||||
>->->-> FRAG >->->->
|
||||
>->->-> FRAG >->->->
|
||||
|
||||
RDMA PUT
|
||||
sender >->->-> RNDV >->->-> receiver
|
||||
<-<-<-< PUT <-<-<-<
|
||||
<-<-<-< PUT <-<-<-<
|
||||
>->->-> write >->->->
|
||||
>->->-> FIN >->->->
|
||||
>->->-> write >->->->
|
||||
>->->-> FIN >->->->
|
||||
|
||||
RMA GET
|
||||
sender >->->-> RGET >->->-> receiver
|
||||
<-<-<-< read <-<-<-<
|
||||
<-<-<-< FIN <-<-<-<
|
Загрузка…
Ссылка в новой задаче
Block a user