1
1
openmpi/ompi/mca/btl/smcuda
Nathan Hjelm a14e0f10d4 Per RFC: Remove des_src and des_dst members from the
mca_btl_base_segment_t and replace them with des_local and des_remote

This change also updates the BTL version to 3.0.0. This commit does
not represent the final version of BTL 3.0.0. More changes are coming.

In making this change I updated all of the BTLs as well as BTL user's
to use the new structure members. Please evaluate your component to
ensure the changes are correct.

RFC text:

This is the first of several BTL interface changes I am proposing for
the 1.9/2.0 release series.

What: Change naming of btl descriptor members. I propose we change
des_src and des_dst (and their associated counts) to be des_local and
des_remote. For receive callbacks the des_local member will be used to
communicate the segment information to the callback. The proposed change
will include updating all of the doxygen in btl.h as well as updating
all BTLs and BTL users to use the new naming scheme.

Why: My btl usage makes use of both put and get operations on the same
descriptor. With the current naming scheme I need to ensure that there
is consistency beteen the segments described in des_src and des_dst
depending on whether a put or get operation is executed. Additionally,
the current naming prevents BTLs that do not require prepare/RMA matched
operations (do not set MCA_BTL_FLAGS_RDMA_MATCHED) from executing
multiple simultaneous put AND get operations. At the moment the
descriptor can only be used with one or the other. The naming change
makes it easier for BTL users to setup/modify descriptors for RMA
operations as the local segment and remote segment are always in the
same member field. The only issue I forsee with this change is that it
will require a little more work to move BTL fixes to the 1.8 release
series.

This commit was SVN r32196.
2014-07-10 16:31:15 +00:00
..
btl_smcuda_component.c Per RFC: Remove des_src and des_dst members from the 2014-07-10 16:31:15 +00:00
btl_smcuda_endpoint.h Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed. 2014-06-25 20:43:28 +00:00
btl_smcuda_fifo.h New btl that extends sm btl to support GPU transfers within a node. 2012-02-24 02:13:33 +00:00
btl_smcuda_frag.c Per RFC: Remove des_src and des_dst members from the 2014-07-10 16:31:15 +00:00
btl_smcuda_frag.h Various minor changes to bring smcuda up to date with sm. 2013-11-07 19:45:56 +00:00
btl_smcuda.c Per RFC: Remove des_src and des_dst members from the 2014-07-10 16:31:15 +00:00
btl_smcuda.h Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed. 2014-06-25 20:43:28 +00:00
configure.m4 Move CUDA-aware configurary to its own file and other minor changes due to review. 2013-07-17 22:12:29 +00:00
help-mpi-btl-smcuda.txt New btl that extends sm btl to support GPU transfers within a node. 2012-02-24 02:13:33 +00:00
Makefile.am The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature: 2014-05-07 21:48:53 +00:00
README Fix support in smcuda btl so it does not blow up when there is no CUDA IPC support between two GPUs. Also make it so CUDA IPC support is added dynamically. 2013-08-21 21:00:09 +00:00

Copyright (c) 2013      NVIDIA Corporation.  All rights reserved.
August 21, 2013

SMCUDA DESIGN DOCUMENT
This document describes the design and use of the smcuda BTL.

BACKGROUND
The smcuda btl is a copy of the sm btl but with some additional features.
The main extra feature is the ability to make use of the CUDA IPC APIs to
quickly move GPU buffers from one GPU to another.  Without this support,
the GPU buffers would all be moved into and then out of host memory.

GENERAL DESIGN

The general design makes use of the large message RDMA RGET support in the
OB1 PML.  However, there are some interesting choices to make use of it.
First, we disable any large message RDMA support in the BTL for host
messages.  This is done because we need to use the mca_btl_smcuda_get() for
the GPU buffers.  This is also done because the upper layers expect there
to be a single mpool but we need one for the GPU memory and one for the
host memory.  Since the advantages of using RDMA with host memory is
unclear, we disabled it.  This means no KNEM or CMA support built in to the
smcuda BTL.

Also note that we give the smcuda BTL a higher rank than the sm BTL.  This
means it will always be selected even if we are doing host only data
transfers.  The smcuda BTL is not built if it is not requested via the
--with-cuda flag to the configure line.

Secondly, the smcuda does not make use of the traditional method of
enabling RDMA operations.  The traditional method checks for the existence
of an RDMA btl hanging off the endpoint.  The smcuda works in conjunction
with the OB1 PML and uses flags that it sends in the BML layer.

OTHER CONSIDERATIONS
CUDA IPC is not necessarily supported by all GPUs on a node.  In NUMA
nodes, CUDA IPC may only work between GPUs that are not connected 
over the IOH.  In addition, we want to check for CUDA IPC support lazily,
when the first GPU access occurs, rather than during MPI_Init() time.
This complicates the design.

INITIALIZATION
When the smcuda BTL initializes, it starts with no support for CUDA IPC.
Upon the first access of a GPU buffer, the smcuda checks which GPU device
it has and sends that to the remote side using a smcuda specific control
message.  The other rank receives the message, and checks to see if there
is CUDA IPC support between the two GPUs via a call to
cuDeviceCanAccessPeer().  If it is true, then the smcuda BTL piggy backs on
the PML error handler callback to make a call into the PML and let it know
to enable CUDA IPC. We created a new flag so that the error handler does
the right thing.  Large message RDMA is enabled by setting a flag in the
bml->btl_flags field.  Control returns to the smcuda BTL where a reply
message is sent so the sending side can set its flag.

At that point, the PML layer starts using the large message RDMA support
in the smcuda BTL.  This is done in some special CUDA code in the PML layer.

ESTABLISHING CUDA IPC SUPPORT
A check has been added into both the send and sendi path in the smcuda btl
that checks to see if it should send a request for CUDA IPC setup message.

    /* Initiate setting up CUDA IPC support. */
    if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
        mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
    }

The first check is to see if the CUDA environment has been initialized.  If
not, then presumably we are not sending any GPU buffers yet and there is
nothing to be done.  If we are initialized, then check the status of the
CUDA IPC endpoint.  If it is in the IPC_INIT stage, then call the function
to send of a control message to the endpoint.

On the receiving side, we first check to see if we are initialized.  If
not, then send a message back to the sender saying we are not initialized.
This will cause the sender to reset its state to IPC_INIT so it can try
again on the next send.

I considered putting the receiving side into a new state like IPC_NOTREADY,
and then when it switches to ready, to then sending the ACK to the sender.
The problem with this is that we would need to do these checks during the
progress loop which adds some extra overhead as we would have to check all
endpoints to see if they were ready.

Note that any rank can initiate the setup of CUDA IPC.  It is triggered by
whichever side does a send or sendi call of a GPU buffer.

I have the sender attempt 5 times to set up the connection.  After that, we
give up.  Note that I do not expect many scenarios where the sender has to
resend.  It could happen in a race condition where one rank has initialized
its CUDA environment but the other side has not.

There are several states the connections can go through.

IPC_INIT   - nothing has happened
IPC_SENT   - message has been sent to other side
IPC_ACKING - Received request and figuring out what to send back
IPC_ACKED  - IPC ACK sent
IPC_OK     - IPC ACK received back
IPC_BAD    - Something went wrong, so marking as no IPC support

NOTE ABOUT CUDA IPC AND MEMORY POOLS
The CUDA IPC support works in the following way.  A sender makes a call to
cuIpcGetMemHandle() and gets a memory handle for its local memory.  The
sender then sends that handle to receiving side.  The receiver calls
cuIpcOpenMemHandle() using that handle and gets back an address to the
remote memory.  The receiver then calls cuMemcpyAsync() to initiate a
remote read of the GPU data.

The receiver maintains a cache of remote memory that it has handles open on.
This is because a call to cuIpcOpenMemHandle() can be very expensive (90usec) so
we want to avoid it when we can.  The cache of remote memory is kept in a memory
pool that is associated with each endpoint.  Note that we do not cache the loca
memory handles because getting them is very cheap and there is no need.