Fix support in smcuda btl so it does not blow up when there is no CUDA IPC support between two GPUs. Also make it so CUDA IPC support is added dynamically.

Fixes ticket 3531. This commit was SVN r29055.
2013-08-21 21:00:09 +00:00 · 2013-08-21 21:00:09 +00:00 · 504fa2cda9
--- a/ompi/mca/btl/btl.h
+++ b/ompi/mca/btl/btl.h
@ -169,6 +169,7 @@ typedef uint8_t mca_btl_base_tag_t;
 */
 #define MCA_BTL_TAG_IB                (MCA_BTL_TAG_BTL + 0)
 #define MCA_BTL_TAG_UDAPL             (MCA_BTL_TAG_BTL + 1)
+#define MCA_BTL_TAG_SMCUDA            (MCA_BTL_TAG_BTL + 2)

 /* prefered protocol */
 #define MCA_BTL_FLAGS_SEND            0x0001
@ -218,6 +219,7 @@ typedef uint8_t mca_btl_base_tag_t;
 /* error callback flags */
 #define MCA_BTL_ERROR_FLAGS_FATAL 0x1
 #define MCA_BTL_ERROR_FLAGS_NONFATAL 0x2
+#define MCA_BTL_ERROR_FLAGS_ADD_CUDA_IPC 0x4

 /**
 * Asynchronous callback function on completion of an operation.
--- a/ompi/mca/btl/smcuda/README
+++ b/ompi/mca/btl/smcuda/README
@ -0,0 +1,113 @@
+Copyright (c) 2013      NVIDIA Corporation.  All rights reserved.
+August 21, 2013
+
+SMCUDA DESIGN DOCUMENT
+This document describes the design and use of the smcuda BTL.
+
+BACKGROUND
+The smcuda btl is a copy of the sm btl but with some additional features.
+The main extra feature is the ability to make use of the CUDA IPC APIs to
+quickly move GPU buffers from one GPU to another.  Without this support,
+the GPU buffers would all be moved into and then out of host memory.
+
+GENERAL DESIGN
+
+The general design makes use of the large message RDMA RGET support in the
+OB1 PML.  However, there are some interesting choices to make use of it.
+First, we disable any large message RDMA support in the BTL for host
+messages.  This is done because we need to use the mca_btl_smcuda_get() for
+the GPU buffers.  This is also done because the upper layers expect there
+to be a single mpool but we need one for the GPU memory and one for the
+host memory.  Since the advantages of using RDMA with host memory is
+unclear, we disabled it.  This means no KNEM or CMA support built in to the
+smcuda BTL.
+
+Also note that we give the smcuda BTL a higher rank than the sm BTL.  This
+means it will always be selected even if we are doing host only data
+transfers.  The smcuda BTL is not built if it is not requested via the
+--with-cuda flag to the configure line.
+
+Secondly, the smcuda does not make use of the traditional method of
+enabling RDMA operations.  The traditional method checks for the existence
+of an RDMA btl hanging off the endpoint.  The smcuda works in conjunction
+with the OB1 PML and uses flags that it sends in the BML layer.
+
+OTHER CONSIDERATIONS
+CUDA IPC is not necessarily supported by all GPUs on a node.  In NUMA
+nodes, CUDA IPC may only work between GPUs that are not connected 
+over the IOH.  In addition, we want to check for CUDA IPC support lazily,
+when the first GPU access occurs, rather than during MPI_Init() time.
+This complicates the design.
+
+INITIALIZATION
+When the smcuda BTL initializes, it starts with no support for CUDA IPC.
+Upon the first access of a GPU buffer, the smcuda checks which GPU device
+it has and sends that to the remote side using a smcuda specific control
+message.  The other rank receives the message, and checks to see if there
+is CUDA IPC support between the two GPUs via a call to
+cuDeviceCanAccessPeer().  If it is true, then the smcuda BTL piggy backs on
+the PML error handler callback to make a call into the PML and let it know
+to enable CUDA IPC. We created a new flag so that the error handler does
+the right thing.  Large message RDMA is enabled by setting a flag in the
+bml->btl_flags field.  Control returns to the smcuda BTL where a reply
+message is sent so the sending side can set its flag.
+
+At that point, the PML layer starts using the large message RDMA support
+in the smcuda BTL.  This is done in some special CUDA code in the PML layer.
+
+ESTABLISHING CUDA IPC SUPPORT
+A check has been added into both the send and sendi path in the smcuda btl
+that checks to see if it should send a request for CUDA IPC setup message.
+
+    /* Initiate setting up CUDA IPC support. */
+    if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
+        mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
+    }
+
+The first check is to see if the CUDA environment has been initialized.  If
+not, then presumably we are not sending any GPU buffers yet and there is
+nothing to be done.  If we are initialized, then check the status of the
+CUDA IPC endpoint.  If it is in the IPC_INIT stage, then call the function
+to send of a control message to the endpoint.
+
+On the receiving side, we first check to see if we are initialized.  If
+not, then send a message back to the sender saying we are not initialized.
+This will cause the sender to reset its state to IPC_INIT so it can try
+again on the next send.
+
+I considered putting the receiving side into a new state like IPC_NOTREADY,
+and then when it switches to ready, to then sending the ACK to the sender.
+The problem with this is that we would need to do these checks during the
+progress loop which adds some extra overhead as we would have to check all
+endpoints to see if they were ready.
+
+Note that any rank can initiate the setup of CUDA IPC.  It is triggered by
+whichever side does a send or sendi call of a GPU buffer.
+
+I have the sender attempt 5 times to set up the connection.  After that, we
+give up.  Note that I do not expect many scenarios where the sender has to
+resend.  It could happen in a race condition where one rank has initialized
+its CUDA environment but the other side has not.
+
+There are several states the connections can go through.
+
+IPC_INIT   - nothing has happened
+IPC_SENT   - message has been sent to other side
+IPC_ACKING - Received request and figuring out what to send back
+IPC_ACKED  - IPC ACK sent
+IPC_OK     - IPC ACK received back
+IPC_BAD    - Something went wrong, so marking as no IPC support
+
+NOTE ABOUT CUDA IPC AND MEMORY POOLS
+The CUDA IPC support works in the following way.  A sender makes a call to
+cuIpcGetMemHandle() and gets a memory handle for its local memory.  The
+sender then sends that handle to receiving side.  The receiver calls
+cuIpcOpenMemHandle() using that handle and gets back an address to the
+remote memory.  The receiver then calls cuMemcpyAsync() to initiate a
+remote read of the GPU data.
+
+The receiver maintains a cache of remote memory that it has handles open on.
+This is because a call to cuIpcOpenMemHandle() can be very expensive (90usec) so
+we want to avoid it when we can.  The cache of remote memory is kept in a memory
+pool that is associated with each endpoint.  Note that we do not cache the loca
+memory handles because getting them is very cheap and there is no need.
--- a/ompi/mca/btl/smcuda/btl_smcuda.c
+++ b/ompi/mca/btl/smcuda/btl_smcuda.c
@ -107,6 +107,11 @@ mca_btl_smcuda_t mca_btl_smcuda = {
    }
 };

+#if OMPI_CUDA_SUPPORT
+static void mca_btl_smcuda_send_cuda_ipc_request(struct mca_btl_base_module_t* btl,
+                                                 struct mca_btl_base_endpoint_t* endpoint);
+#endif /* OMPI_CUDA_SUPPORT */
+
 /*
 * calculate offset of an address from the beginning of a shared memory segment
 */
@ -538,6 +543,12 @@ int mca_btl_smcuda_add_procs(
            return_code = OMPI_ERROR;
            goto CLEANUP;
        }
+#if OMPI_CUDA_SUPPORT
+        peers[proc]->proc_ompi = procs[proc];
+        peers[proc]->ipcstate = IPC_INIT;
+        peers[proc]->ipctries = 0;
+#endif /* OMPI_CUDA_SUPPORT */
+
        n_local_procs++;

        /* add this proc to shared memory accessibility list */
@ -908,6 +919,13 @@ int mca_btl_smcuda_sendi( struct mca_btl_base_module_t* btl,
        mca_btl_smcuda_component_progress();
    }

+#if OMPI_CUDA_SUPPORT
+    /* Initiate setting up CUDA IPC support. */
+    if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstate)) {
+        mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
+    }
+#endif /* OMPI_CUDA_SUPPORT */
+
    /* this check should be unnecessary... turn into an assertion? */
    if( length < mca_btl_smcuda_component.eager_limit ) {

@ -986,6 +1004,11 @@ int mca_btl_smcuda_send( struct mca_btl_base_module_t* btl,
        mca_btl_smcuda_component_progress();
    }

+    /* Initiate setting up CUDA IPC support */
+    if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstate)) {
+        mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
+    }
+
    /* available header space */
    frag->hdr->len = frag->segment.base.seg_len;
    /* type of message, pt-2-pt, one-sided, etc */
@ -1136,6 +1159,87 @@ int mca_btl_smcuda_get_cuda(struct mca_btl_base_module_t* btl,
    return OMPI_SUCCESS;

 }
+
+/**
+ * Send a CUDA IPC request message to the peer.  This indicates that this rank
+ * is interested in establishing CUDA IPC support between this rank and GPU
+ * and the remote rank and GPU.  This is called when we do a send of some
+ * type.
+ *
+ * @param btl (IN)      BTL module
+ * @param peer (IN)     BTL peer addressing
+ */
+#define MAXTRIES 5
+static void mca_btl_smcuda_send_cuda_ipc_request(struct mca_btl_base_module_t* btl,
+                                                 struct mca_btl_base_endpoint_t* endpoint)
+{
+    mca_btl_smcuda_frag_t* frag;
+    int rc, mydevnum, res;
+    ctrlhdr_t ctrlhdr;
+
+    /* We need to grab the lock when changing the state from IPC_INIT as multiple
+     * threads could be doing sends. */
+    OPAL_THREAD_LOCK(&endpoint->endpoint_lock);
+    if (endpoint->ipcstate != IPC_INIT) {
+        OPAL_THREAD_UNLOCK(&endpoint->endpoint_lock);
+        return;
+    } else {
+        endpoint->ipctries++;
+        if (endpoint->ipctries > MAXTRIES) {
+            endpoint->ipcstate = IPC_BAD;
+            OPAL_THREAD_UNLOCK(&endpoint->endpoint_lock);
+            return;
+        }
+        /* All is good.  Set up state and continue. */
+        endpoint->ipcstate = IPC_SENT;
+        OPAL_THREAD_UNLOCK(&endpoint->endpoint_lock);
+    }
+
+    if ( mca_btl_smcuda_component.num_outstanding_frags * 2 > (int) mca_btl_smcuda_component.fifo_size ) {
+        mca_btl_smcuda_component_progress();
+    }
+    
+    if (0 != (res = mca_common_cuda_get_device(&mydevnum))) {
+        opal_output(0, "Cannot determine device.  IPC cannot be set.");
+        endpoint->ipcstate = IPC_BAD;
+        return;
+    }
+
+    /* allocate a fragment, giving up if we can't get one */
+    MCA_BTL_SMCUDA_FRAG_ALLOC_EAGER(frag);
+    if( OPAL_UNLIKELY(NULL == frag) ) {
+        endpoint->ipcstate = IPC_BAD;
+        return;
+    }
+
+    /* Fill in fragment fields. */
+    frag->hdr->tag = MCA_BTL_TAG_SMCUDA;
+    frag->base.des_flags = MCA_BTL_DES_FLAGS_BTL_OWNERSHIP;
+    frag->endpoint = endpoint;
+    ctrlhdr.ctag = IPC_REQ;
+    ctrlhdr.cudev = mydevnum;
+    memcpy(frag->segment.base.seg_addr.pval, &ctrlhdr, sizeof(struct ctrlhdr_st));
+
+    MCA_BTL_SMCUDA_TOUCH_DATA_TILL_CACHELINE_BOUNDARY(frag);
+    /* write the fragment pointer to the FIFO */
+    /*
+     * Note that we don't care what the FIFO-write return code is.  Even if
+     * the return code indicates failure, the write has still "completed" from
+     * our point of view:  it has been posted to a "pending send" queue.
+     */
+    OPAL_THREAD_ADD32(&mca_btl_smcuda_component.num_outstanding_frags, +1);
+    opal_output_verbose(10, mca_btl_smcuda_component.cuda_ipc_output,
+                        "Sending CUDA IPC REQ (try=%d): myrank=%d, mydev=%d, peerrank=%d",
+                        endpoint->ipctries,
+                        mca_btl_smcuda_component.my_smp_rank,
+                        mydevnum, endpoint->peer_smp_rank);
+
+    MCA_BTL_SMCUDA_FIFO_WRITE(endpoint, endpoint->my_smp_rank,
+                              endpoint->peer_smp_rank, (void *) VIRTUAL2RELATIVE(frag->hdr), false, true, rc);
+    return;
+
+}
+
 #endif /* OMPI_CUDA_SUPPORT */

 /**
--- a/ompi/mca/btl/smcuda/btl_smcuda.h
+++ b/ompi/mca/btl/smcuda/btl_smcuda.h
@ -202,6 +202,10 @@ struct mca_btl_smcuda_component_t {
    char *sm_mpool_rndv_file_name;
    char *sm_ctl_file_name;
    char *sm_rndv_file_name;
+#if OMPI_CUDA_SUPPORT
+    int cuda_ipc_verbose;
+    int cuda_ipc_output;
+#endif /* OMPI_CUDA_SUPPORT */
 };
 typedef struct mca_btl_smcuda_component_t mca_btl_smcuda_component_t;
 OMPI_MODULE_DECLSPEC extern mca_btl_smcuda_component_t mca_btl_smcuda_component;
@ -489,6 +493,30 @@ extern struct mca_btl_base_descriptor_t* mca_btl_smcuda_prepare_dst(
 		size_t reserve,
 		size_t* size,
 		uint32_t flags);
+
+/* CUDA IPC control message tags */
+enum ipcCtrlMsg {
+    IPC_REQ = 10,
+    IPC_ACK,
+    IPC_NOTREADY,
+};
+
+/* CUDA IPC control message */
+typedef struct ctrlhdr_st {
+        enum ipcCtrlMsg ctag;
+        int cudev;
+} ctrlhdr_t;
+
+/* State of setting up CUDA IPC on an endpoint */
+enum ipcState {
+    IPC_INIT = 1,
+    IPC_SENT,
+    IPC_ACKING,
+    IPC_ACKED,
+    IPC_OK,
+    IPC_BAD
+};
+
 #endif /* OMPI_CUDA_SUPPORT */


--- a/ompi/mca/btl/smcuda/btl_smcuda_component.c
+++ b/ompi/mca/btl/smcuda/btl_smcuda_component.c
@ -170,6 +170,9 @@ static int smcuda_register(void)
    } else {
        mca_btl_smcuda.super.btl_exclusivity = MCA_BTL_EXCLUSIVITY_LOW;
    }
+    mca_btl_smcuda_param_register_int("cuda_ipc_verbose", 0, &mca_btl_smcuda_component.cuda_ipc_verbose);
+    mca_btl_smcuda_component.cuda_ipc_output = opal_output_open(NULL);
+    opal_output_set_verbosity(mca_btl_smcuda_component.cuda_ipc_output, mca_btl_smcuda_component.cuda_ipc_verbose);
 #else /* OMPI_CUDA_SUPPORT */
    mca_btl_smcuda.super.btl_exclusivity = MCA_BTL_EXCLUSIVITY_HIGH-1;
 #endif /* OMPI_CUDA_SUPPORT */
@ -180,9 +183,6 @@ static int smcuda_register(void)
    mca_btl_smcuda.super.btl_rdma_pipeline_frag_size = 64*1024;
    mca_btl_smcuda.super.btl_min_rdma_pipeline_size = 64*1024;
    mca_btl_smcuda.super.btl_flags = MCA_BTL_FLAGS_SEND;
-#if OMPI_CUDA_SUPPORT
-    mca_btl_smcuda.super.btl_flags |= MCA_BTL_FLAGS_CUDA_GET;
-#endif /* OMPI_CUDA_SUPPORT */
    mca_btl_smcuda.super.btl_seg_size = sizeof (mca_btl_smcuda_segment_t);
    mca_btl_smcuda.super.btl_bandwidth = 9000;  /* Mbs */
    mca_btl_smcuda.super.btl_latency   = 1;     /* Microsecs */
@ -617,6 +617,192 @@ out:
    return rc;
 }

+#if OMPI_CUDA_SUPPORT
+
+/**
+ * Send a CUDA IPC ACK or NOTREADY message back to the peer.
+ *
+ * @param btl (IN)      BTL module
+ * @param peer (IN)     BTL peer addressing
+ * @param peer (IN)     If ready, then send ACK
+ */
+static void mca_btl_smcuda_send_cuda_ipc_ack(struct mca_btl_base_module_t* btl,
+                                             struct mca_btl_base_endpoint_t* endpoint, int ready)
+{
+    mca_btl_smcuda_frag_t* frag;
+    ctrlhdr_t ctrlhdr;
+    int rc;
+
+    if ( mca_btl_smcuda_component.num_outstanding_frags * 2 > (int) mca_btl_smcuda_component.fifo_size ) {
+        mca_btl_smcuda_component_progress();
+    }
+
+    /* allocate a fragment, giving up if we can't get one */
+    MCA_BTL_SMCUDA_FRAG_ALLOC_EAGER(frag);
+    if( OPAL_UNLIKELY(NULL == frag) ) {
+        endpoint->ipcstate = IPC_BAD;
+        return;
+    }
+
+    if (ready) {
+        ctrlhdr.ctag = IPC_ACK;
+    } else {
+        ctrlhdr.ctag = IPC_NOTREADY;
+    }
+
+    /* Fill in fragment fields. */
+    frag->hdr->tag = MCA_BTL_TAG_SMCUDA;
+    frag->base.des_flags = MCA_BTL_DES_FLAGS_BTL_OWNERSHIP;
+    frag->endpoint = endpoint;
+    memcpy(frag->segment.base.seg_addr.pval, &ctrlhdr, sizeof(struct ctrlhdr_st));
+
+    /* write the fragment pointer to the FIFO */
+    /*
+     * Note that we don't care what the FIFO-write return code is.  Even if
+     * the return code indicates failure, the write has still "completed" from
+     * our point of view:  it has been posted to a "pending send" queue.
+     */
+    OPAL_THREAD_ADD32(&mca_btl_smcuda_component.num_outstanding_frags, +1);
+
+    MCA_BTL_SMCUDA_FIFO_WRITE(endpoint, endpoint->my_smp_rank,
+                              endpoint->peer_smp_rank, (void *) VIRTUAL2RELATIVE(frag->hdr), false, true, rc);
+
+    /* Set state now that we have sent message */
+    if (ready) {
+        endpoint->ipcstate = IPC_ACKED;
+    } else {
+        endpoint->ipcstate = IPC_INIT;
+    }
+
+    return;
+
+}
+/* This function is utilized to set up CUDA IPC support within the smcuda
+ * BTL.  It handles smcuda specific control messages that are triggered
+ * when GPU memory transfers are initiated. */
+static void btl_smcuda_control(mca_btl_base_module_t* btl,
+                               mca_btl_base_tag_t tag,
+                               mca_btl_base_descriptor_t* des, void* cbdata)
+{
+    int mydevnum, ipcaccess, res;
+    ctrlhdr_t ctrlhdr;
+    ompi_proc_t *ep_proc;
+    struct mca_btl_base_endpoint_t *endpoint;
+    mca_btl_smcuda_t *smcuda_btl = (mca_btl_smcuda_t *)btl;
+    mca_btl_smcuda_frag_t *frag = (mca_btl_smcuda_frag_t *)des;
+    mca_btl_base_segment_t* segments = des->des_dst;
+
+    /* Use the rank of the peer that sent the data to get to the endpoint
+     * structure.  This is needed for PML callback. */
+    endpoint = mca_btl_smcuda_component.sm_peers[frag->hdr->my_smp_rank];
+    ep_proc = endpoint->proc_ompi;
+
+    /* Copy out control message payload to examine it */
+    memcpy(&ctrlhdr, segments->seg_addr.pval, sizeof(struct ctrlhdr_st));
+
+    /* Handle an incoming CUDA IPC control message. */
+    switch (ctrlhdr.ctag) {
+    case IPC_REQ:
+        /* Initial request to set up IPC.  If the state of IPC
+         * initialization is IPC_INIT, then check on the peer to peer
+         * access and act accordingly.  If we are in the IPC_SENT
+         * state, then this means both sides are trying to set up the
+         * connection.  If my smp rank is higher then check and act
+         * accordingly.  Otherwise, drop the request and let the other
+         * side continue the handshake. */
+        OPAL_THREAD_LOCK(&endpoint->endpoint_lock);
+        if ((IPC_INIT == endpoint->ipcstate) ||
+            ((IPC_SENT == endpoint->ipcstate) && (endpoint->my_smp_rank > endpoint->peer_smp_rank))) {
+            endpoint->ipcstate = IPC_ACKING; /* Move into new state to prevent any new connection attempts */
+            OPAL_THREAD_UNLOCK(&endpoint->endpoint_lock);
+
+            /* If not yet CUDA ready, send a NOTREADY message back. */
+            if (!mca_common_cuda_enabled) {
+                opal_output_verbose(10, mca_btl_smcuda_component.cuda_ipc_output,
+                                    "Sending CUDA IPC NOTREADY: myrank=%d, peerrank=%d",
+                                    mca_btl_smcuda_component.my_smp_rank,
+                                    endpoint->peer_smp_rank);
+                mca_btl_smcuda_send_cuda_ipc_ack(btl, endpoint, 0);
+                return;
+            }
+
+            /* Get my current device.  If this fails, move this endpoint state into
+             * bad state.  No need to send a reply.  */
+            res = mca_common_cuda_get_device(&mydevnum);
+            if (0 != res) {
+                endpoint->ipcstate = IPC_BAD;
+                return;
+            }
+
+            /* Check for IPC support between devices. If the CUDA API call fails, then
+             * just move endpoint into bad state.  No need to send a reply. */
+            res = mca_common_cuda_device_can_access_peer(&ipcaccess, mydevnum, ctrlhdr.cudev);
+            if (0 != res) {
+                endpoint->ipcstate = IPC_BAD;
+                return;
+            }
+
+            assert(endpoint->peer_smp_rank == frag->hdr->my_smp_rank);
+            opal_output_verbose(10, mca_btl_smcuda_component.cuda_ipc_output,
+                                "Analyzed CUDA IPC request: myrank=%d, mydev=%d, peerrank=%d, "
+                                "peerdev=%d --> ACCESS=%d",
+                                endpoint->my_smp_rank, mydevnum, endpoint->peer_smp_rank,
+                                ctrlhdr.cudev, ipcaccess);
+
+            if (0 == ipcaccess) {
+                /* No CUDA IPC support */
+                opal_output_verbose(10, mca_btl_smcuda_component.cuda_ipc_output,
+                                    "Not sending CUDA IPC ACK, no P2P support");
+                endpoint->ipcstate = IPC_BAD;
+            } else {
+                /* CUDA IPC works */
+                smcuda_btl->error_cb(&smcuda_btl->super, MCA_BTL_ERROR_FLAGS_ADD_CUDA_IPC,
+                                     ep_proc, (char *)&mca_btl_smcuda_component.cuda_ipc_output);
+                opal_output_verbose(10, mca_btl_smcuda_component.cuda_ipc_output,
+                                    "Sending CUDA IPC ACK:  myrank=%d, mydev=%d, peerrank=%d, peerdev=%d", 
+                                    endpoint->my_smp_rank, mydevnum, endpoint->peer_smp_rank, 
+                                    ctrlhdr.cudev);
+                mca_btl_smcuda_send_cuda_ipc_ack(btl, endpoint, 1);
+            }
+        } else {
+            OPAL_THREAD_UNLOCK(&endpoint->endpoint_lock);
+            opal_output_verbose(10, mca_btl_smcuda_component.cuda_ipc_output,
+                                "Not sending CUDA IPC ACK because request already initiated");
+        }
+        break;
+
+    case IPC_ACK:
+        opal_output_verbose(10, mca_btl_smcuda_component.cuda_ipc_output,
+                            "Received CUDA IPC ACK, notifying PML: myrank=%d, peerrank=%d",
+                            endpoint->my_smp_rank, endpoint->peer_smp_rank);
+
+        smcuda_btl->error_cb(&smcuda_btl->super, MCA_BTL_ERROR_FLAGS_ADD_CUDA_IPC,
+                             ep_proc, (char *)&mca_btl_smcuda_component.cuda_ipc_output);
+        assert(endpoint->ipcstate == IPC_SENT);
+        endpoint->ipcstate = IPC_ACKED;
+        break;
+
+    case IPC_NOTREADY:
+        /* The remote side is not ready.  Reset state to initialized so next
+         * send call will try again to set up connection. */
+        opal_output_verbose(10, mca_btl_smcuda_component.cuda_ipc_output,
+                            "Received CUDA IPC NOTREADY, reset state to allow another attempt: "
+                            "myrank=%d, peerrank=%d",
+                            endpoint->my_smp_rank, endpoint->peer_smp_rank);
+        OPAL_THREAD_LOCK(&endpoint->endpoint_lock);
+        if (IPC_SENT == endpoint->ipcstate) {
+            endpoint->ipcstate = IPC_INIT;
+        }
+        OPAL_THREAD_UNLOCK(&endpoint->endpoint_lock);
+        break;
+
+    default:
+        opal_output(0, "Received UNKNOWN CUDA IPC control message. This should not happen.");
+    }
+}
+
+#endif /* OMPI_CUDA_SUPPORT */
+
 /*
 *  SM component initialization
 */
@ -721,7 +907,10 @@ mca_btl_smcuda_component_init(int *num_btls,

 #if OMPI_CUDA_SUPPORT
    /* Assume CUDA GET works. */
-	mca_btl_smcuda.super.btl_get = mca_btl_smcuda_get_cuda;
+    mca_btl_smcuda.super.btl_get = mca_btl_smcuda_get_cuda;
+    /* Register a smcuda control function to help setup IPC support */
+    mca_btl_base_active_message_trigger[MCA_BTL_TAG_SMCUDA].cbfunc = btl_smcuda_control;
+    mca_btl_base_active_message_trigger[MCA_BTL_TAG_SMCUDA].cbdata = NULL;
 #endif /* OMPI_CUDA_SUPPORT */


@ -850,6 +1039,9 @@ int mca_btl_smcuda_component_progress(void)
                seg.seg_len = hdr->len;
                Frag.base.des_dst_cnt = 1;
                Frag.base.des_dst = &seg;
+#if OMPI_CUDA_SUPPORT
+                Frag.hdr = hdr;  /* needed for peer rank in control messages */
+#endif /* OMPI_CUDA_SUPPORT */
                reg->cbfunc(&mca_btl_smcuda.super, hdr->tag, &(Frag.base),
                            reg->cbdata);
                /* return the fragment */
--- a/ompi/mca/btl/smcuda/btl_smcuda_endpoint.h
+++ b/ompi/mca/btl/smcuda/btl_smcuda_endpoint.h
@ -45,6 +45,11 @@ struct mca_btl_base_endpoint_t {
    /** lock for concurrent access to endpoint state */
    opal_mutex_t endpoint_lock;

+#if OMPI_CUDA_SUPPORT
+    ompi_proc_t *proc_ompi;  /**< Needed for adding CUDA IPC support dynamically */
+    enum ipcState ipcstate;  /**< CUDA IPC connection status */
+    int ipctries;            /**< Number of times CUDA IPC connect was sent */
+#endif /* OMPI_CUDA_SUPPORT */
 };

 void btl_smcuda_process_pending_sends(struct mca_btl_base_endpoint_t *ep);
--- a/ompi/mca/common/cuda/common_cuda.c
+++ b/ompi/mca/common/cuda/common_cuda.c
@ -100,7 +100,7 @@ static bool stage_three_init_complete = false;
 static bool common_cuda_initialized = false;
 static int mca_common_cuda_verbose;
 static int mca_common_cuda_output = 0;
-static bool mca_common_cuda_enabled = false;
+bool mca_common_cuda_enabled = false;
 static bool mca_common_cuda_register_memory = true;
 static bool mca_common_cuda_warning = false;
 static opal_list_t common_cuda_memory_registrations;
@ -1502,3 +1502,30 @@ static int mca_common_cuda_memmove(void *dest, void *src, size_t size)
    cuFunc.cuMemFree(tmp);
    return 0;
 }
+
+int mca_common_cuda_get_device(int *devicenum)
+{
+    CUdevice cuDev;
+    int res;
+
+    res = cuFunc.cuCtxGetDevice(&cuDev);
+    if(res != CUDA_SUCCESS){
+        opal_output(0, "CUDA: cuCtxGetDevice failed: res=%d",
+                    res);
+        return res;
+    }
+    *devicenum = cuDev;
+    return 0;
+}
+
+int mca_common_cuda_device_can_access_peer(int *access, int dev1, int dev2)
+{
+    int res;
+    res = cuFunc.cuDeviceCanAccessPeer(access, (CUdevice)dev1, (CUdevice)dev2);
+    if(res != CUDA_SUCCESS){
+        opal_output(0, "CUDA: cuDeviceCanAccessPeer failed: res=%d",
+                    res);
+        return res;
+    }
+    return 0;
+}
--- a/ompi/mca/common/cuda/common_cuda.h
+++ b/ompi/mca/common/cuda/common_cuda.h
@ -30,6 +30,7 @@ struct mca_mpool_common_cuda_reg_t {
    uint64_t event;
 };
 typedef struct mca_mpool_common_cuda_reg_t mca_mpool_common_cuda_reg_t;
+extern bool mca_common_cuda_enabled;

 OMPI_DECLSPEC int mca_common_cuda_register_mca_variables(void);

@ -68,6 +69,8 @@ OMPI_DECLSPEC int cuda_ungetmemhandle(void *reg_data, mca_mpool_base_registratio
 OMPI_DECLSPEC int cuda_openmemhandle(void *base, size_t size, mca_mpool_base_registration_t *newreg,
                                     mca_mpool_base_registration_t *hdrreg);
 OMPI_DECLSPEC int cuda_closememhandle(void *reg_data, mca_mpool_base_registration_t *reg);
+OMPI_DECLSPEC int mca_common_cuda_get_device(int *devicenum);
+OMPI_DECLSPEC int mca_common_cuda_device_can_access_peer(int *access, int dev1, int dev2);


 #endif /* OMPI_MCA_COMMON_CUDA_H */
--- a/ompi/mca/pml/ob1/pml_ob1.c
+++ b/ompi/mca/pml/ob1/pml_ob1.c
@ -79,6 +79,11 @@ mca_pml_ob1_t mca_pml_ob1 = {
    }
 };

+#if OMPI_CUDA_SUPPORT
+void mca_pml_ob1_cuda_add_ipc_support(struct mca_btl_base_module_t* btl,
+                                      int32_t flags, ompi_proc_t* errproc,
+                                      char* btlinfo);
+#endif /* OMPI_CUDA_SUPPORT */

 void mca_pml_ob1_error_handler( struct mca_btl_base_module_t* btl,
                                int32_t flags, ompi_proc_t* errproc,
@ -732,6 +737,12 @@ void mca_pml_ob1_process_pending_rdma(void)
 void mca_pml_ob1_error_handler(
        struct mca_btl_base_module_t* btl, int32_t flags,
        ompi_proc_t* errproc, char* btlinfo ) { 
+#if OMPI_CUDA_SUPPORT
+    if (flags & MCA_BTL_ERROR_FLAGS_ADD_CUDA_IPC) {
+        mca_pml_ob1_cuda_add_ipc_support(btl, flags, errproc, btlinfo);
+        return;
+    }
+#endif /* OMPI_CUDA_SUPPORT */
    ompi_rte_abort(-1, NULL);
 }

--- a/ompi/mca/pml/ob1/pml_ob1_cuda.c
+++ b/ompi/mca/pml/ob1/pml_ob1_cuda.c
@ -43,6 +43,9 @@ size_t mca_pml_ob1_rdma_cuda_btls(
 int mca_pml_ob1_cuda_need_buffers(void * rreq,
                                  mca_btl_base_module_t* btl);

+void mca_pml_ob1_cuda_add_ipc_support(struct mca_btl_base_module_t* btl, int32_t flags,
+                                      ompi_proc_t* errproc, char* btlinfo);
+
 /**
 * Handle the CUDA buffer.
 */
@ -146,8 +149,12 @@ int mca_pml_ob1_cuda_need_buffers(void * rreq,
                                  mca_btl_base_module_t* btl) 
 {
    mca_pml_ob1_recv_request_t* recvreq = (mca_pml_ob1_recv_request_t*)rreq;
+    mca_bml_base_endpoint_t* bml_endpoint = 
+        (mca_bml_base_endpoint_t*)recvreq->req_recv.req_base.req_proc->proc_bml;
+    mca_bml_base_btl_t *bml_btl = mca_bml_base_btl_array_find(&bml_endpoint->btl_send, btl);
+
    if ((recvreq->req_recv.req_base.req_convertor.flags & CONVERTOR_CUDA) &&
-        (btl->btl_flags & MCA_BTL_FLAGS_CUDA_GET)) {
+        (bml_btl->btl_flags & MCA_BTL_FLAGS_CUDA_GET)) {
        recvreq->req_recv.req_base.req_convertor.flags &= ~CONVERTOR_CUDA;
        if(opal_convertor_need_buffers(&recvreq->req_recv.req_base.req_convertor) == true) {
            recvreq->req_recv.req_base.req_convertor.flags |= CONVERTOR_CUDA;
@ -160,3 +167,36 @@ int mca_pml_ob1_cuda_need_buffers(void * rreq,
    return true;
 }

+/*
+ * This function enables us to start using RDMA get protocol with GPU buffers.
+ * We do this by adjusting the flags in the BML structure.  This is not the
+ * best thing, but this may go away if CUDA IPC is supported everywhere in the
+ * future. */
+void mca_pml_ob1_cuda_add_ipc_support(struct mca_btl_base_module_t* btl, int32_t flags,
+                                      ompi_proc_t* errproc, char* btlinfo)
+{ 
+    mca_bml_base_endpoint_t* ep;
+    int btl_verbose_stream = 0;
+    int i;
+
+    assert(NULL != errproc);
+    assert(NULL != errproc->proc_bml);
+    if (NULL != btlinfo) {
+        btl_verbose_stream = *(int *)btlinfo;
+    }
+    ep = (mca_bml_base_endpoint_t*)errproc->proc_bml;
+
+    /* Find the corresponding bml and adjust the flag to support CUDA get */
+    for( i = 0; i < (int)ep->btl_send.arr_size; i++ ) {
+        if( ep->btl_send.bml_btls[i].btl == btl ) {
+            ep->btl_send.bml_btls[i].btl_flags |= MCA_BTL_FLAGS_CUDA_GET;
+            opal_output_verbose(5, btl_verbose_stream,
+                        "BTL %s: rank=%d enabling CUDA IPC "
+                        "to rank=%d on node=%s \n",
+                        btl->btl_component->btl_version.mca_component_name,
+                        OMPI_PROC_MY_NAME->vpid,
+                        errproc->proc_name.vpid,
+                        errproc->proc_hostname);
+        }
+    }
+}
--- a/ompi/mca/pml/ob1/pml_ob1_recvreq.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvreq.c
@ -654,9 +654,11 @@ void mca_pml_ob1_recv_request_progress_rget( mca_pml_ob1_recv_request_t* recvreq
 #if OMPI_CUDA_SUPPORT
    if (OPAL_UNLIKELY(NULL == rdma_bml)) {
        if (recvreq->req_recv.req_base.req_convertor.flags & CONVERTOR_CUDA) {
+            mca_bml_base_btl_t *bml_btl;
+            bml_btl = mca_bml_base_btl_array_find(&bml_endpoint->btl_send, btl);
            /* Check to see if this is a CUDA get */
-            if (btl->btl_flags & MCA_BTL_FLAGS_CUDA_GET) {
-                rdma_bml = mca_bml_base_btl_array_find(&bml_endpoint->btl_send, btl);
+            if (bml_btl->btl_flags & MCA_BTL_FLAGS_CUDA_GET) {
+                rdma_bml = bml_btl;
            }
        } else {
            /* Just default back to send and receive.  Must be mix of GPU and HOST memory. */