1
1
openmpi/opal/mca/btl/ugni/btl_ugni_rdma.h

168 строки
7.1 KiB
C
Исходник Обычный вид История

/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
* Copyright (c) 2011-2017 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2011 UT-Battelle, LLC. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#if !defined(MCA_BTL_UGNI_RDMA_H)
#define MCA_BTL_UGNI_RDMA_H
#include "btl_ugni.h"
#include "btl_ugni_frag.h"
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
#include "btl_ugni_device.h"
int mca_btl_ugni_start_eager_get (mca_btl_base_endpoint_t *ep,
mca_btl_ugni_eager_ex_frag_hdr_t hdr,
mca_btl_ugni_base_frag_t *frag);
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
static inline void init_gni_post_desc (mca_btl_ugni_post_descriptor_t *post_desc,
int order, gni_post_type_t op_type,
uint64_t lcl_addr,
gni_mem_handle_t lcl_mdh,
uint64_t rem_addr,
gni_mem_handle_t rem_mdh,
uint64_t bufsize,
gni_cq_handle_t cq_hndl) {
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
post_desc->desc.type = op_type;
post_desc->desc.cq_mode = GNI_CQMODE_GLOBAL_EVENT;
if (MCA_BTL_NO_ORDER == order) {
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
post_desc->desc.dlvr_mode = GNI_DLVMODE_PERFORMANCE;
} else {
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
post_desc->desc.dlvr_mode = GNI_DLVMODE_NO_ADAPT;
}
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
post_desc->desc.local_addr = (uint64_t) lcl_addr;
post_desc->desc.local_mem_hndl = lcl_mdh;
post_desc->desc.remote_addr = (uint64_t) rem_addr;
post_desc->desc.remote_mem_hndl = rem_mdh;
post_desc->desc.length = bufsize;
post_desc->desc.rdma_mode = 0;
post_desc->desc.src_cq_hndl = cq_hndl;
post_desc->tries = 0;
}
static inline int mca_btl_ugni_post_fma (struct mca_btl_base_endpoint_t *endpoint, gni_post_type_t op_type,
size_t size, void *local_address, uint64_t remote_address,
mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle,
int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
mca_btl_ugni_post_descriptor_t *post_desc;
gni_mem_handle_t local_gni_handle = {0, 0};
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
int rc;
if (local_handle) {
local_gni_handle = local_handle->gni_handle;
}
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
post_desc = mca_btl_ugni_alloc_post_descriptor (endpoint, local_handle, cbfunc, cbcontext, cbdata);
if (OPAL_UNLIKELY(NULL == post_desc)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* Post descriptor (CQ is ignored for FMA transactions) -- The CQ associated with the endpoint
* is used. */
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
init_gni_post_desc (post_desc, order, op_type, (intptr_t) local_address, local_gni_handle,
remote_address, remote_handle->gni_handle, size, 0);
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
rc = mca_btl_ugni_endpoint_post_fma (endpoint, post_desc);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
mca_btl_ugni_return_post_descriptor (post_desc);
}
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
return rc;
}
static inline int mca_btl_ugni_post_bte (mca_btl_base_endpoint_t *endpoint, gni_post_type_t op_type,
size_t size, void *local_address, uint64_t remote_address,
mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle,
int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
mca_btl_ugni_post_descriptor_t *post_desc;
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
int rc;
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
post_desc = mca_btl_ugni_alloc_post_descriptor (endpoint, local_handle, cbfunc, cbcontext, cbdata);
if (OPAL_UNLIKELY(NULL == post_desc)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
/* Post descriptor */
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
init_gni_post_desc (post_desc, order, op_type, (intptr_t) local_address, local_handle->gni_handle,
remote_address, remote_handle->gni_handle, size, 0);
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
rc = mca_btl_ugni_endpoint_post_rdma (endpoint, post_desc);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) {
mca_btl_ugni_return_post_descriptor (post_desc);
}
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
return rc;
}
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
static inline int mca_btl_ugni_post_cqwrite (mca_btl_base_endpoint_t *endpoint, mca_btl_ugni_cq_t *cq,
gni_mem_handle_t irq_mhndl, uint64_t value,
mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
mca_btl_ugni_post_descriptor_t *post_desc;
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
int rc;
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
post_desc = mca_btl_ugni_alloc_post_descriptor (endpoint, NULL, cbfunc, cbcontext, cbdata);
if (OPAL_UNLIKELY(NULL == post_desc)) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
post_desc->desc.type = GNI_POST_CQWRITE;
post_desc->desc.cqwrite_value = value; /* up to 48 bytes here, not used for now */
post_desc->desc.cq_mode = GNI_CQMODE_GLOBAL_EVENT;
post_desc->desc.dlvr_mode = GNI_DLVMODE_IN_ORDER;
post_desc->desc.src_cq_hndl = cq->gni_handle;
post_desc->desc.remote_mem_hndl = irq_mhndl;
post_desc->tries = 0;
post_desc->cq = cq;
rc = mca_btl_ugni_endpoint_post_cqwrite (endpoint, post_desc);
if (OPAL_UNLIKELY(OPAL_SUCCESS != rc)) { /* errors for PostCqWrite treated as non-fatal */
mca_btl_ugni_return_post_descriptor (post_desc);
}
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
return rc;
}
static inline int mca_btl_ugni_post (mca_btl_base_endpoint_t *endpoint, int get, size_t size,
void *local_address, uint64_t remote_address,
mca_btl_base_registration_handle_t *local_handle,
mca_btl_base_registration_handle_t *remote_handle,
int order, mca_btl_base_rdma_completion_fn_t cbfunc,
void *cbcontext, void *cbdata)
{
const gni_post_type_t fma_ops[2] = {GNI_POST_FMA_PUT, GNI_POST_FMA_GET};
const gni_post_type_t rdma_ops[2] = {GNI_POST_RDMA_PUT, GNI_POST_RDMA_GET};
if (size <= mca_btl_ugni_component.ugni_fma_limit) {
return mca_btl_ugni_post_fma (endpoint, fma_ops[get], size, local_address, remote_address,
local_handle, remote_handle, order, cbfunc, cbcontext, cbdata);
}
return mca_btl_ugni_post_bte (endpoint, rdma_ops[get], size, local_address, remote_address,
local_handle, remote_handle, order, cbfunc, cbcontext, cbdata);
}
static inline int mca_btl_ugni_repost (mca_btl_ugni_module_t *ugni_module, mca_btl_ugni_post_descriptor_t *post_desc)
{
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
if (GNI_POST_RDMA_PUT == post_desc->desc.type || GNI_POST_RDMA_GET == post_desc->desc.type) {
return mca_btl_ugni_endpoint_post_rdma (post_desc->endpoint, post_desc);
}
btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-12 22:37:35 -06:00
return mca_btl_ugni_endpoint_post_fma (post_desc->endpoint, post_desc);
}
#endif /* MCA_BTL_UGNI_RDMA_H */