2012-07-18 21:29:37 +04:00
|
|
|
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
|
2005-07-01 01:28:35 +04:00
|
|
|
/*
|
2007-03-17 02:11:45 +03:00
|
|
|
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
2005-11-05 22:57:48 +03:00
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
2009-04-10 20:44:37 +04:00
|
|
|
* Copyright (c) 2004-2009 The University of Tennessee and The University
|
2005-11-05 22:57:48 +03:00
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2008-01-21 15:11:18 +03:00
|
|
|
* Copyright (c) 2004-2007 High Performance Computing Center Stuttgart,
|
2005-07-01 01:28:35 +04:00
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2011-02-24 17:09:22 +03:00
|
|
|
* Copyright (c) 2006-2011 Cisco Systems, Inc. All rights reserved.
|
2009-03-25 19:53:26 +03:00
|
|
|
* Copyright (c) 2006-2009 Mellanox Technologies. All rights reserved.
|
2015-01-06 18:48:33 +03:00
|
|
|
* Copyright (c) 2006-2015 Los Alamos National Security, LLC. All rights
|
2008-01-21 15:11:18 +03:00
|
|
|
* reserved.
|
2007-09-24 14:11:52 +04:00
|
|
|
* Copyright (c) 2006-2007 Voltaire All rights reserved.
|
2010-05-24 22:57:55 +04:00
|
|
|
* Copyright (c) 2009-2010 Oracle and/or its affiliates. All rights reserved.
|
2014-01-14 19:41:56 +04:00
|
|
|
* Copyright (c) 2013-2014 NVIDIA Corporation. All rights reserved.
|
2014-12-09 12:43:15 +03:00
|
|
|
* Copyright (c) 2014 Bull SAS. All rights reserved.
|
|
|
|
* Copyright (c) 2015 Research Organization for Information Science
|
|
|
|
* and Technology (RIST). All rights reserved.
|
2005-07-01 01:28:35 +04:00
|
|
|
* $COPYRIGHT$
|
2008-01-21 15:11:18 +03:00
|
|
|
*
|
2005-07-01 01:28:35 +04:00
|
|
|
* Additional copyrights may follow
|
2008-01-21 15:11:18 +03:00
|
|
|
*
|
2005-07-01 01:28:35 +04:00
|
|
|
* $HEADER$
|
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
|
|
|
*
|
2005-07-01 01:28:35 +04:00
|
|
|
* @file
|
|
|
|
*/
|
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
|
|
|
|
2007-04-27 01:03:38 +04:00
|
|
|
#ifndef MCA_BTL_IB_H
|
|
|
|
#define MCA_BTL_IB_H
|
2005-07-01 01:28:35 +04:00
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#include "opal_config.h"
|
2005-07-01 01:28:35 +04:00
|
|
|
#include <sys/types.h>
|
|
|
|
#include <string.h>
|
2006-06-28 11:23:08 +04:00
|
|
|
#include <infiniband/verbs.h>
|
2005-07-01 01:28:35 +04:00
|
|
|
|
|
|
|
/* Open MPI includes */
|
2007-12-21 09:02:00 +03:00
|
|
|
#include "opal/class/opal_pointer_array.h"
|
2009-04-29 04:49:23 +04:00
|
|
|
#include "opal/class/opal_hash_table.h"
|
2015-03-04 07:53:05 +03:00
|
|
|
#include "opal/util/arch.h"
|
2009-02-14 05:26:12 +03:00
|
|
|
#include "opal/util/output.h"
|
Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
2010-10-24 22:35:54 +04:00
|
|
|
#include "opal/mca/event/event.h"
|
2009-11-27 17:49:24 +03:00
|
|
|
#include "opal/threads/threads.h"
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#include "opal/mca/btl/btl.h"
|
|
|
|
#include "opal/mca/mpool/mpool.h"
|
|
|
|
#include "opal/mca/btl/base/btl_base_error.h"
|
|
|
|
#include "opal/mca/btl/base/base.h"
|
2015-10-06 21:47:09 +03:00
|
|
|
#include "opal/runtime/opal_progress_threads.h"
|
2006-09-07 17:05:41 +04:00
|
|
|
|
2008-01-15 02:22:03 +03:00
|
|
|
#include "connect/connect.h"
|
|
|
|
|
2007-06-14 05:59:25 +04:00
|
|
|
BEGIN_C_DECLS
|
2005-07-01 01:28:35 +04:00
|
|
|
|
2015-07-13 03:56:51 +03:00
|
|
|
#define HAVE_XRC (OPAL_HAVE_CONNECTX_XRC || OPAL_HAVE_CONNECTX_XRC_DOMAINS)
|
|
|
|
#define ENABLE_DYNAMIC_SL OPAL_ENABLE_DYNAMIC_SL
|
2007-11-28 10:18:59 +03:00
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
#define MCA_BTL_IB_LEAVE_PINNED 1
|
2006-09-26 16:12:33 +04:00
|
|
|
#define IB_DEFAULT_GID_PREFIX 0xfe80000000000000ll
|
2008-10-08 13:56:43 +04:00
|
|
|
#define MCA_BTL_IB_PKEY_MASK 0x7fff
|
2012-11-05 18:02:37 +04:00
|
|
|
#define MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT (256)
|
2005-07-01 01:28:35 +04:00
|
|
|
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
|
2008-05-02 15:52:33 +04:00
|
|
|
/*--------------------------------------------------------------------*/
|
|
|
|
|
2009-05-07 00:11:28 +04:00
|
|
|
#if OPAL_ENABLE_DEBUG
|
2008-05-02 15:52:33 +04:00
|
|
|
#define ATTACH() do { \
|
|
|
|
int i = 0; \
|
2008-06-09 18:53:58 +04:00
|
|
|
opal_output(0, "WAITING TO DEBUG ATTACH"); \
|
2008-05-02 15:52:33 +04:00
|
|
|
while (i == 0) sleep(5); \
|
|
|
|
} while(0);
|
|
|
|
#else
|
|
|
|
#define ATTACH()
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*--------------------------------------------------------------------*/
|
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
/**
|
|
|
|
* Infiniband (IB) BTL component.
|
|
|
|
*/
|
|
|
|
|
2009-12-15 17:25:07 +03:00
|
|
|
typedef enum {
|
|
|
|
MCA_BTL_OPENIB_TRANSPORT_IB,
|
|
|
|
MCA_BTL_OPENIB_TRANSPORT_IWARP,
|
|
|
|
MCA_BTL_OPENIB_TRANSPORT_RDMAOE,
|
|
|
|
MCA_BTL_OPENIB_TRANSPORT_UNKNOWN,
|
|
|
|
MCA_BTL_OPENIB_TRANSPORT_SIZE
|
|
|
|
} mca_btl_openib_transport_type_t;
|
|
|
|
|
2008-01-21 15:11:18 +03:00
|
|
|
typedef enum {
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
MCA_BTL_OPENIB_PP_QP,
|
2007-11-28 10:18:59 +03:00
|
|
|
MCA_BTL_OPENIB_SRQ_QP,
|
|
|
|
MCA_BTL_OPENIB_XRC_QP
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
} mca_btl_openib_qp_type_t;
|
|
|
|
|
|
|
|
struct mca_btl_openib_pp_qp_info_t {
|
|
|
|
int32_t rd_win;
|
|
|
|
int32_t rd_rsv;
|
|
|
|
}; typedef struct mca_btl_openib_pp_qp_info_t mca_btl_openib_pp_qp_info_t;
|
|
|
|
|
|
|
|
struct mca_btl_openib_srq_qp_info_t {
|
|
|
|
int32_t sd_max;
|
2009-12-15 18:52:10 +03:00
|
|
|
/* The init value for rd_curr_num variables of all SRQs */
|
|
|
|
int32_t rd_init;
|
|
|
|
/* The watermark, threshold - if the number of WQEs in SRQ is less then this value =>
|
|
|
|
the SRQ limit event (IBV_EVENT_SRQ_LIMIT_REACHED) will be generated on corresponding SRQ.
|
|
|
|
As result the maximal number of pre-posted WQEs on the SRQ will be increased */
|
|
|
|
int32_t srq_limit;
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
}; typedef struct mca_btl_openib_srq_qp_info_t mca_btl_openib_srq_qp_info_t;
|
|
|
|
|
|
|
|
struct mca_btl_openib_qp_info_t {
|
2007-09-30 20:14:17 +04:00
|
|
|
mca_btl_openib_qp_type_t type;
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
size_t size;
|
|
|
|
int32_t rd_num;
|
|
|
|
int32_t rd_low;
|
2008-01-21 15:11:18 +03:00
|
|
|
union {
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
mca_btl_openib_pp_qp_info_t pp_qp;
|
|
|
|
mca_btl_openib_srq_qp_info_t srq_qp;
|
2008-01-21 15:11:18 +03:00
|
|
|
} u;
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
}; typedef struct mca_btl_openib_qp_info_t mca_btl_openib_qp_info_t;
|
|
|
|
|
2007-09-30 20:14:17 +04:00
|
|
|
#define BTL_OPENIB_QP_TYPE(Q) (mca_btl_openib_component.qp_infos[(Q)].type)
|
|
|
|
#define BTL_OPENIB_QP_TYPE_PP(Q) \
|
|
|
|
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_PP_QP)
|
|
|
|
#define BTL_OPENIB_QP_TYPE_SRQ(Q) \
|
|
|
|
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_SRQ_QP)
|
2007-11-28 10:18:59 +03:00
|
|
|
#define BTL_OPENIB_QP_TYPE_XRC(Q) \
|
|
|
|
(BTL_OPENIB_QP_TYPE(Q) == MCA_BTL_OPENIB_XRC_QP)
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
|
2008-05-21 01:53:42 +04:00
|
|
|
typedef enum {
|
2014-11-05 00:25:02 +03:00
|
|
|
BTL_OPENIB_RQ_SOURCE_DEVICE_INI = MCA_BASE_VAR_SOURCE_MAX,
|
2008-05-21 01:53:42 +04:00
|
|
|
} btl_openib_receive_queues_source_t;
|
|
|
|
|
2008-10-06 04:46:02 +04:00
|
|
|
typedef enum {
|
|
|
|
BTL_OPENIB_DT_IB,
|
|
|
|
BTL_OPENIB_DT_IWARP,
|
|
|
|
BTL_OPENIB_DT_ALL
|
|
|
|
} btl_openib_device_type_t;
|
|
|
|
|
2010-02-18 12:48:16 +03:00
|
|
|
/* The structer for manage all BTL SRQs */
|
|
|
|
typedef struct mca_btl_openib_srq_manager_t {
|
|
|
|
opal_mutex_t lock;
|
2011-07-04 18:00:41 +04:00
|
|
|
/* The keys of this hash table are addresses of
|
2010-02-18 12:48:16 +03:00
|
|
|
SRQs structures, and the elements are BTL modules
|
|
|
|
pointers that associated with these SRQs */
|
|
|
|
opal_hash_table_t srq_addr_table;
|
|
|
|
} mca_btl_openib_srq_manager_t;
|
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
struct mca_btl_openib_component_t {
|
2015-01-06 18:48:33 +03:00
|
|
|
mca_btl_base_component_3_0_0_t super; /**< base BTL component */
|
2006-03-13 20:03:21 +03:00
|
|
|
|
|
|
|
int ib_max_btls;
|
2008-07-23 04:28:59 +04:00
|
|
|
/**< maximum number of devices available to openib component */
|
2008-01-21 15:11:18 +03:00
|
|
|
|
2006-06-01 06:32:18 +04:00
|
|
|
int ib_num_btls;
|
2008-07-23 04:28:59 +04:00
|
|
|
/**< number of devices available to the openib component */
|
2005-07-01 01:28:35 +04:00
|
|
|
|
2007-05-09 01:47:21 +04:00
|
|
|
struct mca_btl_openib_module_t **openib_btls;
|
2007-04-27 01:03:38 +04:00
|
|
|
/**< array of available BTLs */
|
2005-07-01 01:28:35 +04:00
|
|
|
|
2008-07-23 04:28:59 +04:00
|
|
|
opal_pointer_array_t devices; /**< array of available devices */
|
|
|
|
int devices_count;
|
2007-08-22 13:31:12 +04:00
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
int ib_free_list_num;
|
|
|
|
/**< initial size of free lists */
|
|
|
|
|
|
|
|
int ib_free_list_max;
|
|
|
|
/**< maximum size of free lists */
|
|
|
|
|
|
|
|
int ib_free_list_inc;
|
|
|
|
/**< number of elements to alloc when growing free lists */
|
|
|
|
|
2005-07-03 20:22:16 +04:00
|
|
|
opal_list_t ib_procs;
|
2005-07-01 01:28:35 +04:00
|
|
|
/**< list of ib proc structures */
|
|
|
|
|
2005-07-04 03:09:55 +04:00
|
|
|
opal_event_t ib_send_event;
|
2005-07-01 01:28:35 +04:00
|
|
|
/**< event structure for sends */
|
|
|
|
|
2005-07-04 03:09:55 +04:00
|
|
|
opal_event_t ib_recv_event;
|
2005-07-01 01:28:35 +04:00
|
|
|
/**< event structure for recvs */
|
|
|
|
|
2005-07-04 02:45:48 +04:00
|
|
|
opal_mutex_t ib_lock;
|
2005-07-01 01:28:35 +04:00
|
|
|
/**< lock for accessing module state */
|
|
|
|
|
2008-01-21 15:11:18 +03:00
|
|
|
char* ib_mpool_name;
|
|
|
|
/**< name of ib memory pool */
|
2007-04-27 01:03:38 +04:00
|
|
|
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
uint8_t num_pp_qps; /**< number of pp qp's */
|
|
|
|
uint8_t num_srq_qps; /**< number of srq qp's */
|
2007-11-28 10:18:59 +03:00
|
|
|
uint8_t num_xrc_qps; /**< number of xrc qp's */
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
uint8_t num_qps; /**< total number of qp's */
|
2008-01-21 15:11:18 +03:00
|
|
|
|
2007-11-28 10:18:59 +03:00
|
|
|
opal_hash_table_t ib_addr_table; /**< used only for xrc.hash-table that
|
|
|
|
keeps table of all lids/subnets */
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
mca_btl_openib_qp_info_t* qp_infos;
|
2008-01-21 15:11:18 +03:00
|
|
|
|
2007-04-27 01:03:38 +04:00
|
|
|
size_t eager_limit; /**< Eager send limit of first fragment, in Bytes */
|
|
|
|
size_t max_send_size; /**< Maximum send size, in Bytes */
|
2011-02-23 18:50:37 +03:00
|
|
|
uint32_t max_hw_msg_size;/**< Maximum message size for RDMA protocols in Bytes */
|
2007-04-27 01:03:38 +04:00
|
|
|
uint32_t reg_mru_len; /**< Length of the registration cache most recently used list */
|
2008-01-21 15:11:18 +03:00
|
|
|
uint32_t use_srq; /**< Use the Shared Receive Queue (SRQ mode) */
|
|
|
|
|
|
|
|
uint32_t ib_cq_size[2]; /**< Max outstanding CQE on the CQ */
|
|
|
|
|
2013-03-28 01:09:41 +04:00
|
|
|
int ib_max_inline_data; /**< Max size of inline data */
|
|
|
|
unsigned int ib_pkey_val;
|
|
|
|
unsigned int ib_psn;
|
|
|
|
unsigned int ib_qp_ous_rd_atom;
|
2008-01-21 15:11:18 +03:00
|
|
|
uint32_t ib_mtu;
|
2013-03-28 01:09:41 +04:00
|
|
|
unsigned int ib_min_rnr_timer;
|
|
|
|
unsigned int ib_timeout;
|
|
|
|
unsigned int ib_retry_count;
|
|
|
|
unsigned int ib_rnr_retry;
|
|
|
|
unsigned int ib_max_rdma_dst_ops;
|
|
|
|
unsigned int ib_service_level;
|
2011-06-28 18:28:29 +04:00
|
|
|
#if (ENABLE_DYNAMIC_SL)
|
2013-03-28 01:09:41 +04:00
|
|
|
unsigned int ib_path_record_service_level;
|
2011-06-28 18:28:29 +04:00
|
|
|
#endif
|
2013-03-28 01:09:41 +04:00
|
|
|
int use_eager_rdma;
|
|
|
|
int eager_rdma_threshold; /**< After this number of msg, use RDMA for short messages, always */
|
|
|
|
int eager_rdma_num;
|
2006-10-31 20:29:25 +03:00
|
|
|
int32_t max_eager_rdma;
|
2013-03-28 01:09:41 +04:00
|
|
|
unsigned int btls_per_lid;
|
|
|
|
unsigned int max_lmc;
|
|
|
|
int apm_lmc;
|
|
|
|
int apm_ports;
|
|
|
|
unsigned int buffer_alignment; /**< Preferred communication buffer alignment in Bytes (must be power of two) */
|
2010-05-24 22:57:55 +04:00
|
|
|
int32_t error_counter; /**< Counts number on error events that we got on all devices */
|
2015-10-06 21:47:09 +03:00
|
|
|
opal_event_base_t *async_evbase; /**< Async event base */
|
2013-03-28 01:09:41 +04:00
|
|
|
bool use_async_event_thread; /**< Use the async event handler */
|
2010-02-18 12:48:16 +03:00
|
|
|
mca_btl_openib_srq_manager_t srq_manager; /**< Hash table for all BTL SRQs */
|
2011-01-19 23:58:22 +03:00
|
|
|
#if BTL_OPENIB_FAILOVER_ENABLED
|
2013-03-28 01:09:41 +04:00
|
|
|
bool port_error_failover; /**< Report port errors to speed up failover */
|
2007-05-09 01:47:21 +04:00
|
|
|
#endif
|
2013-03-28 01:09:41 +04:00
|
|
|
/* declare as an int instead of btl_openib_device_type_t since there is no
|
|
|
|
guarantee about the size of an enum. this value will be registered as an
|
|
|
|
integer with the MCA variable system */
|
|
|
|
int device_type;
|
2007-06-14 05:59:25 +04:00
|
|
|
char *if_include;
|
|
|
|
char **if_include_list;
|
|
|
|
char *if_exclude;
|
|
|
|
char **if_exclude_list;
|
2008-11-17 23:20:24 +03:00
|
|
|
char *ipaddr_include;
|
|
|
|
char *ipaddr_exclude;
|
2005-07-15 19:13:19 +04:00
|
|
|
|
2008-05-21 01:53:42 +04:00
|
|
|
/* MCA param btl_openib_receive_queues */
|
|
|
|
char *receive_queues;
|
|
|
|
/* Whether we got a non-default value of btl_openib_receive_queues */
|
2014-07-30 06:17:46 +04:00
|
|
|
mca_base_var_source_t receive_queues_source;
|
2008-05-21 01:53:42 +04:00
|
|
|
|
2008-07-23 04:28:59 +04:00
|
|
|
/** Colon-delimited list of filenames for device parameters */
|
|
|
|
char *device_params_file_names;
|
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
|
|
|
|
|
|
|
/** Whether we're in verbose mode or not */
|
|
|
|
bool verbose;
|
|
|
|
|
2008-07-23 04:28:59 +04:00
|
|
|
/** Whether we want a warning if no device-specific parameters are
|
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
|
|
|
found in INI files */
|
2008-07-23 04:28:59 +04:00
|
|
|
bool warn_no_device_params_found;
|
2006-09-26 16:12:33 +04:00
|
|
|
/** Whether we want a warning if non default GID prefix is not configured
|
|
|
|
on multiport setup */
|
|
|
|
bool warn_default_gid_prefix;
|
2007-06-14 05:59:25 +04:00
|
|
|
/** Whether we want a warning if the user specifies a non-existent
|
2008-07-23 04:28:59 +04:00
|
|
|
device and/or port via btl_openib_if_[in|ex]clude MCA params */
|
2007-06-14 05:59:25 +04:00
|
|
|
bool warn_nonexistent_if;
|
2012-08-25 15:39:06 +04:00
|
|
|
/** Whether we want to abort if there's not enough registered
|
|
|
|
memory available */
|
|
|
|
bool abort_not_enough_reg_mem;
|
|
|
|
|
2007-06-14 05:59:25 +04:00
|
|
|
/** Dummy argv-style list; a copy of names from the
|
|
|
|
if_[in|ex]clude list that we use for error checking (to ensure
|
|
|
|
that they all exist) */
|
|
|
|
char **if_list;
|
2007-12-09 17:05:13 +03:00
|
|
|
bool use_message_coalescing;
|
2013-03-28 01:09:41 +04:00
|
|
|
unsigned int cq_poll_ratio;
|
|
|
|
unsigned int cq_poll_progress;
|
|
|
|
unsigned int cq_poll_batch;
|
|
|
|
unsigned int eager_rdma_poll_ratio;
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
int rdma_qp;
|
2007-11-28 10:13:34 +03:00
|
|
|
int credits_qp; /* qp used for software flow control */
|
2008-07-10 19:08:49 +04:00
|
|
|
bool cpc_explicitly_defined;
|
2008-01-09 13:26:21 +03:00
|
|
|
/**< free list of frags only; used for pining user memory */
|
2015-02-19 23:41:41 +03:00
|
|
|
opal_free_list_t send_user_free;
|
2008-01-09 13:26:21 +03:00
|
|
|
/**< free list of frags only; used for pining user memory */
|
2015-02-19 23:41:41 +03:00
|
|
|
opal_free_list_t recv_user_free;
|
2008-01-09 13:26:21 +03:00
|
|
|
/**< frags for coalesced massages */
|
2015-02-19 23:41:41 +03:00
|
|
|
opal_free_list_t send_free_coalesced;
|
2009-12-15 17:25:07 +03:00
|
|
|
/** Default receive queues */
|
|
|
|
char* default_recv_qps;
|
2011-02-24 17:09:22 +03:00
|
|
|
/** GID index to use */
|
|
|
|
int gid_index;
|
2009-12-15 18:52:10 +03:00
|
|
|
/** Whether we want a dynamically resizing srq, enabled by default */
|
|
|
|
bool enable_srq_resize;
|
2014-12-23 23:45:56 +03:00
|
|
|
bool allow_max_memory_registration;
|
2013-12-23 21:47:43 +04:00
|
|
|
int memory_registration_verbose_level;
|
2013-09-13 18:28:26 +04:00
|
|
|
int memory_registration_verbose;
|
2014-01-14 19:41:56 +04:00
|
|
|
int ignore_locality;
|
2011-01-19 23:58:22 +03:00
|
|
|
#if BTL_OPENIB_FAILOVER_ENABLED
|
2010-07-14 14:08:19 +04:00
|
|
|
int verbose_failover;
|
|
|
|
#endif
|
2012-05-09 00:42:09 +04:00
|
|
|
#if BTL_OPENIB_MALLOC_HOOKS_ENABLED
|
|
|
|
int use_memalign;
|
|
|
|
size_t memalign_threshold;
|
|
|
|
void* (*previous_malloc_hook)(size_t __size, const void*);
|
|
|
|
#endif
|
2013-11-13 17:22:39 +04:00
|
|
|
#if OPAL_CUDA_SUPPORT
|
2013-03-28 01:09:41 +04:00
|
|
|
bool cuda_async_send;
|
2013-05-15 01:04:15 +04:00
|
|
|
bool cuda_async_recv;
|
2013-11-13 17:22:39 +04:00
|
|
|
bool cuda_have_gdr;
|
2013-12-10 20:04:08 +04:00
|
|
|
bool driver_have_gdr;
|
2013-11-13 17:22:39 +04:00
|
|
|
bool cuda_want_gdr;
|
2013-11-01 16:19:40 +04:00
|
|
|
#endif /* OPAL_CUDA_SUPPORT */
|
2013-10-23 10:08:54 +04:00
|
|
|
#if HAVE_DECL_IBV_LINK_LAYER_ETHERNET
|
2013-10-29 00:04:49 +04:00
|
|
|
bool rroce_enable;
|
2013-10-23 10:08:54 +04:00
|
|
|
#endif
|
2005-07-01 01:28:35 +04:00
|
|
|
}; typedef struct mca_btl_openib_component_t mca_btl_openib_component_t;
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
OPAL_MODULE_DECLSPEC extern mca_btl_openib_component_t mca_btl_openib_component;
|
2005-07-01 01:28:35 +04:00
|
|
|
|
2008-01-21 15:11:18 +03:00
|
|
|
typedef mca_btl_base_recv_reg_t mca_btl_openib_recv_reg_t;
|
|
|
|
|
2008-05-02 15:52:33 +04:00
|
|
|
/**
|
|
|
|
* Common information for all ports that is sent in the modex message
|
|
|
|
*/
|
|
|
|
typedef struct mca_btl_openib_modex_message_t {
|
|
|
|
/** The subnet ID of this port */
|
2007-01-13 01:42:20 +03:00
|
|
|
uint64_t subnet_id;
|
2008-05-02 15:52:33 +04:00
|
|
|
/** LID of this port */
|
|
|
|
uint16_t lid;
|
|
|
|
/** APM LID for this port */
|
|
|
|
uint16_t apm_lid;
|
|
|
|
/** The MTU used by this port */
|
|
|
|
uint8_t mtu;
|
2009-12-15 17:25:07 +03:00
|
|
|
/** vendor id define device type and tuning */
|
|
|
|
uint32_t vendor_id;
|
|
|
|
/** vendor part id define device type and tuning */
|
|
|
|
uint32_t vendor_part_id;
|
|
|
|
/** Transport type of remote port */
|
|
|
|
uint8_t transport_type;
|
2008-05-02 15:52:33 +04:00
|
|
|
/** Dummy field used to calculate the real length */
|
|
|
|
uint8_t end;
|
|
|
|
} mca_btl_openib_modex_message_t;
|
|
|
|
|
|
|
|
#define MCA_BTL_OPENIB_MODEX_MSG_NTOH(hdr) \
|
2007-01-13 02:14:45 +03:00
|
|
|
do { \
|
2007-01-13 17:22:42 +03:00
|
|
|
(hdr).subnet_id = ntoh64((hdr).subnet_id); \
|
2008-05-02 15:52:33 +04:00
|
|
|
(hdr).lid = ntohs((hdr).lid); \
|
2007-01-13 02:14:45 +03:00
|
|
|
} while (0)
|
2008-05-02 15:52:33 +04:00
|
|
|
#define MCA_BTL_OPENIB_MODEX_MSG_HTON(hdr) \
|
2007-01-13 02:14:45 +03:00
|
|
|
do { \
|
2007-01-13 17:22:42 +03:00
|
|
|
(hdr).subnet_id = hton64((hdr).subnet_id); \
|
2008-05-02 15:52:33 +04:00
|
|
|
(hdr).lid = htons((hdr).lid); \
|
2007-01-13 02:14:45 +03:00
|
|
|
} while (0)
|
|
|
|
|
2008-07-23 04:28:59 +04:00
|
|
|
typedef struct mca_btl_openib_device_qp_t {
|
2015-02-19 23:41:41 +03:00
|
|
|
opal_free_list_t send_free; /**< free lists of send buffer descriptors */
|
|
|
|
opal_free_list_t recv_free; /**< free lists of receive buffer descriptors */
|
2008-07-23 04:28:59 +04:00
|
|
|
} mca_btl_openib_device_qp_t;
|
2008-01-09 13:26:21 +03:00
|
|
|
|
2007-12-23 15:29:34 +03:00
|
|
|
struct mca_btl_base_endpoint_t;
|
|
|
|
|
2008-07-23 04:28:59 +04:00
|
|
|
typedef struct mca_btl_openib_device_t {
|
2007-12-23 15:29:34 +03:00
|
|
|
opal_object_t super;
|
2006-06-28 11:23:08 +04:00
|
|
|
struct ibv_device *ib_dev; /* the ib device */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#if OPAL_ENABLE_PROGRESS_THREADS == 1
|
2008-07-23 04:28:59 +04:00
|
|
|
struct ibv_comp_channel *ib_channel; /* Channel event for the device */
|
2006-11-02 19:15:21 +03:00
|
|
|
opal_thread_t thread; /* Progress thread */
|
|
|
|
volatile bool progress; /* Progress status */
|
2014-06-26 00:43:28 +04:00
|
|
|
#endif
|
2008-07-23 04:28:59 +04:00
|
|
|
opal_mutex_t device_lock; /* device level lock */
|
2006-06-28 11:23:08 +04:00
|
|
|
struct ibv_context *ib_dev_context;
|
|
|
|
struct ibv_device_attr ib_dev_attr;
|
|
|
|
struct ibv_pd *ib_pd;
|
2007-08-20 16:28:25 +04:00
|
|
|
struct ibv_cq *ib_cq[2];
|
|
|
|
uint32_t cq_size[2];
|
2006-06-28 11:23:08 +04:00
|
|
|
mca_mpool_base_module_t *mpool;
|
2008-07-23 04:28:59 +04:00
|
|
|
/* MTU for this device */
|
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
|
|
|
uint32_t mtu;
|
2008-07-23 04:28:59 +04:00
|
|
|
/* Whether this device supports eager RDMA */
|
2006-12-14 18:52:13 +03:00
|
|
|
uint8_t use_eager_rdma;
|
2008-07-23 04:28:59 +04:00
|
|
|
uint8_t btls; /** < number of btls using this device */
|
2007-12-21 09:02:00 +03:00
|
|
|
opal_pointer_array_t *endpoints;
|
2008-07-23 04:28:59 +04:00
|
|
|
opal_pointer_array_t *device_btls;
|
2007-12-23 15:29:34 +03:00
|
|
|
uint16_t hp_cq_polls;
|
|
|
|
uint16_t eager_rdma_polls;
|
|
|
|
bool pollme;
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
volatile bool got_fatal_event;
|
2010-05-24 22:57:55 +04:00
|
|
|
volatile bool got_port_event;
|
2007-11-28 10:18:59 +03:00
|
|
|
#if HAVE_XRC
|
2015-01-07 07:27:25 +03:00
|
|
|
#if OPAL_HAVE_CONNECTX_XRC_DOMAINS
|
2014-12-09 12:43:15 +03:00
|
|
|
struct ibv_xrcd *xrcd;
|
|
|
|
#else
|
2007-11-28 10:18:59 +03:00
|
|
|
struct ibv_xrc_domain *xrc_domain;
|
2014-12-09 12:43:15 +03:00
|
|
|
#endif
|
2007-11-28 10:18:59 +03:00
|
|
|
int xrc_fd;
|
|
|
|
#endif
|
2008-01-21 18:07:39 +03:00
|
|
|
int32_t non_eager_rdma_endpoints;
|
2007-12-23 15:29:34 +03:00
|
|
|
int32_t eager_rdma_buffers_count;
|
|
|
|
struct mca_btl_base_endpoint_t **eager_rdma_buffers;
|
2008-01-09 13:26:21 +03:00
|
|
|
/**< frags for control massages */
|
2015-02-19 23:41:41 +03:00
|
|
|
opal_free_list_t send_free_control;
|
2008-07-23 04:28:59 +04:00
|
|
|
/* QP types and attributes that will be used on this device */
|
|
|
|
mca_btl_openib_device_qp_t *qps;
|
|
|
|
/* Maximum value supported by this device for max_inline_data */
|
2008-06-24 21:18:07 +04:00
|
|
|
uint32_t max_inline_data;
|
2012-07-19 21:52:21 +04:00
|
|
|
/* Registration limit and current count */
|
|
|
|
uint64_t mem_reg_max, mem_reg_active;
|
2014-01-06 23:51:30 +04:00
|
|
|
/* Device is ready for use */
|
|
|
|
bool ready_for_use;
|
2015-10-06 21:47:09 +03:00
|
|
|
/* Async event */
|
|
|
|
opal_event_t async_event;
|
2008-07-23 04:28:59 +04:00
|
|
|
} mca_btl_openib_device_t;
|
|
|
|
OBJ_CLASS_DECLARATION(mca_btl_openib_device_t);
|
2007-04-27 01:03:38 +04:00
|
|
|
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
struct mca_btl_openib_module_pp_qp_t {
|
|
|
|
int32_t dummy;
|
|
|
|
}; typedef struct mca_btl_openib_module_pp_qp_t mca_btl_openib_module_pp_qp_t;
|
|
|
|
|
|
|
|
struct mca_btl_openib_module_srq_qp_t {
|
2008-01-21 15:11:18 +03:00
|
|
|
struct ibv_srq *srq;
|
|
|
|
int32_t rd_posted;
|
|
|
|
int32_t sd_credits; /* the max number of outstanding sends on a QP when using SRQ */
|
|
|
|
/* i.e. the number of frags that can be outstanding (down counter) */
|
2007-11-28 10:12:44 +03:00
|
|
|
opal_list_t pending_frags[2]; /**< list of high/low prio frags */
|
2009-12-15 18:52:10 +03:00
|
|
|
/** The number of receive buffers that can be post in the current time.
|
|
|
|
The value may be increased in the IBV_EVENT_SRQ_LIMIT_REACHED
|
|
|
|
event handler. The value starts from (rd_num / 4) and increased up to rd_num */
|
|
|
|
int32_t rd_curr_num;
|
|
|
|
/** We post additional WQEs only if a number of WQEs (in specific SRQ) is less of this value.
|
|
|
|
The value increased together with rd_curr_num. The value is unique for every SRQ. */
|
|
|
|
int32_t rd_low_local;
|
2011-07-04 18:00:41 +04:00
|
|
|
/** The flag points if we want to get the
|
2009-12-15 18:52:10 +03:00
|
|
|
IBV_EVENT_SRQ_LIMIT_REACHED events for dynamically resizing SRQ */
|
|
|
|
bool srq_limit_event_flag;
|
2009-12-16 13:23:58 +03:00
|
|
|
/**< In difference of the "--mca enable_srq_resize" parameter that says, if we want(or no)
|
|
|
|
to start with small num of pre-posted receive buffers (rd_curr_num) and to increase this number by needs
|
2011-07-04 17:39:38 +04:00
|
|
|
(the max of this value is rd_num * the whole size of SRQ), the "srq_limit_event_flag" says if we want to get limit event
|
2009-12-16 13:23:58 +03:00
|
|
|
from device if the defined srq limit was reached (signal to the main thread) and we put off this flag if the rd_curr_num
|
|
|
|
was increased up to rd_num.
|
|
|
|
In order to prevent lock/unlock operation in the critical path we prefer only put-on
|
|
|
|
the srq_limit_event_flag in asynchronous thread, because in this way we post receive buffers
|
|
|
|
in the main thread only and only after posting we set (if srq_limit_event_flag is true)
|
|
|
|
the limit for IBV_EVENT_SRQ_LIMIT_REACHED event. */
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
}; typedef struct mca_btl_openib_module_srq_qp_t mca_btl_openib_module_srq_qp_t;
|
|
|
|
|
|
|
|
struct mca_btl_openib_module_qp_t {
|
|
|
|
union {
|
|
|
|
mca_btl_openib_module_pp_qp_t pp_qp;
|
|
|
|
mca_btl_openib_module_srq_qp_t srq_qp;
|
|
|
|
} u;
|
|
|
|
}; typedef struct mca_btl_openib_module_qp_t mca_btl_openib_module_qp_t;
|
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
/**
|
2007-04-27 01:03:38 +04:00
|
|
|
* IB BTL Interface
|
2005-07-01 01:28:35 +04:00
|
|
|
*/
|
|
|
|
struct mca_btl_openib_module_t {
|
2008-05-02 15:52:33 +04:00
|
|
|
/* Base BTL module */
|
|
|
|
mca_btl_base_module_t super;
|
|
|
|
|
2008-01-21 15:11:18 +03:00
|
|
|
bool btl_inited;
|
2008-05-02 15:52:33 +04:00
|
|
|
|
|
|
|
/** Common information about all ports */
|
|
|
|
mca_btl_openib_modex_message_t port_info;
|
|
|
|
|
|
|
|
/** Array of CPCs on this port */
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
opal_btl_openib_connect_base_module_t **cpcs;
|
2008-05-02 15:52:33 +04:00
|
|
|
|
|
|
|
/** Number of elements in the cpcs array */
|
|
|
|
uint8_t num_cpcs;
|
|
|
|
|
2008-07-23 04:28:59 +04:00
|
|
|
mca_btl_openib_device_t *device;
|
2008-01-21 15:11:18 +03:00
|
|
|
uint8_t port_num; /**< ID of the PORT */
|
2007-04-22 14:22:12 +04:00
|
|
|
uint16_t pkey_index;
|
2008-01-21 15:11:18 +03:00
|
|
|
struct ibv_port_attr ib_port_attr;
|
2006-06-28 11:23:08 +04:00
|
|
|
uint16_t lid; /**< lid that is actually used (for LMC) */
|
2008-02-20 16:44:05 +03:00
|
|
|
int apm_port; /**< Alternative port that may be used for APM */
|
2006-06-28 11:23:08 +04:00
|
|
|
uint8_t src_path_bits; /**< offset from base lid (for LMC) */
|
2008-01-21 15:11:18 +03:00
|
|
|
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
int32_t num_peers;
|
2008-01-21 15:11:18 +03:00
|
|
|
|
|
|
|
opal_mutex_t ib_lock; /**< module level lock */
|
|
|
|
|
2007-04-27 01:03:38 +04:00
|
|
|
size_t eager_rdma_frag_size; /**< length of eager frag */
|
2007-12-23 15:29:34 +03:00
|
|
|
volatile int32_t eager_rdma_channels; /**< number of open RDMA channels */
|
2006-08-17 00:21:38 +04:00
|
|
|
|
|
|
|
mca_btl_base_module_error_cb_fn_t error_cb; /**< error handler */
|
2008-01-21 15:11:18 +03:00
|
|
|
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
mca_btl_openib_module_qp_t * qps;
|
2012-07-19 21:52:21 +04:00
|
|
|
|
|
|
|
int local_procs; /** number of local procs */
|
2015-11-24 02:07:12 +03:00
|
|
|
|
|
|
|
bool atomic_ops_be; /** atomic result is big endian */
|
2007-04-27 01:03:38 +04:00
|
|
|
};
|
|
|
|
typedef struct mca_btl_openib_module_t mca_btl_openib_module_t;
|
2006-12-17 15:26:41 +03:00
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
extern mca_btl_openib_module_t mca_btl_openib_module;
|
|
|
|
|
2015-01-06 18:48:33 +03:00
|
|
|
struct mca_btl_base_registration_handle_t {
|
|
|
|
uint32_t rkey;
|
|
|
|
uint32_t lkey;
|
|
|
|
};
|
|
|
|
|
2006-12-17 15:26:41 +03:00
|
|
|
struct mca_btl_openib_reg_t {
|
|
|
|
mca_mpool_base_registration_t base;
|
|
|
|
struct ibv_mr *mr;
|
2015-01-06 18:48:33 +03:00
|
|
|
mca_btl_base_registration_handle_t btl_handle;
|
2006-12-17 15:26:41 +03:00
|
|
|
};
|
|
|
|
typedef struct mca_btl_openib_reg_t mca_btl_openib_reg_t;
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#if OPAL_ENABLE_PROGRESS_THREADS == 1
|
2006-11-02 19:15:21 +03:00
|
|
|
extern void* mca_btl_openib_progress_thread(opal_object_t*);
|
2014-06-26 00:43:28 +04:00
|
|
|
#endif
|
2007-04-27 01:03:38 +04:00
|
|
|
|
2006-08-17 00:21:38 +04:00
|
|
|
|
|
|
|
/**
|
|
|
|
* Register a callback function that is called on error..
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL module
|
|
|
|
* @return Status indicating if cleanup was successful
|
|
|
|
*/
|
|
|
|
|
|
|
|
int mca_btl_openib_register_error_cb(
|
|
|
|
struct mca_btl_base_module_t* btl,
|
|
|
|
mca_btl_base_module_error_cb_fn_t cbfunc
|
|
|
|
);
|
2008-01-21 15:11:18 +03:00
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
|
|
|
|
/**
|
|
|
|
* Cleanup any resources held by the BTL.
|
2008-01-21 15:11:18 +03:00
|
|
|
*
|
2005-07-01 01:28:35 +04:00
|
|
|
* @param btl BTL instance.
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
* @return OPAL_SUCCESS or error status on failure.
|
2005-07-01 01:28:35 +04:00
|
|
|
*/
|
|
|
|
|
|
|
|
extern int mca_btl_openib_finalize(
|
|
|
|
struct mca_btl_base_module_t* btl
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* PML->BTL notification of change in the process list.
|
2008-01-21 15:11:18 +03:00
|
|
|
*
|
2007-04-27 01:03:38 +04:00
|
|
|
* @param btl (IN) BTL module
|
|
|
|
* @param nprocs (IN) Number of processes
|
|
|
|
* @param procs (IN) Set of processes
|
|
|
|
* @param peers (OUT) Set of (optional) peer addressing info.
|
|
|
|
* @param reachable (IN/OUT) Set of processes that are reachable via this BTL.
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
* @return OPAL_SUCCESS or error status on failure.
|
2008-01-21 15:11:18 +03:00
|
|
|
*
|
2005-07-01 01:28:35 +04:00
|
|
|
*/
|
|
|
|
|
|
|
|
extern int mca_btl_openib_add_procs(
|
|
|
|
struct mca_btl_base_module_t* btl,
|
|
|
|
size_t nprocs,
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
struct opal_proc_t **procs,
|
2005-07-01 01:28:35 +04:00
|
|
|
struct mca_btl_base_endpoint_t** peers,
|
2009-03-04 01:25:13 +03:00
|
|
|
opal_bitmap_t* reachable
|
2005-07-01 01:28:35 +04:00
|
|
|
);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* PML->BTL notification of change in the process list.
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL instance
|
|
|
|
* @param nproc (IN) Number of processes.
|
|
|
|
* @param procs (IN) Set of processes.
|
|
|
|
* @param peers (IN) Set of peer data structures.
|
|
|
|
* @return Status indicating if cleanup was successful
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
extern int mca_btl_openib_del_procs(
|
|
|
|
struct mca_btl_base_module_t* btl,
|
|
|
|
size_t nprocs,
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
struct opal_proc_t **procs,
|
2005-07-01 01:28:35 +04:00
|
|
|
struct mca_btl_base_endpoint_t** peers
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* PML->BTL Initiate a send of the specified size.
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL instance
|
2007-04-27 01:03:38 +04:00
|
|
|
* @param btl_peer (IN) BTL peer addressing
|
|
|
|
* @param descriptor (IN) Descriptor of data to be transmitted.
|
|
|
|
* @param tag (IN) Tag.
|
2005-07-01 01:28:35 +04:00
|
|
|
*/
|
|
|
|
extern int mca_btl_openib_send(
|
|
|
|
struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* btl_peer,
|
2008-01-21 15:11:18 +03:00
|
|
|
struct mca_btl_base_descriptor_t* descriptor,
|
2005-07-01 01:28:35 +04:00
|
|
|
mca_btl_base_tag_t tag
|
|
|
|
);
|
|
|
|
|
2009-03-25 19:53:26 +03:00
|
|
|
/**
|
|
|
|
* PML->BTL Initiate a immediate send of the specified size.
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL instance
|
|
|
|
* @param ep (IN) Endpoint
|
|
|
|
* @param convertor (IN) Datatypes converter
|
|
|
|
* @param header (IN) PML header
|
|
|
|
* @param header_size (IN) PML header size
|
|
|
|
* @param payload_size (IN) Payload size
|
|
|
|
* @param order (IN) Order
|
|
|
|
* @param flags (IN) Flags
|
|
|
|
* @param tag (IN) Tag
|
|
|
|
* @param descriptor (OUT) Messages descriptor
|
|
|
|
*/
|
|
|
|
extern int mca_btl_openib_sendi( struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* ep,
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
struct opal_convertor_t* convertor,
|
2009-03-25 19:53:26 +03:00
|
|
|
void* header,
|
|
|
|
size_t header_size,
|
|
|
|
size_t payload_size,
|
|
|
|
uint8_t order,
|
|
|
|
uint32_t flags,
|
|
|
|
mca_btl_base_tag_t tag,
|
|
|
|
mca_btl_base_descriptor_t** descriptor
|
2011-07-04 18:00:41 +04:00
|
|
|
);
|
2009-03-25 19:53:26 +03:00
|
|
|
|
2015-01-06 18:48:33 +03:00
|
|
|
/* forward decaration for internal put/get */
|
|
|
|
struct mca_btl_openib_put_frag_t;
|
|
|
|
struct mca_btl_openib_get_frag_t;
|
|
|
|
|
2005-08-18 21:08:27 +04:00
|
|
|
/**
|
2015-01-06 18:48:33 +03:00
|
|
|
* @brief Schedule a put fragment with the HCA (internal)
|
2005-08-18 21:08:27 +04:00
|
|
|
*
|
|
|
|
* @param btl (IN) BTL instance
|
2015-01-06 18:48:33 +03:00
|
|
|
* @param ep (IN) BTL endpoint
|
|
|
|
* @param frag (IN) Fragment prepared by mca_btl_openib_put
|
|
|
|
*
|
|
|
|
* If the fragment can not be scheduled due to resource limitations then
|
|
|
|
* the fragment will be put on the pending put fragment list and retried
|
|
|
|
* when another get/put fragment has completed.
|
2014-10-31 01:43:41 +03:00
|
|
|
*/
|
2015-01-06 18:48:33 +03:00
|
|
|
int mca_btl_openib_put_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
|
|
|
|
struct mca_btl_openib_put_frag_t *frag);
|
2014-10-31 01:43:41 +03:00
|
|
|
|
|
|
|
/**
|
2015-01-06 18:48:33 +03:00
|
|
|
* @brief Schedule an RDMA write with the HCA
|
2014-10-31 01:43:41 +03:00
|
|
|
*
|
|
|
|
* @param btl (IN) BTL instance
|
2015-01-06 18:48:33 +03:00
|
|
|
* @param ep (IN) BTL endpoint
|
|
|
|
* @param local_address (IN) Source address
|
|
|
|
* @param remote_address (IN) Destination address
|
|
|
|
* @param local_handle (IN) Registration handle for region containing the region {local_address, size}
|
|
|
|
* @param remote_handle (IN) Registration handle for region containing the region {remote_address, size}
|
|
|
|
* @param size (IN) Number of bytes to write
|
|
|
|
* @param flags (IN) Transfer flags
|
|
|
|
* @param order (IN) Ordering
|
|
|
|
* @param cbfunc (IN) Function to call on completion
|
|
|
|
* @param cbcontext (IN) Context for completion callback
|
|
|
|
* @param cbdata (IN) Data for completion callback
|
|
|
|
*
|
|
|
|
* @return OPAL_ERR_BAD_PARAM if a bad parameter was passed
|
|
|
|
* @return OPAL_SUCCCESS if the operation was successfully scheduled
|
|
|
|
*
|
|
|
|
* This function will attempt to schedule a put operation with the HCA.
|
2014-11-14 03:26:35 +03:00
|
|
|
*/
|
2015-01-06 18:48:33 +03:00
|
|
|
int mca_btl_openib_put (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, void *local_address,
|
|
|
|
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
|
|
|
|
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
|
|
|
|
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata);
|
2014-11-14 03:26:35 +03:00
|
|
|
|
2015-01-06 18:48:33 +03:00
|
|
|
/**
|
|
|
|
* @brief Schedule a get fragment with the HCA (internal)
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL instance
|
|
|
|
* @param ep (IN) BTL endpoint
|
|
|
|
* @param qp (IN) ID of queue pair to schedule the get on
|
|
|
|
* @param frag (IN) Fragment prepared by mca_btl_openib_get
|
|
|
|
*
|
|
|
|
* If the fragment can not be scheduled due to resource limitations then
|
|
|
|
* the fragment will be put on the pending get fragment list and retried
|
|
|
|
* when another get/put fragment has completed.
|
|
|
|
*/
|
|
|
|
int mca_btl_openib_get_internal (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *ep,
|
|
|
|
struct mca_btl_openib_get_frag_t *frag);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* @brief Schedule an RDMA read with the HCA
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL instance
|
|
|
|
* @param ep (IN) BTL endpoint
|
|
|
|
* @param local_address (IN) Destination address
|
|
|
|
* @param remote_address (IN) Source address
|
|
|
|
* @param local_handle (IN) Registration handle for region containing the region {local_address, size}
|
|
|
|
* @param remote_handle (IN) Registration handle for region containing the region {remote_address, size}
|
|
|
|
* @param size (IN) Number of bytes to read
|
|
|
|
* @param flags (IN) Transfer flags
|
|
|
|
* @param order (IN) Ordering
|
|
|
|
* @param cbfunc (IN) Function to call on completion
|
|
|
|
* @param cbcontext (IN) Context for completion callback
|
|
|
|
* @param cbdata (IN) Data for completion callback
|
|
|
|
*
|
|
|
|
* @return OPAL_ERR_BAD_PARAM if a bad parameter was passed
|
|
|
|
* @return OPAL_SUCCCESS if the operation was successfully scheduled
|
|
|
|
*
|
|
|
|
* This function will attempt to schedule a get operation with the HCA.
|
|
|
|
*/
|
|
|
|
int mca_btl_openib_get (mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint, void *local_address,
|
|
|
|
uint64_t remote_address, mca_btl_base_registration_handle_t *local_handle,
|
|
|
|
mca_btl_base_registration_handle_t *remote_handle, size_t size, int flags,
|
|
|
|
int order, mca_btl_base_rdma_completion_fn_t cbfunc, void *cbcontext, void *cbdata);
|
2014-11-14 03:26:35 +03:00
|
|
|
|
2015-01-06 00:36:57 +03:00
|
|
|
/**
|
|
|
|
* Initiate an asynchronous fetching atomic operation.
|
|
|
|
* Completion Semantics: if this function returns a 1 then the operation
|
|
|
|
* is complete. a return of OPAL_SUCCESS indicates
|
|
|
|
* the atomic operation has been queued with the
|
|
|
|
* network.
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL module
|
|
|
|
* @param endpoint (IN) BTL addressing information
|
|
|
|
* @param local_address (OUT) Local address to store the result in
|
|
|
|
* @param remote_address (IN) Remote address perfom operation on to (registered remotely)
|
|
|
|
* @param local_handle (IN) Local registration handle for region containing
|
|
|
|
* (local_address, local_address + 8)
|
|
|
|
* @param remote_handle (IN) Remote registration handle for region containing
|
|
|
|
* (remote_address, remote_address + 8)
|
|
|
|
* @param op (IN) Operation to perform
|
|
|
|
* @param operand (IN) Operand for the operation
|
|
|
|
* @param flags (IN) Flags for this put operation
|
|
|
|
* @param order (IN) Ordering
|
|
|
|
* @param cbfunc (IN) Function to call on completion (if queued)
|
|
|
|
* @param cbcontext (IN) Context for the callback
|
|
|
|
* @param cbdata (IN) Data for callback
|
|
|
|
*
|
|
|
|
* @retval OPAL_SUCCESS The operation was successfully queued
|
|
|
|
* @retval 1 The operation is complete
|
|
|
|
* @retval OPAL_ERROR The operation was NOT successfully queued
|
|
|
|
* @retval OPAL_ERR_OUT_OF_RESOURCE Insufficient resources to queue the atomic
|
|
|
|
* operation. Try again later
|
|
|
|
* @retval OPAL_ERR_NOT_AVAILABLE Atomic operation can not be performed due to
|
|
|
|
* alignment restrictions or the operation {op} is not supported
|
|
|
|
* by the hardware.
|
|
|
|
*
|
|
|
|
* After the operation is complete the remote address specified by {remote_address} and
|
|
|
|
* {remote_handle} will be updated with (*remote_address) = (*remote_address) op operand.
|
|
|
|
* {local_address} will be updated with the previous value stored in {remote_address}.
|
|
|
|
* The btl will guarantee consistency of atomic operations performed via the btl. Note,
|
|
|
|
* however, that not all btls will provide consistency between btl atomic operations and
|
|
|
|
* cpu atomics.
|
|
|
|
*/
|
|
|
|
int mca_btl_openib_atomic_fop (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
|
|
|
|
void *local_address, uint64_t remote_address,
|
|
|
|
struct mca_btl_base_registration_handle_t *local_handle,
|
|
|
|
struct mca_btl_base_registration_handle_t *remote_handle, mca_btl_base_atomic_op_t op,
|
|
|
|
uint64_t operand, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
|
|
|
|
void *cbcontext, void *cbdata);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Initiate an asynchronous compare and swap operation.
|
|
|
|
* Completion Semantics: if this function returns a 1 then the operation
|
|
|
|
* is complete. a return of OPAL_SUCCESS indicates
|
|
|
|
* the atomic operation has been queued with the
|
|
|
|
* network.
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL module
|
|
|
|
* @param endpoint (IN) BTL addressing information
|
|
|
|
* @param local_address (OUT) Local address to store the result in
|
|
|
|
* @param remote_address (IN) Remote address perfom operation on to (registered remotely)
|
|
|
|
* @param local_handle (IN) Local registration handle for region containing
|
|
|
|
* (local_address, local_address + 8)
|
|
|
|
* @param remote_handle (IN) Remote registration handle for region containing
|
|
|
|
* (remote_address, remote_address + 8)
|
|
|
|
* @param compare (IN) Operand for the operation
|
|
|
|
* @param value (IN) Value to store on success
|
|
|
|
* @param flags (IN) Flags for this put operation
|
|
|
|
* @param order (IN) Ordering
|
|
|
|
* @param cbfunc (IN) Function to call on completion (if queued)
|
|
|
|
* @param cbcontext (IN) Context for the callback
|
|
|
|
* @param cbdata (IN) Data for callback
|
|
|
|
*
|
|
|
|
* @retval OPAL_SUCCESS The operation was successfully queued
|
|
|
|
* @retval 1 The operation is complete
|
|
|
|
* @retval OPAL_ERROR The operation was NOT successfully queued
|
|
|
|
* @retval OPAL_ERR_OUT_OF_RESOURCE Insufficient resources to queue the atomic
|
|
|
|
* operation. Try again later
|
|
|
|
* @retval OPAL_ERR_NOT_AVAILABLE Atomic operation can not be performed due to
|
|
|
|
* alignment restrictions or the operation {op} is not supported
|
|
|
|
* by the hardware.
|
|
|
|
*
|
|
|
|
* After the operation is complete the remote address specified by {remote_address} and
|
|
|
|
* {remote_handle} will be updated with {value} if *remote_address == compare.
|
|
|
|
* {local_address} will be updated with the previous value stored in {remote_address}.
|
|
|
|
* The btl will guarantee consistency of atomic operations performed via the btl. Note,
|
|
|
|
* however, that not all btls will provide consistency between btl atomic operations and
|
|
|
|
* cpu atomics.
|
|
|
|
*/
|
|
|
|
int mca_btl_openib_atomic_cswap (struct mca_btl_base_module_t *btl, struct mca_btl_base_endpoint_t *endpoint,
|
|
|
|
void *local_address, uint64_t remote_address,
|
|
|
|
struct mca_btl_base_registration_handle_t *local_handle,
|
|
|
|
struct mca_btl_base_registration_handle_t *remote_handle, uint64_t compare,
|
|
|
|
uint64_t value, int flags, int order, mca_btl_base_rdma_completion_fn_t cbfunc,
|
|
|
|
void *cbcontext, void *cbdata);
|
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
/**
|
|
|
|
* Allocate a descriptor.
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL module
|
|
|
|
* @param size (IN) Requested descriptor size.
|
|
|
|
*/
|
|
|
|
extern mca_btl_base_descriptor_t* mca_btl_openib_alloc(
|
2007-12-09 17:00:42 +03:00
|
|
|
struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* endpoint,
|
|
|
|
uint8_t order,
|
2007-12-09 17:08:01 +03:00
|
|
|
size_t size,
|
2008-01-21 15:11:18 +03:00
|
|
|
uint32_t flags);
|
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
|
|
|
|
/**
|
|
|
|
* Return a segment allocated by this BTL.
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL module
|
|
|
|
* @param descriptor (IN) Allocated descriptor.
|
|
|
|
*/
|
|
|
|
extern int mca_btl_openib_free(
|
2008-01-21 15:11:18 +03:00
|
|
|
struct mca_btl_base_module_t* btl,
|
|
|
|
mca_btl_base_descriptor_t* des);
|
|
|
|
|
2005-07-01 01:28:35 +04:00
|
|
|
|
|
|
|
/**
|
|
|
|
* Pack data and return a descriptor that can be
|
|
|
|
* used for send/put.
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL module
|
|
|
|
* @param peer (IN) BTL peer addressing
|
|
|
|
*/
|
|
|
|
mca_btl_base_descriptor_t* mca_btl_openib_prepare_src(
|
2005-07-12 17:38:54 +04:00
|
|
|
struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* peer,
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
struct opal_convertor_t* convertor,
|
2007-05-24 23:51:26 +04:00
|
|
|
uint8_t order,
|
2005-07-12 17:38:54 +04:00
|
|
|
size_t reserve,
|
2007-12-09 17:08:01 +03:00
|
|
|
size_t* size,
|
|
|
|
uint32_t flags
|
2005-07-12 17:38:54 +04:00
|
|
|
);
|
2005-07-01 01:28:35 +04:00
|
|
|
|
2007-11-28 10:16:52 +03:00
|
|
|
extern void mca_btl_openib_frag_progress_pending_put_get(
|
|
|
|
struct mca_btl_base_endpoint_t*, const int);
|
|
|
|
|
2007-03-17 02:11:45 +03:00
|
|
|
/**
|
|
|
|
* Fault Tolerance Event Notification Function
|
2007-04-27 01:03:38 +04:00
|
|
|
*
|
|
|
|
* @param state (IN) Checkpoint State
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
* @return OPAL_SUCCESS or failure status
|
2007-03-17 02:11:45 +03:00
|
|
|
*/
|
2007-04-27 01:03:38 +04:00
|
|
|
extern int mca_btl_openib_ft_event(int state);
|
|
|
|
|
2007-03-17 02:11:45 +03:00
|
|
|
|
2009-06-18 16:24:39 +04:00
|
|
|
/**
|
|
|
|
* Show an error during init, particularly when running out of
|
|
|
|
* registered memory.
|
|
|
|
*/
|
|
|
|
void mca_btl_openib_show_init_error(const char *file, int line,
|
|
|
|
const char *func, const char *dev);
|
|
|
|
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
#define BTL_OPENIB_HP_CQ 0
|
|
|
|
#define BTL_OPENIB_LP_CQ 1
|
|
|
|
|
2006-09-05 20:00:18 +04:00
|
|
|
|
2007-04-27 01:03:38 +04:00
|
|
|
/**
|
2008-01-21 15:11:18 +03:00
|
|
|
* Post to Shared Receive Queue with certain priority
|
2007-04-27 01:03:38 +04:00
|
|
|
*
|
|
|
|
* @param openib_btl (IN) BTL module
|
|
|
|
* @param additional (IN) Additional Bytes to reserve
|
|
|
|
* @param prio (IN) Priority (either BTL_OPENIB_HP_QP or BTL_OPENIB_LP_QP)
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
* @return OPAL_SUCCESS or failure status
|
2007-04-27 01:03:38 +04:00
|
|
|
*/
|
|
|
|
|
2007-11-28 17:52:31 +03:00
|
|
|
int mca_btl_openib_post_srr(mca_btl_openib_module_t* openib_btl, const int qp);
|
2006-09-07 17:05:41 +04:00
|
|
|
|
2009-12-15 17:25:07 +03:00
|
|
|
/**
|
|
|
|
* Get a transport name of btl by its transport type.
|
|
|
|
*/
|
|
|
|
|
|
|
|
const char* btl_openib_get_transport_name(mca_btl_openib_transport_type_t transport_type);
|
|
|
|
|
2015-05-07 21:47:07 +03:00
|
|
|
/**
|
|
|
|
* Get an endpoint for a process
|
|
|
|
*
|
|
|
|
* @param btl (IN) BTL module
|
|
|
|
* @param proc (IN) opal process object
|
|
|
|
*
|
|
|
|
* This function will return an existing endpoint if one exists otherwise it will allocate
|
|
|
|
* a new endpoint and return it.
|
|
|
|
*/
|
|
|
|
struct mca_btl_base_endpoint_t *mca_btl_openib_get_ep (struct mca_btl_base_module_t *btl,
|
|
|
|
struct opal_proc_t *proc);
|
|
|
|
|
2009-12-15 17:25:07 +03:00
|
|
|
/**
|
|
|
|
* Get a transport type of btl.
|
|
|
|
*/
|
|
|
|
|
|
|
|
mca_btl_openib_transport_type_t mca_btl_openib_get_transport_type(mca_btl_openib_module_t* openib_btl);
|
|
|
|
|
2007-11-28 10:20:26 +03:00
|
|
|
static inline int qp_cq_prio(const int qp)
|
2007-11-28 10:18:59 +03:00
|
|
|
{
|
2007-11-28 10:20:26 +03:00
|
|
|
if(0 == qp)
|
|
|
|
return BTL_OPENIB_HP_CQ; /* smallest qp is always HP */
|
|
|
|
|
|
|
|
/* If the size for this qp is <= the eager limit, make it a
|
|
|
|
high priority QP. Otherwise, make it a low priority QP. */
|
|
|
|
return (mca_btl_openib_component.qp_infos[qp].size <=
|
|
|
|
mca_btl_openib_component.eager_limit) ?
|
|
|
|
BTL_OPENIB_HP_CQ : BTL_OPENIB_LP_CQ;
|
2007-11-28 10:18:59 +03:00
|
|
|
}
|
|
|
|
|
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
|
|
|
#define BTL_OPENIB_RDMA_QP(QP) \
|
|
|
|
((QP) == mca_btl_openib_component.rdma_qp)
|
|
|
|
|
2015-10-06 21:47:09 +03:00
|
|
|
/**
|
|
|
|
* Run function as part of opal_progress()
|
|
|
|
*
|
|
|
|
* @param[in] fn function to run
|
|
|
|
* @param[in] arg function data
|
|
|
|
*/
|
|
|
|
int mca_btl_openib_run_in_main (void *(*fn)(void *), void *arg);
|
|
|
|
|
|
|
|
|
2007-06-14 05:59:25 +04:00
|
|
|
END_C_DECLS
|
|
|
|
|
2007-04-27 01:03:38 +04:00
|
|
|
#endif /* MCA_BTL_IB_H */
|