2007-12-21 09:02:00 +03:00
/* -*- Mode: C; c-basic-offset:4 ; -*- */
2005-07-01 01:28:35 +04:00
/*
2007-03-17 02:11:45 +03:00
* Copyright ( c ) 2004 - 2007 The Trustees of Indiana University and Indiana
2005-11-05 22:57:48 +03:00
* University Research and Technology
* Corporation . All rights reserved .
2008-02-18 20:39:30 +03:00
* Copyright ( c ) 2004 - 2008 The University of Tennessee and The University
2005-11-05 22:57:48 +03:00
* of Tennessee Research Foundation . All rights
* reserved .
2008-01-21 15:11:18 +03:00
* Copyright ( c ) 2004 - 2005 High Performance Computing Center Stuttgart ,
2005-07-01 01:28:35 +04:00
* University of Stuttgart . All rights reserved .
* Copyright ( c ) 2004 - 2005 The Regents of the University of California .
* All rights reserved .
2008-03-29 15:54:24 +03:00
* Copyright ( c ) 2006 - 2008 Cisco Systems , Inc . All rights reserved .
2008-06-05 16:20:13 +04:00
* Copyright ( c ) 2006 - 2008 Mellanox Technologies . All rights reserved .
2007-07-25 19:03:34 +04:00
* Copyright ( c ) 2006 - 2007 Los Alamos National Security , LLC . All rights
2008-01-21 15:11:18 +03:00
* reserved .
2007-09-24 14:11:52 +04:00
* Copyright ( c ) 2006 - 2007 Voltaire All rights reserved .
2005-07-01 01:28:35 +04:00
* $ COPYRIGHT $
2008-01-21 15:11:18 +03:00
*
2005-07-01 01:28:35 +04:00
* Additional copyrights may follow
2008-01-21 15:11:18 +03:00
*
2005-07-01 01:28:35 +04:00
* $ HEADER $
*/
# include "ompi_config.h"
2007-08-07 03:40:35 +04:00
2008-01-21 15:11:18 +03:00
# include <infiniband/verbs.h>
2008-05-29 02:05:47 +04:00
# include <infiniband/driver.h>
2008-01-21 15:11:18 +03:00
# include <errno.h>
# include <string.h> /* for strerror()*/
2008-02-14 17:10:27 +03:00
# include <unistd.h>
2008-05-29 18:19:51 +04:00
# include <sys/types.h>
# include <sys/stat.h>
2008-07-25 02:51:26 +04:00
# include <malloc.h>
2007-08-07 03:40:35 +04:00
2006-02-12 04:33:29 +03:00
# include "ompi/constants.h"
2005-07-04 03:09:55 +04:00
# include "opal/event/event.h"
2006-12-19 11:34:48 +03:00
# include "opal/include/opal/align.h"
2005-07-04 05:36:20 +04:00
# include "opal/util/if.h"
2005-07-04 04:13:44 +04:00
# include "opal/util/argv.h"
2006-02-12 04:33:29 +03:00
# include "opal/sys/timer.h"
2006-09-19 17:27:05 +04:00
# include "opal/sys/atomic.h"
2007-06-14 05:59:25 +04:00
# include "opal/util/argv.h"
2008-07-25 02:51:26 +04:00
# include "opal/memoryhooks/memory.h"
2006-02-12 04:33:29 +03:00
# include "opal/mca/base/mca_base_param.h"
2008-02-11 13:34:11 +03:00
# include "opal/mca/carto/carto.h"
# include "opal/mca/carto/base/base.h"
# include "opal/mca/paffinity/base/base.h"
2008-05-21 01:53:42 +04:00
# include "opal/mca/installdirs/installdirs.h"
2008-06-04 18:39:37 +04:00
# include "opal_stdint.h"
2007-08-07 03:40:35 +04:00
2008-07-25 02:51:26 +04:00
# include "orte/util/show_help.h"
2006-02-12 04:33:29 +03:00
# include "orte/mca/errmgr/errmgr.h"
2008-03-24 02:10:15 +03:00
# include "orte/util/proc_info.h"
2008-08-06 18:22:03 +04:00
# include "orte/util/name_fns.h"
2008-02-28 04:57:57 +03:00
# include "orte/runtime/orte_globals.h"
2007-08-07 03:40:35 +04:00
# include "ompi/proc/proc.h"
# include "ompi/mca/pml/pml.h"
# include "ompi/mca/btl/btl.h"
2008-01-21 15:11:18 +03:00
# include "ompi/mca/mpool/base/base.h"
2006-12-17 15:26:41 +03:00
# include "ompi/mca/mpool/rdma/mpool_rdma.h"
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
# include "ompi/mca/btl/base/base.h"
2008-01-21 15:11:18 +03:00
# include "ompi/datatype/convertor.h"
# include "ompi/mca/mpool/mpool.h"
2007-08-07 03:40:35 +04:00
# include "ompi/runtime/ompi_module_exchange.h"
2008-08-06 18:22:03 +04:00
# include "ompi/runtime/mpiruntime.h"
2007-08-07 03:40:35 +04:00
2005-07-01 01:28:35 +04:00
# include "btl_openib.h"
# include "btl_openib_frag.h"
2008-01-21 15:11:18 +03:00
# include "btl_openib_endpoint.h"
2006-03-26 12:30:50 +04:00
# include "btl_openib_eager_rdma.h"
2006-06-06 00:02:41 +04:00
# include "btl_openib_proc.h"
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
# include "btl_openib_ini.h"
# include "btl_openib_mca.h"
2007-11-28 10:18:59 +03:00
# include "btl_openib_xrc.h"
2008-05-02 15:52:33 +04:00
# include "btl_openib_fd.h"
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
# if OMPI_HAVE_THREADS
# include "btl_openib_async.h"
# endif
2007-08-07 03:40:35 +04:00
# include "connect/base.h"
2008-05-08 18:38:14 +04:00
# include "btl_openib_iwarp.h"
2008-06-05 16:20:13 +04:00
# include "ompi/runtime/params.h"
2005-07-12 17:38:54 +04:00
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
/*
* Local functions
*/
static int btl_openib_component_open ( void ) ;
static int btl_openib_component_close ( void ) ;
2008-01-09 13:27:15 +03:00
static mca_btl_base_module_t * * btl_openib_component_init ( int * , bool , bool ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
static int btl_openib_component_progress ( void ) ;
2008-05-21 01:53:42 +04:00
/*
* Local variables
*/
2008-07-23 04:28:59 +04:00
static mca_btl_openib_device_t * receive_queues_device = NULL ;
2008-05-21 01:53:42 +04:00
2005-07-01 01:28:35 +04:00
mca_btl_openib_component_t mca_btl_openib_component = {
{
/* First, the mca_base_component_t struct containing meta information
about the component itself */
{
2008-07-29 02:40:57 +04:00
MCA_BTL_BASE_VERSION_2_0_0 ,
2005-07-01 01:28:35 +04:00
2005-07-12 23:02:39 +04:00
" openib " , /* MCA component name */
Major simplifications to component versioning:
- After long discussions and ruminations on how we run components in
LAM/MPI, made the decision that, by default, all components included
in Open MPI will use the version number of their parent project
(i.e., OMPI or ORTE). They are certaint free to use a different
number, but this simplification makes the common cases easy:
- components are only released when the parent project is released
- it is easy (trivial?) to distinguish which version component goes
with with version of the parent project
- removed all autogen/configure code for templating the version .h
file in components
- made all ORTE components use ORTE_*_VERSION for version numbers
- made all OMPI components use OMPI_*_VERSION for version numbers
- removed all VERSION files from components
- configure now displays OPAL, ORTE, and OMPI version numbers
- ditto for ompi_info
- right now, faking it -- OPAL and ORTE and OMPI will always have the
same version number (i.e., they all come from the same top-level
VERSION file). But this paves the way for the Great Configure
Reorganization, where, among other things, each project will have
its own version number.
So all in all, we went from a boatload of version numbers to
[effectively] three. That's pretty good. :-)
This commit was SVN r6344.
2005-07-05 00:12:36 +04:00
OMPI_MAJOR_VERSION , /* MCA component major version */
OMPI_MINOR_VERSION , /* MCA component minor version */
OMPI_RELEASE_VERSION , /* MCA component release version */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
btl_openib_component_open , /* component open */
btl_openib_component_close /* component close */
2005-07-01 01:28:35 +04:00
} ,
{
2007-03-17 02:11:45 +03:00
/* The component is not checkpoint ready */
MCA_BASE_METADATA_PARAM_NONE
2005-07-01 01:28:35 +04:00
} ,
2008-01-21 15:11:18 +03:00
btl_openib_component_init ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
btl_openib_component_progress ,
2005-07-01 01:28:35 +04:00
}
} ;
/*
* Called by MCA framework to open the component , registers
* component parameters .
*/
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
int btl_openib_component_open ( void )
2005-07-01 01:28:35 +04:00
{
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
int ret ;
2006-06-09 22:02:45 +04:00
2005-07-01 01:28:35 +04:00
/* initialize state */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
mca_btl_openib_component . ib_num_btls = 0 ;
mca_btl_openib_component . openib_btls = NULL ;
2008-07-23 04:28:59 +04:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . devices , opal_pointer_array_t ) ;
mca_btl_openib_component . devices_count = 0 ;
2008-07-10 19:08:49 +04:00
mca_btl_openib_component . cpc_explicitly_defined = false ;
2007-08-20 16:28:25 +04:00
2008-01-21 15:11:18 +03:00
/* initialize objects */
2005-07-03 20:22:16 +04:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_procs , opal_list_t ) ;
2005-07-01 01:28:35 +04:00
/* register IB component parameters */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
ret = btl_openib_register_mca_params ( ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
mca_btl_openib_component . max_send_size =
mca_btl_openib_module . super . btl_max_send_size ;
mca_btl_openib_component . eager_limit =
mca_btl_openib_module . super . btl_eager_limit ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
srand48 ( getpid ( ) * time ( NULL ) ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return ret ;
2005-07-01 01:28:35 +04:00
}
/*
* component cleanup - sanity checking of queue lengths
*/
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
static int btl_openib_component_close ( void )
2005-07-01 01:28:35 +04:00
{
2008-08-07 17:48:38 +04:00
int rc = OMPI_SUCCESS ;
# if OMPI_HAVE_THREADS
/* Tell the async thread to shutdown */
if ( mca_btl_openib_component . use_async_event_thread & &
0 ! = mca_btl_openib_component . async_thread ) {
int async_command = 0 ;
if ( write ( mca_btl_openib_component . async_pipe [ 1 ] , & async_command ,
sizeof ( int ) ) < 0 ) {
BTL_ERROR ( ( " Failed to communicate with async event thread " ) ) ;
rc = OMPI_ERROR ;
} else {
if ( pthread_join ( mca_btl_openib_component . async_thread , NULL ) ) {
BTL_ERROR ( ( " Failed to stop OpenIB async event thread " ) ) ;
rc = OMPI_ERROR ;
}
}
close ( mca_btl_openib_component . async_pipe [ 0 ] ) ;
close ( mca_btl_openib_component . async_pipe [ 1 ] ) ;
close ( mca_btl_openib_component . async_comp_pipe [ 0 ] ) ;
close ( mca_btl_openib_component . async_comp_pipe [ 1 ] ) ;
}
# endif
2008-05-02 15:52:33 +04:00
ompi_btl_openib_connect_base_finalize ( ) ;
ompi_btl_openib_fd_finalize ( ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
ompi_btl_openib_ini_finalize ( ) ;
2008-05-21 01:53:42 +04:00
if ( NULL ! = mca_btl_openib_component . receive_queues ) {
free ( mca_btl_openib_component . receive_queues ) ;
}
2008-08-07 17:48:38 +04:00
return rc ;
2005-07-01 01:28:35 +04:00
}
2008-05-29 02:05:47 +04:00
static bool check_basics ( void )
{
2008-05-29 18:19:51 +04:00
int rc ;
2008-05-29 02:05:47 +04:00
char * file ;
2008-05-29 18:19:51 +04:00
struct stat s ;
2008-05-29 02:05:47 +04:00
2008-05-29 18:19:51 +04:00
/* Check to see if $sysfsdir/class/infiniband/ exists */
2008-05-29 02:05:47 +04:00
asprintf ( & file , " %s/class/infiniband " , ibv_get_sysfs_path ( ) ) ;
if ( NULL = = file ) {
return false ;
}
2008-05-29 18:19:51 +04:00
rc = stat ( file , & s ) ;
2008-05-29 02:05:47 +04:00
free ( file ) ;
2008-05-29 18:19:51 +04:00
if ( 0 ! = rc | | ! S_ISDIR ( s . st_mode ) ) {
2008-05-29 02:05:47 +04:00
return false ;
}
2008-05-29 18:19:51 +04:00
/* It exists and is a directory -- good enough */
return true ;
2008-05-29 02:05:47 +04:00
}
2008-05-02 15:52:33 +04:00
static void inline pack8 ( char * * dest , uint8_t value )
{
/* Copy one character */
* * dest = ( char ) value ;
/* Most the dest ahead one */
+ + * dest ;
}
2005-10-01 02:58:09 +04:00
/*
2008-05-02 15:52:33 +04:00
* Register local openib port information with the modex so that it
* can be shared with all other peers .
2005-10-01 02:58:09 +04:00
*/
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
static int btl_openib_modex_send ( void )
2005-10-01 02:58:09 +04:00
{
2008-05-02 15:52:33 +04:00
int rc , i , j ;
int modex_message_size ;
mca_btl_openib_modex_message_t dummy ;
2008-01-15 02:22:03 +03:00
char * message , * offset ;
2008-05-02 15:52:33 +04:00
size_t size , msg_size ;
ompi_btl_openib_connect_base_module_t * cpc ;
2008-01-15 02:22:03 +03:00
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " Starting to modex send " ) ;
2008-05-02 15:52:33 +04:00
if ( 0 = = mca_btl_openib_component . ib_num_btls ) {
2008-01-15 02:22:03 +03:00
return 0 ;
}
2008-05-02 15:52:33 +04:00
modex_message_size = ( ( char * ) & ( dummy . end ) ) - ( ( char * ) & dummy ) ;
/* The message is packed into multiple parts:
* 1. a uint8_t indicating the number of modules ( ports ) in the message
* 2. for each module :
* a . the common module data
* b . a uint8_t indicating how many CPCs follow
* c . for each CPC :
* a . a uint8_t indicating the index of the CPC in the all [ ]
* array in btl_openib_connect_base . c
* b . a uint8_t indicating the priority of this CPC
* c . a uint8_t indicating the length of the blob to follow
* d . a blob that is only meaningful to that CPC
*/
msg_size =
/* uint8_t for number of modules in the message */
1 +
/* For each module: */
mca_btl_openib_component . ib_num_btls *
(
/* Common module data */
modex_message_size +
/* uint8_t for how many CPCs follow */
1
) ;
/* For each module, add in the size of the per-CPC data */
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
for ( j = 0 ;
j < mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ;
+ + j ) {
msg_size + =
/* uint8_t for the index of the CPC */
1 +
/* uint8_t for the CPC's priority */
1 +
/* uint8_t for the blob length */
1 +
/* blob length */
mca_btl_openib_component . openib_btls [ i ] - > cpcs [ j ] - > data . cbm_modex_message_len ;
}
}
2008-01-15 02:22:03 +03:00
message = malloc ( msg_size ) ;
if ( NULL = = message ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc " ) ) ;
2008-01-15 02:22:03 +03:00
return OMPI_ERR_OUT_OF_RESOURCE ;
2008-01-21 15:11:18 +03:00
}
2008-01-15 02:22:03 +03:00
2008-05-02 15:52:33 +04:00
/* Pack the number of modules */
offset = message ;
pack8 ( & offset , mca_btl_openib_component . ib_num_btls ) ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " modex sending %d btls (packed: %d, offset now at %d) " , mca_btl_openib_component . ib_num_btls , * ( ( uint8_t * ) message ) , ( int ) ( offset - message ) ) ;
2008-01-15 02:22:03 +03:00
2008-05-02 15:52:33 +04:00
/* Pack each of the modules */
2008-01-15 02:22:03 +03:00
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
2008-05-02 15:52:33 +04:00
/* Pack the modex common message struct. */
size = modex_message_size ;
memcpy ( offset ,
& ( mca_btl_openib_component . openib_btls [ i ] - > port_info ) ,
size ) ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " modex packed btl port modex message: 0x% " PRIx64 " , %d, %d (size: %d) " ,
2008-05-02 15:52:33 +04:00
mca_btl_openib_component . openib_btls [ i ] - > port_info . subnet_id ,
mca_btl_openib_component . openib_btls [ i ] - > port_info . mtu ,
mca_btl_openib_component . openib_btls [ i ] - > port_info . lid ,
( int ) size ) ;
2008-01-15 02:22:03 +03:00
# if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT
2008-05-02 15:52:33 +04:00
MCA_BTL_OPENIB_MODEX_MSG_HTON ( * ( mca_btl_openib_modex_message_t * ) offset ) ;
2008-01-15 02:22:03 +03:00
# endif
2008-05-02 15:52:33 +04:00
offset + = size ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " modex packed btl %d: modex message, offset now %d " ,
2008-05-02 15:52:33 +04:00
i , ( int ) ( offset - message ) ) ;
/* Pack the number of CPCs that follow */
pack8 ( & offset ,
mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ) ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " modex packed btl %d: to pack %d cpcs (packed: %d, offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ,
* ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack each CPC */
for ( j = 0 ;
j < mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ;
+ + j ) {
uint8_t u8 ;
cpc = mca_btl_openib_component . openib_btls [ i ] - > cpcs [ j ] ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " modex packed btl %d: packing cpc %s " ,
2008-05-02 15:52:33 +04:00
i , cpc - > data . cbm_component - > cbc_name ) ;
/* Pack the CPC index */
u8 = ompi_btl_openib_connect_base_get_cpc_index ( cpc - > data . cbm_component ) ;
pack8 ( & offset , u8 ) ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " packing btl %d: cpc %d: index %d (packed %d, offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , j , u8 , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack the CPC priority */
pack8 ( & offset , cpc - > data . cbm_priority ) ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " packing btl %d: cpc %d: priority %d (packed %d, offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , j , cpc - > data . cbm_priority , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack the blob length */
u8 = cpc - > data . cbm_modex_message_len ;
pack8 ( & offset , u8 ) ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " packing btl %d: cpc %d: message len %d (packed %d, offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , j , u8 , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* If the blob length is > 0, pack the blob */
if ( u8 > 0 ) {
memcpy ( offset , cpc - > data . cbm_modex_message , u8 ) ;
offset + = u8 ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " packing btl %d: cpc %d: blob packed %d %x (offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , j ,
( ( uint32_t * ) cpc - > data . cbm_modex_message ) [ 0 ] ,
( ( uint32_t * ) cpc - > data . cbm_modex_message ) [ 1 ] ,
( int ) ( offset - message ) ) ;
}
2008-01-15 02:22:03 +03:00
2008-05-02 15:52:33 +04:00
/* Sanity check */
assert ( ( size_t ) ( offset - message ) < = msg_size ) ;
}
2005-10-01 02:58:09 +04:00
}
2008-01-15 02:22:03 +03:00
2008-05-02 15:52:33 +04:00
/* All done -- send it! */
2008-01-21 15:11:18 +03:00
rc = ompi_modex_send ( & mca_btl_openib_component . super . btl_version ,
2008-01-15 02:22:03 +03:00
message , msg_size ) ;
free ( message ) ;
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " Modex sent! %d calculated, %d actual \n " , ( int ) msg_size , ( int ) ( offset - message ) ) ;
2008-01-15 02:22:03 +03:00
2005-10-01 02:58:09 +04:00
return rc ;
}
2005-11-10 23:15:02 +03:00
/*
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
* Active Message Callback function on control message .
2005-11-10 23:15:02 +03:00
*/
2007-11-28 10:13:34 +03:00
static void btl_openib_control ( mca_btl_base_module_t * btl ,
mca_btl_base_tag_t tag , mca_btl_base_descriptor_t * des ,
void * cbdata )
2005-11-10 23:15:02 +03:00
{
2007-11-28 10:11:14 +03:00
/* don't return credits used for control messages */
2007-12-09 17:05:13 +03:00
mca_btl_openib_module_t * obtl = ( mca_btl_openib_module_t * ) btl ;
2007-11-28 10:13:34 +03:00
mca_btl_openib_endpoint_t * ep = to_com_frag ( des ) - > endpoint ;
2007-11-28 10:11:14 +03:00
mca_btl_openib_control_header_t * ctl_hdr =
to_base_frag ( des ) - > segment . seg_addr . pval ;
2006-03-26 12:30:50 +04:00
mca_btl_openib_eager_rdma_header_t * rdma_hdr ;
2007-12-09 17:05:13 +03:00
mca_btl_openib_header_coalesced_t * clsc_hdr =
( mca_btl_openib_header_coalesced_t * ) ( ctl_hdr + 1 ) ;
2008-01-15 08:32:53 +03:00
mca_btl_active_message_callback_t * reg ;
2007-12-09 17:05:13 +03:00
size_t len = des - > des_dst - > seg_len - sizeof ( * ctl_hdr ) ;
2008-01-21 15:11:18 +03:00
2006-03-26 12:30:50 +04:00
switch ( ctl_hdr - > type ) {
2006-09-05 20:02:09 +04:00
case MCA_BTL_OPENIB_CONTROL_CREDITS :
2007-11-28 10:13:34 +03:00
assert ( 0 ) ; /* Credit message is handled elsewhere */
2007-01-13 02:14:45 +03:00
break ;
2006-03-26 12:30:50 +04:00
case MCA_BTL_OPENIB_CONTROL_RDMA :
rdma_hdr = ( mca_btl_openib_eager_rdma_header_t * ) ctl_hdr ;
2008-01-21 15:11:18 +03:00
2008-05-02 15:52:33 +04:00
BTL_VERBOSE ( ( " prior to NTOH received rkey %lu, rdma_start.lval %llu, pval %p, ival %u " ,
2008-01-21 15:11:18 +03:00
rdma_hdr - > rkey ,
2007-01-13 02:14:45 +03:00
( unsigned long ) rdma_hdr - > rdma_start . lval ,
rdma_hdr - > rdma_start . pval ,
2007-03-05 17:17:50 +03:00
rdma_hdr - > rdma_start . ival
2007-01-13 02:14:45 +03:00
) ) ;
2008-01-21 15:11:18 +03:00
2007-11-28 10:13:34 +03:00
if ( ep - > nbo ) {
2007-11-28 10:12:44 +03:00
BTL_OPENIB_EAGER_RDMA_CONTROL_HEADER_NTOH ( * rdma_hdr ) ;
}
2008-01-21 15:11:18 +03:00
2007-11-28 10:12:44 +03:00
BTL_VERBOSE ( ( " received rkey %lu, rdma_start.lval %llu, pval %p, "
2008-05-02 15:52:33 +04:00
" ival %u " , rdma_hdr - > rkey ,
2007-11-28 10:12:44 +03:00
( unsigned long ) rdma_hdr - > rdma_start . lval ,
rdma_hdr - > rdma_start . pval , rdma_hdr - > rdma_start . ival ) ) ;
2008-01-21 15:11:18 +03:00
2007-11-28 10:13:34 +03:00
if ( ep - > eager_rdma_remote . base . pval ) {
2008-01-10 00:54:11 +03:00
BTL_ERROR ( ( " Got RDMA connect twice! " ) ) ;
return ;
2006-03-26 12:30:50 +04:00
}
2007-11-28 10:13:34 +03:00
ep - > eager_rdma_remote . rkey = rdma_hdr - > rkey ;
ep - > eager_rdma_remote . base . lval = rdma_hdr - > rdma_start . lval ;
ep - > eager_rdma_remote . tokens = mca_btl_openib_component . eager_rdma_num - 1 ;
2006-03-26 12:30:50 +04:00
break ;
2007-12-09 17:05:13 +03:00
case MCA_BTL_OPENIB_CONTROL_COALESCED :
while ( len > 0 ) {
2007-12-09 17:10:25 +03:00
size_t skip ;
2007-12-09 17:05:13 +03:00
mca_btl_base_descriptor_t tmp_des ;
mca_btl_base_segment_t tmp_seg ;
assert ( len > = sizeof ( * clsc_hdr ) ) ;
2007-12-09 17:10:25 +03:00
if ( ep - > nbo )
BTL_OPENIB_HEADER_COALESCED_NTOH ( * clsc_hdr ) ;
skip = ( sizeof ( * clsc_hdr ) + clsc_hdr - > alloc_size ) ;
2007-12-09 17:05:13 +03:00
tmp_des . des_dst = & tmp_seg ;
tmp_des . des_dst_cnt = 1 ;
tmp_seg . seg_addr . pval = clsc_hdr + 1 ;
tmp_seg . seg_len = clsc_hdr - > size ;
/* call registered callback */
2008-01-15 08:32:53 +03:00
reg = mca_btl_base_active_message_trigger + clsc_hdr - > tag ;
reg - > cbfunc ( & obtl - > super , clsc_hdr - > tag , & tmp_des , reg - > cbdata ) ;
2007-12-09 17:05:13 +03:00
len - = skip ;
clsc_hdr = ( mca_btl_openib_header_coalesced_t * )
( ( ( unsigned char * ) clsc_hdr ) + skip ) ;
}
break ;
2006-03-26 12:30:50 +04:00
default :
2007-01-13 02:14:45 +03:00
BTL_ERROR ( ( " Unknown message type received by BTL " ) ) ;
2006-03-26 12:30:50 +04:00
break ;
}
2005-11-10 23:15:02 +03:00
}
2006-12-17 15:26:41 +03:00
static int openib_reg_mr ( void * reg_data , void * base , size_t size ,
mca_mpool_base_registration_t * reg )
{
2008-07-23 04:28:59 +04:00
mca_btl_openib_device_t * device = ( mca_btl_openib_device_t * ) reg_data ;
2006-12-17 15:26:41 +03:00
mca_btl_openib_reg_t * openib_reg = ( mca_btl_openib_reg_t * ) reg ;
2008-07-23 04:28:59 +04:00
openib_reg - > mr = ibv_reg_mr ( device - > ib_pd , base , size , IBV_ACCESS_LOCAL_WRITE |
2006-12-17 15:26:41 +03:00
IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ ) ;
if ( NULL = = openib_reg - > mr )
return OMPI_ERR_OUT_OF_RESOURCE ;
return OMPI_SUCCESS ;
}
static int openib_dereg_mr ( void * reg_data , mca_mpool_base_registration_t * reg )
{
mca_btl_openib_reg_t * openib_reg = ( mca_btl_openib_reg_t * ) reg ;
if ( openib_reg - > mr ! = NULL ) {
if ( ibv_dereg_mr ( openib_reg - > mr ) ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " %s: error unpinning openib memory errno says %s " ,
2007-10-15 21:53:02 +04:00
__func__ , strerror ( errno ) ) ) ;
2006-12-17 15:26:41 +03:00
return OMPI_ERROR ;
}
}
openib_reg - > mr = NULL ;
return OMPI_SUCCESS ;
}
2007-06-13 16:47:38 +04:00
static inline int param_register_int ( const char * param_name , int default_value )
{
int param_value = default_value ;
int id = mca_base_param_register_int ( " btl " , " openib " , param_name , NULL ,
default_value ) ;
mca_base_param_lookup_int ( id , & param_value ) ;
return param_value ;
}
2006-12-17 15:26:41 +03:00
2007-11-28 10:14:34 +03:00
# if OMPI_HAVE_THREADS
static int start_async_event_thread ( void )
{
/* Set the fatal counter to zero */
mca_btl_openib_component . fatal_counter = 0 ;
/* Create pipe for communication with async event thread */
if ( pipe ( mca_btl_openib_component . async_pipe ) ) {
BTL_ERROR ( ( " Failed to create pipe for communication with "
" async event thread " ) ) ;
return OMPI_ERROR ;
}
2008-07-13 20:21:49 +04:00
if ( pipe ( mca_btl_openib_component . async_comp_pipe ) ) {
BTL_ERROR ( ( " Failed to create comp pipe for communication with "
" main thread " ) ) ;
return OMPI_ERROR ;
}
2007-11-28 10:14:34 +03:00
/* Starting async event thread for the component */
if ( pthread_create ( & mca_btl_openib_component . async_thread , NULL ,
( void * ( * ) ( void * ) ) btl_openib_async_thread , NULL ) ) {
BTL_ERROR ( ( " Failed to create async event thread " ) ) ;
return OMPI_ERROR ;
}
return OMPI_SUCCESS ;
}
# endif
2008-07-23 04:28:59 +04:00
static int init_one_port ( opal_list_t * btl_list , mca_btl_openib_device_t * device ,
2007-04-22 14:22:12 +04:00
uint8_t port_num , uint16_t pkey_index ,
struct ibv_port_attr * ib_port_attr )
2006-06-28 11:23:08 +04:00
{
2008-01-28 13:38:08 +03:00
uint16_t lid , i , lmc , lmc_step ;
2006-06-28 11:23:08 +04:00
mca_btl_openib_module_t * openib_btl ;
mca_btl_base_selected_module_t * ib_selected ;
2006-09-22 14:27:12 +04:00
union ibv_gid gid ;
2007-01-13 01:42:20 +03:00
uint64_t subnet_id ;
2007-04-21 04:15:05 +04:00
2008-05-02 15:52:33 +04:00
/* If we have struct ibv_device.transport_type, then we're >= OFED
2008-05-07 18:51:36 +04:00
v1 .2 , and the transport could be iWarp or IB . If we don ' t have
that member , then we ' re < OFED v1 .2 , and it can only be IB . */
2008-05-02 15:52:33 +04:00
# if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
2008-07-23 04:28:59 +04:00
if ( IBV_TRANSPORT_IWARP = = device - > ib_dev - > transport_type ) {
subnet_id = mca_btl_openib_get_iwarp_subnet_id ( device - > ib_dev ) ;
2008-06-27 00:23:56 +04:00
BTL_VERBOSE ( ( " my iWARP subnet_id is %016 " PRIx64 , subnet_id ) ) ;
2008-05-02 15:52:33 +04:00
} else {
2008-06-27 00:23:56 +04:00
memset ( & gid , 0 , sizeof ( gid ) ) ;
2008-07-23 04:28:59 +04:00
if ( 0 ! = ibv_query_gid ( device - > ib_dev_context , port_num , 0 , & gid ) ) {
2008-06-27 00:23:56 +04:00
BTL_ERROR ( ( " ibv_query_gid failed (%s:%d) \n " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , port_num ) ) ;
2008-06-27 00:23:56 +04:00
return OMPI_ERR_NOT_FOUND ;
}
2008-05-02 15:52:33 +04:00
subnet_id = ntoh64 ( gid . global . subnet_prefix ) ;
2008-06-27 00:23:56 +04:00
BTL_VERBOSE ( ( " my IB subnet_id for HCA %s port %d is %016 " PRIx64 ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , port_num , subnet_id ) ) ;
2008-05-02 15:52:33 +04:00
}
# else
2008-07-23 04:28:59 +04:00
if ( 0 ! = ibv_query_gid ( device - > ib_dev_context , port_num , 0 , & gid ) ) {
2008-06-27 00:23:56 +04:00
BTL_ERROR ( ( " ibv_query_gid failed (%s:%d) \n " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , port_num ) ) ;
2008-06-27 00:23:56 +04:00
return OMPI_ERR_NOT_FOUND ;
}
2007-01-13 01:42:20 +03:00
subnet_id = ntoh64 ( gid . global . subnet_prefix ) ;
2008-06-27 00:23:56 +04:00
BTL_VERBOSE ( ( " my IB-only subnet_id for HCA %s port %d is %016 " PRIx64 ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , port_num , subnet_id ) ) ;
2008-05-02 15:52:33 +04:00
# endif
2006-09-26 16:12:33 +04:00
if ( mca_btl_openib_component . ib_num_btls > 0 & &
2007-01-13 01:42:20 +03:00
IB_DEFAULT_GID_PREFIX = = subnet_id & &
2006-09-26 16:12:33 +04:00
mca_btl_openib_component . warn_default_gid_prefix ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " default subnet prefix " ,
2008-03-24 02:10:15 +03:00
true , orte_process_info . nodename ) ;
2006-09-26 16:12:33 +04:00
}
2006-06-28 11:23:08 +04:00
lmc = ( 1 < < ib_port_attr - > lmc ) ;
2008-01-28 13:38:08 +03:00
lmc_step = 1 ;
2006-06-28 11:23:08 +04:00
2008-01-21 15:11:18 +03:00
if ( 0 ! = mca_btl_openib_component . max_lmc & &
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
mca_btl_openib_component . max_lmc < lmc ) {
2006-06-28 11:23:08 +04:00
lmc = mca_btl_openib_component . max_lmc ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-06-28 11:23:08 +04:00
2008-01-28 19:10:18 +03:00
# if OMPI_HAVE_THREADS
2008-05-02 15:52:33 +04:00
/* APM support -- only meaningful if async event support is
enabled . If async events are not enabled , then there ' s nothing
to listen for the APM event to load the new path , so it ' s not
worth enabling APM . */
2008-01-28 13:38:08 +03:00
if ( lmc > 1 ) {
2008-02-20 16:44:05 +03:00
if ( - 1 = = mca_btl_openib_component . apm_lmc ) {
2008-01-28 13:38:08 +03:00
lmc_step = lmc ;
2008-02-20 16:44:05 +03:00
mca_btl_openib_component . apm_lmc = lmc - 1 ;
} else if ( 0 = = lmc % ( mca_btl_openib_component . apm_lmc + 1 ) ) {
lmc_step = mca_btl_openib_component . apm_lmc + 1 ;
2008-01-28 13:38:08 +03:00
} else {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " apm with wrong lmc " , true ,
2008-02-20 16:44:05 +03:00
mca_btl_openib_component . apm_lmc , lmc ) ;
2008-01-28 13:38:08 +03:00
return OMPI_ERROR ;
}
} else {
2008-02-20 16:44:05 +03:00
if ( mca_btl_openib_component . apm_lmc ) {
2008-01-28 13:38:08 +03:00
/* Disable apm and report warning */
2008-02-20 16:44:05 +03:00
mca_btl_openib_component . apm_lmc = 0 ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " apm without lmc " , true ) ;
2008-01-28 13:38:08 +03:00
}
}
2008-01-28 19:10:18 +03:00
# endif
2008-01-28 13:38:08 +03:00
2006-06-28 11:23:08 +04:00
for ( lid = ib_port_attr - > lid ;
2008-01-28 13:38:08 +03:00
lid < ib_port_attr - > lid + lmc ; lid + = lmc_step ) {
2006-06-28 11:23:08 +04:00
for ( i = 0 ; i < mca_btl_openib_component . btls_per_lid ; i + + ) {
2007-06-13 16:47:38 +04:00
char param [ 40 ] ;
2008-01-15 02:22:03 +03:00
2006-06-28 11:23:08 +04:00
openib_btl = malloc ( sizeof ( mca_btl_openib_module_t ) ) ;
if ( NULL = = openib_btl ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return OMPI_ERR_OUT_OF_RESOURCE ;
2006-06-28 11:23:08 +04:00
}
memcpy ( openib_btl , & mca_btl_openib_module ,
sizeof ( mca_btl_openib_module ) ) ;
memcpy ( & openib_btl - > ib_port_attr , ib_port_attr ,
sizeof ( struct ibv_port_attr ) ) ;
ib_selected = OBJ_NEW ( mca_btl_base_selected_module_t ) ;
ib_selected - > btl_module = ( mca_btl_base_module_t * ) openib_btl ;
2008-07-23 04:28:59 +04:00
openib_btl - > device = device ;
2006-06-28 11:23:08 +04:00
openib_btl - > port_num = ( uint8_t ) port_num ;
2007-04-22 14:22:12 +04:00
openib_btl - > pkey_index = pkey_index ;
2006-06-28 11:23:08 +04:00
openib_btl - > lid = lid ;
2008-02-20 16:44:05 +03:00
openib_btl - > apm_port = 0 ;
2006-06-28 11:23:08 +04:00
openib_btl - > src_path_bits = lid - ib_port_attr - > lid ;
2008-05-02 15:52:33 +04:00
2007-01-13 01:42:20 +03:00
openib_btl - > port_info . subnet_id = subnet_id ;
2008-07-23 04:28:59 +04:00
openib_btl - > port_info . mtu = device - > mtu ;
2008-02-20 16:44:05 +03:00
openib_btl - > port_info . lid = lid ;
2008-05-02 15:52:33 +04:00
openib_btl - > cpcs = NULL ;
openib_btl - > num_cpcs = 0 ;
2008-01-15 08:32:53 +03:00
mca_btl_base_active_message_trigger [ MCA_BTL_TAG_IB ] . cbfunc = btl_openib_control ;
mca_btl_base_active_message_trigger [ MCA_BTL_TAG_IB ] . cbdata = NULL ;
2007-04-21 04:15:05 +04:00
2008-07-23 04:28:59 +04:00
/* Check bandwidth configured for this device */
sprintf ( param , " bandwidth_%s " , ibv_get_device_name ( device - > ib_dev ) ) ;
2007-06-13 16:47:38 +04:00
openib_btl - > super . btl_bandwidth =
param_register_int ( param , openib_btl - > super . btl_bandwidth ) ;
2008-07-23 04:28:59 +04:00
/* Check bandwidth configured for this device/port */
sprintf ( param , " bandwidth_%s:%d " , ibv_get_device_name ( device - > ib_dev ) ,
2007-06-13 16:47:38 +04:00
port_num ) ;
openib_btl - > super . btl_bandwidth =
param_register_int ( param , openib_btl - > super . btl_bandwidth ) ;
2008-07-23 04:28:59 +04:00
/* Check bandwidth configured for this device/port/LID */
2007-06-13 16:47:38 +04:00
sprintf ( param , " bandwidth_%s:%d:%d " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , port_num , lid ) ;
2007-06-13 16:47:38 +04:00
openib_btl - > super . btl_bandwidth =
param_register_int ( param , openib_btl - > super . btl_bandwidth ) ;
2008-07-23 04:28:59 +04:00
/* Check latency configured for this device */
sprintf ( param , " latency_%s " , ibv_get_device_name ( device - > ib_dev ) ) ;
2007-06-13 16:47:38 +04:00
openib_btl - > super . btl_latency =
param_register_int ( param , openib_btl - > super . btl_latency ) ;
2008-07-23 04:28:59 +04:00
/* Check latency configured for this device/port */
sprintf ( param , " latency_%s:%d " , ibv_get_device_name ( device - > ib_dev ) ,
2007-06-13 16:47:38 +04:00
port_num ) ;
openib_btl - > super . btl_latency =
param_register_int ( param , openib_btl - > super . btl_latency ) ;
2008-07-23 04:28:59 +04:00
/* Check latency configured for this device/port/LID */
sprintf ( param , " latency_%s:%d:%d " , ibv_get_device_name ( device - > ib_dev ) ,
2007-06-13 16:47:38 +04:00
port_num , lid ) ;
openib_btl - > super . btl_latency =
param_register_int ( param , openib_btl - > super . btl_latency ) ;
2007-04-21 04:15:05 +04:00
/* Auto-detect the port bandwidth */
if ( 0 = = openib_btl - > super . btl_bandwidth ) {
/* To calculate the bandwidth available on this port,
we have to look up the values corresponding to
port - > active_speed and port - > active_width . These
are enums corresponding to the IB spec . Overall
forumula is 80 % of the reported speed ( to get the
true link speed ) times the number of links . */
switch ( ib_port_attr - > active_speed ) {
2008-01-21 15:11:18 +03:00
case 1 :
2007-04-21 04:15:05 +04:00
/* 2.5Gbps * 0.8, in megabits */
openib_btl - > super . btl_bandwidth = 2000 ;
break ;
2008-01-21 15:11:18 +03:00
case 2 :
2007-04-21 04:15:05 +04:00
/* 5.0Gbps * 0.8, in megabits */
openib_btl - > super . btl_bandwidth = 4000 ;
break ;
2008-01-21 15:11:18 +03:00
case 4 :
2007-04-21 04:15:05 +04:00
/* 10.0Gbps * 0.8, in megabits */
openib_btl - > super . btl_bandwidth = 8000 ;
break ;
2008-01-21 15:11:18 +03:00
default :
2008-02-18 23:28:06 +03:00
/* Who knows? Declare this port unreachable (do
* not * return ERR_VALUE_OF_OUT_OF_BOUNDS ; that
is reserved for when we exceed the number of
allowable BTLs ) . */
return OMPI_ERR_UNREACH ;
2007-04-21 04:15:05 +04:00
}
switch ( ib_port_attr - > active_width ) {
case 1 :
/* 1x */
/* unity */
break ;
case 2 :
/* 4x */
openib_btl - > super . btl_bandwidth * = 4 ;
break ;
case 4 :
/* 8x */
openib_btl - > super . btl_bandwidth * = 8 ;
break ;
case 8 :
/* 12x */
openib_btl - > super . btl_bandwidth * = 12 ;
break ;
default :
2008-02-18 23:28:06 +03:00
/* Who knows? Declare this port unreachable (do
* not * return ERR_VALUE_OF_OUT_OF_BOUNDS ; that
is reserved for when we exceed the number of
allowable BTLs ) . */
return OMPI_ERR_UNREACH ;
2007-04-21 04:15:05 +04:00
}
}
2006-06-28 11:23:08 +04:00
opal_list_append ( btl_list , ( opal_list_item_t * ) ib_selected ) ;
2008-07-23 04:28:59 +04:00
opal_pointer_array_add ( device - > device_btls , ( void * ) openib_btl ) ;
+ + device - > btls ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
+ + mca_btl_openib_component . ib_num_btls ;
if ( - 1 ! = mca_btl_openib_component . ib_max_btls & &
mca_btl_openib_component . ib_num_btls > =
mca_btl_openib_component . ib_max_btls ) {
2006-09-25 15:18:20 +04:00
return OMPI_ERR_VALUE_OUT_OF_BOUNDS ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-06-28 11:23:08 +04:00
}
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return OMPI_SUCCESS ;
2006-06-28 11:23:08 +04:00
}
2008-07-23 04:28:59 +04:00
static void device_construct ( mca_btl_openib_device_t * device )
2007-12-23 15:29:34 +03:00
{
2008-07-23 04:28:59 +04:00
device - > ib_dev = NULL ;
device - > ib_dev_context = NULL ;
device - > ib_pd = NULL ;
device - > mpool = NULL ;
2008-05-02 15:52:33 +04:00
# if OMPI_ENABLE_PROGRESS_THREADS
2008-07-23 04:28:59 +04:00
device - > ib_channel = NULL ;
2008-01-09 13:26:21 +03:00
# endif
2008-07-23 04:28:59 +04:00
device - > btls = 0 ;
device - > ib_cq [ BTL_OPENIB_HP_CQ ] = NULL ;
device - > ib_cq [ BTL_OPENIB_LP_CQ ] = NULL ;
device - > cq_size [ BTL_OPENIB_HP_CQ ] = 0 ;
device - > cq_size [ BTL_OPENIB_LP_CQ ] = 0 ;
device - > non_eager_rdma_endpoints = 0 ;
device - > hp_cq_polls = mca_btl_openib_component . cq_poll_ratio ;
device - > eager_rdma_polls = mca_btl_openib_component . eager_rdma_poll_ratio ;
device - > pollme = true ;
device - > eager_rdma_buffers_count = 0 ;
device - > eager_rdma_buffers = NULL ;
2008-01-09 13:26:21 +03:00
# if HAVE_XRC
2008-07-23 04:28:59 +04:00
device - > xrc_fd = - 1 ;
2008-01-09 13:26:21 +03:00
# endif
2008-07-23 04:28:59 +04:00
device - > qps = NULL ;
2008-06-20 23:30:51 +04:00
# if OMPI_HAVE_THREADS
mca_btl_openib_component . async_pipe [ 0 ] =
mca_btl_openib_component . async_pipe [ 1 ] = - 1 ;
2008-07-13 20:21:49 +04:00
mca_btl_openib_component . async_comp_pipe [ 0 ] =
mca_btl_openib_component . async_comp_pipe [ 1 ] = - 1 ;
2008-06-20 23:30:51 +04:00
# endif
2008-07-23 04:28:59 +04:00
OBJ_CONSTRUCT ( & device - > device_lock , opal_mutex_t ) ;
OBJ_CONSTRUCT ( & device - > send_free_control , ompi_free_list_t ) ;
device - > max_inline_data = 0 ;
2007-12-23 15:29:34 +03:00
}
2008-07-23 04:28:59 +04:00
static void device_destruct ( mca_btl_openib_device_t * device )
2007-12-23 15:29:34 +03:00
{
2008-01-09 13:26:21 +03:00
int i ;
2008-06-05 16:20:13 +04:00
# if OMPI_HAVE_THREADS
# if OMPI_ENABLE_PROGRESS_THREADS
2008-07-23 04:28:59 +04:00
if ( device - > progress ) {
device - > progress = false ;
if ( pthread_cancel ( device - > thread . t_handle ) ) {
2008-06-05 16:20:13 +04:00
BTL_ERROR ( ( " Failed to cancel OpenIB progress thread " ) ) ;
2008-07-23 04:28:59 +04:00
goto device_error ;
2008-06-05 16:20:13 +04:00
}
2008-07-23 04:28:59 +04:00
opal_thread_join ( & device - > thread , NULL ) ;
2008-06-05 16:20:13 +04:00
}
2008-07-23 04:28:59 +04:00
if ( ibv_destroy_comp_channel ( device - > ib_channel ) ) {
2008-06-05 16:20:13 +04:00
BTL_VERBOSE ( ( " Failed to close comp_channel " ) ) ;
2008-07-23 04:28:59 +04:00
goto device_error ;
2008-06-05 16:20:13 +04:00
}
# endif
2008-07-23 04:28:59 +04:00
/* signaling to async_tread to stop poll for this device */
2008-06-20 23:30:51 +04:00
if ( mca_btl_openib_component . use_async_event_thread & &
- 1 ! = mca_btl_openib_component . async_pipe [ 1 ] ) {
2008-07-23 04:28:59 +04:00
int device_to_remove ;
device_to_remove = - ( device - > ib_dev_context - > async_fd ) ;
if ( write ( mca_btl_openib_component . async_pipe [ 1 ] , & device_to_remove ,
2008-06-05 16:20:13 +04:00
sizeof ( int ) ) < 0 ) {
BTL_ERROR ( ( " Failed to write to pipe " ) ) ;
2008-07-23 04:28:59 +04:00
goto device_error ;
2008-06-05 16:20:13 +04:00
}
2008-07-13 20:21:49 +04:00
/* wait for ok from thread */
2008-07-23 04:28:59 +04:00
if ( OMPI_SUCCESS ! = btl_openib_async_command_done ( device_to_remove ) ) {
goto device_error ;
2008-07-13 20:21:49 +04:00
}
2008-06-05 16:20:13 +04:00
}
# endif
2008-07-23 04:28:59 +04:00
if ( device - > eager_rdma_buffers ) {
2007-12-23 15:29:34 +03:00
int i ;
2008-07-23 04:28:59 +04:00
for ( i = 0 ; i < device - > eager_rdma_buffers_count ; i + + )
if ( device - > eager_rdma_buffers [ i ] )
OBJ_RELEASE ( device - > eager_rdma_buffers [ i ] ) ;
free ( device - > eager_rdma_buffers ) ;
2007-12-23 15:29:34 +03:00
}
2008-06-05 16:20:13 +04:00
2008-07-23 04:28:59 +04:00
if ( NULL ! = device - > qps ) {
2008-05-21 01:53:42 +04:00
for ( i = 0 ; i < mca_btl_openib_component . num_qps ; i + + ) {
2008-07-23 04:28:59 +04:00
OBJ_DESTRUCT ( & device - > qps [ i ] . send_free ) ;
OBJ_DESTRUCT ( & device - > qps [ i ] . recv_free ) ;
2008-05-21 01:53:42 +04:00
}
2008-07-23 04:28:59 +04:00
free ( device - > qps ) ;
2008-05-21 01:53:42 +04:00
}
2008-06-05 16:20:13 +04:00
2008-07-23 04:28:59 +04:00
OBJ_DESTRUCT ( & device - > send_free_control ) ;
2008-06-05 16:20:13 +04:00
/* Release CQs */
2008-07-23 04:28:59 +04:00
if ( device - > ib_cq [ BTL_OPENIB_HP_CQ ] ! = NULL ) {
if ( ibv_destroy_cq ( device - > ib_cq [ BTL_OPENIB_HP_CQ ] ) ) {
2008-06-05 16:20:13 +04:00
BTL_VERBOSE ( ( " Failed to close HP CQ " ) ) ;
2008-07-23 04:28:59 +04:00
goto device_error ;
2008-06-05 16:20:13 +04:00
}
}
2008-07-23 04:28:59 +04:00
if ( device - > ib_cq [ BTL_OPENIB_LP_CQ ] ! = NULL ) {
if ( ibv_destroy_cq ( device - > ib_cq [ BTL_OPENIB_LP_CQ ] ) ) {
2008-06-05 16:20:13 +04:00
BTL_VERBOSE ( ( " Failed to close LP CQ " ) ) ;
2008-07-23 04:28:59 +04:00
goto device_error ;
2008-06-05 16:20:13 +04:00
}
}
2008-07-23 04:28:59 +04:00
if ( OMPI_SUCCESS ! = mca_mpool_base_module_destroy ( device - > mpool ) ) {
2008-06-05 16:20:13 +04:00
BTL_VERBOSE ( ( " Failed to release mpool " ) ) ;
2008-07-23 04:28:59 +04:00
goto device_error ;
2008-06-05 16:20:13 +04:00
}
# if HAVE_XRC
if ( MCA_BTL_XRC_ENABLED ) {
2008-07-23 04:28:59 +04:00
if ( OMPI_SUCCESS ! = mca_btl_openib_close_xrc_domain ( device ) ) {
2008-06-05 16:20:13 +04:00
BTL_VERBOSE ( ( " XRC Internal error. Failed to close xrc domain " ) ) ;
2008-07-23 04:28:59 +04:00
goto device_error ;
2008-06-05 16:20:13 +04:00
}
}
# endif
2008-07-23 04:28:59 +04:00
if ( ibv_dealloc_pd ( device - > ib_pd ) ) {
2008-06-05 16:20:13 +04:00
BTL_VERBOSE ( ( " Warning! Failed to release PD " ) ) ;
2008-07-23 04:28:59 +04:00
goto device_error ;
2008-06-05 16:20:13 +04:00
}
2008-07-23 04:28:59 +04:00
OBJ_DESTRUCT ( & device - > device_lock ) ;
2008-06-05 16:20:13 +04:00
2008-07-23 04:28:59 +04:00
if ( ibv_close_device ( device - > ib_dev_context ) ) {
2008-09-17 02:06:14 +04:00
if ( 1 = = ompi_mpi_leave_pinned | | ompi_mpi_leave_pinned_pipeline ) {
2008-07-23 04:28:59 +04:00
BTL_VERBOSE ( ( " Warning! Failed to close device " ) ) ;
goto device_error ;
2008-06-05 16:20:13 +04:00
} else {
2008-07-23 04:28:59 +04:00
BTL_ERROR ( ( " Error! Failed to close device " ) ) ;
goto device_error ;
2008-06-05 16:20:13 +04:00
}
}
2008-07-23 04:28:59 +04:00
BTL_VERBOSE ( ( " device was successfully released " ) ) ;
2008-06-05 16:20:13 +04:00
return ;
2008-07-23 04:28:59 +04:00
device_error :
BTL_VERBOSE ( ( " Failed to destroy device resources " ) ) ;
2007-12-23 15:29:34 +03:00
}
2008-07-23 04:28:59 +04:00
OBJ_CLASS_INSTANCE ( mca_btl_openib_device_t , opal_object_t , device_construct ,
device_destruct ) ;
2007-12-23 15:29:34 +03:00
2008-07-23 04:28:59 +04:00
static int prepare_device_for_use ( mca_btl_openib_device_t * device )
2008-01-09 13:26:21 +03:00
{
mca_btl_openib_frag_init_data_t * init_data ;
int qp , length ;
# if OMPI_HAVE_THREADS
if ( mca_btl_openib_component . use_async_event_thread ) {
if ( 0 = = mca_btl_openib_component . async_thread ) {
/* async thread is not yet started, so start it here */
if ( start_async_event_thread ( ) ! = OMPI_SUCCESS )
return OMPI_ERROR ;
}
2008-07-23 04:28:59 +04:00
device - > got_fatal_event = false ;
2008-01-09 13:26:21 +03:00
if ( write ( mca_btl_openib_component . async_pipe [ 1 ] ,
2008-07-23 04:28:59 +04:00
& device - > ib_dev_context - > async_fd , sizeof ( int ) ) < 0 ) {
2008-01-09 13:26:21 +03:00
BTL_ERROR ( ( " Failed to write to pipe [%d] " , errno ) ) ;
return OMPI_ERROR ;
2008-01-21 15:11:18 +03:00
}
2008-07-13 20:21:49 +04:00
/* wait for ok from thread */
if ( OMPI_SUCCESS ! =
2008-07-23 04:28:59 +04:00
btl_openib_async_command_done ( device - > ib_dev_context - > async_fd ) ) {
2008-07-13 20:21:49 +04:00
return OMPI_ERROR ;
}
2008-01-09 13:26:21 +03:00
}
# if OMPI_ENABLE_PROGRESS_THREADS == 1
/* Prepare data for thread, but not starting it */
2008-07-23 04:28:59 +04:00
OBJ_CONSTRUCT ( & device - > thread , opal_thread_t ) ;
device - > thread . t_run = mca_btl_openib_progress_thread ;
device - > thread . t_arg = device ;
device - > progress = false ;
2008-01-09 13:26:21 +03:00
# endif
# endif
2008-05-28 15:31:38 +04:00
# if HAVE_XRC
/* if user configured to run with XRC qp and the device doesn't
2008-07-23 04:28:59 +04:00
* support it - we should ignore this device . Maybe we have another
2008-05-28 15:31:38 +04:00
* one that has XRC support
*/
2008-07-23 04:28:59 +04:00
if ( ! ( device - > ib_dev_attr . device_cap_flags & IBV_DEVICE_XRC ) & &
2008-05-28 15:31:38 +04:00
MCA_BTL_XRC_ENABLED ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" XRC on device without XRC support " , true ,
mca_btl_openib_component . num_xrc_qps ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) ,
2008-05-28 15:31:38 +04:00
orte_process_info . nodename ) ;
return OMPI_ERROR ;
}
if ( MCA_BTL_XRC_ENABLED ) {
2008-07-23 04:28:59 +04:00
if ( OMPI_SUCCESS ! = mca_btl_openib_open_xrc_domain ( device ) ) {
2008-05-28 15:31:38 +04:00
BTL_ERROR ( ( " XRC Internal error. Failed to open xrc domain " ) ) ;
return OMPI_ERROR ;
}
}
# endif
2008-07-23 04:28:59 +04:00
device - > endpoints = OBJ_NEW ( opal_pointer_array_t ) ;
opal_pointer_array_init ( device - > endpoints , 10 , INT_MAX , 10 ) ;
opal_pointer_array_add ( & mca_btl_openib_component . devices , device ) ;
2008-06-24 22:31:46 +04:00
if ( mca_btl_openib_component . max_eager_rdma > 0 & &
2008-07-23 04:28:59 +04:00
device - > use_eager_rdma ) {
device - > eager_rdma_buffers =
calloc ( mca_btl_openib_component . max_eager_rdma * device - > btls ,
2008-01-09 13:26:21 +03:00
sizeof ( mca_btl_openib_endpoint_t * ) ) ;
2008-07-23 04:28:59 +04:00
if ( NULL = = device - > eager_rdma_buffers ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Memory allocation fails " ) ) ;
2008-01-09 13:26:21 +03:00
return OMPI_ERR_OUT_OF_RESOURCE ;
}
}
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
length = sizeof ( mca_btl_openib_header_t ) +
2008-01-21 15:11:18 +03:00
sizeof ( mca_btl_openib_footer_t ) +
2008-01-09 13:26:21 +03:00
sizeof ( mca_btl_openib_eager_rdma_header_t ) ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data - > order = MCA_BTL_NO_ORDER ;
2008-07-23 04:28:59 +04:00
init_data - > list = & device - > send_free_control ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new (
2008-07-23 04:28:59 +04:00
& device - > send_free_control ,
2008-01-09 13:26:21 +03:00
sizeof ( mca_btl_openib_send_control_frag_t ) , CACHE_LINE_SIZE ,
OBJ_CLASS ( mca_btl_openib_send_control_frag_t ) , length ,
mca_btl_openib_component . buffer_alignment ,
mca_btl_openib_component . ib_free_list_num , - 1 ,
mca_btl_openib_component . ib_free_list_inc ,
2008-07-23 04:28:59 +04:00
device - > mpool , mca_btl_openib_frag_init ,
2008-01-21 15:11:18 +03:00
init_data ) ) {
2008-01-09 13:26:21 +03:00
return OMPI_ERROR ;
}
2008-01-21 15:11:18 +03:00
/* setup all the qps */
for ( qp = 0 ; qp < mca_btl_openib_component . num_qps ; qp + + ) {
2008-01-09 13:26:21 +03:00
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
2008-01-21 15:11:18 +03:00
/* Initialize pool of send fragments */
2008-01-09 13:26:21 +03:00
length = sizeof ( mca_btl_openib_header_t ) +
sizeof ( mca_btl_openib_header_coalesced_t ) +
sizeof ( mca_btl_openib_control_header_t ) +
sizeof ( mca_btl_openib_footer_t ) +
mca_btl_openib_component . qp_infos [ qp ] . size ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data - > order = qp ;
2008-07-23 04:28:59 +04:00
init_data - > list = & device - > qps [ qp ] . send_free ;
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new ( init_data - > list ,
sizeof ( mca_btl_openib_send_frag_t ) , CACHE_LINE_SIZE ,
OBJ_CLASS ( mca_btl_openib_send_frag_t ) , length ,
mca_btl_openib_component . buffer_alignment ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2008-07-23 04:28:59 +04:00
device - > mpool , mca_btl_openib_frag_init ,
2008-01-21 15:11:18 +03:00
init_data ) ) {
2008-01-09 13:26:21 +03:00
return OMPI_ERROR ;
}
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
length = sizeof ( mca_btl_openib_header_t ) +
sizeof ( mca_btl_openib_header_coalesced_t ) +
sizeof ( mca_btl_openib_control_header_t ) +
sizeof ( mca_btl_openib_footer_t ) +
mca_btl_openib_component . qp_infos [ qp ] . size ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data - > order = qp ;
2008-07-23 04:28:59 +04:00
init_data - > list = & device - > qps [ qp ] . recv_free ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new ( init_data - > list ,
sizeof ( mca_btl_openib_recv_frag_t ) , CACHE_LINE_SIZE ,
OBJ_CLASS ( mca_btl_openib_recv_frag_t ) ,
length , mca_btl_openib_component . buffer_alignment ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2008-07-23 04:28:59 +04:00
device - > mpool , mca_btl_openib_frag_init ,
2008-01-21 15:11:18 +03:00
init_data ) ) {
2008-01-09 13:26:21 +03:00
return OMPI_ERROR ;
}
}
2008-07-23 04:28:59 +04:00
mca_btl_openib_component . devices_count + + ;
2008-01-09 13:26:21 +03:00
return OMPI_SUCCESS ;
}
2008-01-09 13:27:15 +03:00
static int
2008-07-23 04:28:59 +04:00
get_port_list ( mca_btl_openib_device_t * device , int * allowed_ports )
2008-01-09 13:27:15 +03:00
{
int i , j , k , num_ports = 0 ;
const char * dev_name ;
char * name ;
2008-07-23 04:28:59 +04:00
dev_name = ibv_get_device_name ( device - > ib_dev ) ;
2008-01-09 13:27:15 +03:00
name = ( char * ) malloc ( strlen ( dev_name ) + 4 ) ;
if ( NULL = = name ) {
return 0 ;
}
/* Assume that all ports are allowed. num_ports will be adjusted
below to reflect whether this is true or not . */
2008-07-23 04:28:59 +04:00
for ( i = 1 ; i < = device - > ib_dev_attr . phys_port_cnt ; + + i ) {
2008-01-09 13:27:15 +03:00
allowed_ports [ num_ports + + ] = i ;
}
num_ports = 0 ;
if ( NULL ! = mca_btl_openib_component . if_include_list ) {
2008-07-23 04:28:59 +04:00
/* If only the device name is given (eg. mtdevice0,mtdevice1) use all
2008-01-09 13:27:15 +03:00
ports */
i = 0 ;
while ( mca_btl_openib_component . if_include_list [ i ] ) {
2008-01-21 15:11:18 +03:00
if ( 0 = = strcmp ( dev_name ,
2008-01-09 13:27:15 +03:00
mca_btl_openib_component . if_include_list [ i ] ) ) {
2008-07-23 04:28:59 +04:00
num_ports = device - > ib_dev_attr . phys_port_cnt ;
2008-01-09 13:27:15 +03:00
goto done ;
}
+ + i ;
}
2008-07-23 04:28:59 +04:00
/* Include only requested ports on the device */
for ( i = 1 ; i < = device - > ib_dev_attr . phys_port_cnt ; + + i ) {
2008-01-09 13:27:15 +03:00
sprintf ( name , " %s:%d " , dev_name , i ) ;
2008-01-21 15:11:18 +03:00
for ( j = 0 ;
2008-01-09 13:27:15 +03:00
NULL ! = mca_btl_openib_component . if_include_list [ j ] ; + + j ) {
2008-01-21 15:11:18 +03:00
if ( 0 = = strcmp ( name ,
2008-01-09 13:27:15 +03:00
mca_btl_openib_component . if_include_list [ j ] ) ) {
allowed_ports [ num_ports + + ] = i ;
break ;
}
}
}
} else if ( NULL ! = mca_btl_openib_component . if_exclude_list ) {
2008-07-23 04:28:59 +04:00
/* If only the device name is given (eg. mtdevice0,mtdevice1) exclude
2008-01-09 13:27:15 +03:00
all ports */
i = 0 ;
while ( mca_btl_openib_component . if_exclude_list [ i ] ) {
2008-01-21 15:11:18 +03:00
if ( 0 = = strcmp ( dev_name ,
2008-01-09 13:27:15 +03:00
mca_btl_openib_component . if_exclude_list [ i ] ) ) {
num_ports = 0 ;
goto done ;
}
+ + i ;
}
2008-07-23 04:28:59 +04:00
/* Exclude the specified ports on this device */
for ( i = 1 ; i < = device - > ib_dev_attr . phys_port_cnt ; + + i ) {
2008-01-09 13:27:15 +03:00
sprintf ( name , " %s:%d " , dev_name , i ) ;
2008-01-21 15:11:18 +03:00
for ( j = 0 ;
2008-01-09 13:27:15 +03:00
NULL ! = mca_btl_openib_component . if_exclude_list [ j ] ; + + j ) {
2008-01-21 15:11:18 +03:00
if ( 0 = = strcmp ( name ,
2008-01-09 13:27:15 +03:00
mca_btl_openib_component . if_exclude_list [ j ] ) ) {
/* If found, set a sentinel value */
j = - 1 ;
break ;
}
}
/* If we didn't find it, it's ok to include in the list */
if ( - 1 ! = j ) {
allowed_ports [ num_ports + + ] = i ;
}
}
} else {
2008-07-23 04:28:59 +04:00
num_ports = device - > ib_dev_attr . phys_port_cnt ;
2008-01-09 13:27:15 +03:00
}
done :
/* Remove the following from the error-checking if_list:
- bare device name
- device name suffixed with port number */
if ( NULL ! = mca_btl_openib_component . if_list ) {
for ( i = 0 ; NULL ! = mca_btl_openib_component . if_list [ i ] ; + + i ) {
/* Look for raw device name */
if ( 0 = = strcmp ( mca_btl_openib_component . if_list [ i ] , dev_name ) ) {
j = opal_argv_count ( mca_btl_openib_component . if_list ) ;
opal_argv_delete ( & j , & ( mca_btl_openib_component . if_list ) ,
i , 1 ) ;
- - i ;
}
}
2008-07-23 04:28:59 +04:00
for ( i = 1 ; i < = device - > ib_dev_attr . phys_port_cnt ; + + i ) {
2008-01-09 13:27:15 +03:00
sprintf ( name , " %s:%d " , dev_name , i ) ;
for ( j = 0 ; NULL ! = mca_btl_openib_component . if_list [ j ] ; + + j ) {
if ( 0 = = strcmp ( mca_btl_openib_component . if_list [ j ] , name ) ) {
k = opal_argv_count ( mca_btl_openib_component . if_list ) ;
opal_argv_delete ( & k , & ( mca_btl_openib_component . if_list ) ,
j , 1 ) ;
- - j ;
break ;
}
}
}
}
free ( name ) ;
return num_ports ;
}
2008-05-21 01:53:42 +04:00
/*
* Prefer values that are already in the target
*/
2008-01-09 13:27:15 +03:00
static void merge_values ( ompi_btl_openib_ini_values_t * target ,
ompi_btl_openib_ini_values_t * src )
{
if ( ! target - > mtu_set & & src - > mtu_set ) {
target - > mtu = src - > mtu ;
target - > mtu_set = true ;
}
if ( ! target - > use_eager_rdma_set & & src - > use_eager_rdma_set ) {
target - > use_eager_rdma = src - > use_eager_rdma ;
target - > use_eager_rdma_set = true ;
}
2008-05-21 01:53:42 +04:00
if ( NULL = = target - > receive_queues & & NULL ! = src - > receive_queues ) {
target - > receive_queues = strdup ( src - > receive_queues ) ;
}
2008-06-24 21:18:07 +04:00
if ( ! target - > max_inline_data_set & & src - > max_inline_data_set ) {
target - > max_inline_data = src - > max_inline_data ;
target - > max_inline_data_set = true ;
}
2008-01-09 13:27:15 +03:00
}
static bool inline is_credit_message ( const mca_btl_openib_recv_frag_t * frag )
{
mca_btl_openib_control_header_t * chdr =
to_base_frag ( frag ) - > segment . seg_addr . pval ;
return ( MCA_BTL_TAG_BTL = = frag - > hdr - > tag ) & &
( MCA_BTL_OPENIB_CONTROL_CREDITS = = chdr - > type ) ;
}
2008-05-21 01:53:42 +04:00
static int32_t atoi_param ( char * param , int32_t dflt )
{
if ( NULL = = param | | ' \0 ' = = param [ 0 ] ) {
return dflt ? dflt : 1 ;
}
return atoi ( param ) ;
}
2008-07-23 04:28:59 +04:00
static void init_apm_port ( mca_btl_openib_device_t * device , int port , uint16_t lid )
2008-02-20 16:44:05 +03:00
{
int index ;
struct mca_btl_openib_module_t * btl ;
2008-07-23 04:28:59 +04:00
for ( index = 0 ; index < device - > btls ; index + + ) {
btl = opal_pointer_array_get_item ( device - > device_btls , index ) ;
2008-02-20 16:44:05 +03:00
/* Ok, we already have btl for the fist port,
* second one will be used for APM */
btl - > apm_port = port ;
btl - > port_info . apm_lid = lid + btl - > src_path_bits ;
mca_btl_openib_component . apm_ports + + ;
BTL_VERBOSE ( ( " APM-PORT: Setting alternative port - %d, lid - %d "
, port , lid ) ) ;
}
}
2008-05-21 01:53:42 +04:00
static int setup_qps ( void )
{
char * * queues , * * params = NULL ;
int num_xrc_qps = 0 , num_pp_qps = 0 , num_srq_qps = 0 , qp = 0 ;
uint32_t max_qp_size , max_size_needed ;
int32_t min_freelist_size = 0 ;
int smallest_pp_qp = 0 , ret = OMPI_ERROR ;
queues = opal_argv_split ( mca_btl_openib_component . receive_queues , ' : ' ) ;
if ( 0 = = opal_argv_count ( queues ) ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" no qps in receive_queues " , true ,
orte_process_info . nodename ,
mca_btl_openib_component . receive_queues ) ;
ret = OMPI_ERROR ;
goto error ;
}
while ( queues [ qp ] ! = NULL ) {
if ( 0 = = strncmp ( " P, " , queues [ qp ] , 2 ) ) {
num_pp_qps + + ;
if ( smallest_pp_qp > qp ) {
smallest_pp_qp = qp ;
}
} else if ( 0 = = strncmp ( " S, " , queues [ qp ] , 2 ) ) {
num_srq_qps + + ;
} else if ( 0 = = strncmp ( " X, " , queues [ qp ] , 2 ) ) {
# if HAVE_XRC
num_xrc_qps + + ;
# else
orte_show_help ( " help-mpi-btl-openib.txt " , " No XRC support " , true ,
orte_process_info . nodename ,
mca_btl_openib_component . receive_queues ) ;
2008-05-21 08:09:54 +04:00
ret = OMPI_ERR_NOT_AVAILABLE ;
2008-05-21 01:53:42 +04:00
goto error ;
# endif
} else {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" invalid qp type in receive_queues " , true ,
orte_process_info . nodename ,
mca_btl_openib_component . receive_queues ,
queues [ qp ] ) ;
ret = OMPI_ERR_BAD_PARAM ;
goto error ;
}
qp + + ;
}
/* Current XRC implementation can't used with other QP types - PP
and SRQ */
if ( num_xrc_qps > 0 & & ( num_pp_qps > 0 | | num_srq_qps > 0 ) ) {
orte_show_help ( " help-mpi-btl-openib.txt " , " XRC with PP or SRQ " , true ,
orte_process_info . nodename ,
mca_btl_openib_component . receive_queues ) ;
ret = OMPI_ERR_BAD_PARAM ;
goto error ;
}
/* Current XRC implementation can't used with btls_per_lid > 1 */
if ( num_xrc_qps > 0 & & mca_btl_openib_component . btls_per_lid > 1 ) {
orte_show_help ( " help-mpi-btl-openib.txt " , " XRC with BTLs per LID " ,
true , orte_process_info . nodename ,
mca_btl_openib_component . receive_queues , num_xrc_qps ) ;
ret = OMPI_ERR_BAD_PARAM ;
goto error ;
}
mca_btl_openib_component . num_pp_qps = num_pp_qps ;
mca_btl_openib_component . num_srq_qps = num_srq_qps ;
mca_btl_openib_component . num_xrc_qps = num_xrc_qps ;
mca_btl_openib_component . num_qps = num_pp_qps + num_srq_qps + num_xrc_qps ;
mca_btl_openib_component . qp_infos = ( mca_btl_openib_qp_info_t * )
malloc ( sizeof ( mca_btl_openib_qp_info_t ) *
mca_btl_openib_component . num_qps ) ;
qp = 0 ;
# define P(N) (((N) > count) ? NULL : params[(N)])
while ( queues [ qp ] ! = NULL ) {
int count ;
int32_t rd_low , rd_num ;
params = opal_argv_split_with_empty ( queues [ qp ] , ' , ' ) ;
count = opal_argv_count ( params ) ;
if ( ' P ' = = params [ 0 ] [ 0 ] ) {
int32_t rd_win , rd_rsv ;
if ( count < 3 | | count > 6 ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" invalid pp qp specification " , true ,
orte_process_info . nodename , queues [ qp ] ) ;
ret = OMPI_ERR_BAD_PARAM ;
goto error ;
}
mca_btl_openib_component . qp_infos [ qp ] . type = MCA_BTL_OPENIB_PP_QP ;
mca_btl_openib_component . qp_infos [ qp ] . size = atoi_param ( P ( 1 ) , 0 ) ;
rd_num = atoi_param ( P ( 2 ) , 256 ) ;
/* by default set rd_low to be 3/4 of rd_num */
rd_low = atoi_param ( P ( 3 ) , rd_num - ( rd_num / 4 ) ) ;
rd_win = atoi_param ( P ( 4 ) , ( rd_num - rd_low ) * 2 ) ;
rd_rsv = atoi_param ( P ( 5 ) , ( rd_num * 2 ) / rd_win ) ;
BTL_VERBOSE ( ( " pp: rd_num is %d rd_low is %d rd_win %d rd_rsv %d " ,
rd_num , rd_low , rd_win , rd_rsv ) ) ;
/* Calculate the smallest freelist size that can be allowed */
if ( rd_num + rd_rsv > min_freelist_size ) {
min_freelist_size = rd_num + rd_rsv ;
}
mca_btl_openib_component . qp_infos [ qp ] . u . pp_qp . rd_win = rd_win ;
mca_btl_openib_component . qp_infos [ qp ] . u . pp_qp . rd_rsv = rd_rsv ;
if ( ( rd_num - rd_low ) > rd_win ) {
orte_show_help ( " help-mpi-btl-openib.txt " , " non optimal rd_win " ,
true , rd_win , rd_num - rd_low ) ;
}
} else {
int32_t sd_max ;
if ( count < 3 | | count > 5 ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" invalid srq specification " , true ,
orte_process_info . nodename , queues [ qp ] ) ;
ret = OMPI_ERR_BAD_PARAM ;
goto error ;
}
mca_btl_openib_component . qp_infos [ qp ] . type = ( params [ 0 ] [ 0 ] = = ' X ' ) ?
MCA_BTL_OPENIB_XRC_QP : MCA_BTL_OPENIB_SRQ_QP ;
mca_btl_openib_component . qp_infos [ qp ] . size = atoi_param ( P ( 1 ) , 0 ) ;
rd_num = atoi_param ( P ( 2 ) , 256 ) ;
/* by default set rd_low to be 3/4 of rd_num */
rd_low = atoi_param ( P ( 3 ) , rd_num - ( rd_num / 4 ) ) ;
sd_max = atoi_param ( P ( 4 ) , rd_low / 4 ) ;
BTL_VERBOSE ( ( " srq: rd_num is %d rd_low is %d sd_max is %d " ,
rd_num , rd_low , sd_max ) ) ;
/* Calculate the smallest freelist size that can be allowed */
if ( rd_num > min_freelist_size ) {
min_freelist_size = rd_num ;
}
mca_btl_openib_component . qp_infos [ qp ] . u . srq_qp . sd_max = sd_max ;
}
if ( rd_num < = rd_low ) {
orte_show_help ( " help-mpi-btl-openib.txt " , " rd_num must be > rd_low " ,
true , orte_process_info . nodename , queues [ qp ] ) ;
ret = OMPI_ERR_BAD_PARAM ;
goto error ;
}
mca_btl_openib_component . qp_infos [ qp ] . rd_num = rd_num ;
mca_btl_openib_component . qp_infos [ qp ] . rd_low = rd_low ;
opal_argv_free ( params ) ;
qp + + ;
}
params = NULL ;
/* Sanity check some sizes */
max_qp_size = mca_btl_openib_component . qp_infos [ mca_btl_openib_component . num_qps - 1 ] . size ;
max_size_needed = ( mca_btl_openib_module . super . btl_eager_limit >
mca_btl_openib_module . super . btl_max_send_size ) ?
mca_btl_openib_module . super . btl_eager_limit :
mca_btl_openib_module . super . btl_max_send_size ;
if ( max_qp_size < max_size_needed ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" biggest qp size is too small " , true ,
orte_process_info . nodename , max_qp_size ,
max_size_needed ) ;
ret = OMPI_ERR_BAD_PARAM ;
goto error ;
} else if ( max_qp_size > max_size_needed ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" biggest qp size is too big " , true ,
orte_process_info . nodename , max_qp_size ,
max_size_needed ) ;
}
if ( mca_btl_openib_component . ib_free_list_max > 0 & &
min_freelist_size > mca_btl_openib_component . ib_free_list_max ) {
orte_show_help ( " help-mpi-btl-openib.txt " , " freelist too small " , true ,
orte_process_info . nodename ,
mca_btl_openib_component . ib_free_list_max ,
min_freelist_size ) ;
ret = OMPI_ERR_BAD_PARAM ;
goto error ;
}
mca_btl_openib_component . rdma_qp = mca_btl_openib_component . num_qps - 1 ;
mca_btl_openib_component . credits_qp = smallest_pp_qp ;
ret = OMPI_SUCCESS ;
error :
if ( NULL ! = params ) {
opal_argv_free ( params ) ;
}
if ( NULL ! = queues ) {
opal_argv_free ( queues ) ;
}
return ret ;
}
2008-07-23 04:28:59 +04:00
static int init_one_device ( opal_list_t * btl_list , struct ibv_device * ib_dev )
2006-06-28 11:23:08 +04:00
{
struct mca_mpool_base_resources_t mpool_resources ;
2008-07-23 04:28:59 +04:00
mca_btl_openib_device_t * device ;
2007-06-14 05:59:25 +04:00
uint8_t i , k = 0 ;
int ret = - 1 , port_cnt ;
2006-08-31 00:21:47 +04:00
ompi_btl_openib_ini_values_t values , default_values ;
2007-06-14 17:59:28 +04:00
int * allowed_ports = NULL ;
2008-06-24 21:18:07 +04:00
bool need_search ;
2008-01-21 15:11:18 +03:00
2008-07-23 04:28:59 +04:00
device = OBJ_NEW ( mca_btl_openib_device_t ) ;
if ( NULL = = device ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return OMPI_ERR_OUT_OF_RESOURCE ;
2006-06-28 11:23:08 +04:00
}
2008-01-21 15:11:18 +03:00
2008-07-23 04:28:59 +04:00
device - > ib_dev = ib_dev ;
device - > ib_dev_context = ibv_open_device ( ib_dev ) ;
device - > ib_pd = NULL ;
device - > device_btls = OBJ_NEW ( opal_pointer_array_t ) ;
if ( OPAL_SUCCESS ! = opal_pointer_array_init ( device - > device_btls , 2 , INT_MAX , 2 ) ) {
BTL_ERROR ( ( " Failed to initialize device_btls array: %s:%d " , __FILE__ , __LINE__ ) ) ;
2008-02-20 16:44:05 +03:00
return OMPI_ERR_OUT_OF_RESOURCE ;
}
2007-12-23 15:29:34 +03:00
2008-07-23 04:28:59 +04:00
if ( NULL = = device - > ib_dev_context ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error obtaining device context for %s errno says %s " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2008-01-21 15:11:18 +03:00
}
2008-07-23 04:28:59 +04:00
if ( ibv_query_device ( device - > ib_dev_context , & device - > ib_dev_attr ) ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error obtaining device attributes for %s errno says %s " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2006-06-28 11:23:08 +04:00
}
2007-06-14 05:59:25 +04:00
/* If mca_btl_if_include/exclude were specified, get usable ports */
2008-07-23 04:28:59 +04:00
allowed_ports = ( int * ) malloc ( device - > ib_dev_attr . phys_port_cnt * sizeof ( int ) ) ;
port_cnt = get_port_list ( device , allowed_ports ) ;
2008-05-21 01:53:42 +04:00
if ( 0 = = port_cnt ) {
2008-05-18 22:50:56 +04:00
free ( allowed_ports ) ;
2008-05-18 23:11:45 +04:00
ret = OMPI_SUCCESS ;
goto error ;
}
2008-05-21 01:53:42 +04:00
2008-07-23 04:28:59 +04:00
/* Load in vendor/part-specific device parameters. Note that even if
2006-08-31 00:21:47 +04:00
we don ' t find values for this vendor / part , " values " will be set
indicating that it does not have good values */
2008-07-23 04:28:59 +04:00
ret = ompi_btl_openib_ini_query ( device - > ib_dev_attr . vendor_id ,
device - > ib_dev_attr . vendor_part_id ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
& values ) ;
2006-08-31 00:21:47 +04:00
if ( OMPI_SUCCESS ! = ret & & OMPI_ERR_NOT_FOUND ! = ret ) {
/* If we get a serious error, propagate it upwards */
2008-01-09 13:26:21 +03:00
goto error ;
2006-08-31 00:21:47 +04:00
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
if ( OMPI_ERR_NOT_FOUND = = ret ) {
2008-07-23 04:28:59 +04:00
/* If we didn't find a matching device in the INI files, output a
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
warning that we ' re using default values ( unless overridden
that we don ' t want to see these warnings ) */
2008-07-23 04:28:59 +04:00
if ( mca_btl_openib_component . warn_no_device_params_found ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2008-07-23 04:28:59 +04:00
" no device params found " , true ,
2008-03-24 02:10:15 +03:00
orte_process_info . nodename ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) ,
device - > ib_dev_attr . vendor_id ,
device - > ib_dev_attr . vendor_part_id ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-08-31 00:21:47 +04:00
}
/* Note that even if we don't find default values, "values" will
be set indicating that it does not have good values */
ret = ompi_btl_openib_ini_query ( 0 , 0 , & default_values ) ;
if ( OMPI_SUCCESS ! = ret & & OMPI_ERR_NOT_FOUND ! = ret ) {
/* If we get a serious error, propagate it upwards */
2008-01-09 13:26:21 +03:00
goto error ;
2006-08-31 00:21:47 +04:00
}
2008-07-23 04:28:59 +04:00
/* If we did find values for this device (or in the defaults
2006-08-31 00:21:47 +04:00
section ) , handle them */
merge_values ( & values , & default_values ) ;
if ( values . mtu_set ) {
switch ( values . mtu ) {
case 256 :
2008-07-23 04:28:59 +04:00
device - > mtu = IBV_MTU_256 ;
2006-08-31 00:21:47 +04:00
break ;
case 512 :
2008-07-23 04:28:59 +04:00
device - > mtu = IBV_MTU_512 ;
2006-08-31 00:21:47 +04:00
break ;
case 1024 :
2008-07-23 04:28:59 +04:00
device - > mtu = IBV_MTU_1024 ;
2006-08-31 00:21:47 +04:00
break ;
case 2048 :
2008-07-23 04:28:59 +04:00
device - > mtu = IBV_MTU_2048 ;
2006-08-31 00:21:47 +04:00
break ;
case 4096 :
2008-07-23 04:28:59 +04:00
device - > mtu = IBV_MTU_4096 ;
2006-08-31 00:21:47 +04:00
break ;
default :
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " invalid MTU value specified in INI file (%d); ignored " , values . mtu ) ) ;
2008-07-23 04:28:59 +04:00
device - > mtu = mca_btl_openib_component . ib_mtu ;
2006-08-31 00:21:47 +04:00
break ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-08-31 00:21:47 +04:00
} else {
2008-07-23 04:28:59 +04:00
device - > mtu = mca_btl_openib_component . ib_mtu ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2008-07-23 04:28:59 +04:00
/* Allocate the protection domain for the device */
device - > ib_pd = ibv_alloc_pd ( device - > ib_dev_context ) ;
if ( NULL = = device - > ib_pd ) {
2008-06-24 21:18:07 +04:00
BTL_ERROR ( ( " error allocating protection domain for %s errno says %s " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , strerror ( errno ) ) ) ;
2008-06-24 21:18:07 +04:00
goto error ;
}
/* Figure out what the max_inline_data value should be for all
2008-07-23 04:28:59 +04:00
ports and QPs on this device */
2008-06-24 21:18:07 +04:00
need_search = false ;
if ( 0 = = mca_btl_openib_component . ib_max_inline_data ) {
need_search = true ;
} else if ( mca_btl_openib_component . ib_max_inline_data > 0 ) {
2008-07-23 04:28:59 +04:00
device - > max_inline_data = mca_btl_openib_component . ib_max_inline_data ;
2008-06-24 21:18:07 +04:00
} else if ( values . max_inline_data_set ) {
if ( 0 = = values . max_inline_data ) {
need_search = true ;
} else if ( values . max_inline_data > 0 ) {
2008-07-23 04:28:59 +04:00
device - > max_inline_data = values . max_inline_data ;
2008-06-24 21:18:07 +04:00
}
}
/* Horrible. :-( Per the thread starting here:
http : //lists.openfabrics.org/pipermail/general/2008-June/051822.html,
we can ' t rely on the value reported by the device to determine
the maximum max_inline_data value . So we have to search by
looping over max_inline_data values and trying to make dummy
QPs . Yuck ! */
if ( need_search ) {
struct ibv_qp * qp ;
struct ibv_cq * cq ;
struct ibv_qp_init_attr init_attr ;
uint32_t max_inline_data ;
/* Make a dummy CQ */
# if OMPI_IBV_CREATE_CQ_ARGS == 3
2008-07-23 04:28:59 +04:00
cq = ibv_create_cq ( device - > ib_dev_context , 1 , NULL ) ;
2008-06-24 21:18:07 +04:00
# else
2008-07-23 04:28:59 +04:00
cq = ibv_create_cq ( device - > ib_dev_context , 1 , NULL , NULL , 0 ) ;
2008-06-24 21:18:07 +04:00
# endif
if ( NULL = = cq ) {
orte_show_help ( " help-mpi-btl-openib.txt " , " init-fail-create-q " ,
true , orte_process_info . nodename ,
__FILE__ , __LINE__ , " ibv_create_cq " ,
strerror ( errno ) , errno ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) ) ;
2008-06-24 21:18:07 +04:00
ret = OMPI_ERR_NOT_AVAILABLE ;
goto error ;
}
/* Setup the QP attributes */
memset ( & init_attr , 0 , sizeof ( init_attr ) ) ;
init_attr . qp_type = IBV_QPT_RC ;
init_attr . send_cq = cq ;
init_attr . recv_cq = cq ;
init_attr . srq = 0 ;
init_attr . cap . max_send_sge = 1 ;
init_attr . cap . max_recv_sge = 1 ;
init_attr . cap . max_recv_wr = 1 ;
/* Loop over max_inline_data values; just check powers of 2 --
that ' s good enough */
init_attr . cap . max_inline_data = max_inline_data = 1 < < 20 ;
while ( max_inline_data > 0 ) {
2008-07-23 04:28:59 +04:00
qp = ibv_create_qp ( device - > ib_pd , & init_attr ) ;
2008-06-24 21:18:07 +04:00
if ( NULL ! = qp ) {
break ;
}
max_inline_data > > = 1 ;
init_attr . cap . max_inline_data = max_inline_data ;
}
/* Did we find it? */
if ( NULL ! = qp ) {
2008-07-23 04:28:59 +04:00
device - > max_inline_data = max_inline_data ;
2008-06-24 21:18:07 +04:00
ibv_destroy_qp ( qp ) ;
} else {
2008-07-23 04:28:59 +04:00
device - > max_inline_data = 0 ;
2008-06-24 21:18:07 +04:00
}
/* Destroy the temp CQ */
ibv_destroy_cq ( cq ) ;
}
2008-05-21 01:53:42 +04:00
/* If the user specified btl_openib_receive_queues MCA param, it
2008-07-23 04:28:59 +04:00
overrides all device INI params */
2008-05-21 01:53:42 +04:00
if ( BTL_OPENIB_RQ_SOURCE_MCA ! =
mca_btl_openib_component . receive_queues_source & &
NULL ! = values . receive_queues ) {
2008-07-23 04:28:59 +04:00
/* If a prior device's INI values set a different value for
2008-05-21 01:53:42 +04:00
receive_queues , this is unsupported ( see
https : //svn.open-mpi.org/trac/ompi/ticket/1285) */
2008-07-23 04:28:59 +04:00
if ( BTL_OPENIB_RQ_SOURCE_DEVICE_INI = =
2008-05-21 01:53:42 +04:00
mca_btl_openib_component . receive_queues_source ) {
if ( 0 ! = strcmp ( values . receive_queues ,
mca_btl_openib_component . receive_queues ) ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" conflicting receive_queues " , true ,
orte_process_info . nodename ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) ,
device - > ib_dev_attr . vendor_id ,
device - > ib_dev_attr . vendor_part_id ,
2008-05-21 01:53:42 +04:00
values . receive_queues ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( receive_queues_device - > ib_dev ) ,
receive_queues_device - > ib_dev_attr . vendor_id ,
receive_queues_device - > ib_dev_attr . vendor_part_id ,
2008-05-21 01:53:42 +04:00
mca_btl_openib_component . receive_queues ,
opal_install_dirs . pkgdatadir ) ;
ret = OMPI_ERR_RESOURCE_BUSY ;
goto error ;
}
} else {
if ( NULL ! = mca_btl_openib_component . receive_queues ) {
free ( mca_btl_openib_component . receive_queues ) ;
}
2008-07-23 04:28:59 +04:00
receive_queues_device = device ;
2008-05-21 01:53:42 +04:00
mca_btl_openib_component . receive_queues =
strdup ( values . receive_queues ) ;
mca_btl_openib_component . receive_queues_source =
2008-07-23 04:28:59 +04:00
BTL_OPENIB_RQ_SOURCE_DEVICE_INI ;
2008-05-21 01:53:42 +04:00
}
}
2008-06-24 22:31:46 +04:00
/* Should we use RDMA for short / eager messages? First check MCA
param , then check INI file values . */
if ( mca_btl_openib_component . use_eager_rdma > = 0 ) {
2008-07-23 04:28:59 +04:00
device - > use_eager_rdma = mca_btl_openib_component . use_eager_rdma ;
2008-06-24 22:31:46 +04:00
} else if ( values . use_eager_rdma_set ) {
2008-07-23 04:28:59 +04:00
device - > use_eager_rdma = values . use_eager_rdma ;
2006-12-14 18:52:13 +03:00
}
2008-06-24 22:31:46 +04:00
/* Eager RDMA is not currently supported with progress threads */
2008-07-23 04:28:59 +04:00
if ( device - > use_eager_rdma & & OMPI_ENABLE_PROGRESS_THREADS ) {
device - > use_eager_rdma = 0 ;
2008-06-24 22:31:46 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
" eager RDMA and progress threads " , true ) ;
}
2006-12-14 18:52:13 +03:00
2008-05-21 01:53:42 +04:00
# if HAVE_XRC
/* if user configured to run with XRC qp and the device doesn't
2008-07-23 04:28:59 +04:00
* support it - we should ignore this device . Maybe we have another
2008-05-21 01:53:42 +04:00
* one that has XRC support
*/
2008-07-23 04:28:59 +04:00
if ( ! ( device - > ib_dev_attr . device_cap_flags & IBV_DEVICE_XRC ) & &
2008-05-21 01:53:42 +04:00
mca_btl_openib_component . num_xrc_qps > 0 ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" XRC on device without XRC support " , true ,
mca_btl_openib_component . num_xrc_qps ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) ,
2008-05-21 01:53:42 +04:00
orte_process_info . nodename ) ;
ret = OMPI_SUCCESS ;
goto error ;
}
2005-11-10 23:15:02 +03:00
2007-11-28 10:18:59 +03:00
if ( MCA_BTL_XRC_ENABLED ) {
2008-07-23 04:28:59 +04:00
if ( OMPI_SUCCESS ! = mca_btl_openib_open_xrc_domain ( device ) ) {
2007-11-28 10:18:59 +03:00
BTL_ERROR ( ( " XRC Internal error. Failed to open xrc domain " ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2007-11-28 10:18:59 +03:00
}
}
2008-04-02 23:52:03 +04:00
# endif
2007-11-28 10:18:59 +03:00
2008-07-23 04:28:59 +04:00
mpool_resources . reg_data = ( void * ) device ;
2006-12-17 15:26:41 +03:00
mpool_resources . sizeof_reg = sizeof ( mca_btl_openib_reg_t ) ;
mpool_resources . register_mem = openib_reg_mr ;
mpool_resources . deregister_mem = openib_dereg_mr ;
2008-07-23 04:28:59 +04:00
device - > mpool =
2006-06-28 11:23:08 +04:00
mca_mpool_base_module_create ( mca_btl_openib_component . ib_mpool_name ,
2008-07-23 04:28:59 +04:00
device , & mpool_resources ) ;
if ( NULL = = device - > mpool ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error creating IB memory pool for %s errno says %s " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2006-06-28 11:23:08 +04:00
}
2008-01-21 15:11:18 +03:00
2008-05-02 15:52:33 +04:00
# if OMPI_ENABLE_PROGRESS_THREADS
2008-07-23 04:28:59 +04:00
device - > ib_channel = ibv_create_comp_channel ( device - > ib_dev_context ) ;
if ( NULL = = device - > ib_channel ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error creating channel for %s errno says %s " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) ,
2006-11-02 19:15:21 +03:00
strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2006-11-02 19:15:21 +03:00
}
# endif
2008-01-21 15:11:18 +03:00
ret = OMPI_SUCCESS ;
2006-11-02 19:15:21 +03:00
2007-06-14 05:59:25 +04:00
/* Note ports are 1 based (i >= 1) */
for ( k = 0 ; k < port_cnt ; k + + ) {
2006-06-28 11:23:08 +04:00
struct ibv_port_attr ib_port_attr ;
2007-06-14 05:59:25 +04:00
i = allowed_ports [ k ] ;
2008-07-23 04:28:59 +04:00
if ( ibv_query_port ( device - > ib_dev_context , i , & ib_port_attr ) ) {
2006-06-28 11:23:08 +04:00
BTL_ERROR ( ( " error getting port attributes for device %s "
" port number %d errno says %s " ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) , i , strerror ( errno ) ) ) ;
2008-01-21 15:11:18 +03:00
break ;
2006-06-28 11:23:08 +04:00
}
2008-02-20 16:44:05 +03:00
if ( IBV_PORT_ACTIVE = = ib_port_attr . state ) {
2008-07-23 04:28:59 +04:00
if ( mca_btl_openib_component . apm_ports & & device - > btls > 0 ) {
init_apm_port ( device , i , ib_port_attr . lid ) ;
2008-02-20 16:44:05 +03:00
break ;
}
2007-04-22 14:22:12 +04:00
if ( 0 = = mca_btl_openib_component . ib_pkey_val ) {
2008-07-23 04:28:59 +04:00
ret = init_one_port ( btl_list , device , i , mca_btl_openib_component . ib_pkey_ix ,
2007-04-22 14:22:12 +04:00
& ib_port_attr ) ;
2008-02-20 16:44:05 +03:00
} else {
2007-04-22 14:22:12 +04:00
uint16_t pkey , j ;
2008-07-23 04:28:59 +04:00
for ( j = 0 ; j < device - > ib_dev_attr . max_pkeys ; j + + ) {
ibv_query_pkey ( device - > ib_dev_context , i , j , & pkey ) ;
2007-04-22 14:22:12 +04:00
pkey = ntohs ( pkey ) ;
if ( pkey = = mca_btl_openib_component . ib_pkey_val ) {
2008-07-23 04:28:59 +04:00
ret = init_one_port ( btl_list , device , i , j , & ib_port_attr ) ;
2007-04-22 14:22:12 +04:00
break ;
}
}
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
if ( OMPI_SUCCESS ! = ret ) {
2008-01-21 15:11:18 +03:00
/* Out of bounds error indicates that we hit max btl number
2006-09-25 15:18:20 +04:00
* don ' t propagate the error to the caller */
2008-05-16 07:28:20 +04:00
if ( OMPI_ERR_VALUE_OUT_OF_BOUNDS = = ret ) {
2006-09-25 15:18:20 +04:00
ret = OMPI_SUCCESS ;
2008-05-16 07:28:20 +04:00
}
2006-06-28 11:23:08 +04:00
break ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-06-28 11:23:08 +04:00
}
}
2008-05-16 07:27:42 +04:00
free ( allowed_ports ) ;
2006-06-28 11:23:08 +04:00
2008-05-16 07:28:20 +04:00
/* If we made a BTL, check APM status and return. Otherwise, fall
through and destroy everything */
2008-07-23 04:28:59 +04:00
if ( device - > btls > 0 ) {
2008-02-20 16:44:05 +03:00
/* if apm was enabled it should be > 1 */
if ( 1 = = mca_btl_openib_component . apm_ports ) {
2008-05-16 07:28:20 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
" apm not enough ports " , true ) ;
2008-02-20 16:44:05 +03:00
mca_btl_openib_component . apm_ports = 0 ;
}
2008-05-21 01:53:42 +04:00
return OMPI_SUCCESS ;
If you have an HCA with no active ports, we still create an mpool.
This mpool will have no btl module owner there was no btl created for
the HCA with no ports, but it will still be tracked in the mpool
framework (i.e., it's available).
If MPI_ALLOC_MEM is called by the app, one of two things will happen:
1. if there's an HCA on the host with some active ports, the openib
btl component will still be in the process space, and therefore
the "mpool with no btl" (MWNB) module will still be able to call
the reg/dereg functions, and all will be fine. However, if
MPI_FREE_MEM is never invoked to free the memory, bad things will
happen during MPI_FINALIZE. The pml is finalized, which finalizes
all the btls. The btls finalize all their mpools and all is fine.
But later we close down the mpool framework which then finalizes
any left over mpool modules, such as MWNB. However, the openib
BTL module functions that the MWNB was registered with are no
longer in the process space, and it segv's while trying deregister
the memory.
2. if there are *no* HCA's on the host with active ports, then the
openib btl will have been unloaded, and when the MWNM tries to
register the memory, the functions it tries to call (in the openib
btl) are no longer there, and we segv.
This commit was SVN r15735.
2007-08-02 00:53:34 +04:00
}
2006-06-28 11:23:08 +04:00
2008-01-09 13:26:21 +03:00
error :
2008-05-02 15:52:33 +04:00
# if OMPI_ENABLE_PROGRESS_THREADS
2008-07-23 04:28:59 +04:00
if ( device - > ib_channel ) {
ibv_destroy_comp_channel ( device - > ib_channel ) ;
2008-04-02 23:52:03 +04:00
}
2006-11-06 15:34:56 +03:00
# endif
2008-07-23 04:28:59 +04:00
if ( device - > mpool ) {
mca_mpool_base_module_destroy ( device - > mpool ) ;
2008-04-02 23:52:03 +04:00
}
# if HAVE_XRC
2007-11-28 10:18:59 +03:00
if ( MCA_BTL_XRC_ENABLED ) {
2008-07-23 04:28:59 +04:00
if ( OMPI_SUCCESS ! = mca_btl_openib_close_xrc_domain ( device ) ) {
2007-11-28 10:18:59 +03:00
BTL_ERROR ( ( " XRC Internal error. Failed to close xrc domain " ) ) ;
}
}
2008-04-02 23:52:03 +04:00
# endif
2008-07-23 04:28:59 +04:00
if ( device - > ib_pd ) {
ibv_dealloc_pd ( device - > ib_pd ) ;
2008-04-02 23:52:03 +04:00
}
2008-07-23 04:28:59 +04:00
if ( device - > ib_dev_context ) {
ibv_close_device ( device - > ib_dev_context ) ;
2008-04-02 23:52:03 +04:00
}
2008-07-23 04:28:59 +04:00
OBJ_RELEASE ( device ) ;
2006-06-28 11:23:08 +04:00
return ret ;
}
2006-12-17 15:26:41 +03:00
2007-10-23 16:57:45 +04:00
static int finish_btl_init ( mca_btl_openib_module_t * openib_btl )
{
2008-01-09 13:26:21 +03:00
int qp ;
2007-10-23 16:57:45 +04:00
openib_btl - > num_peers = 0 ;
/* Initialize module state */
OBJ_CONSTRUCT ( & openib_btl - > ib_lock , opal_mutex_t ) ;
2008-01-21 15:11:18 +03:00
2007-10-23 16:57:45 +04:00
/* setup the qp structure */
openib_btl - > qps = ( mca_btl_openib_module_qp_t * )
2008-01-09 13:26:21 +03:00
calloc ( mca_btl_openib_component . num_qps ,
sizeof ( mca_btl_openib_module_qp_t ) ) ;
2008-01-21 15:11:18 +03:00
/* setup all the qps */
2008-05-16 07:28:20 +04:00
for ( qp = 0 ; qp < mca_btl_openib_component . num_qps ; qp + + ) {
if ( ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) {
2007-11-28 10:12:44 +03:00
OBJ_CONSTRUCT ( & openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ 0 ] ,
opal_list_t ) ;
OBJ_CONSTRUCT ( & openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ 1 ] ,
opal_list_t ) ;
2008-01-21 15:11:18 +03:00
openib_btl - > qps [ qp ] . u . srq_qp . sd_credits =
2007-10-23 16:57:45 +04:00
mca_btl_openib_component . qp_infos [ qp ] . u . srq_qp . sd_max ;
}
}
2008-07-23 04:28:59 +04:00
/* initialize the memory pool using the device */
openib_btl - > super . btl_mpool = openib_btl - > device - > mpool ;
2008-01-21 15:11:18 +03:00
2007-12-23 15:29:34 +03:00
openib_btl - > eager_rdma_channels = 0 ;
2007-10-23 16:57:45 +04:00
openib_btl - > eager_rdma_frag_size = OPAL_ALIGN (
sizeof ( mca_btl_openib_header_t ) +
2007-12-09 17:05:13 +03:00
sizeof ( mca_btl_openib_header_coalesced_t ) +
sizeof ( mca_btl_openib_control_header_t ) +
2007-10-23 16:57:45 +04:00
sizeof ( mca_btl_openib_footer_t ) +
openib_btl - > super . btl_eager_limit ,
mca_btl_openib_component . buffer_alignment , size_t ) ;
return OMPI_SUCCESS ;
}
2008-02-03 18:16:24 +03:00
static struct ibv_device * * ibv_get_device_list_compat ( int * num_devs )
{
struct ibv_device * * ib_devs ;
# ifdef HAVE_IBV_GET_DEVICE_LIST
ib_devs = ibv_get_device_list ( num_devs ) ;
# else
struct dlist * dev_list ;
struct ibv_device * ib_dev ;
* num_devs = 0 ;
2008-07-23 04:28:59 +04:00
/* Determine the number of device's available on the host */
2008-02-03 18:16:24 +03:00
dev_list = ibv_get_devices ( ) ;
if ( NULL = = dev_list )
return NULL ;
dlist_start ( dev_list ) ;
dlist_for_each_data ( dev_list , ib_dev , struct ibv_device )
( * num_devs ) + + ;
/* Allocate space for the ib devices */
ib_devs = ( struct ibv_device * * ) malloc ( * num_devs * sizeof ( struct ibv_dev * ) ) ;
if ( NULL = = ib_devs ) {
* num_devs = 0 ;
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2008-02-03 18:16:24 +03:00
return NULL ;
}
dlist_start ( dev_list ) ;
dlist_for_each_data ( dev_list , ib_dev , struct ibv_device )
* ( + + ib_devs ) = ib_dev ;
# endif
2008-02-04 17:05:01 +03:00
return ib_devs ;
2008-02-03 18:16:24 +03:00
}
static void ibv_free_device_list_compat ( struct ibv_device * * ib_devs )
{
# ifdef HAVE_IBV_GET_DEVICE_LIST
ibv_free_device_list ( ib_devs ) ;
# else
free ( ib_devs ) ;
# endif
}
2008-02-11 13:34:11 +03:00
static opal_carto_graph_t * host_topo ;
static int get_ib_dev_distance ( struct ibv_device * dev )
{
opal_paffinity_base_cpu_set_t cpus ;
2008-07-23 04:28:59 +04:00
opal_carto_base_node_t * device_node ;
2008-08-21 23:21:28 +04:00
int min_distance = - 1 , i , num_processors ;
2008-07-23 04:28:59 +04:00
const char * device = ibv_get_device_name ( dev ) ;
2008-02-11 13:34:11 +03:00
2008-08-21 23:21:28 +04:00
if ( opal_paffinity_base_get_processor_info ( & num_processors ) ! = OMPI_SUCCESS ) {
num_processors = 100 ; /* Choose something big enough */
}
2008-02-11 13:34:11 +03:00
2008-07-23 04:28:59 +04:00
device_node = opal_carto_base_find_node ( host_topo , device ) ;
2008-02-11 13:34:11 +03:00
2008-07-23 04:28:59 +04:00
/* no topology info for device found. Assume that it is close */
if ( NULL = = device_node )
2008-02-11 13:34:11 +03:00
return 0 ;
OPAL_PAFFINITY_CPU_ZERO ( cpus ) ;
opal_paffinity_base_get ( & cpus ) ;
2008-08-21 23:21:28 +04:00
for ( i = 0 ; i < num_processors ; i + + ) {
2008-02-11 13:34:11 +03:00
opal_carto_base_node_t * slot_node ;
int distance , socket , core ;
char * slot ;
if ( ! OPAL_PAFFINITY_CPU_ISSET ( i , cpus ) )
continue ;
2008-08-21 23:21:28 +04:00
opal_paffinity_base_get_map_to_socket_core ( i , & socket , & core ) ;
2008-02-11 13:34:11 +03:00
asprintf ( & slot , " slot%d " , socket ) ;
2008-05-07 23:33:49 +04:00
slot_node = opal_carto_base_find_node ( host_topo , slot ) ;
2008-02-11 13:34:11 +03:00
free ( slot ) ;
if ( NULL = = slot_node )
return 0 ;
2008-07-23 04:28:59 +04:00
distance = opal_carto_base_spf ( host_topo , slot_node , device_node ) ;
2008-02-11 13:34:11 +03:00
if ( distance < 0 )
return 0 ;
if ( min_distance < 0 | | min_distance < distance )
min_distance = distance ;
}
return min_distance ;
}
struct dev_distance {
struct ibv_device * ib_dev ;
int distance ;
} ;
static int compare_distance ( const void * p1 , const void * p2 )
{
const struct dev_distance * d1 = p1 ;
const struct dev_distance * d2 = p2 ;
return d1 - > distance - d2 - > distance ;
}
static struct dev_distance *
sort_devs_by_distance ( struct ibv_device * * ib_devs , int count )
{
int i ;
struct dev_distance * devs = malloc ( count * sizeof ( struct dev_distance ) ) ;
2008-05-07 23:33:49 +04:00
opal_carto_base_get_host_graph ( & host_topo , " Infiniband " ) ;
2008-02-11 13:34:11 +03:00
2008-05-16 07:28:20 +04:00
for ( i = 0 ; i < count ; i + + ) {
2008-02-11 13:34:11 +03:00
devs [ i ] . ib_dev = ib_devs [ i ] ;
devs [ i ] . distance = get_ib_dev_distance ( ib_devs [ i ] ) ;
}
qsort ( devs , count , sizeof ( struct dev_distance ) , compare_distance ) ;
2008-05-07 23:33:49 +04:00
opal_carto_base_free_graph ( host_topo ) ;
2008-02-11 13:34:11 +03:00
return devs ;
}
2008-08-06 18:22:03 +04:00
2005-07-01 01:28:35 +04:00
/*
* IB component initialization :
* ( 1 ) read interface list from kernel and compare against component parameters
* then create a BTL instance for selected interfaces
* ( 2 ) setup IB listen socket for incoming connection attempts
* ( 3 ) register BTL parameters with the MCA
*/
2008-01-21 15:11:18 +03:00
static mca_btl_base_module_t * *
btl_openib_component_init ( int * num_btl_modules ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
bool enable_progress_threads ,
bool enable_mpi_threads )
2005-07-01 01:28:35 +04:00
{
2008-01-21 15:11:18 +03:00
struct ibv_device * * ib_devs ;
2005-07-01 01:28:35 +04:00
mca_btl_base_module_t * * btls ;
2008-01-09 13:26:21 +03:00
int i , ret , num_devs , length ;
2008-01-21 15:11:18 +03:00
opal_list_t btl_list ;
mca_btl_openib_module_t * openib_btl ;
mca_btl_base_selected_module_t * ib_selected ;
opal_list_item_t * item ;
2006-01-13 02:42:44 +03:00
unsigned short seedv [ 3 ] ;
2008-01-09 13:26:21 +03:00
mca_btl_openib_frag_init_data_t * init_data ;
2008-02-11 13:34:11 +03:00
struct dev_distance * dev_sorted ;
int distance ;
2008-07-25 02:51:26 +04:00
int index , value ;
mca_base_param_source_t source ;
2005-07-20 01:04:22 +04:00
2005-07-01 01:28:35 +04:00
/* initialization */
* num_btl_modules = 0 ;
2008-01-21 15:11:18 +03:00
num_devs = 0 ;
2005-07-01 01:28:35 +04:00
2008-05-29 02:05:47 +04:00
/* Per https://svn.open-mpi.org/trac/ompi/ticket/1305, check to
see if $ sysfsdir / class / infiniband exists . If it does not ,
assume that the RDMA hardware drivers are not loaded , and
therefore we don ' t want OpenFabrics verbs support in this OMPI
job . No need to print a warning . */
if ( ! check_basics ( ) ) {
return NULL ;
}
2008-02-28 04:57:57 +03:00
seedv [ 0 ] = ORTE_PROC_MY_NAME - > vpid ;
2006-01-13 02:42:44 +03:00
seedv [ 1 ] = opal_sys_timer_get_cycles ( ) ;
seedv [ 2 ] = opal_sys_timer_get_cycles ( ) ;
seed48 ( seedv ) ;
2006-01-17 19:23:35 +03:00
2008-07-23 04:28:59 +04:00
/* Read in INI files with device-specific parameters */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
if ( OMPI_SUCCESS ! = ( ret = ompi_btl_openib_ini_init ( ) ) ) {
2007-06-14 05:59:25 +04:00
goto no_btls ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2008-05-07 02:43:52 +04:00
/* Initialize FD listening */
if ( OMPI_SUCCESS ! = ompi_btl_openib_fd_init ( ) ) {
goto no_btls ;
}
2008-05-02 15:52:33 +04:00
/* Init CPC components */
if ( OMPI_SUCCESS ! = ( ret = ompi_btl_openib_connect_base_init ( ) ) ) {
goto no_btls ;
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
2008-09-17 02:06:14 +04:00
/* If we have a memory manager available, and
mpi_leave_pinned = = - 1 , then unless the user explicitly set
2008-07-25 02:51:26 +04:00
mpi_leave_pinned_pipeline = = 0 , then set mpi_leave_pinned to 1.
We have a memory manager if :
- we have both FREE and MUNMAP support
- we have MUNMAP support and the linux mallopt */
value = opal_mem_hooks_support_level ( ) ;
if ( ( ( OPAL_MEMORY_FREE_SUPPORT | OPAL_MEMORY_MUNMAP_SUPPORT ) = =
( ( OPAL_MEMORY_FREE_SUPPORT | OPAL_MEMORY_MUNMAP_SUPPORT ) & value ) ) | |
( 0 ! = ( OPAL_MEMORY_MUNMAP_SUPPORT & value ) & &
OMPI_MPOOL_BASE_HAVE_LINUX_MALLOPT ) ) {
ret = 0 ;
index = mca_base_param_find ( " mpi " , NULL , " leave_pinned " ) ;
if ( index > = 0 ) {
if ( OPAL_SUCCESS = = mca_base_param_lookup_int ( index , & value ) & &
2008-09-17 02:06:14 +04:00
- 1 = = value ) {
+ + ret ;
2008-07-25 02:51:26 +04:00
}
}
index = mca_base_param_find ( " mpi " , NULL , " leave_pinned_pipeline " ) ;
if ( index > = 0 ) {
if ( OPAL_SUCCESS = = mca_base_param_lookup_int ( index , & value ) & &
2008-08-01 01:56:20 +04:00
OPAL_SUCCESS = = mca_base_param_lookup_source ( index , & source ,
NULL ) ) {
2008-07-25 02:51:26 +04:00
if ( 0 = = value & & MCA_BASE_PARAM_SOURCE_DEFAULT = = source ) {
+ + ret ;
}
}
}
/* If we were good on both parameters, then set leave_pinned=1 */
if ( 2 = = ret ) {
ompi_mpi_leave_pinned = 1 ;
ompi_mpi_leave_pinned_pipeline = 0 ;
}
}
2008-01-09 13:26:21 +03:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . send_free_coalesced , ompi_free_list_t ) ;
OBJ_CONSTRUCT ( & mca_btl_openib_component . send_user_free , ompi_free_list_t ) ;
OBJ_CONSTRUCT ( & mca_btl_openib_component . recv_user_free , ompi_free_list_t ) ;
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data - > order = mca_btl_openib_component . rdma_qp ;
init_data - > list = & mca_btl_openib_component . send_user_free ;
2008-01-21 15:11:18 +03:00
2008-05-16 07:28:20 +04:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new (
2008-01-09 13:26:21 +03:00
& mca_btl_openib_component . send_user_free ,
sizeof ( mca_btl_openib_put_frag_t ) , 2 ,
OBJ_CLASS ( mca_btl_openib_put_frag_t ) ,
0 , 0 ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2008-01-21 15:11:18 +03:00
NULL , mca_btl_openib_frag_init , init_data ) ) {
2008-01-09 13:26:21 +03:00
goto no_btls ;
}
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
init_data - > order = mca_btl_openib_component . rdma_qp ;
init_data - > list = & mca_btl_openib_component . recv_user_free ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new (
& mca_btl_openib_component . recv_user_free ,
sizeof ( mca_btl_openib_get_frag_t ) , 2 ,
OBJ_CLASS ( mca_btl_openib_get_frag_t ) ,
0 , 0 ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2008-01-21 15:11:18 +03:00
NULL , mca_btl_openib_frag_init , init_data ) ) {
2008-01-09 13:26:21 +03:00
goto no_btls ;
}
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
length = sizeof ( mca_btl_openib_coalesced_frag_t ) ;
init_data - > list = & mca_btl_openib_component . send_free_coalesced ;
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex (
& mca_btl_openib_component . send_free_coalesced ,
length , 2 , OBJ_CLASS ( mca_btl_openib_coalesced_frag_t ) ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
NULL , mca_btl_openib_frag_init , init_data ) ) {
goto no_btls ;
}
2007-04-21 04:15:05 +04:00
/* If we want fork support, try to enable it */
# ifdef HAVE_IBV_FORK_INIT
if ( 0 ! = mca_btl_openib_component . want_fork_support ) {
if ( 0 ! = ibv_fork_init ( ) ) {
/* If the want_fork_support MCA parameter is >0, then the
user was specifically asking for fork support and we
couldn ' t provide it . So print an error and deactivate
this BTL . */
if ( mca_btl_openib_component . want_fork_support > 0 ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2007-04-21 04:15:05 +04:00
" ibv_fork_init fail " , true ,
2008-03-24 02:10:15 +03:00
orte_process_info . nodename ) ;
2007-06-14 05:59:25 +04:00
goto no_btls ;
2008-01-21 15:11:18 +03:00
}
2007-04-21 04:15:05 +04:00
}
}
# endif
2007-06-14 05:59:25 +04:00
/* Parse the include and exclude lists, checking for errors */
mca_btl_openib_component . if_include_list =
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_exclude_list =
2007-06-14 05:59:25 +04:00
mca_btl_openib_component . if_list = NULL ;
if ( NULL ! = mca_btl_openib_component . if_include & &
NULL ! = mca_btl_openib_component . if_exclude ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2007-06-14 05:59:25 +04:00
" specified include and exclude " , true ,
mca_btl_openib_component . if_include ,
mca_btl_openib_component . if_exclude , NULL ) ;
goto no_btls ;
} else if ( NULL ! = mca_btl_openib_component . if_include ) {
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_include_list =
2007-06-14 05:59:25 +04:00
opal_argv_split ( mca_btl_openib_component . if_include , ' , ' ) ;
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_list =
2007-06-14 05:59:25 +04:00
opal_argv_copy ( mca_btl_openib_component . if_include_list ) ;
} else if ( NULL ! = mca_btl_openib_component . if_exclude ) {
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_exclude_list =
2007-06-14 05:59:25 +04:00
opal_argv_split ( mca_btl_openib_component . if_exclude , ' , ' ) ;
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_list =
2007-06-14 05:59:25 +04:00
opal_argv_copy ( mca_btl_openib_component . if_exclude_list ) ;
}
2008-02-03 18:16:24 +03:00
ib_devs = ibv_get_device_list_compat ( & num_devs ) ;
2005-07-01 01:28:35 +04:00
2008-02-03 18:16:24 +03:00
if ( 0 = = num_devs | | NULL = = ib_devs ) {
2008-07-23 04:28:59 +04:00
mca_btl_base_error_no_nics ( " OpenFabrics (openib) " , " device " ) ;
2008-02-03 18:16:24 +03:00
goto no_btls ;
2005-07-01 01:28:35 +04:00
}
2008-02-11 13:34:11 +03:00
dev_sorted = sort_devs_by_distance ( ib_devs , num_devs ) ;
2008-01-21 15:11:18 +03:00
OBJ_CONSTRUCT ( & btl_list , opal_list_t ) ;
2005-07-04 02:45:48 +04:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_lock , opal_mutex_t ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
# if OMPI_HAVE_THREADS
2007-11-28 10:14:34 +03:00
mca_btl_openib_component . async_thread = 0 ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
# endif
2008-06-05 23:08:08 +04:00
distance = dev_sorted [ 0 ] . distance ;
2008-05-16 07:28:20 +04:00
for ( i = 0 ; i < num_devs & & ( - 1 = = mca_btl_openib_component . ib_max_btls | |
2007-11-28 10:14:34 +03:00
mca_btl_openib_component . ib_num_btls <
mca_btl_openib_component . ib_max_btls ) ; i + + ) {
2008-06-05 23:08:08 +04:00
if ( distance ! = dev_sorted [ i ] . distance ) {
2008-02-11 13:34:11 +03:00
break ;
2008-05-16 07:28:20 +04:00
}
2008-02-11 13:34:11 +03:00
2008-05-16 07:28:20 +04:00
if ( OMPI_SUCCESS ! =
2008-07-23 04:28:59 +04:00
( ret = init_one_device ( & btl_list , dev_sorted [ i ] . ib_dev ) ) )
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
break ;
2006-09-19 12:56:32 +04:00
}
2008-05-16 07:28:20 +04:00
if ( OMPI_SUCCESS ! = ret ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2008-07-23 04:28:59 +04:00
" error in device init " , true ,
orte_process_info . nodename ,
2008-05-21 01:53:42 +04:00
ibv_get_device_name ( dev_sorted [ i ] . ib_dev ) ) ;
2008-05-16 07:36:11 +04:00
return NULL ;
2006-09-19 12:56:32 +04:00
}
2007-06-14 05:59:25 +04:00
2008-02-11 13:34:11 +03:00
free ( dev_sorted ) ;
2008-07-23 04:28:59 +04:00
/* If we got back from checking all the devices and find that
there are still items in the component . if_list , that means that
they didn ' t exist . Show an appropriate warning if the warning
was not disabled . */
2007-06-14 05:59:25 +04:00
if ( 0 ! = opal_argv_count ( mca_btl_openib_component . if_list ) & &
mca_btl_openib_component . warn_nonexistent_if ) {
char * str = opal_argv_join ( mca_btl_openib_component . if_list , ' , ' ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " nonexistent port " ,
2008-03-24 02:10:15 +03:00
true , orte_process_info . nodename ,
2008-01-21 15:11:18 +03:00
( ( NULL ! = mca_btl_openib_component . if_include ) ?
2007-06-14 05:59:25 +04:00
" in " : " ex " ) , str ) ;
free ( str ) ;
}
2008-01-21 15:11:18 +03:00
2006-09-19 12:56:32 +04:00
if ( 0 = = mca_btl_openib_component . ib_num_btls ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2008-03-24 02:10:15 +03:00
" no active ports found " , true , orte_process_info . nodename ) ;
2006-09-19 12:56:32 +04:00
return NULL ;
}
2008-05-21 01:53:42 +04:00
/* Setup the BSRQ QP's based on the final value of
mca_btl_openib_component . receive_queues . */
2008-05-28 15:36:38 +04:00
if ( OMPI_SUCCESS ! = setup_qps ( ) ) {
return NULL ;
}
2008-05-21 01:53:42 +04:00
2008-05-28 15:31:38 +04:00
/* For XRC:
* from this point we know if MCA_BTL_XRC_ENABLED it true or false */
/* Init XRC IB Addr hash table */
if ( MCA_BTL_XRC_ENABLED ) {
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_addr_table ,
opal_hash_table_t ) ;
}
2008-05-21 01:53:42 +04:00
/* Loop through all the btl modules that we made and find every
2008-07-23 04:28:59 +04:00
base device that doesn ' t have device - > qps setup on it yet ( remember
that some modules may share the same device , so when going through
to loop , we may hit a device that was already setup earlier in
2008-05-21 01:53:42 +04:00
the loop ) . */
for ( item = opal_list_get_first ( & btl_list ) ;
opal_list_get_end ( & btl_list ) ! = item ;
item = opal_list_get_next ( item ) ) {
mca_btl_base_selected_module_t * m =
( mca_btl_base_selected_module_t * ) item ;
2008-07-23 04:28:59 +04:00
mca_btl_openib_device_t * device =
( ( mca_btl_openib_module_t * ) m - > btl_module ) - > device ;
if ( NULL = = device - > qps ) {
2008-05-21 01:53:42 +04:00
2008-07-23 04:28:59 +04:00
/* Setup the device qps info */
device - > qps = ( mca_btl_openib_device_qp_t * )
2008-05-21 01:53:42 +04:00
calloc ( mca_btl_openib_component . num_qps ,
2008-07-23 04:28:59 +04:00
sizeof ( mca_btl_openib_device_qp_t ) ) ;
2008-05-21 01:53:42 +04:00
for ( i = 0 ; i < mca_btl_openib_component . num_qps ; i + + ) {
2008-07-23 04:28:59 +04:00
OBJ_CONSTRUCT ( & device - > qps [ i ] . send_free , ompi_free_list_t ) ;
OBJ_CONSTRUCT ( & device - > qps [ i ] . recv_free , ompi_free_list_t ) ;
2008-05-21 01:53:42 +04:00
}
2008-07-23 04:28:59 +04:00
/* Do finial init on device */
ret = prepare_device_for_use ( device ) ;
2008-05-21 01:53:42 +04:00
if ( OMPI_SUCCESS ! = ret ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
2008-07-23 04:28:59 +04:00
" error in device init " , true ,
2008-05-21 01:53:42 +04:00
orte_process_info . nodename ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( device - > ib_dev ) ) ;
2008-05-21 01:53:42 +04:00
return NULL ;
}
}
}
2005-07-01 01:28:35 +04:00
/* Allocate space for btl modules */
2006-06-28 11:23:08 +04:00
mca_btl_openib_component . openib_btls =
2007-05-09 01:47:21 +04:00
malloc ( sizeof ( mca_btl_openib_module_t * ) *
2006-06-28 11:23:08 +04:00
mca_btl_openib_component . ib_num_btls ) ;
2005-07-12 17:38:54 +04:00
if ( NULL = = mca_btl_openib_component . openib_btls ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2005-07-01 01:28:35 +04:00
return NULL ;
}
2008-02-03 18:16:24 +03:00
btls = ( struct mca_btl_base_module_t * * )
malloc ( mca_btl_openib_component . ib_num_btls *
sizeof ( struct mca_btl_base_module_t * ) ) ;
2005-07-01 01:28:35 +04:00
if ( NULL = = btls ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2005-07-01 01:28:35 +04:00
return NULL ;
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
/* Copy the btl module structs into a contiguous array and fully
initialize them */
2005-07-01 01:28:35 +04:00
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
2008-01-21 15:11:18 +03:00
item = opal_list_remove_first ( & btl_list ) ;
ib_selected = ( mca_btl_base_selected_module_t * ) item ;
2008-05-28 15:31:38 +04:00
openib_btl = ( mca_btl_openib_module_t * ) ib_selected - > btl_module ;
/* Do we have at least one CPC that can handle this
port ? */
ret =
ompi_btl_openib_connect_base_select_for_local_port ( openib_btl ) ;
if ( OMPI_SUCCESS ! = ret ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" failed load cpc " , true ,
orte_process_info . nodename ,
2008-07-23 04:28:59 +04:00
ibv_get_device_name ( openib_btl - > device - > ib_dev ) ) ;
2008-05-28 15:31:38 +04:00
return NULL ;
}
mca_btl_openib_component . openib_btls [ i ] = openib_btl ;
2008-01-21 15:11:18 +03:00
OBJ_RELEASE ( ib_selected ) ;
2007-10-23 15:10:52 +04:00
btls [ i ] = & openib_btl - > super ;
2007-10-23 16:57:45 +04:00
if ( finish_btl_init ( openib_btl ) ! = OMPI_SUCCESS )
return NULL ;
2007-10-23 15:10:52 +04:00
}
2005-07-01 01:28:35 +04:00
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
btl_openib_modex_send ( ) ;
2005-07-01 01:28:35 +04:00
* num_btl_modules = mca_btl_openib_component . ib_num_btls ;
2008-02-03 18:16:24 +03:00
ibv_free_device_list_compat ( ib_devs ) ;
2007-06-14 05:59:25 +04:00
if ( NULL ! = mca_btl_openib_component . if_include_list ) {
opal_argv_free ( mca_btl_openib_component . if_include_list ) ;
mca_btl_openib_component . if_include_list = NULL ;
}
if ( NULL ! = mca_btl_openib_component . if_exclude_list ) {
opal_argv_free ( mca_btl_openib_component . if_exclude_list ) ;
mca_btl_openib_component . if_exclude_list = NULL ;
}
2008-08-06 18:22:03 +04:00
/* setup the fork warning message as we are sensitive
* to memory corruption issues when fork is called
*/
ompi_warn_fork ( ) ;
2005-07-01 01:28:35 +04:00
return btls ;
2007-06-14 05:59:25 +04:00
no_btls :
/* If we fail early enough in the setup, we just modex around that
there are no openib BTL ' s in this process and return NULL . */
2008-05-02 15:52:33 +04:00
/* Be sure to shut down the fd listener */
ompi_btl_openib_fd_finalize ( ) ;
2007-11-28 10:18:59 +03:00
2007-06-14 05:59:25 +04:00
mca_btl_openib_component . ib_num_btls = 0 ;
btl_openib_modex_send ( ) ;
return NULL ;
2005-07-01 01:28:35 +04:00
}
2008-01-09 13:27:15 +03:00
static void progress_pending_eager_rdma ( mca_btl_base_endpoint_t * ep )
2006-08-31 00:21:47 +04:00
{
2008-01-09 13:27:15 +03:00
int qp ;
opal_list_item_t * frag ;
2006-12-14 18:52:13 +03:00
2008-01-09 13:27:15 +03:00
/* Go over all QPs and try to send high prio packets over eager rdma
* channel */
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
for ( qp = 0 ; qp < mca_btl_openib_component . num_qps ; qp + + ) {
while ( ep - > qps [ qp ] . qp - > sd_wqe > 0 & & ep - > eager_rdma_remote . tokens > 0 ) {
frag = opal_list_remove_first ( & ep - > qps [ qp ] . pending_frags [ 0 ] ) ;
if ( NULL = = frag )
break ;
mca_btl_openib_endpoint_post_send ( ep , to_send_frag ( frag ) ) ;
}
if ( ep - > eager_rdma_remote . tokens = = 0 )
break ;
2006-12-14 18:52:13 +03:00
}
2008-01-09 13:27:15 +03:00
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2006-08-31 00:21:47 +04:00
}
2008-01-09 13:27:15 +03:00
static inline int
get_enpoint_credits ( mca_btl_base_endpoint_t * ep , const int qp )
2007-11-28 10:13:34 +03:00
{
2008-01-09 13:27:15 +03:00
return BTL_OPENIB_QP_TYPE_PP ( qp ) ? ep - > qps [ qp ] . u . pp_qp . sd_credits : 1 ;
}
static void progress_pending_frags_pp ( mca_btl_base_endpoint_t * ep , const int qp )
{
int i ;
opal_list_item_t * frag ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:27:15 +03:00
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
for ( i = 0 ; i < 2 ; i + + ) {
while ( ( get_enpoint_credits ( ep , qp ) +
( 1 - i ) * ep - > eager_rdma_remote . tokens ) > 0 ) {
frag = opal_list_remove_first ( & ep - > qps [ qp ] . pending_frags [ i ] ) ;
if ( NULL = = frag )
break ;
mca_btl_openib_endpoint_post_send ( ep , to_send_frag ( frag ) ) ;
}
}
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
}
void mca_btl_openib_frag_progress_pending_put_get ( mca_btl_base_endpoint_t * ep ,
const int qp )
2008-01-21 15:11:18 +03:00
{
2008-01-09 13:27:15 +03:00
mca_btl_openib_module_t * openib_btl = ep - > endpoint_btl ;
opal_list_item_t * frag ;
size_t i , len = opal_list_get_size ( & ep - > pending_get_frags ) ;
for ( i = 0 ; i < len & & ep - > qps [ qp ] . qp - > sd_wqe > 0 & & ep - > get_tokens > 0 ; i + + )
{
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
frag = opal_list_remove_first ( & ( ep - > pending_get_frags ) ) ;
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
if ( NULL = = frag )
break ;
if ( mca_btl_openib_get ( ( mca_btl_base_module_t * ) openib_btl , ep ,
& to_base_frag ( frag ) - > base ) = = OMPI_ERR_OUT_OF_RESOURCE )
break ;
}
2008-01-21 15:11:18 +03:00
2008-01-09 13:27:15 +03:00
len = opal_list_get_size ( & ep - > pending_put_frags ) ;
for ( i = 0 ; i < len & & ep - > qps [ qp ] . qp - > sd_wqe > 0 ; i + + ) {
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
frag = opal_list_remove_first ( & ( ep - > pending_put_frags ) ) ;
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
if ( NULL = = frag )
break ;
if ( mca_btl_openib_put ( ( mca_btl_base_module_t * ) openib_btl , ep ,
& to_base_frag ( frag ) - > base ) = = OMPI_ERR_OUT_OF_RESOURCE )
break ;
}
2007-11-28 10:13:34 +03:00
}
2006-08-31 00:21:47 +04:00
2006-09-12 13:17:59 +04:00
static int btl_openib_handle_incoming ( mca_btl_openib_module_t * openib_btl ,
2007-11-28 10:13:34 +03:00
mca_btl_openib_endpoint_t * ep ,
2008-01-21 15:11:18 +03:00
mca_btl_openib_recv_frag_t * frag ,
2007-11-28 10:12:44 +03:00
size_t byte_len )
2006-03-26 12:30:50 +04:00
{
2007-11-28 10:11:14 +03:00
mca_btl_base_descriptor_t * des = & to_base_frag ( frag ) - > base ;
mca_btl_openib_header_t * hdr = frag - > hdr ;
2007-11-28 10:13:34 +03:00
int rqp = to_base_frag ( frag ) - > base . order , cqp ;
uint16_t rcredits = 0 , credits ;
bool is_credit_msg ;
2007-11-28 10:11:14 +03:00
2007-11-28 10:13:34 +03:00
if ( ep - > nbo ) {
2007-11-28 10:11:14 +03:00
BTL_OPENIB_HEADER_NTOH ( * hdr ) ;
2007-01-13 02:14:45 +03:00
}
2007-03-14 17:36:03 +03:00
2006-09-12 13:17:59 +04:00
/* advance the segment address past the header and subtract from the
2007-11-28 10:13:34 +03:00
* length . */
2007-11-28 10:11:14 +03:00
des - > des_dst - > seg_len = byte_len - sizeof ( mca_btl_openib_header_t ) ;
2006-03-26 12:30:50 +04:00
2007-11-28 10:13:34 +03:00
if ( OPAL_LIKELY ( ! ( is_credit_msg = is_credit_message ( frag ) ) ) ) {
/* call registered callback */
2008-01-15 08:32:53 +03:00
mca_btl_active_message_callback_t * reg ;
reg = mca_btl_base_active_message_trigger + hdr - > tag ;
reg - > cbfunc ( & openib_btl - > super , hdr - > tag , des , reg - > cbdata ) ;
2007-12-09 16:56:13 +03:00
if ( MCA_BTL_OPENIB_RDMA_FRAG ( frag ) ) {
cqp = ( hdr - > credits > > 11 ) & 0x0f ;
hdr - > credits & = 0x87ff ;
} else {
cqp = rqp ;
}
2007-11-28 10:13:34 +03:00
if ( BTL_OPENIB_IS_RDMA_CREDITS ( hdr - > credits ) ) {
rcredits = BTL_OPENIB_CREDITS ( hdr - > credits ) ;
hdr - > credits = 0 ;
2007-11-28 10:12:44 +03:00
}
2007-07-24 17:23:08 +04:00
} else {
2007-11-28 10:13:34 +03:00
mca_btl_openib_rdma_credits_header_t * chdr = des - > des_dst - > seg_addr . pval ;
if ( ep - > nbo ) {
BTL_OPENIB_RDMA_CREDITS_HEADER_NTOH ( * chdr ) ;
}
cqp = chdr - > qpn ;
rcredits = chdr - > rdma_credits ;
}
credits = hdr - > credits ;
2007-11-28 10:12:44 +03:00
2007-11-28 10:13:34 +03:00
if ( hdr - > cm_seen )
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . cm_sent , - hdr - > cm_seen ) ;
/* Now return fragment. Don't touch hdr after this point! */
if ( MCA_BTL_OPENIB_RDMA_FRAG ( frag ) ) {
mca_btl_openib_eager_rdma_local_t * erl = & ep - > eager_rdma_local ;
OPAL_THREAD_LOCK ( & erl - > lock ) ;
MCA_BTL_OPENIB_RDMA_MAKE_REMOTE ( frag - > ftr ) ;
while ( erl - > tail ! = erl - > head ) {
mca_btl_openib_recv_frag_t * tf ;
tf = MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG ( ep , erl - > tail ) ;
if ( MCA_BTL_OPENIB_RDMA_FRAG_LOCAL ( tf ) )
break ;
OPAL_THREAD_ADD32 ( & erl - > credits , 1 ) ;
MCA_BTL_OPENIB_RDMA_NEXT_INDEX ( erl - > tail ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
}
2007-11-28 10:13:34 +03:00
OPAL_THREAD_UNLOCK ( & erl - > lock ) ;
} else {
MCA_BTL_IB_FRAG_RETURN ( frag ) ;
2007-11-28 10:20:26 +03:00
if ( BTL_OPENIB_QP_TYPE_PP ( rqp ) ) {
2007-11-28 10:13:34 +03:00
if ( OPAL_UNLIKELY ( is_credit_msg ) )
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . cm_received , 1 ) ;
else
OPAL_THREAD_ADD32 ( & ep - > qps [ rqp ] . u . pp_qp . rd_posted , - 1 ) ;
mca_btl_openib_endpoint_post_rr ( ep , cqp ) ;
2007-11-28 10:20:26 +03:00
} else {
mca_btl_openib_module_t * btl = ep - > endpoint_btl ;
OPAL_THREAD_ADD32 ( & btl - > qps [ rqp ] . u . srq_qp . rd_posted , - 1 ) ;
2007-11-28 17:52:31 +03:00
mca_btl_openib_post_srr ( btl , rqp ) ;
2007-11-28 10:13:34 +03:00
}
}
if ( rcredits > 0 ) {
OPAL_THREAD_ADD32 ( & ep - > eager_rdma_remote . tokens , rcredits ) ;
progress_pending_eager_rdma ( ep ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-03-26 12:30:50 +04:00
2007-11-28 10:13:34 +03:00
assert ( ( cqp ! = MCA_BTL_NO_ORDER & & BTL_OPENIB_QP_TYPE_PP ( cqp ) ) | | ! credits ) ;
if ( credits ) {
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . sd_credits , credits ) ;
progress_pending_frags_pp ( ep , cqp ) ;
}
2007-12-09 16:56:13 +03:00
send_credits ( ep , cqp ) ;
2007-11-28 10:13:34 +03:00
2006-03-26 12:30:50 +04:00
return OMPI_SUCCESS ;
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
static char * btl_openib_component_status_to_string ( enum ibv_wc_status status )
2008-01-21 15:11:18 +03:00
{
switch ( status ) {
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
case IBV_WC_SUCCESS :
2008-01-21 15:11:18 +03:00
return " SUCCESS " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
break ;
case IBV_WC_LOC_LEN_ERR :
2008-01-21 15:11:18 +03:00
return " LOCAL LENGTH ERROR " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
break ;
case IBV_WC_LOC_QP_OP_ERR :
return " LOCAL QP OPERATION ERROR " ;
break ;
case IBV_WC_LOC_EEC_OP_ERR :
return " LOCAL EEC OPERATION ERROR " ;
break ;
case IBV_WC_LOC_PROT_ERR :
return " LOCAL PROTOCOL ERROR " ;
break ;
case IBV_WC_WR_FLUSH_ERR :
return " WORK REQUEST FLUSHED ERROR " ;
break ;
case IBV_WC_MW_BIND_ERR :
return " MEMORY WINDOW BIND ERROR " ;
break ;
case IBV_WC_BAD_RESP_ERR :
return " BAD RESPONSE ERROR " ;
break ;
case IBV_WC_LOC_ACCESS_ERR :
return " LOCAL ACCESS ERROR " ;
break ;
case IBV_WC_REM_INV_REQ_ERR :
return " INVALID REQUEST ERROR " ;
break ;
case IBV_WC_REM_ACCESS_ERR :
return " REMOTE ACCESS ERROR " ;
break ;
case IBV_WC_REM_OP_ERR :
return " REMOTE OPERATION ERROR " ;
break ;
case IBV_WC_RETRY_EXC_ERR :
return " RETRY EXCEEDED ERROR " ;
break ;
case IBV_WC_RNR_RETRY_EXC_ERR :
2007-04-27 01:03:38 +04:00
return " RECEIVER NOT READY RETRY EXCEEDED ERROR " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
break ;
case IBV_WC_LOC_RDD_VIOL_ERR :
return " LOCAL RDD VIOLATION ERROR " ;
break ;
case IBV_WC_REM_INV_RD_REQ_ERR :
return " INVALID READ REQUEST ERROR " ;
break ;
case IBV_WC_REM_ABORT_ERR :
return " REMOTE ABORT ERROR " ;
break ;
case IBV_WC_INV_EECN_ERR :
return " INVALID EECN ERROR " ;
break ;
case IBV_WC_INV_EEC_STATE_ERR :
return " INVALID EEC STATE ERROR " ;
break ;
2008-01-21 15:11:18 +03:00
case IBV_WC_FATAL_ERR :
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return " FATAL ERROR " ;
break ;
case IBV_WC_RESP_TIMEOUT_ERR :
return " RESPONSE TIMEOUT ERROR " ;
break ;
case IBV_WC_GENERAL_ERR :
return " GENERAL ERROR " ;
break ;
default :
return " STATUS UNDEFINED " ;
break ;
}
2006-06-06 00:02:41 +04:00
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
2008-01-09 17:46:41 +03:00
static void
progress_pending_frags_wqe ( mca_btl_base_endpoint_t * ep , const int qpn )
2007-11-28 10:15:20 +03:00
{
int i ;
opal_list_item_t * frag ;
2008-01-09 17:46:41 +03:00
mca_btl_openib_qp_t * qp = ep - > qps [ qpn ] . qp ;
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
2007-11-28 10:15:20 +03:00
for ( i = 0 ; i < 2 ; i + + ) {
while ( qp - > sd_wqe > 0 ) {
mca_btl_base_endpoint_t * ep ;
2008-01-09 17:46:41 +03:00
OPAL_THREAD_LOCK ( & qp - > lock ) ;
2007-11-28 10:15:20 +03:00
frag = opal_list_remove_first ( & qp - > pending_frags [ i ] ) ;
2008-01-09 17:46:41 +03:00
OPAL_THREAD_UNLOCK ( & qp - > lock ) ;
2007-11-28 10:15:20 +03:00
if ( NULL = = frag )
break ;
ep = to_com_frag ( frag ) - > endpoint ;
mca_btl_openib_endpoint_post_send ( ep , to_send_frag ( frag ) ) ;
}
}
2008-01-09 17:46:41 +03:00
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2007-11-28 10:15:20 +03:00
}
2008-01-09 13:27:15 +03:00
static void progress_pending_frags_srq ( mca_btl_openib_module_t * openib_btl ,
const int qp )
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
{
2007-11-28 10:11:14 +03:00
opal_list_item_t * frag ;
2007-11-28 10:12:44 +03:00
int i ;
2008-01-21 15:11:18 +03:00
2007-11-28 10:18:59 +03:00
assert ( BTL_OPENIB_QP_TYPE_SRQ ( qp ) | | BTL_OPENIB_QP_TYPE_XRC ( qp ) ) ;
2008-01-21 15:11:18 +03:00
2007-11-28 10:12:44 +03:00
for ( i = 0 ; i < 2 ; i + + ) {
2007-11-28 10:20:26 +03:00
while ( openib_btl - > qps [ qp ] . u . srq_qp . sd_credits > 0 ) {
2007-11-28 10:12:44 +03:00
OPAL_THREAD_LOCK ( & openib_btl - > ib_lock ) ;
2007-11-28 10:20:26 +03:00
frag = opal_list_remove_first (
& openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ i ] ) ;
2007-11-28 10:12:44 +03:00
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
if ( NULL = = frag )
break ;
mca_btl_openib_endpoint_send ( to_com_frag ( frag ) - > endpoint ,
to_send_frag ( frag ) ) ;
}
2007-05-09 01:47:21 +04:00
}
}
2007-12-23 15:29:34 +03:00
static char * cq_name [ ] = { " HP CQ " , " LP CQ " } ;
2008-07-23 04:28:59 +04:00
static void handle_wc ( mca_btl_openib_device_t * device , const uint32_t cq ,
2007-12-23 15:29:34 +03:00
struct ibv_wc * wc )
2006-11-02 19:15:21 +03:00
{
2007-12-23 15:29:34 +03:00
static int flush_err_printed [ ] = { 0 , 0 } ;
2007-11-28 10:11:14 +03:00
mca_btl_openib_com_frag_t * frag ;
mca_btl_base_descriptor_t * des ;
2007-12-23 15:29:34 +03:00
mca_btl_openib_endpoint_t * endpoint ;
2007-08-20 16:28:25 +04:00
mca_btl_openib_module_t * openib_btl = NULL ;
2007-12-23 15:29:34 +03:00
ompi_proc_t * remote_proc = NULL ;
2008-02-18 20:39:30 +03:00
int qp , btl_ownership ;
2006-11-02 19:15:21 +03:00
2007-12-23 15:29:34 +03:00
des = ( mca_btl_base_descriptor_t * ) ( uintptr_t ) wc - > wr_id ;
frag = to_com_frag ( des ) ;
2007-12-02 17:46:37 +03:00
2007-12-23 15:29:34 +03:00
/* For receive fragments "order" contains QP idx the fragment was posted
* to . For send fragments " order " contains QP idx the fragment was send
* through */
qp = des - > order ;
endpoint = frag - > endpoint ;
2007-11-28 10:11:14 +03:00
2007-12-23 15:29:34 +03:00
if ( endpoint )
openib_btl = endpoint - > endpoint_btl ;
2007-11-28 10:11:14 +03:00
2007-12-23 15:29:34 +03:00
if ( wc - > status ! = IBV_WC_SUCCESS )
goto error ;
2007-08-20 16:28:25 +04:00
2007-12-23 15:29:34 +03:00
/* Handle work completions */
switch ( wc - > opcode ) {
case IBV_WC_RDMA_READ :
OPAL_THREAD_ADD32 ( & endpoint - > get_tokens , 1 ) ;
/* fall through */
case IBV_WC_RDMA_WRITE :
case IBV_WC_SEND :
if ( openib_frag_type ( des ) = = MCA_BTL_OPENIB_FRAG_SEND ) {
opal_list_item_t * i ;
2008-02-18 20:39:30 +03:00
while ( ( i = opal_list_remove_first ( & to_send_frag ( des ) - > coalesced_frags ) ) ) {
btl_ownership = ( to_base_frag ( i ) - > base . des_flags & MCA_BTL_DES_FLAGS_BTL_OWNERSHIP ) ;
2007-12-23 15:29:34 +03:00
to_base_frag ( i ) - > base . des_cbfunc ( & openib_btl - > super , endpoint ,
& to_base_frag ( i ) - > base , OMPI_SUCCESS ) ;
2008-02-18 20:39:30 +03:00
if ( btl_ownership ) {
mca_btl_openib_free ( & openib_btl - > super , & to_base_frag ( i ) - > base ) ;
}
}
2007-12-23 15:29:34 +03:00
}
/* Process a completed send/put/get */
2008-02-18 20:39:30 +03:00
btl_ownership = ( des - > des_flags & MCA_BTL_DES_FLAGS_BTL_OWNERSHIP ) ;
2007-12-23 15:29:34 +03:00
des - > des_cbfunc ( & openib_btl - > super , endpoint , des , OMPI_SUCCESS ) ;
2008-02-18 20:39:30 +03:00
if ( btl_ownership ) {
mca_btl_openib_free ( & openib_btl - > super , des ) ;
}
2007-11-28 10:11:14 +03:00
2007-12-23 15:29:34 +03:00
/* return send wqe */
qp_put_wqe ( endpoint , qp ) ;
2005-11-10 23:15:02 +03:00
2007-12-23 15:29:34 +03:00
if ( IBV_WC_SEND = = wc - > opcode & & ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) {
OPAL_THREAD_ADD32 ( & openib_btl - > qps [ qp ] . u . srq_qp . sd_credits , 1 ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
2007-12-23 15:29:34 +03:00
/* new SRQ credit available. Try to progress pending frags*/
progress_pending_frags_srq ( openib_btl , qp ) ;
}
/* new wqe or/and get token available. Try to progress pending frags */
2008-01-09 17:46:41 +03:00
progress_pending_frags_wqe ( endpoint , qp ) ;
2007-12-23 15:29:34 +03:00
mca_btl_openib_frag_progress_pending_put_get ( endpoint , qp ) ;
break ;
case IBV_WC_RECV :
if ( wc - > wc_flags & IBV_WC_WITH_IMM ) {
endpoint = ( mca_btl_openib_endpoint_t * )
2008-07-23 04:28:59 +04:00
opal_pointer_array_get_item ( device - > endpoints , wc - > imm_data ) ;
2007-12-23 15:29:34 +03:00
frag - > endpoint = endpoint ;
openib_btl = endpoint - > endpoint_btl ;
}
2007-11-28 10:18:59 +03:00
2007-12-23 15:29:34 +03:00
/* Process a RECV */
if ( btl_openib_handle_incoming ( openib_btl , endpoint , to_recv_frag ( frag ) ,
wc - > byte_len ) ! = OMPI_SUCCESS ) {
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
2005-11-10 23:15:02 +03:00
break ;
2007-12-23 15:29:34 +03:00
}
2007-07-24 17:23:08 +04:00
2007-12-23 15:29:34 +03:00
/* decide if it is time to setup an eager rdma channel */
if ( ! endpoint - > eager_rdma_local . base . pval & & endpoint - > use_eager_rdma & &
wc - > byte_len < mca_btl_openib_component . eager_limit & &
openib_btl - > eager_rdma_channels <
mca_btl_openib_component . max_eager_rdma & &
OPAL_THREAD_ADD32 ( & endpoint - > eager_recv_count , 1 ) = =
mca_btl_openib_component . eager_rdma_threshold ) {
mca_btl_openib_endpoint_connect_eager_rdma ( endpoint ) ;
}
break ;
default :
BTL_ERROR ( ( " Unhandled work completion opcode is %d " , wc - > opcode ) ) ;
if ( openib_btl )
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
break ;
}
2007-07-26 17:56:07 +04:00
2007-12-23 15:29:34 +03:00
return ;
error :
if ( endpoint & & endpoint - > endpoint_proc & & endpoint - > endpoint_proc - > proc_ompi )
remote_proc = endpoint - > endpoint_proc - > proc_ompi ;
2008-05-07 20:53:05 +04:00
/* For iWARP, the TCP connection is tied to the QP once the QP is
* in RTS . And destroying the QP is thus tied to connection
* teardown for iWARP . To destroy the connection in iWARP you
* must move the QP out of RTS , either into CLOSING for a nice
* graceful close ( e . g . , via rdma_disconnect ( ) ) , or to ERROR if
* you want to be rude ( e . g . , just destroying the QP without
* disconnecting first ) . In both cases , all pending non - completed
* SQ and RQ WRs will automatically be flushed .
2008-05-07 01:57:40 +04:00
*/
2008-05-12 15:56:14 +04:00
# if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
2008-05-12 16:05:16 +04:00
if ( IBV_WC_WR_FLUSH_ERR = = wc - > status & &
2008-07-23 04:28:59 +04:00
IBV_TRANSPORT_IWARP = = device - > ib_dev - > transport_type ) {
2008-05-07 01:57:40 +04:00
return ;
}
2008-05-12 15:56:14 +04:00
# endif
2008-05-07 01:57:40 +04:00
2008-05-12 16:05:16 +04:00
if ( IBV_WC_WR_FLUSH_ERR ! = wc - > status | | ! flush_err_printed [ cq ] + + ) {
2007-12-23 15:29:34 +03:00
BTL_PEER_ERROR ( remote_proc , ( " error polling %s with status %s "
" status number %d for wr_id %llu opcode %d qp_idx %d " ,
cq_name [ cq ] , btl_openib_component_status_to_string ( wc - > status ) ,
wc - > status , wc - > wr_id , wc - > opcode , qp ) ) ;
}
2008-06-02 15:03:48 +04:00
if ( IBV_WC_RNR_RETRY_EXC_ERR = = wc - > status | |
IBV_WC_RETRY_EXC_ERR = = wc - > status ) {
char * peer_hostname =
( NULL ! = endpoint - > endpoint_proc - > proc_ompi - > proc_hostname ) ?
endpoint - > endpoint_proc - > proc_ompi - > proc_hostname :
" <unknown -- please run with mpi_keep_peer_hostnames=1> " ;
const char * device_name =
ibv_get_device_name ( endpoint - > qps [ qp ] . qp - > lcl_qp - > context - > device ) ;
if ( IBV_WC_RNR_RETRY_EXC_ERR = = wc - > status ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
BTL_OPENIB_QP_TYPE_PP ( qp ) ?
" pp rnr retry exceeded " :
" srq rnr retry exceeded " , true ,
orte_process_info . nodename , device_name ,
peer_hostname ) ;
} else if ( IBV_WC_RETRY_EXC_ERR = = wc - > status ) {
orte_show_help ( " help-mpi-btl-openib.txt " ,
" pp retry exceeded " , true ,
orte_process_info . nodename ,
device_name , peer_hostname ) ;
}
}
2007-12-23 15:29:34 +03:00
if ( openib_btl )
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
}
2008-07-23 04:28:59 +04:00
static int poll_device ( mca_btl_openib_device_t * device , int count )
2007-12-23 15:29:34 +03:00
{
2008-01-21 15:11:18 +03:00
int ne = 0 , cq ;
2007-12-23 15:29:34 +03:00
uint32_t hp_iter = 0 ;
struct ibv_wc wc ;
2008-07-23 04:28:59 +04:00
device - > pollme = false ;
2007-12-23 15:29:34 +03:00
for ( cq = 0 ; cq < 2 & & hp_iter < mca_btl_openib_component . cq_poll_progress ; )
{
2008-07-23 04:28:59 +04:00
ne = ibv_poll_cq ( device - > ib_cq [ cq ] , 1 , & wc ) ;
2007-12-23 15:29:34 +03:00
if ( 0 = = ne ) {
/* don't check low prio cq if there was something in high prio cq,
* but for each cq_poll_ratio hp cq polls poll lp cq once */
2008-07-23 04:28:59 +04:00
if ( count & & device - > hp_cq_polls )
2005-07-01 01:28:35 +04:00
break ;
2007-12-23 15:29:34 +03:00
cq + + ;
2008-07-23 04:28:59 +04:00
device - > hp_cq_polls = mca_btl_openib_component . cq_poll_ratio ;
2007-12-23 15:29:34 +03:00
continue ;
2005-07-01 01:28:35 +04:00
}
2007-12-23 15:29:34 +03:00
if ( ne < 0 )
goto error ;
count + + ;
if ( BTL_OPENIB_HP_CQ = = cq ) {
2008-07-23 04:28:59 +04:00
device - > pollme = true ;
2007-12-23 15:29:34 +03:00
hp_iter + + ;
2008-07-23 04:28:59 +04:00
device - > hp_cq_polls - - ;
2007-12-23 15:29:34 +03:00
}
2008-01-21 15:11:18 +03:00
2008-07-23 04:28:59 +04:00
handle_wc ( device , cq , & wc ) ;
2005-07-01 01:28:35 +04:00
}
2008-01-21 15:11:18 +03:00
2005-07-01 01:28:35 +04:00
return count ;
2006-09-12 13:17:59 +04:00
error :
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error polling %s with %d errno says %s " , cq_name [ cq ] , ne ,
2008-01-21 15:11:18 +03:00
strerror ( errno ) ) ) ;
2006-09-05 19:59:02 +04:00
return count ;
2005-07-01 01:28:35 +04:00
}
2007-06-14 05:59:25 +04:00
2008-05-02 15:52:33 +04:00
# if OMPI_ENABLE_PROGRESS_THREADS
2008-01-09 13:27:15 +03:00
void * mca_btl_openib_progress_thread ( opal_object_t * arg )
2007-06-14 05:59:25 +04:00
{
2008-01-09 13:27:15 +03:00
opal_thread_t * thread = ( opal_thread_t * ) arg ;
2008-07-23 04:28:59 +04:00
mca_btl_openib_device_t * device = thread - > t_arg ;
2008-01-09 13:27:15 +03:00
struct ibv_cq * ev_cq ;
void * ev_ctx ;
2007-06-14 05:59:25 +04:00
2008-01-09 13:27:15 +03:00
/* This thread enter in a cancel enabled state */
pthread_setcancelstate ( PTHREAD_CANCEL_ENABLE , NULL ) ;
pthread_setcanceltype ( PTHREAD_CANCEL_ASYNCHRONOUS , NULL ) ;
2007-06-14 05:59:25 +04:00
2008-06-09 18:53:58 +04:00
opal_output ( - 1 , " WARNING: the openib btl progress thread code *does not yet work*. Your run is likely to hang, crash, break the kitchen sink, and/or eat your cat. You have been warned. " ) ;
2008-01-09 13:27:15 +03:00
2008-07-23 04:28:59 +04:00
while ( device - > progress ) {
2008-01-09 13:27:15 +03:00
while ( opal_progress_threads ( ) ) {
while ( opal_progress_threads ( ) )
sched_yield ( ) ;
usleep ( 100 ) ; /* give app a chance to re-enter library */
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
2008-07-23 04:28:59 +04:00
if ( ibv_get_cq_event ( device - > ib_channel , & ev_cq , & ev_ctx ) )
2008-01-09 13:27:15 +03:00
BTL_ERROR ( ( " Failed to get CQ event with error %s " ,
strerror ( errno ) ) ) ;
if ( ibv_req_notify_cq ( ev_cq , 0 ) ) {
BTL_ERROR ( ( " Couldn't request CQ notification with error %s " ,
strerror ( errno ) ) ) ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
ibv_ack_cq_events ( ev_cq , 1 ) ;
2008-07-23 04:28:59 +04:00
while ( poll_device ( device , 0 ) ) ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
return PTHREAD_CANCELED ;
}
# endif
2007-06-14 05:59:25 +04:00
2008-07-23 04:28:59 +04:00
static int progress_one_device ( mca_btl_openib_device_t * device )
2008-01-09 13:27:15 +03:00
{
int i , c , count = 0 , ret ;
mca_btl_openib_recv_frag_t * frag ;
mca_btl_openib_endpoint_t * endpoint ;
uint32_t non_eager_rdma_endpoints = 0 ;
2007-06-14 05:59:25 +04:00
2008-07-23 04:28:59 +04:00
c = device - > eager_rdma_buffers_count ;
non_eager_rdma_endpoints + = ( device - > non_eager_rdma_endpoints + device - > pollme ) ;
2008-01-09 13:27:15 +03:00
for ( i = 0 ; i < c ; i + + ) {
2008-07-23 04:28:59 +04:00
endpoint = device - > eager_rdma_buffers [ i ] ;
2008-01-09 13:27:15 +03:00
if ( ! endpoint )
continue ;
OPAL_THREAD_LOCK ( & endpoint - > eager_rdma_local . lock ) ;
frag = MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG ( endpoint ,
endpoint - > eager_rdma_local . head ) ;
if ( MCA_BTL_OPENIB_RDMA_FRAG_LOCAL ( frag ) ) {
uint32_t size ;
mca_btl_openib_module_t * btl = endpoint - > endpoint_btl ;
opal_atomic_rmb ( ) ;
if ( endpoint - > nbo ) {
BTL_OPENIB_FOOTER_NTOH ( * frag - > ftr ) ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
size = MCA_BTL_OPENIB_RDMA_FRAG_GET_SIZE ( frag - > ftr ) ;
# if OMPI_ENABLE_DEBUG
if ( frag - > ftr - > seq ! = endpoint - > eager_rdma_local . seq )
BTL_ERROR ( ( " Eager RDMA wrong SEQ: received %d expected %d " ,
frag - > ftr - > seq ,
endpoint - > eager_rdma_local . seq ) ) ;
endpoint - > eager_rdma_local . seq + + ;
# endif
MCA_BTL_OPENIB_RDMA_NEXT_INDEX ( endpoint - > eager_rdma_local . head ) ;
OPAL_THREAD_UNLOCK ( & endpoint - > eager_rdma_local . lock ) ;
frag - > hdr = ( mca_btl_openib_header_t * ) ( ( ( char * ) frag - > ftr ) -
size + sizeof ( mca_btl_openib_footer_t ) ) ;
to_base_frag ( frag ) - > segment . seg_addr . pval =
( ( unsigned char * ) frag - > hdr ) + sizeof ( mca_btl_openib_header_t ) ;
ret = btl_openib_handle_incoming ( btl , to_com_frag ( frag ) - > endpoint ,
frag , size - sizeof ( mca_btl_openib_footer_t ) ) ;
if ( ret ! = MPI_SUCCESS ) {
btl - > error_cb ( & btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
return 0 ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
count + + ;
} else
OPAL_THREAD_UNLOCK ( & endpoint - > eager_rdma_local . lock ) ;
2007-06-14 05:59:25 +04:00
}
2008-07-23 04:28:59 +04:00
device - > eager_rdma_polls - - ;
2007-06-14 05:59:25 +04:00
2008-07-23 04:28:59 +04:00
if ( 0 = = count | | non_eager_rdma_endpoints ! = 0 | | ! device - > eager_rdma_polls ) {
count + = poll_device ( device , count ) ;
device - > eager_rdma_polls = mca_btl_openib_component . eager_rdma_poll_ratio ;
2008-01-09 13:27:15 +03:00
}
return count ;
}
/*
* IB component progress .
*/
static int btl_openib_component_progress ( void )
{
int i ;
int count = 0 ;
# if OMPI_HAVE_THREADS
if ( OPAL_UNLIKELY ( mca_btl_openib_component . use_async_event_thread & &
mca_btl_openib_component . fatal_counter ) ) {
goto error ;
}
# endif
2008-07-23 04:28:59 +04:00
for ( i = 0 ; i < mca_btl_openib_component . devices_count ; i + + ) {
mca_btl_openib_device_t * device =
opal_pointer_array_get_item ( & mca_btl_openib_component . devices , i ) ;
count + = progress_one_device ( device ) ;
2008-01-09 13:27:15 +03:00
}
return count ;
# if OMPI_HAVE_THREADS
error :
/* Set the fatal counter to zero */
mca_btl_openib_component . fatal_counter = 0 ;
/* Lets found all fatal events */
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
mca_btl_openib_module_t * openib_btl =
mca_btl_openib_component . openib_btls [ i ] ;
2008-07-23 04:28:59 +04:00
if ( openib_btl - > device - > got_fatal_event ) {
2008-01-09 13:27:15 +03:00
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
}
}
return count ;
# endif
2007-06-14 05:59:25 +04:00
}
2007-11-28 17:52:31 +03:00
int mca_btl_openib_post_srr ( mca_btl_openib_module_t * openib_btl , const int qp )
{
2007-11-28 17:57:15 +03:00
int rd_low = mca_btl_openib_component . qp_infos [ qp ] . rd_low ;
int rd_num = mca_btl_openib_component . qp_infos [ qp ] . rd_num ;
int num_post , i , rc ;
struct ibv_recv_wr * bad_wr , * wr_list = NULL , * wr = NULL ;
2007-11-28 17:52:31 +03:00
assert ( ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) ;
OPAL_THREAD_LOCK ( & openib_btl - > ib_lock ) ;
2007-11-28 17:57:15 +03:00
if ( openib_btl - > qps [ qp ] . u . srq_qp . rd_posted > rd_low ) {
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
return OMPI_SUCCESS ;
}
num_post = rd_num - openib_btl - > qps [ qp ] . u . srq_qp . rd_posted ;
for ( i = 0 ; i < num_post ; i + + ) {
ompi_free_list_item_t * item ;
2008-07-23 04:28:59 +04:00
OMPI_FREE_LIST_WAIT ( & openib_btl - > device - > qps [ qp ] . recv_free , item , rc ) ;
2007-11-28 17:57:15 +03:00
to_base_frag ( item ) - > base . order = qp ;
to_com_frag ( item ) - > endpoint = NULL ;
if ( NULL = = wr )
wr = wr_list = & to_recv_frag ( item ) - > rd_desc ;
else
wr = wr - > next = & to_recv_frag ( item ) - > rd_desc ;
}
wr - > next = NULL ;
rc = ibv_post_srq_recv ( openib_btl - > qps [ qp ] . u . srq_qp . srq , wr_list , & bad_wr ) ;
if ( OPAL_LIKELY ( 0 = = rc ) ) {
2007-11-28 17:52:31 +03:00
OPAL_THREAD_ADD32 ( & openib_btl - > qps [ qp ] . u . srq_qp . rd_posted , num_post ) ;
2007-11-28 17:57:15 +03:00
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
return OMPI_SUCCESS ;
2007-11-28 17:52:31 +03:00
}
2007-11-28 17:57:15 +03:00
for ( i = 0 ; wr_list & & wr_list ! = bad_wr ; i + + , wr_list = wr_list - > next ) ;
BTL_ERROR ( ( " error posting receive descriptors to shared receive "
" queue %d (%d from %d) " , qp , i , num_post ) ) ;
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
return OMPI_ERROR ;
2007-11-28 17:52:31 +03:00
}