2007-12-21 09:02:00 +03:00
/* -*- Mode: C; c-basic-offset:4 ; -*- */
2005-07-01 01:28:35 +04:00
/*
2007-03-17 02:11:45 +03:00
* Copyright ( c ) 2004 - 2007 The Trustees of Indiana University and Indiana
2005-11-05 22:57:48 +03:00
* University Research and Technology
* Corporation . All rights reserved .
2008-02-18 20:39:30 +03:00
* Copyright ( c ) 2004 - 2008 The University of Tennessee and The University
2005-11-05 22:57:48 +03:00
* of Tennessee Research Foundation . All rights
* reserved .
2008-01-21 15:11:18 +03:00
* Copyright ( c ) 2004 - 2005 High Performance Computing Center Stuttgart ,
2005-07-01 01:28:35 +04:00
* University of Stuttgart . All rights reserved .
* Copyright ( c ) 2004 - 2005 The Regents of the University of California .
* All rights reserved .
2008-03-29 15:54:24 +03:00
* Copyright ( c ) 2006 - 2008 Cisco Systems , Inc . All rights reserved .
2007-02-15 21:03:20 +03:00
* Copyright ( c ) 2006 - 2007 Mellanox Technologies . All rights reserved .
2007-07-25 19:03:34 +04:00
* Copyright ( c ) 2006 - 2007 Los Alamos National Security , LLC . All rights
2008-01-21 15:11:18 +03:00
* reserved .
2007-09-24 14:11:52 +04:00
* Copyright ( c ) 2006 - 2007 Voltaire All rights reserved .
2005-07-01 01:28:35 +04:00
* $ COPYRIGHT $
2008-01-21 15:11:18 +03:00
*
2005-07-01 01:28:35 +04:00
* Additional copyrights may follow
2008-01-21 15:11:18 +03:00
*
2005-07-01 01:28:35 +04:00
* $ HEADER $
*/
# include "ompi_config.h"
2007-08-07 03:40:35 +04:00
2008-01-21 15:11:18 +03:00
# include <infiniband/verbs.h>
# include <errno.h>
# include <string.h> /* for strerror()*/
2008-02-14 17:10:27 +03:00
# include <unistd.h>
2007-08-07 03:40:35 +04:00
2006-02-12 04:33:29 +03:00
# include "ompi/constants.h"
2005-07-04 03:09:55 +04:00
# include "opal/event/event.h"
2006-12-19 11:34:48 +03:00
# include "opal/include/opal/align.h"
2005-07-04 05:36:20 +04:00
# include "opal/util/if.h"
2005-07-04 04:13:44 +04:00
# include "opal/util/argv.h"
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
# include "orte/util/output.h"
2006-02-12 04:33:29 +03:00
# include "opal/sys/timer.h"
2006-09-19 17:27:05 +04:00
# include "opal/sys/atomic.h"
2007-06-14 05:59:25 +04:00
# include "opal/util/argv.h"
2006-02-12 04:33:29 +03:00
# include "opal/mca/base/mca_base_param.h"
2008-02-11 13:34:11 +03:00
# include "opal/mca/carto/carto.h"
# include "opal/mca/carto/base/base.h"
# include "opal/mca/paffinity/base/base.h"
2007-08-07 03:40:35 +04:00
2006-02-12 04:33:29 +03:00
# include "orte/mca/errmgr/errmgr.h"
2008-03-24 02:10:15 +03:00
# include "orte/util/proc_info.h"
2008-02-28 04:57:57 +03:00
# include "orte/runtime/orte_globals.h"
2007-08-07 03:40:35 +04:00
# include "ompi/proc/proc.h"
# include "ompi/mca/pml/pml.h"
# include "ompi/mca/btl/btl.h"
2008-01-21 15:11:18 +03:00
# include "ompi/mca/mpool/base/base.h"
2006-12-17 15:26:41 +03:00
# include "ompi/mca/mpool/rdma/mpool_rdma.h"
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
# include "ompi/mca/btl/base/base.h"
2008-01-21 15:11:18 +03:00
# include "ompi/datatype/convertor.h"
# include "ompi/mca/mpool/mpool.h"
2007-08-07 03:40:35 +04:00
# include "ompi/runtime/ompi_module_exchange.h"
2005-07-01 01:28:35 +04:00
# include "btl_openib.h"
# include "btl_openib_frag.h"
2008-01-21 15:11:18 +03:00
# include "btl_openib_endpoint.h"
2006-03-26 12:30:50 +04:00
# include "btl_openib_eager_rdma.h"
2006-06-06 00:02:41 +04:00
# include "btl_openib_proc.h"
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
# include "btl_openib_ini.h"
# include "btl_openib_mca.h"
2007-11-28 10:18:59 +03:00
# include "btl_openib_xrc.h"
2008-05-02 15:52:33 +04:00
# include "btl_openib_fd.h"
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
# if OMPI_HAVE_THREADS
# include "btl_openib_async.h"
# endif
2007-08-07 03:40:35 +04:00
# include "connect/base.h"
2008-05-08 18:38:14 +04:00
# include "btl_openib_iwarp.h"
2005-07-12 17:38:54 +04:00
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
/*
* Local functions
*/
static int btl_openib_component_open ( void ) ;
static int btl_openib_component_close ( void ) ;
2008-01-09 13:27:15 +03:00
static mca_btl_base_module_t * * btl_openib_component_init ( int * , bool , bool ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
static int btl_openib_component_progress ( void ) ;
2005-07-01 01:28:35 +04:00
mca_btl_openib_component_t mca_btl_openib_component = {
{
/* First, the mca_base_component_t struct containing meta information
about the component itself */
{
/* Indicate that we are a pml v1.0.0 component (which also implies a
specific MCA version ) */
2006-08-18 02:02:01 +04:00
MCA_BTL_BASE_VERSION_1_0_1 ,
2005-07-01 01:28:35 +04:00
2005-07-12 23:02:39 +04:00
" openib " , /* MCA component name */
Major simplifications to component versioning:
- After long discussions and ruminations on how we run components in
LAM/MPI, made the decision that, by default, all components included
in Open MPI will use the version number of their parent project
(i.e., OMPI or ORTE). They are certaint free to use a different
number, but this simplification makes the common cases easy:
- components are only released when the parent project is released
- it is easy (trivial?) to distinguish which version component goes
with with version of the parent project
- removed all autogen/configure code for templating the version .h
file in components
- made all ORTE components use ORTE_*_VERSION for version numbers
- made all OMPI components use OMPI_*_VERSION for version numbers
- removed all VERSION files from components
- configure now displays OPAL, ORTE, and OMPI version numbers
- ditto for ompi_info
- right now, faking it -- OPAL and ORTE and OMPI will always have the
same version number (i.e., they all come from the same top-level
VERSION file). But this paves the way for the Great Configure
Reorganization, where, among other things, each project will have
its own version number.
So all in all, we went from a boatload of version numbers to
[effectively] three. That's pretty good. :-)
This commit was SVN r6344.
2005-07-05 00:12:36 +04:00
OMPI_MAJOR_VERSION , /* MCA component major version */
OMPI_MINOR_VERSION , /* MCA component minor version */
OMPI_RELEASE_VERSION , /* MCA component release version */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
btl_openib_component_open , /* component open */
btl_openib_component_close /* component close */
2005-07-01 01:28:35 +04:00
} ,
/* Next the MCA v1.0.0 component meta data */
{
2007-03-17 02:11:45 +03:00
/* The component is not checkpoint ready */
MCA_BASE_METADATA_PARAM_NONE
2005-07-01 01:28:35 +04:00
} ,
2008-01-21 15:11:18 +03:00
btl_openib_component_init ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
btl_openib_component_progress ,
2005-07-01 01:28:35 +04:00
}
} ;
/*
* Called by MCA framework to open the component , registers
* component parameters .
*/
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
int btl_openib_component_open ( void )
2005-07-01 01:28:35 +04:00
{
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
int ret ;
2006-06-09 22:02:45 +04:00
2005-07-01 01:28:35 +04:00
/* initialize state */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
mca_btl_openib_component . ib_num_btls = 0 ;
mca_btl_openib_component . openib_btls = NULL ;
2007-12-21 09:02:00 +03:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . hcas , opal_pointer_array_t ) ;
2007-08-22 13:31:12 +04:00
mca_btl_openib_component . hcas_count = 0 ;
2007-08-20 16:28:25 +04:00
2008-01-21 15:11:18 +03:00
/* initialize objects */
2005-07-03 20:22:16 +04:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_procs , opal_list_t ) ;
2005-07-01 01:28:35 +04:00
/* register IB component parameters */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
ret = btl_openib_register_mca_params ( ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
mca_btl_openib_component . max_send_size =
mca_btl_openib_module . super . btl_max_send_size ;
mca_btl_openib_component . eager_limit =
mca_btl_openib_module . super . btl_eager_limit ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
srand48 ( getpid ( ) * time ( NULL ) ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return ret ;
2005-07-01 01:28:35 +04:00
}
/*
* component cleanup - sanity checking of queue lengths
*/
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
static int btl_openib_component_close ( void )
2005-07-01 01:28:35 +04:00
{
2008-05-02 15:52:33 +04:00
ompi_btl_openib_connect_base_finalize ( ) ;
ompi_btl_openib_fd_finalize ( ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
ompi_btl_openib_ini_finalize ( ) ;
2005-07-01 01:28:35 +04:00
return OMPI_SUCCESS ;
}
2008-05-02 15:52:33 +04:00
static void inline pack8 ( char * * dest , uint8_t value )
{
/* Copy one character */
* * dest = ( char ) value ;
/* Most the dest ahead one */
+ + * dest ;
}
2005-10-01 02:58:09 +04:00
/*
2008-05-02 15:52:33 +04:00
* Register local openib port information with the modex so that it
* can be shared with all other peers .
2005-10-01 02:58:09 +04:00
*/
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
static int btl_openib_modex_send ( void )
2005-10-01 02:58:09 +04:00
{
2008-05-02 15:52:33 +04:00
int rc , i , j ;
int modex_message_size ;
mca_btl_openib_modex_message_t dummy ;
2008-01-15 02:22:03 +03:00
char * message , * offset ;
2008-05-02 15:52:33 +04:00
size_t size , msg_size ;
ompi_btl_openib_connect_base_module_t * cpc ;
2008-01-15 02:22:03 +03:00
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " Starting to modex send " ) ;
2008-05-02 15:52:33 +04:00
if ( 0 = = mca_btl_openib_component . ib_num_btls ) {
2008-01-15 02:22:03 +03:00
return 0 ;
}
2008-05-02 15:52:33 +04:00
modex_message_size = ( ( char * ) & ( dummy . end ) ) - ( ( char * ) & dummy ) ;
/* The message is packed into multiple parts:
* 1. a uint8_t indicating the number of modules ( ports ) in the message
* 2. for each module :
* a . the common module data
* b . a uint8_t indicating how many CPCs follow
* c . for each CPC :
* a . a uint8_t indicating the index of the CPC in the all [ ]
* array in btl_openib_connect_base . c
* b . a uint8_t indicating the priority of this CPC
* c . a uint8_t indicating the length of the blob to follow
* d . a blob that is only meaningful to that CPC
*/
msg_size =
/* uint8_t for number of modules in the message */
1 +
/* For each module: */
mca_btl_openib_component . ib_num_btls *
(
/* Common module data */
modex_message_size +
/* uint8_t for how many CPCs follow */
1
) ;
/* For each module, add in the size of the per-CPC data */
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
for ( j = 0 ;
j < mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ;
+ + j ) {
msg_size + =
/* uint8_t for the index of the CPC */
1 +
/* uint8_t for the CPC's priority */
1 +
/* uint8_t for the blob length */
1 +
/* blob length */
mca_btl_openib_component . openib_btls [ i ] - > cpcs [ j ] - > data . cbm_modex_message_len ;
}
}
2008-01-15 02:22:03 +03:00
message = malloc ( msg_size ) ;
if ( NULL = = message ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc " ) ) ;
2008-01-15 02:22:03 +03:00
return OMPI_ERR_OUT_OF_RESOURCE ;
2008-01-21 15:11:18 +03:00
}
2008-01-15 02:22:03 +03:00
2008-05-02 15:52:33 +04:00
/* Pack the number of modules */
offset = message ;
pack8 ( & offset , mca_btl_openib_component . ib_num_btls ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " modex sending %d btls (packed: %d, offset now at %d) " , mca_btl_openib_component . ib_num_btls , * ( ( uint8_t * ) message ) , ( int ) ( offset - message ) ) ;
2008-01-15 02:22:03 +03:00
2008-05-02 15:52:33 +04:00
/* Pack each of the modules */
2008-01-15 02:22:03 +03:00
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
2008-05-02 15:52:33 +04:00
/* Pack the modex common message struct. */
size = modex_message_size ;
memcpy ( offset ,
& ( mca_btl_openib_component . openib_btls [ i ] - > port_info ) ,
size ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " modex packed btl port modex message: %lx, %d, %d (size: %d) " ,
2008-05-02 15:52:33 +04:00
mca_btl_openib_component . openib_btls [ i ] - > port_info . subnet_id ,
mca_btl_openib_component . openib_btls [ i ] - > port_info . mtu ,
mca_btl_openib_component . openib_btls [ i ] - > port_info . lid ,
( int ) size ) ;
2008-01-15 02:22:03 +03:00
# if !defined(WORDS_BIGENDIAN) && OMPI_ENABLE_HETEROGENEOUS_SUPPORT
2008-05-02 15:52:33 +04:00
MCA_BTL_OPENIB_MODEX_MSG_HTON ( * ( mca_btl_openib_modex_message_t * ) offset ) ;
2008-01-15 02:22:03 +03:00
# endif
2008-05-02 15:52:33 +04:00
offset + = size ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " modex packed btl %d: modex message, offset now %d " ,
2008-05-02 15:52:33 +04:00
i , ( int ) ( offset - message ) ) ;
/* Pack the number of CPCs that follow */
pack8 ( & offset ,
mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " modex packed btl %d: to pack %d cpcs (packed: %d, offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ,
* ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack each CPC */
for ( j = 0 ;
j < mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ;
+ + j ) {
uint8_t u8 ;
cpc = mca_btl_openib_component . openib_btls [ i ] - > cpcs [ j ] ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " modex packed btl %d: packing cpc %s " ,
2008-05-02 15:52:33 +04:00
i , cpc - > data . cbm_component - > cbc_name ) ;
/* Pack the CPC index */
u8 = ompi_btl_openib_connect_base_get_cpc_index ( cpc - > data . cbm_component ) ;
pack8 ( & offset , u8 ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " packing btl %d: cpc %d: index %d (packed %d, offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , j , u8 , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack the CPC priority */
pack8 ( & offset , cpc - > data . cbm_priority ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " packing btl %d: cpc %d: priority %d (packed %d, offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , j , cpc - > data . cbm_priority , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack the blob length */
u8 = cpc - > data . cbm_modex_message_len ;
pack8 ( & offset , u8 ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " packing btl %d: cpc %d: message len %d (packed %d, offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , j , u8 , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* If the blob length is > 0, pack the blob */
if ( u8 > 0 ) {
memcpy ( offset , cpc - > data . cbm_modex_message , u8 ) ;
offset + = u8 ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " packing btl %d: cpc %d: blob packed %d %x (offset now %d) " ,
2008-05-02 15:52:33 +04:00
i , j ,
( ( uint32_t * ) cpc - > data . cbm_modex_message ) [ 0 ] ,
( ( uint32_t * ) cpc - > data . cbm_modex_message ) [ 1 ] ,
( int ) ( offset - message ) ) ;
}
2008-01-15 02:22:03 +03:00
2008-05-02 15:52:33 +04:00
/* Sanity check */
assert ( ( size_t ) ( offset - message ) < = msg_size ) ;
}
2005-10-01 02:58:09 +04:00
}
2008-01-15 02:22:03 +03:00
2008-05-02 15:52:33 +04:00
/* All done -- send it! */
2008-01-21 15:11:18 +03:00
rc = ompi_modex_send ( & mca_btl_openib_component . super . btl_version ,
2008-01-15 02:22:03 +03:00
message , msg_size ) ;
free ( message ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " Modex sent! %d calculated, %d actual \n " , ( int ) msg_size , ( int ) ( offset - message ) ) ;
2008-01-15 02:22:03 +03:00
2005-10-01 02:58:09 +04:00
return rc ;
}
2005-11-10 23:15:02 +03:00
/*
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
* Active Message Callback function on control message .
2005-11-10 23:15:02 +03:00
*/
2007-11-28 10:13:34 +03:00
static void btl_openib_control ( mca_btl_base_module_t * btl ,
mca_btl_base_tag_t tag , mca_btl_base_descriptor_t * des ,
void * cbdata )
2005-11-10 23:15:02 +03:00
{
2007-11-28 10:11:14 +03:00
/* don't return credits used for control messages */
2007-12-09 17:05:13 +03:00
mca_btl_openib_module_t * obtl = ( mca_btl_openib_module_t * ) btl ;
2007-11-28 10:13:34 +03:00
mca_btl_openib_endpoint_t * ep = to_com_frag ( des ) - > endpoint ;
2007-11-28 10:11:14 +03:00
mca_btl_openib_control_header_t * ctl_hdr =
to_base_frag ( des ) - > segment . seg_addr . pval ;
2006-03-26 12:30:50 +04:00
mca_btl_openib_eager_rdma_header_t * rdma_hdr ;
2007-12-09 17:05:13 +03:00
mca_btl_openib_header_coalesced_t * clsc_hdr =
( mca_btl_openib_header_coalesced_t * ) ( ctl_hdr + 1 ) ;
2008-01-15 08:32:53 +03:00
mca_btl_active_message_callback_t * reg ;
2007-12-09 17:05:13 +03:00
size_t len = des - > des_dst - > seg_len - sizeof ( * ctl_hdr ) ;
2008-01-21 15:11:18 +03:00
2006-03-26 12:30:50 +04:00
switch ( ctl_hdr - > type ) {
2006-09-05 20:02:09 +04:00
case MCA_BTL_OPENIB_CONTROL_CREDITS :
2007-11-28 10:13:34 +03:00
assert ( 0 ) ; /* Credit message is handled elsewhere */
2007-01-13 02:14:45 +03:00
break ;
2006-03-26 12:30:50 +04:00
case MCA_BTL_OPENIB_CONTROL_RDMA :
rdma_hdr = ( mca_btl_openib_eager_rdma_header_t * ) ctl_hdr ;
2008-01-21 15:11:18 +03:00
2008-05-02 15:52:33 +04:00
BTL_VERBOSE ( ( " prior to NTOH received rkey %lu, rdma_start.lval %llu, pval %p, ival %u " ,
2008-01-21 15:11:18 +03:00
rdma_hdr - > rkey ,
2007-01-13 02:14:45 +03:00
( unsigned long ) rdma_hdr - > rdma_start . lval ,
rdma_hdr - > rdma_start . pval ,
2007-03-05 17:17:50 +03:00
rdma_hdr - > rdma_start . ival
2007-01-13 02:14:45 +03:00
) ) ;
2008-01-21 15:11:18 +03:00
2007-11-28 10:13:34 +03:00
if ( ep - > nbo ) {
2007-11-28 10:12:44 +03:00
BTL_OPENIB_EAGER_RDMA_CONTROL_HEADER_NTOH ( * rdma_hdr ) ;
}
2008-01-21 15:11:18 +03:00
2007-11-28 10:12:44 +03:00
BTL_VERBOSE ( ( " received rkey %lu, rdma_start.lval %llu, pval %p, "
2008-05-02 15:52:33 +04:00
" ival %u " , rdma_hdr - > rkey ,
2007-11-28 10:12:44 +03:00
( unsigned long ) rdma_hdr - > rdma_start . lval ,
rdma_hdr - > rdma_start . pval , rdma_hdr - > rdma_start . ival ) ) ;
2008-01-21 15:11:18 +03:00
2007-11-28 10:13:34 +03:00
if ( ep - > eager_rdma_remote . base . pval ) {
2008-01-10 00:54:11 +03:00
BTL_ERROR ( ( " Got RDMA connect twice! " ) ) ;
return ;
2006-03-26 12:30:50 +04:00
}
2007-11-28 10:13:34 +03:00
ep - > eager_rdma_remote . rkey = rdma_hdr - > rkey ;
ep - > eager_rdma_remote . base . lval = rdma_hdr - > rdma_start . lval ;
ep - > eager_rdma_remote . tokens = mca_btl_openib_component . eager_rdma_num - 1 ;
2006-03-26 12:30:50 +04:00
break ;
2007-12-09 17:05:13 +03:00
case MCA_BTL_OPENIB_CONTROL_COALESCED :
while ( len > 0 ) {
2007-12-09 17:10:25 +03:00
size_t skip ;
2007-12-09 17:05:13 +03:00
mca_btl_base_descriptor_t tmp_des ;
mca_btl_base_segment_t tmp_seg ;
assert ( len > = sizeof ( * clsc_hdr ) ) ;
2007-12-09 17:10:25 +03:00
if ( ep - > nbo )
BTL_OPENIB_HEADER_COALESCED_NTOH ( * clsc_hdr ) ;
skip = ( sizeof ( * clsc_hdr ) + clsc_hdr - > alloc_size ) ;
2007-12-09 17:05:13 +03:00
tmp_des . des_dst = & tmp_seg ;
tmp_des . des_dst_cnt = 1 ;
tmp_seg . seg_addr . pval = clsc_hdr + 1 ;
tmp_seg . seg_len = clsc_hdr - > size ;
/* call registered callback */
2008-01-15 08:32:53 +03:00
reg = mca_btl_base_active_message_trigger + clsc_hdr - > tag ;
reg - > cbfunc ( & obtl - > super , clsc_hdr - > tag , & tmp_des , reg - > cbdata ) ;
2007-12-09 17:05:13 +03:00
len - = skip ;
clsc_hdr = ( mca_btl_openib_header_coalesced_t * )
( ( ( unsigned char * ) clsc_hdr ) + skip ) ;
}
break ;
2006-03-26 12:30:50 +04:00
default :
2007-01-13 02:14:45 +03:00
BTL_ERROR ( ( " Unknown message type received by BTL " ) ) ;
2006-03-26 12:30:50 +04:00
break ;
}
2005-11-10 23:15:02 +03:00
}
2006-12-17 15:26:41 +03:00
static int openib_reg_mr ( void * reg_data , void * base , size_t size ,
mca_mpool_base_registration_t * reg )
{
mca_btl_openib_hca_t * hca = ( mca_btl_openib_hca_t * ) reg_data ;
mca_btl_openib_reg_t * openib_reg = ( mca_btl_openib_reg_t * ) reg ;
openib_reg - > mr = ibv_reg_mr ( hca - > ib_pd , base , size , IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ ) ;
if ( NULL = = openib_reg - > mr )
return OMPI_ERR_OUT_OF_RESOURCE ;
return OMPI_SUCCESS ;
}
static int openib_dereg_mr ( void * reg_data , mca_mpool_base_registration_t * reg )
{
mca_btl_openib_reg_t * openib_reg = ( mca_btl_openib_reg_t * ) reg ;
if ( openib_reg - > mr ! = NULL ) {
if ( ibv_dereg_mr ( openib_reg - > mr ) ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " %s: error unpinning openib memory errno says %s " ,
2007-10-15 21:53:02 +04:00
__func__ , strerror ( errno ) ) ) ;
2006-12-17 15:26:41 +03:00
return OMPI_ERROR ;
}
}
openib_reg - > mr = NULL ;
return OMPI_SUCCESS ;
}
2007-06-13 16:47:38 +04:00
static inline int param_register_int ( const char * param_name , int default_value )
{
int param_value = default_value ;
int id = mca_base_param_register_int ( " btl " , " openib " , param_name , NULL ,
default_value ) ;
mca_base_param_lookup_int ( id , & param_value ) ;
return param_value ;
}
2006-12-17 15:26:41 +03:00
2007-11-28 10:14:34 +03:00
# if OMPI_HAVE_THREADS
static int start_async_event_thread ( void )
{
/* Set the fatal counter to zero */
mca_btl_openib_component . fatal_counter = 0 ;
/* Create pipe for communication with async event thread */
if ( pipe ( mca_btl_openib_component . async_pipe ) ) {
BTL_ERROR ( ( " Failed to create pipe for communication with "
" async event thread " ) ) ;
return OMPI_ERROR ;
}
/* Starting async event thread for the component */
if ( pthread_create ( & mca_btl_openib_component . async_thread , NULL ,
( void * ( * ) ( void * ) ) btl_openib_async_thread , NULL ) ) {
BTL_ERROR ( ( " Failed to create async event thread " ) ) ;
return OMPI_ERROR ;
}
return OMPI_SUCCESS ;
}
# endif
2006-06-28 11:23:08 +04:00
static int init_one_port ( opal_list_t * btl_list , mca_btl_openib_hca_t * hca ,
2007-04-22 14:22:12 +04:00
uint8_t port_num , uint16_t pkey_index ,
struct ibv_port_attr * ib_port_attr )
2006-06-28 11:23:08 +04:00
{
2008-01-28 13:38:08 +03:00
uint16_t lid , i , lmc , lmc_step ;
2006-06-28 11:23:08 +04:00
mca_btl_openib_module_t * openib_btl ;
mca_btl_base_selected_module_t * ib_selected ;
2006-09-22 14:27:12 +04:00
union ibv_gid gid ;
2007-01-13 01:42:20 +03:00
uint64_t subnet_id ;
2007-04-21 04:15:05 +04:00
2006-09-26 16:12:33 +04:00
ibv_query_gid ( hca - > ib_dev_context , port_num , 0 , & gid ) ;
2008-05-02 15:52:33 +04:00
/* If we have struct ibv_device.transport_type, then we're >= OFED
2008-05-07 18:51:36 +04:00
v1 .2 , and the transport could be iWarp or IB . If we don ' t have
that member , then we ' re < OFED v1 .2 , and it can only be IB . */
2008-05-02 15:52:33 +04:00
# if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
if ( IBV_TRANSPORT_IWARP = = hca - > ib_dev - > transport_type ) {
2008-05-07 02:43:52 +04:00
subnet_id = get_iwarp_subnet_id ( hca - > ib_dev ) ;
2008-05-02 15:52:33 +04:00
} else {
subnet_id = ntoh64 ( gid . global . subnet_prefix ) ;
}
# else
2007-01-13 01:42:20 +03:00
subnet_id = ntoh64 ( gid . global . subnet_prefix ) ;
2008-05-02 15:52:33 +04:00
# endif
BTL_VERBOSE ( ( " my subnet_id is %016x " , subnet_id ) ) ;
2008-01-21 15:11:18 +03:00
2006-09-26 16:12:33 +04:00
if ( mca_btl_openib_component . ib_num_btls > 0 & &
2007-01-13 01:42:20 +03:00
IB_DEFAULT_GID_PREFIX = = subnet_id & &
2006-09-26 16:12:33 +04:00
mca_btl_openib_component . warn_default_gid_prefix ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " default subnet prefix " ,
2008-03-24 02:10:15 +03:00
true , orte_process_info . nodename ) ;
2006-09-26 16:12:33 +04:00
}
2006-06-28 11:23:08 +04:00
lmc = ( 1 < < ib_port_attr - > lmc ) ;
2008-01-28 13:38:08 +03:00
lmc_step = 1 ;
2006-06-28 11:23:08 +04:00
2008-01-21 15:11:18 +03:00
if ( 0 ! = mca_btl_openib_component . max_lmc & &
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
mca_btl_openib_component . max_lmc < lmc ) {
2006-06-28 11:23:08 +04:00
lmc = mca_btl_openib_component . max_lmc ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-06-28 11:23:08 +04:00
2008-01-28 19:10:18 +03:00
# if OMPI_HAVE_THREADS
2008-05-02 15:52:33 +04:00
/* APM support -- only meaningful if async event support is
enabled . If async events are not enabled , then there ' s nothing
to listen for the APM event to load the new path , so it ' s not
worth enabling APM . */
2008-01-28 13:38:08 +03:00
if ( lmc > 1 ) {
2008-02-20 16:44:05 +03:00
if ( - 1 = = mca_btl_openib_component . apm_lmc ) {
2008-01-28 13:38:08 +03:00
lmc_step = lmc ;
2008-02-20 16:44:05 +03:00
mca_btl_openib_component . apm_lmc = lmc - 1 ;
} else if ( 0 = = lmc % ( mca_btl_openib_component . apm_lmc + 1 ) ) {
lmc_step = mca_btl_openib_component . apm_lmc + 1 ;
2008-01-28 13:38:08 +03:00
} else {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " apm with wrong lmc " , true ,
2008-02-20 16:44:05 +03:00
mca_btl_openib_component . apm_lmc , lmc ) ;
2008-01-28 13:38:08 +03:00
return OMPI_ERROR ;
}
} else {
2008-02-20 16:44:05 +03:00
if ( mca_btl_openib_component . apm_lmc ) {
2008-01-28 13:38:08 +03:00
/* Disable apm and report warning */
2008-02-20 16:44:05 +03:00
mca_btl_openib_component . apm_lmc = 0 ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " apm without lmc " , true ) ;
2008-01-28 13:38:08 +03:00
}
}
2008-01-28 19:10:18 +03:00
# endif
2008-01-28 13:38:08 +03:00
2006-06-28 11:23:08 +04:00
for ( lid = ib_port_attr - > lid ;
2008-01-28 13:38:08 +03:00
lid < ib_port_attr - > lid + lmc ; lid + = lmc_step ) {
2006-06-28 11:23:08 +04:00
for ( i = 0 ; i < mca_btl_openib_component . btls_per_lid ; i + + ) {
2007-06-13 16:47:38 +04:00
char param [ 40 ] ;
2008-01-15 02:22:03 +03:00
int rc ;
2006-06-28 11:23:08 +04:00
openib_btl = malloc ( sizeof ( mca_btl_openib_module_t ) ) ;
if ( NULL = = openib_btl ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return OMPI_ERR_OUT_OF_RESOURCE ;
2006-06-28 11:23:08 +04:00
}
memcpy ( openib_btl , & mca_btl_openib_module ,
sizeof ( mca_btl_openib_module ) ) ;
memcpy ( & openib_btl - > ib_port_attr , ib_port_attr ,
sizeof ( struct ibv_port_attr ) ) ;
ib_selected = OBJ_NEW ( mca_btl_base_selected_module_t ) ;
ib_selected - > btl_module = ( mca_btl_base_module_t * ) openib_btl ;
openib_btl - > hca = hca ;
openib_btl - > port_num = ( uint8_t ) port_num ;
2007-04-22 14:22:12 +04:00
openib_btl - > pkey_index = pkey_index ;
2006-06-28 11:23:08 +04:00
openib_btl - > lid = lid ;
2008-02-20 16:44:05 +03:00
openib_btl - > apm_port = 0 ;
2006-06-28 11:23:08 +04:00
openib_btl - > src_path_bits = lid - ib_port_attr - > lid ;
2008-05-02 15:52:33 +04:00
2007-01-13 01:42:20 +03:00
openib_btl - > port_info . subnet_id = subnet_id ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
openib_btl - > port_info . mtu = hca - > mtu ;
2008-02-20 16:44:05 +03:00
openib_btl - > port_info . lid = lid ;
2008-05-02 15:52:33 +04:00
openib_btl - > cpcs = NULL ;
openib_btl - > num_cpcs = 0 ;
/* Do we have at least one CPC that can handle this
port ? */
rc =
ompi_btl_openib_connect_base_select_for_local_port ( openib_btl ) ;
if ( OMPI_ERR_NOT_SUPPORTED = = rc ) {
2008-01-15 02:22:03 +03:00
continue ;
2008-05-02 15:52:33 +04:00
} else if ( OMPI_SUCCESS ! = rc ) {
return rc ;
2008-01-15 02:22:03 +03:00
}
2008-01-15 08:32:53 +03:00
mca_btl_base_active_message_trigger [ MCA_BTL_TAG_IB ] . cbfunc = btl_openib_control ;
mca_btl_base_active_message_trigger [ MCA_BTL_TAG_IB ] . cbdata = NULL ;
2007-04-21 04:15:05 +04:00
2007-06-13 16:47:38 +04:00
/* Check bandwidth configured for this HCA */
sprintf ( param , " bandwidth_%s " , ibv_get_device_name ( hca - > ib_dev ) ) ;
openib_btl - > super . btl_bandwidth =
param_register_int ( param , openib_btl - > super . btl_bandwidth ) ;
/* Check bandwidth configured for this HCA/port */
sprintf ( param , " bandwidth_%s:%d " , ibv_get_device_name ( hca - > ib_dev ) ,
port_num ) ;
openib_btl - > super . btl_bandwidth =
param_register_int ( param , openib_btl - > super . btl_bandwidth ) ;
/* Check bandwidth configured for this HCA/port/LID */
sprintf ( param , " bandwidth_%s:%d:%d " ,
ibv_get_device_name ( hca - > ib_dev ) , port_num , lid ) ;
openib_btl - > super . btl_bandwidth =
param_register_int ( param , openib_btl - > super . btl_bandwidth ) ;
/* Check latency configured for this HCA */
sprintf ( param , " latency_%s " , ibv_get_device_name ( hca - > ib_dev ) ) ;
openib_btl - > super . btl_latency =
param_register_int ( param , openib_btl - > super . btl_latency ) ;
/* Check latency configured for this HCA/port */
sprintf ( param , " latency_%s:%d " , ibv_get_device_name ( hca - > ib_dev ) ,
port_num ) ;
openib_btl - > super . btl_latency =
param_register_int ( param , openib_btl - > super . btl_latency ) ;
/* Check latency configured for this HCA/port/LID */
sprintf ( param , " latency_%s:%d:%d " , ibv_get_device_name ( hca - > ib_dev ) ,
port_num , lid ) ;
openib_btl - > super . btl_latency =
param_register_int ( param , openib_btl - > super . btl_latency ) ;
2007-04-21 04:15:05 +04:00
/* Auto-detect the port bandwidth */
if ( 0 = = openib_btl - > super . btl_bandwidth ) {
/* To calculate the bandwidth available on this port,
we have to look up the values corresponding to
port - > active_speed and port - > active_width . These
are enums corresponding to the IB spec . Overall
forumula is 80 % of the reported speed ( to get the
true link speed ) times the number of links . */
switch ( ib_port_attr - > active_speed ) {
2008-01-21 15:11:18 +03:00
case 1 :
2007-04-21 04:15:05 +04:00
/* 2.5Gbps * 0.8, in megabits */
openib_btl - > super . btl_bandwidth = 2000 ;
break ;
2008-01-21 15:11:18 +03:00
case 2 :
2007-04-21 04:15:05 +04:00
/* 5.0Gbps * 0.8, in megabits */
openib_btl - > super . btl_bandwidth = 4000 ;
break ;
2008-01-21 15:11:18 +03:00
case 4 :
2007-04-21 04:15:05 +04:00
/* 10.0Gbps * 0.8, in megabits */
openib_btl - > super . btl_bandwidth = 8000 ;
break ;
2008-01-21 15:11:18 +03:00
default :
2008-02-18 23:28:06 +03:00
/* Who knows? Declare this port unreachable (do
* not * return ERR_VALUE_OF_OUT_OF_BOUNDS ; that
is reserved for when we exceed the number of
allowable BTLs ) . */
return OMPI_ERR_UNREACH ;
2007-04-21 04:15:05 +04:00
}
switch ( ib_port_attr - > active_width ) {
case 1 :
/* 1x */
/* unity */
break ;
case 2 :
/* 4x */
openib_btl - > super . btl_bandwidth * = 4 ;
break ;
case 4 :
/* 8x */
openib_btl - > super . btl_bandwidth * = 8 ;
break ;
case 8 :
/* 12x */
openib_btl - > super . btl_bandwidth * = 12 ;
break ;
default :
2008-02-18 23:28:06 +03:00
/* Who knows? Declare this port unreachable (do
* not * return ERR_VALUE_OF_OUT_OF_BOUNDS ; that
is reserved for when we exceed the number of
allowable BTLs ) . */
return OMPI_ERR_UNREACH ;
2007-04-21 04:15:05 +04:00
}
}
2006-06-28 11:23:08 +04:00
opal_list_append ( btl_list , ( opal_list_item_t * ) ib_selected ) ;
2008-02-20 16:44:05 +03:00
opal_pointer_array_add ( hca - > hca_btls , ( void * ) openib_btl ) ;
2006-06-28 11:23:08 +04:00
hca - > btls + + ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
+ + mca_btl_openib_component . ib_num_btls ;
if ( - 1 ! = mca_btl_openib_component . ib_max_btls & &
mca_btl_openib_component . ib_num_btls > =
mca_btl_openib_component . ib_max_btls ) {
2006-09-25 15:18:20 +04:00
return OMPI_ERR_VALUE_OUT_OF_BOUNDS ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-06-28 11:23:08 +04:00
}
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return OMPI_SUCCESS ;
2006-06-28 11:23:08 +04:00
}
2007-12-23 15:29:34 +03:00
static void hca_construct ( mca_btl_openib_hca_t * hca )
{
2008-01-09 13:26:21 +03:00
int i ;
2007-12-23 15:29:34 +03:00
hca - > ib_dev = NULL ;
hca - > ib_dev_context = NULL ;
2008-03-02 12:11:23 +03:00
hca - > ib_pd = NULL ;
2008-01-09 13:26:21 +03:00
hca - > mpool = NULL ;
2008-05-02 15:52:33 +04:00
# if OMPI_ENABLE_PROGRESS_THREADS
2008-01-09 13:26:21 +03:00
hca - > ib_channel = NULL ;
# endif
2007-12-23 15:29:34 +03:00
hca - > btls = 0 ;
hca - > ib_cq [ BTL_OPENIB_HP_CQ ] = NULL ;
hca - > ib_cq [ BTL_OPENIB_LP_CQ ] = NULL ;
hca - > cq_size [ BTL_OPENIB_HP_CQ ] = 0 ;
hca - > cq_size [ BTL_OPENIB_LP_CQ ] = 0 ;
hca - > non_eager_rdma_endpoints = 0 ;
hca - > hp_cq_polls = mca_btl_openib_component . cq_poll_ratio ;
hca - > eager_rdma_polls = mca_btl_openib_component . eager_rdma_poll_ratio ;
hca - > pollme = true ;
hca - > eager_rdma_buffers_count = 0 ;
hca - > eager_rdma_buffers = NULL ;
2008-01-09 13:26:21 +03:00
# if HAVE_XRC
hca - > xrc_fd = - 1 ;
# endif
hca - > qps = ( mca_btl_openib_hca_qp_t * ) calloc ( mca_btl_openib_component . num_qps ,
sizeof ( mca_btl_openib_hca_qp_t ) ) ;
2008-01-21 15:11:18 +03:00
OBJ_CONSTRUCT ( & hca - > hca_lock , opal_mutex_t ) ;
2008-01-09 13:26:21 +03:00
for ( i = 0 ; i < mca_btl_openib_component . num_qps ; i + + ) {
OBJ_CONSTRUCT ( & hca - > qps [ i ] . send_free , ompi_free_list_t ) ;
OBJ_CONSTRUCT ( & hca - > qps [ i ] . recv_free , ompi_free_list_t ) ;
}
OBJ_CONSTRUCT ( & hca - > send_free_control , ompi_free_list_t ) ;
2007-12-23 15:29:34 +03:00
}
static void hca_destruct ( mca_btl_openib_hca_t * hca )
{
2008-01-09 13:26:21 +03:00
int i ;
2007-12-23 15:29:34 +03:00
if ( hca - > eager_rdma_buffers ) {
int i ;
for ( i = 0 ; i < hca - > eager_rdma_buffers_count ; i + + )
if ( hca - > eager_rdma_buffers [ i ] )
OBJ_RELEASE ( hca - > eager_rdma_buffers [ i ] ) ;
free ( hca - > eager_rdma_buffers ) ;
}
2008-01-21 15:11:18 +03:00
OBJ_DESTRUCT ( & hca - > hca_lock ) ;
2008-01-09 13:26:21 +03:00
for ( i = 0 ; i < mca_btl_openib_component . num_qps ; i + + ) {
OBJ_DESTRUCT ( & hca - > qps [ i ] . send_free ) ;
OBJ_DESTRUCT ( & hca - > qps [ i ] . recv_free ) ;
}
OBJ_DESTRUCT ( & hca - > send_free_control ) ;
if ( hca - > qps )
free ( hca - > qps ) ;
2007-12-23 15:29:34 +03:00
}
OBJ_CLASS_INSTANCE ( mca_btl_openib_hca_t , opal_object_t , hca_construct ,
hca_destruct ) ;
2008-01-09 13:26:21 +03:00
static int prepare_hca_for_use ( mca_btl_openib_hca_t * hca )
{
mca_btl_openib_frag_init_data_t * init_data ;
int qp , length ;
# if OMPI_HAVE_THREADS
if ( mca_btl_openib_component . use_async_event_thread ) {
if ( 0 = = mca_btl_openib_component . async_thread ) {
/* async thread is not yet started, so start it here */
if ( start_async_event_thread ( ) ! = OMPI_SUCCESS )
return OMPI_ERROR ;
}
hca - > got_fatal_event = false ;
if ( write ( mca_btl_openib_component . async_pipe [ 1 ] ,
& hca - > ib_dev_context - > async_fd , sizeof ( int ) ) < 0 ) {
BTL_ERROR ( ( " Failed to write to pipe [%d] " , errno ) ) ;
return OMPI_ERROR ;
2008-01-21 15:11:18 +03:00
}
2008-01-09 13:26:21 +03:00
}
# if OMPI_ENABLE_PROGRESS_THREADS == 1
/* Prepare data for thread, but not starting it */
OBJ_CONSTRUCT ( & hca - > thread , opal_thread_t ) ;
hca - > thread . t_run = mca_btl_openib_progress_thread ;
hca - > thread . t_arg = hca ;
hca - > progress = false ;
# endif
# endif
hca - > endpoints = OBJ_NEW ( opal_pointer_array_t ) ;
opal_pointer_array_init ( hca - > endpoints , 10 , INT_MAX , 10 ) ;
opal_pointer_array_add ( & mca_btl_openib_component . hcas , hca ) ;
if ( mca_btl_openib_component . max_eager_rdma > 0 & &
mca_btl_openib_component . use_eager_rdma & &
hca - > use_eager_rdma ) {
hca - > eager_rdma_buffers =
calloc ( mca_btl_openib_component . max_eager_rdma * hca - > btls ,
sizeof ( mca_btl_openib_endpoint_t * ) ) ;
if ( NULL = = hca - > eager_rdma_buffers ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Memory allocation fails " ) ) ;
2008-01-09 13:26:21 +03:00
return OMPI_ERR_OUT_OF_RESOURCE ;
}
}
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
length = sizeof ( mca_btl_openib_header_t ) +
2008-01-21 15:11:18 +03:00
sizeof ( mca_btl_openib_footer_t ) +
2008-01-09 13:26:21 +03:00
sizeof ( mca_btl_openib_eager_rdma_header_t ) ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data - > order = MCA_BTL_NO_ORDER ;
init_data - > list = & hca - > send_free_control ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new (
& hca - > send_free_control ,
sizeof ( mca_btl_openib_send_control_frag_t ) , CACHE_LINE_SIZE ,
OBJ_CLASS ( mca_btl_openib_send_control_frag_t ) , length ,
mca_btl_openib_component . buffer_alignment ,
mca_btl_openib_component . ib_free_list_num , - 1 ,
mca_btl_openib_component . ib_free_list_inc ,
hca - > mpool , mca_btl_openib_frag_init ,
2008-01-21 15:11:18 +03:00
init_data ) ) {
2008-01-09 13:26:21 +03:00
return OMPI_ERROR ;
}
2008-01-21 15:11:18 +03:00
/* setup all the qps */
for ( qp = 0 ; qp < mca_btl_openib_component . num_qps ; qp + + ) {
2008-01-09 13:26:21 +03:00
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
2008-01-21 15:11:18 +03:00
/* Initialize pool of send fragments */
2008-01-09 13:26:21 +03:00
length = sizeof ( mca_btl_openib_header_t ) +
sizeof ( mca_btl_openib_header_coalesced_t ) +
sizeof ( mca_btl_openib_control_header_t ) +
sizeof ( mca_btl_openib_footer_t ) +
mca_btl_openib_component . qp_infos [ qp ] . size ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data - > order = qp ;
init_data - > list = & hca - > qps [ qp ] . send_free ;
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new ( init_data - > list ,
sizeof ( mca_btl_openib_send_frag_t ) , CACHE_LINE_SIZE ,
OBJ_CLASS ( mca_btl_openib_send_frag_t ) , length ,
mca_btl_openib_component . buffer_alignment ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
hca - > mpool , mca_btl_openib_frag_init ,
2008-01-21 15:11:18 +03:00
init_data ) ) {
2008-01-09 13:26:21 +03:00
return OMPI_ERROR ;
}
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
length = sizeof ( mca_btl_openib_header_t ) +
sizeof ( mca_btl_openib_header_coalesced_t ) +
sizeof ( mca_btl_openib_control_header_t ) +
sizeof ( mca_btl_openib_footer_t ) +
mca_btl_openib_component . qp_infos [ qp ] . size ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data - > order = qp ;
init_data - > list = & hca - > qps [ qp ] . recv_free ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new ( init_data - > list ,
sizeof ( mca_btl_openib_recv_frag_t ) , CACHE_LINE_SIZE ,
OBJ_CLASS ( mca_btl_openib_recv_frag_t ) ,
length , mca_btl_openib_component . buffer_alignment ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
hca - > mpool , mca_btl_openib_frag_init ,
2008-01-21 15:11:18 +03:00
init_data ) ) {
2008-01-09 13:26:21 +03:00
return OMPI_ERROR ;
}
}
mca_btl_openib_component . hcas_count + + ;
return OMPI_SUCCESS ;
}
2008-01-09 13:27:15 +03:00
static int
get_port_list ( mca_btl_openib_hca_t * hca , int * allowed_ports )
{
int i , j , k , num_ports = 0 ;
const char * dev_name ;
char * name ;
dev_name = ibv_get_device_name ( hca - > ib_dev ) ;
name = ( char * ) malloc ( strlen ( dev_name ) + 4 ) ;
if ( NULL = = name ) {
return 0 ;
}
/* Assume that all ports are allowed. num_ports will be adjusted
below to reflect whether this is true or not . */
for ( i = 1 ; i < = hca - > ib_dev_attr . phys_port_cnt ; + + i ) {
allowed_ports [ num_ports + + ] = i ;
}
num_ports = 0 ;
if ( NULL ! = mca_btl_openib_component . if_include_list ) {
/* If only the HCA name is given (eg. mthca0,mthca1) use all
ports */
i = 0 ;
while ( mca_btl_openib_component . if_include_list [ i ] ) {
2008-01-21 15:11:18 +03:00
if ( 0 = = strcmp ( dev_name ,
2008-01-09 13:27:15 +03:00
mca_btl_openib_component . if_include_list [ i ] ) ) {
num_ports = hca - > ib_dev_attr . phys_port_cnt ;
goto done ;
}
+ + i ;
}
/* Include only requested ports on the HCA */
for ( i = 1 ; i < = hca - > ib_dev_attr . phys_port_cnt ; + + i ) {
sprintf ( name , " %s:%d " , dev_name , i ) ;
2008-01-21 15:11:18 +03:00
for ( j = 0 ;
2008-01-09 13:27:15 +03:00
NULL ! = mca_btl_openib_component . if_include_list [ j ] ; + + j ) {
2008-01-21 15:11:18 +03:00
if ( 0 = = strcmp ( name ,
2008-01-09 13:27:15 +03:00
mca_btl_openib_component . if_include_list [ j ] ) ) {
allowed_ports [ num_ports + + ] = i ;
break ;
}
}
}
} else if ( NULL ! = mca_btl_openib_component . if_exclude_list ) {
/* If only the HCA name is given (eg. mthca0,mthca1) exclude
all ports */
i = 0 ;
while ( mca_btl_openib_component . if_exclude_list [ i ] ) {
2008-01-21 15:11:18 +03:00
if ( 0 = = strcmp ( dev_name ,
2008-01-09 13:27:15 +03:00
mca_btl_openib_component . if_exclude_list [ i ] ) ) {
num_ports = 0 ;
goto done ;
}
+ + i ;
}
/* Exclude the specified ports on this HCA */
for ( i = 1 ; i < = hca - > ib_dev_attr . phys_port_cnt ; + + i ) {
sprintf ( name , " %s:%d " , dev_name , i ) ;
2008-01-21 15:11:18 +03:00
for ( j = 0 ;
2008-01-09 13:27:15 +03:00
NULL ! = mca_btl_openib_component . if_exclude_list [ j ] ; + + j ) {
2008-01-21 15:11:18 +03:00
if ( 0 = = strcmp ( name ,
2008-01-09 13:27:15 +03:00
mca_btl_openib_component . if_exclude_list [ j ] ) ) {
/* If found, set a sentinel value */
j = - 1 ;
break ;
}
}
/* If we didn't find it, it's ok to include in the list */
if ( - 1 ! = j ) {
allowed_ports [ num_ports + + ] = i ;
}
}
} else {
num_ports = hca - > ib_dev_attr . phys_port_cnt ;
}
done :
/* Remove the following from the error-checking if_list:
- bare device name
- device name suffixed with port number */
if ( NULL ! = mca_btl_openib_component . if_list ) {
for ( i = 0 ; NULL ! = mca_btl_openib_component . if_list [ i ] ; + + i ) {
/* Look for raw device name */
if ( 0 = = strcmp ( mca_btl_openib_component . if_list [ i ] , dev_name ) ) {
j = opal_argv_count ( mca_btl_openib_component . if_list ) ;
opal_argv_delete ( & j , & ( mca_btl_openib_component . if_list ) ,
i , 1 ) ;
- - i ;
}
}
for ( i = 1 ; i < = hca - > ib_dev_attr . phys_port_cnt ; + + i ) {
sprintf ( name , " %s:%d " , dev_name , i ) ;
for ( j = 0 ; NULL ! = mca_btl_openib_component . if_list [ j ] ; + + j ) {
if ( 0 = = strcmp ( mca_btl_openib_component . if_list [ j ] , name ) ) {
k = opal_argv_count ( mca_btl_openib_component . if_list ) ;
opal_argv_delete ( & k , & ( mca_btl_openib_component . if_list ) ,
j , 1 ) ;
- - j ;
break ;
}
}
}
}
free ( name ) ;
return num_ports ;
}
static void merge_values ( ompi_btl_openib_ini_values_t * target ,
ompi_btl_openib_ini_values_t * src )
{
if ( ! target - > mtu_set & & src - > mtu_set ) {
target - > mtu = src - > mtu ;
target - > mtu_set = true ;
}
if ( ! target - > use_eager_rdma_set & & src - > use_eager_rdma_set ) {
target - > use_eager_rdma = src - > use_eager_rdma ;
target - > use_eager_rdma_set = true ;
}
}
static bool inline is_credit_message ( const mca_btl_openib_recv_frag_t * frag )
{
mca_btl_openib_control_header_t * chdr =
to_base_frag ( frag ) - > segment . seg_addr . pval ;
return ( MCA_BTL_TAG_BTL = = frag - > hdr - > tag ) & &
( MCA_BTL_OPENIB_CONTROL_CREDITS = = chdr - > type ) ;
}
2008-02-20 16:44:05 +03:00
static void init_apm_port ( mca_btl_openib_hca_t * hca , int port , uint16_t lid )
{
int index ;
struct mca_btl_openib_module_t * btl ;
for ( index = 0 ; index < hca - > btls ; index + + ) {
btl = opal_pointer_array_get_item ( hca - > hca_btls , index ) ;
/* Ok, we already have btl for the fist port,
* second one will be used for APM */
btl - > apm_port = port ;
btl - > port_info . apm_lid = lid + btl - > src_path_bits ;
mca_btl_openib_component . apm_ports + + ;
BTL_VERBOSE ( ( " APM-PORT: Setting alternative port - %d, lid - %d "
, port , lid ) ) ;
}
}
2006-06-28 11:23:08 +04:00
static int init_one_hca ( opal_list_t * btl_list , struct ibv_device * ib_dev )
{
struct mca_mpool_base_resources_t mpool_resources ;
mca_btl_openib_hca_t * hca ;
2007-06-14 05:59:25 +04:00
uint8_t i , k = 0 ;
int ret = - 1 , port_cnt ;
2006-08-31 00:21:47 +04:00
ompi_btl_openib_ini_values_t values , default_values ;
2007-06-14 17:59:28 +04:00
int * allowed_ports = NULL ;
2008-01-21 15:11:18 +03:00
2007-12-23 15:29:34 +03:00
hca = OBJ_NEW ( mca_btl_openib_hca_t ) ;
2006-06-28 11:23:08 +04:00
if ( NULL = = hca ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return OMPI_ERR_OUT_OF_RESOURCE ;
2006-06-28 11:23:08 +04:00
}
2008-01-21 15:11:18 +03:00
2006-06-28 11:23:08 +04:00
hca - > ib_dev = ib_dev ;
hca - > ib_dev_context = ibv_open_device ( ib_dev ) ;
2008-05-02 15:52:33 +04:00
hca - > ib_pd = NULL ;
2008-02-20 16:44:05 +03:00
hca - > hca_btls = OBJ_NEW ( opal_pointer_array_t ) ;
if ( OPAL_SUCCESS ! = opal_pointer_array_init ( hca - > hca_btls , 2 , INT_MAX , 2 ) ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed to initialize hca_btls array: %s:%d " , __FILE__ , __LINE__ ) ) ;
2008-02-20 16:44:05 +03:00
return OMPI_ERR_OUT_OF_RESOURCE ;
}
2007-12-23 15:29:34 +03:00
2008-01-21 15:11:18 +03:00
if ( NULL = = hca - > ib_dev_context ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error obtaining device context for %s errno says %s " ,
2008-01-21 15:11:18 +03:00
ibv_get_device_name ( ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2008-01-21 15:11:18 +03:00
}
if ( ibv_query_device ( hca - > ib_dev_context , & hca - > ib_dev_attr ) ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error obtaining device attributes for %s errno says %s " ,
2008-01-21 15:11:18 +03:00
ibv_get_device_name ( ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2006-06-28 11:23:08 +04:00
}
2007-06-14 05:59:25 +04:00
/* If mca_btl_if_include/exclude were specified, get usable ports */
2007-12-23 15:29:34 +03:00
allowed_ports = ( int * ) malloc ( hca - > ib_dev_attr . phys_port_cnt * sizeof ( int ) ) ;
2007-06-14 05:59:25 +04:00
port_cnt = get_port_list ( hca , allowed_ports ) ;
if ( 0 = = port_cnt ) {
ret = OMPI_SUCCESS ;
2008-01-09 13:26:21 +03:00
goto error ;
2007-06-14 05:59:25 +04:00
}
2007-11-28 10:18:59 +03:00
# if HAVE_XRC
/* if user configured to run with XRC qp and the device don't support it -
* we should ignore this hca . Maybe we have other one that have XRC support
*/
if ( ! ( hca - > ib_dev_attr . device_cap_flags & IBV_DEVICE_XRC ) & &
mca_btl_openib_component . num_xrc_qps > 0 ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2007-11-28 10:18:59 +03:00
" XRC on device without XRC support " , true ,
mca_btl_openib_component . num_xrc_qps ,
ibv_get_device_name ( ib_dev ) ,
2008-03-24 02:10:15 +03:00
orte_process_info . nodename ) ;
2007-11-28 10:18:59 +03:00
ret = OMPI_SUCCESS ;
2008-01-09 13:26:21 +03:00
goto error ;
2007-11-28 10:18:59 +03:00
}
# endif
2006-08-31 00:21:47 +04:00
/* Load in vendor/part-specific HCA parameters. Note that even if
we don ' t find values for this vendor / part , " values " will be set
indicating that it does not have good values */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
ret = ompi_btl_openib_ini_query ( hca - > ib_dev_attr . vendor_id ,
hca - > ib_dev_attr . vendor_part_id ,
& values ) ;
2006-08-31 00:21:47 +04:00
if ( OMPI_SUCCESS ! = ret & & OMPI_ERR_NOT_FOUND ! = ret ) {
/* If we get a serious error, propagate it upwards */
2008-01-09 13:26:21 +03:00
goto error ;
2006-08-31 00:21:47 +04:00
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
if ( OMPI_ERR_NOT_FOUND = = ret ) {
/* If we didn't find a matching HCA in the INI files, output a
warning that we ' re using default values ( unless overridden
that we don ' t want to see these warnings ) */
if ( mca_btl_openib_component . warn_no_hca_params_found ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
" no hca params found " , true ,
2008-03-24 02:10:15 +03:00
orte_process_info . nodename ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
hca - > ib_dev_attr . vendor_id ,
hca - > ib_dev_attr . vendor_part_id ) ;
}
2006-08-31 00:21:47 +04:00
}
/* Note that even if we don't find default values, "values" will
be set indicating that it does not have good values */
ret = ompi_btl_openib_ini_query ( 0 , 0 , & default_values ) ;
if ( OMPI_SUCCESS ! = ret & & OMPI_ERR_NOT_FOUND ! = ret ) {
/* If we get a serious error, propagate it upwards */
2008-01-09 13:26:21 +03:00
goto error ;
2006-08-31 00:21:47 +04:00
}
/* If we did find values for this HCA (or in the defaults
section ) , handle them */
merge_values ( & values , & default_values ) ;
if ( values . mtu_set ) {
switch ( values . mtu ) {
case 256 :
hca - > mtu = IBV_MTU_256 ;
break ;
case 512 :
hca - > mtu = IBV_MTU_512 ;
break ;
case 1024 :
hca - > mtu = IBV_MTU_1024 ;
break ;
case 2048 :
hca - > mtu = IBV_MTU_2048 ;
break ;
case 4096 :
hca - > mtu = IBV_MTU_4096 ;
break ;
default :
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " invalid MTU value specified in INI file (%d); ignored " , values . mtu ) ) ;
2006-08-31 00:21:47 +04:00
hca - > mtu = mca_btl_openib_component . ib_mtu ;
break ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-08-31 00:21:47 +04:00
} else {
hca - > mtu = mca_btl_openib_component . ib_mtu ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-12-14 18:52:13 +03:00
/* If "use eager rdma" was set, then enable it on this HCA */
if ( values . use_eager_rdma_set ) {
hca - > use_eager_rdma = values . use_eager_rdma ;
}
2008-02-12 19:59:59 +03:00
/* Allocate the protection domain for the HCA */
2006-06-28 11:23:08 +04:00
hca - > ib_pd = ibv_alloc_pd ( hca - > ib_dev_context ) ;
if ( NULL = = hca - > ib_pd ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error allocating protection domain for %s errno says %s " ,
2006-06-28 11:23:08 +04:00
ibv_get_device_name ( ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2006-06-28 11:23:08 +04:00
}
2005-11-10 23:15:02 +03:00
2008-04-02 23:52:03 +04:00
# if HAVE_XRC
2007-11-28 10:18:59 +03:00
if ( MCA_BTL_XRC_ENABLED ) {
if ( OMPI_SUCCESS ! = mca_btl_openib_open_xrc_domain ( hca ) ) {
BTL_ERROR ( ( " XRC Internal error. Failed to open xrc domain " ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2007-11-28 10:18:59 +03:00
}
}
2008-04-02 23:52:03 +04:00
# endif
2007-11-28 10:18:59 +03:00
2006-12-17 15:26:41 +03:00
mpool_resources . reg_data = ( void * ) hca ;
mpool_resources . sizeof_reg = sizeof ( mca_btl_openib_reg_t ) ;
mpool_resources . register_mem = openib_reg_mr ;
mpool_resources . deregister_mem = openib_dereg_mr ;
2006-06-28 11:23:08 +04:00
hca - > mpool =
mca_mpool_base_module_create ( mca_btl_openib_component . ib_mpool_name ,
hca , & mpool_resources ) ;
if ( NULL = = hca - > mpool ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error creating IB memory pool for %s errno says %s " ,
2006-06-28 11:23:08 +04:00
ibv_get_device_name ( ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2006-06-28 11:23:08 +04:00
}
2008-01-21 15:11:18 +03:00
2008-05-02 15:52:33 +04:00
# if OMPI_ENABLE_PROGRESS_THREADS
2006-11-02 19:15:21 +03:00
hca - > ib_channel = ibv_create_comp_channel ( hca - > ib_dev_context ) ;
if ( NULL = = hca - > ib_channel ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error creating channel for %s errno says %s " ,
2006-11-02 19:15:21 +03:00
ibv_get_device_name ( hca - > ib_dev ) ,
strerror ( errno ) ) ) ;
2008-01-09 13:26:21 +03:00
goto error ;
2006-11-02 19:15:21 +03:00
}
# endif
2008-01-21 15:11:18 +03:00
ret = OMPI_SUCCESS ;
2006-11-02 19:15:21 +03:00
2007-06-14 05:59:25 +04:00
/* Note ports are 1 based (i >= 1) */
for ( k = 0 ; k < port_cnt ; k + + ) {
2006-06-28 11:23:08 +04:00
struct ibv_port_attr ib_port_attr ;
2007-06-14 05:59:25 +04:00
i = allowed_ports [ k ] ;
2006-06-28 11:23:08 +04:00
if ( ibv_query_port ( hca - > ib_dev_context , i , & ib_port_attr ) ) {
BTL_ERROR ( ( " error getting port attributes for device %s "
" port number %d errno says %s " ,
ibv_get_device_name ( ib_dev ) , i , strerror ( errno ) ) ) ;
2008-01-21 15:11:18 +03:00
break ;
2006-06-28 11:23:08 +04:00
}
2008-02-20 16:44:05 +03:00
if ( IBV_PORT_ACTIVE = = ib_port_attr . state ) {
if ( mca_btl_openib_component . apm_ports & & hca - > btls > 0 ) {
init_apm_port ( hca , i , ib_port_attr . lid ) ;
break ;
}
2007-04-22 14:22:12 +04:00
if ( 0 = = mca_btl_openib_component . ib_pkey_val ) {
ret = init_one_port ( btl_list , hca , i , mca_btl_openib_component . ib_pkey_ix ,
& ib_port_attr ) ;
2008-02-20 16:44:05 +03:00
} else {
2007-04-22 14:22:12 +04:00
uint16_t pkey , j ;
for ( j = 0 ; j < hca - > ib_dev_attr . max_pkeys ; j + + ) {
ibv_query_pkey ( hca - > ib_dev_context , i , j , & pkey ) ;
pkey = ntohs ( pkey ) ;
if ( pkey = = mca_btl_openib_component . ib_pkey_val ) {
ret = init_one_port ( btl_list , hca , i , j , & ib_port_attr ) ;
break ;
}
}
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
if ( OMPI_SUCCESS ! = ret ) {
2008-01-21 15:11:18 +03:00
/* Out of bounds error indicates that we hit max btl number
2006-09-25 15:18:20 +04:00
* don ' t propagate the error to the caller */
if ( OMPI_ERR_VALUE_OUT_OF_BOUNDS = = ret )
ret = OMPI_SUCCESS ;
2006-06-28 11:23:08 +04:00
break ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-06-28 11:23:08 +04:00
}
}
2007-08-20 16:28:25 +04:00
/* If we made a BTL, we're done. Otherwise, fall through and
destroy everything */
if ( hca - > btls > 0 ) {
2008-02-20 16:44:05 +03:00
/* if apm was enabled it should be > 1 */
if ( 1 = = mca_btl_openib_component . apm_ports ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " apm not enough ports " , true ) ;
2008-02-20 16:44:05 +03:00
mca_btl_openib_component . apm_ports = 0 ;
}
2008-01-09 13:26:21 +03:00
ret = prepare_hca_for_use ( hca ) ;
2008-04-02 23:52:03 +04:00
if ( OMPI_SUCCESS = = ret ) {
2008-01-09 13:26:21 +03:00
return OMPI_SUCCESS ;
2008-04-02 23:52:03 +04:00
}
If you have an HCA with no active ports, we still create an mpool.
This mpool will have no btl module owner there was no btl created for
the HCA with no ports, but it will still be tracked in the mpool
framework (i.e., it's available).
If MPI_ALLOC_MEM is called by the app, one of two things will happen:
1. if there's an HCA on the host with some active ports, the openib
btl component will still be in the process space, and therefore
the "mpool with no btl" (MWNB) module will still be able to call
the reg/dereg functions, and all will be fine. However, if
MPI_FREE_MEM is never invoked to free the memory, bad things will
happen during MPI_FINALIZE. The pml is finalized, which finalizes
all the btls. The btls finalize all their mpools and all is fine.
But later we close down the mpool framework which then finalizes
any left over mpool modules, such as MWNB. However, the openib
BTL module functions that the MWNB was registered with are no
longer in the process space, and it segv's while trying deregister
the memory.
2. if there are *no* HCA's on the host with active ports, then the
openib btl will have been unloaded, and when the MWNM tries to
register the memory, the functions it tries to call (in the openib
btl) are no longer there, and we segv.
This commit was SVN r15735.
2007-08-02 00:53:34 +04:00
}
2006-06-28 11:23:08 +04:00
2008-01-09 13:26:21 +03:00
error :
2008-05-02 15:52:33 +04:00
# if OMPI_ENABLE_PROGRESS_THREADS
2008-04-02 23:52:03 +04:00
if ( hca - > ib_channel ) {
2008-01-09 13:26:21 +03:00
ibv_destroy_comp_channel ( hca - > ib_channel ) ;
2008-04-02 23:52:03 +04:00
}
2006-11-06 15:34:56 +03:00
# endif
2008-04-02 23:52:03 +04:00
if ( hca - > mpool ) {
2008-01-09 13:26:21 +03:00
mca_mpool_base_module_destroy ( hca - > mpool ) ;
2008-04-02 23:52:03 +04:00
}
# if HAVE_XRC
2007-11-28 10:18:59 +03:00
if ( MCA_BTL_XRC_ENABLED ) {
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = mca_btl_openib_close_xrc_domain ( hca ) ) {
2007-11-28 10:18:59 +03:00
BTL_ERROR ( ( " XRC Internal error. Failed to close xrc domain " ) ) ;
}
}
2008-04-02 23:52:03 +04:00
# endif
if ( hca - > ib_pd ) {
2008-01-09 13:26:21 +03:00
ibv_dealloc_pd ( hca - > ib_pd ) ;
2008-04-02 23:52:03 +04:00
}
if ( hca - > ib_dev_context ) {
2008-01-09 13:26:21 +03:00
ibv_close_device ( hca - > ib_dev_context ) ;
2008-04-02 23:52:03 +04:00
}
2007-12-23 15:29:34 +03:00
OBJ_RELEASE ( hca ) ;
2006-06-28 11:23:08 +04:00
return ret ;
}
2006-12-17 15:26:41 +03:00
2007-10-23 16:57:45 +04:00
static int finish_btl_init ( mca_btl_openib_module_t * openib_btl )
{
2008-01-09 13:26:21 +03:00
int qp ;
2007-10-23 16:57:45 +04:00
openib_btl - > num_peers = 0 ;
/* Initialize module state */
OBJ_CONSTRUCT ( & openib_btl - > ib_lock , opal_mutex_t ) ;
2008-01-21 15:11:18 +03:00
2007-10-23 16:57:45 +04:00
/* setup the qp structure */
openib_btl - > qps = ( mca_btl_openib_module_qp_t * )
2008-01-09 13:26:21 +03:00
calloc ( mca_btl_openib_component . num_qps ,
sizeof ( mca_btl_openib_module_qp_t ) ) ;
2008-01-21 15:11:18 +03:00
/* setup all the qps */
for ( qp = 0 ; qp < mca_btl_openib_component . num_qps ; qp + + ) {
if ( ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) {
2007-11-28 10:12:44 +03:00
OBJ_CONSTRUCT ( & openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ 0 ] ,
opal_list_t ) ;
OBJ_CONSTRUCT ( & openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ 1 ] ,
opal_list_t ) ;
2008-01-21 15:11:18 +03:00
openib_btl - > qps [ qp ] . u . srq_qp . sd_credits =
2007-10-23 16:57:45 +04:00
mca_btl_openib_component . qp_infos [ qp ] . u . srq_qp . sd_max ;
}
}
2008-01-21 15:11:18 +03:00
/* initialize the memory pool using the hca */
2008-01-09 13:26:21 +03:00
openib_btl - > super . btl_mpool = openib_btl - > hca - > mpool ;
2008-01-21 15:11:18 +03:00
2007-12-23 15:29:34 +03:00
openib_btl - > eager_rdma_channels = 0 ;
2007-10-23 16:57:45 +04:00
openib_btl - > eager_rdma_frag_size = OPAL_ALIGN (
sizeof ( mca_btl_openib_header_t ) +
2007-12-09 17:05:13 +03:00
sizeof ( mca_btl_openib_header_coalesced_t ) +
sizeof ( mca_btl_openib_control_header_t ) +
2007-10-23 16:57:45 +04:00
sizeof ( mca_btl_openib_footer_t ) +
openib_btl - > super . btl_eager_limit ,
mca_btl_openib_component . buffer_alignment , size_t ) ;
return OMPI_SUCCESS ;
}
2008-02-03 18:16:24 +03:00
static struct ibv_device * * ibv_get_device_list_compat ( int * num_devs )
{
struct ibv_device * * ib_devs ;
# ifdef HAVE_IBV_GET_DEVICE_LIST
ib_devs = ibv_get_device_list ( num_devs ) ;
# else
struct dlist * dev_list ;
struct ibv_device * ib_dev ;
* num_devs = 0 ;
/* Determine the number of hca's available on the host */
dev_list = ibv_get_devices ( ) ;
if ( NULL = = dev_list )
return NULL ;
dlist_start ( dev_list ) ;
dlist_for_each_data ( dev_list , ib_dev , struct ibv_device )
( * num_devs ) + + ;
/* Allocate space for the ib devices */
ib_devs = ( struct ibv_device * * ) malloc ( * num_devs * sizeof ( struct ibv_dev * ) ) ;
if ( NULL = = ib_devs ) {
* num_devs = 0 ;
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2008-02-03 18:16:24 +03:00
return NULL ;
}
dlist_start ( dev_list ) ;
dlist_for_each_data ( dev_list , ib_dev , struct ibv_device )
* ( + + ib_devs ) = ib_dev ;
# endif
2008-02-04 17:05:01 +03:00
return ib_devs ;
2008-02-03 18:16:24 +03:00
}
static void ibv_free_device_list_compat ( struct ibv_device * * ib_devs )
{
# ifdef HAVE_IBV_GET_DEVICE_LIST
ibv_free_device_list ( ib_devs ) ;
# else
free ( ib_devs ) ;
# endif
}
2008-02-11 13:34:11 +03:00
static opal_carto_graph_t * host_topo ;
static int get_ib_dev_distance ( struct ibv_device * dev )
{
opal_paffinity_base_cpu_set_t cpus ;
opal_carto_base_node_t * hca_node ;
2008-03-05 05:45:15 +03:00
int min_distance = - 1 , i , max_proc_id , num_processors ;
2008-02-11 13:34:11 +03:00
const char * hca = ibv_get_device_name ( dev ) ;
2008-03-05 05:45:15 +03:00
if ( opal_paffinity_base_get_processor_info ( & num_processors , & max_proc_id ) ! = OMPI_SUCCESS )
2008-02-11 13:34:11 +03:00
max_proc_id = 100 ; /* Choose something big enough */
2008-05-07 23:33:49 +04:00
hca_node = opal_carto_base_find_node ( host_topo , hca ) ;
2008-02-11 13:34:11 +03:00
/* no topology info for HCA found. Assume that it is close */
if ( NULL = = hca_node )
return 0 ;
OPAL_PAFFINITY_CPU_ZERO ( cpus ) ;
opal_paffinity_base_get ( & cpus ) ;
for ( i = 0 ; i < max_proc_id ; i + + ) {
opal_carto_base_node_t * slot_node ;
int distance , socket , core ;
char * slot ;
if ( ! OPAL_PAFFINITY_CPU_ISSET ( i , cpus ) )
continue ;
opal_paffinity_base_map_to_socket_core ( i , & socket , & core ) ;
asprintf ( & slot , " slot%d " , socket ) ;
2008-05-07 23:33:49 +04:00
slot_node = opal_carto_base_find_node ( host_topo , slot ) ;
2008-02-11 13:34:11 +03:00
free ( slot ) ;
if ( NULL = = slot_node )
return 0 ;
2008-05-07 23:33:49 +04:00
distance = opal_carto_base_spf ( host_topo , slot_node , hca_node ) ;
2008-02-11 13:34:11 +03:00
if ( distance < 0 )
return 0 ;
if ( min_distance < 0 | | min_distance < distance )
min_distance = distance ;
}
return min_distance ;
}
struct dev_distance {
struct ibv_device * ib_dev ;
int distance ;
} ;
static int compare_distance ( const void * p1 , const void * p2 )
{
const struct dev_distance * d1 = p1 ;
const struct dev_distance * d2 = p2 ;
return d1 - > distance - d2 - > distance ;
}
static struct dev_distance *
sort_devs_by_distance ( struct ibv_device * * ib_devs , int count )
{
int i ;
struct dev_distance * devs = malloc ( count * sizeof ( struct dev_distance ) ) ;
2008-05-07 23:33:49 +04:00
opal_carto_base_get_host_graph ( & host_topo , " Infiniband " ) ;
2008-02-11 13:34:11 +03:00
for ( i = 0 ; i < count ; i + + ) {
devs [ i ] . ib_dev = ib_devs [ i ] ;
devs [ i ] . distance = get_ib_dev_distance ( ib_devs [ i ] ) ;
}
qsort ( devs , count , sizeof ( struct dev_distance ) , compare_distance ) ;
2008-05-07 23:33:49 +04:00
opal_carto_base_free_graph ( host_topo ) ;
2008-02-11 13:34:11 +03:00
return devs ;
}
2005-07-01 01:28:35 +04:00
/*
* IB component initialization :
* ( 1 ) read interface list from kernel and compare against component parameters
* then create a BTL instance for selected interfaces
* ( 2 ) setup IB listen socket for incoming connection attempts
* ( 3 ) register BTL parameters with the MCA
*/
2008-01-21 15:11:18 +03:00
static mca_btl_base_module_t * *
btl_openib_component_init ( int * num_btl_modules ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
bool enable_progress_threads ,
bool enable_mpi_threads )
2005-07-01 01:28:35 +04:00
{
2008-01-21 15:11:18 +03:00
struct ibv_device * * ib_devs ;
2005-07-01 01:28:35 +04:00
mca_btl_base_module_t * * btls ;
2008-01-09 13:26:21 +03:00
int i , ret , num_devs , length ;
2008-01-21 15:11:18 +03:00
opal_list_t btl_list ;
mca_btl_openib_module_t * openib_btl ;
mca_btl_base_selected_module_t * ib_selected ;
opal_list_item_t * item ;
2006-01-13 02:42:44 +03:00
unsigned short seedv [ 3 ] ;
2008-01-09 13:26:21 +03:00
mca_btl_openib_frag_init_data_t * init_data ;
2008-02-11 13:34:11 +03:00
struct dev_distance * dev_sorted ;
int distance ;
2005-07-20 01:04:22 +04:00
2005-07-01 01:28:35 +04:00
/* initialization */
* num_btl_modules = 0 ;
2008-01-21 15:11:18 +03:00
num_devs = 0 ;
2005-07-01 01:28:35 +04:00
2008-02-28 04:57:57 +03:00
seedv [ 0 ] = ORTE_PROC_MY_NAME - > vpid ;
2006-01-13 02:42:44 +03:00
seedv [ 1 ] = opal_sys_timer_get_cycles ( ) ;
seedv [ 2 ] = opal_sys_timer_get_cycles ( ) ;
seed48 ( seedv ) ;
2006-01-17 19:23:35 +03:00
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
/* Read in INI files with HCA-specific parameters */
if ( OMPI_SUCCESS ! = ( ret = ompi_btl_openib_ini_init ( ) ) ) {
2007-06-14 05:59:25 +04:00
goto no_btls ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2008-05-07 02:43:52 +04:00
/* Initialize FD listening */
if ( OMPI_SUCCESS ! = ompi_btl_openib_fd_init ( ) ) {
goto no_btls ;
}
2008-05-02 15:52:33 +04:00
/* Init CPC components */
if ( OMPI_SUCCESS ! = ( ret = ompi_btl_openib_connect_base_init ( ) ) ) {
goto no_btls ;
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
2008-05-02 15:52:33 +04:00
if ( MCA_BTL_XRC_ENABLED ) {
2007-11-28 10:18:59 +03:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_addr_table ,
opal_hash_table_t ) ;
}
2008-01-09 13:26:21 +03:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . send_free_coalesced , ompi_free_list_t ) ;
OBJ_CONSTRUCT ( & mca_btl_openib_component . send_user_free , ompi_free_list_t ) ;
OBJ_CONSTRUCT ( & mca_btl_openib_component . recv_user_free , ompi_free_list_t ) ;
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data - > order = mca_btl_openib_component . rdma_qp ;
init_data - > list = & mca_btl_openib_component . send_user_free ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new (
& mca_btl_openib_component . send_user_free ,
sizeof ( mca_btl_openib_put_frag_t ) , 2 ,
OBJ_CLASS ( mca_btl_openib_put_frag_t ) ,
0 , 0 ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2008-01-21 15:11:18 +03:00
NULL , mca_btl_openib_frag_init , init_data ) ) {
2008-01-09 13:26:21 +03:00
goto no_btls ;
}
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
init_data - > order = mca_btl_openib_component . rdma_qp ;
init_data - > list = & mca_btl_openib_component . recv_user_free ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:26:21 +03:00
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex_new (
& mca_btl_openib_component . recv_user_free ,
sizeof ( mca_btl_openib_get_frag_t ) , 2 ,
OBJ_CLASS ( mca_btl_openib_get_frag_t ) ,
0 , 0 ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2008-01-21 15:11:18 +03:00
NULL , mca_btl_openib_frag_init , init_data ) ) {
2008-01-09 13:26:21 +03:00
goto no_btls ;
}
init_data = malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
length = sizeof ( mca_btl_openib_coalesced_frag_t ) ;
init_data - > list = & mca_btl_openib_component . send_free_coalesced ;
if ( OMPI_SUCCESS ! = ompi_free_list_init_ex (
& mca_btl_openib_component . send_free_coalesced ,
length , 2 , OBJ_CLASS ( mca_btl_openib_coalesced_frag_t ) ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
NULL , mca_btl_openib_frag_init , init_data ) ) {
goto no_btls ;
}
2007-04-21 04:15:05 +04:00
/* If we want fork support, try to enable it */
# ifdef HAVE_IBV_FORK_INIT
if ( 0 ! = mca_btl_openib_component . want_fork_support ) {
if ( 0 ! = ibv_fork_init ( ) ) {
/* If the want_fork_support MCA parameter is >0, then the
user was specifically asking for fork support and we
couldn ' t provide it . So print an error and deactivate
this BTL . */
if ( mca_btl_openib_component . want_fork_support > 0 ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2007-04-21 04:15:05 +04:00
" ibv_fork_init fail " , true ,
2008-03-24 02:10:15 +03:00
orte_process_info . nodename ) ;
2007-06-14 05:59:25 +04:00
goto no_btls ;
2008-01-21 15:11:18 +03:00
}
2007-04-21 04:15:05 +04:00
}
}
# endif
2007-06-14 05:59:25 +04:00
/* Parse the include and exclude lists, checking for errors */
mca_btl_openib_component . if_include_list =
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_exclude_list =
2007-06-14 05:59:25 +04:00
mca_btl_openib_component . if_list = NULL ;
if ( NULL ! = mca_btl_openib_component . if_include & &
NULL ! = mca_btl_openib_component . if_exclude ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2007-06-14 05:59:25 +04:00
" specified include and exclude " , true ,
mca_btl_openib_component . if_include ,
mca_btl_openib_component . if_exclude , NULL ) ;
goto no_btls ;
} else if ( NULL ! = mca_btl_openib_component . if_include ) {
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_include_list =
2007-06-14 05:59:25 +04:00
opal_argv_split ( mca_btl_openib_component . if_include , ' , ' ) ;
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_list =
2007-06-14 05:59:25 +04:00
opal_argv_copy ( mca_btl_openib_component . if_include_list ) ;
} else if ( NULL ! = mca_btl_openib_component . if_exclude ) {
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_exclude_list =
2007-06-14 05:59:25 +04:00
opal_argv_split ( mca_btl_openib_component . if_exclude , ' , ' ) ;
2008-01-21 15:11:18 +03:00
mca_btl_openib_component . if_list =
2007-06-14 05:59:25 +04:00
opal_argv_copy ( mca_btl_openib_component . if_exclude_list ) ;
}
2008-02-03 18:16:24 +03:00
ib_devs = ibv_get_device_list_compat ( & num_devs ) ;
2005-07-01 01:28:35 +04:00
2008-02-03 18:16:24 +03:00
if ( 0 = = num_devs | | NULL = = ib_devs ) {
2008-03-29 15:54:24 +03:00
mca_btl_base_error_no_nics ( " OpenFabrics (openib) " , " HCA " ) ;
2008-02-03 18:16:24 +03:00
goto no_btls ;
2005-07-01 01:28:35 +04:00
}
2008-02-11 13:34:11 +03:00
dev_sorted = sort_devs_by_distance ( ib_devs , num_devs ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
/* We must loop through all the hca id's, get their handles and
for each hca we query the number of ports on the hca and set up
2008-01-21 15:11:18 +03:00
a distinct btl module for each hca port */
2005-07-01 01:28:35 +04:00
2008-01-21 15:11:18 +03:00
OBJ_CONSTRUCT ( & btl_list , opal_list_t ) ;
2005-07-04 02:45:48 +04:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_lock , opal_mutex_t ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
# if OMPI_HAVE_THREADS
2007-11-28 10:14:34 +03:00
mca_btl_openib_component . async_thread = 0 ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
# endif
2007-11-28 10:14:34 +03:00
for ( i = 0 ; i < num_devs & & ( - 1 = = mca_btl_openib_component . ib_max_btls | |
mca_btl_openib_component . ib_num_btls <
mca_btl_openib_component . ib_max_btls ) ; i + + ) {
2008-02-11 13:34:11 +03:00
if ( 0 = = mca_btl_openib_component . ib_num_btls )
distance = dev_sorted [ i ] . distance ;
else if ( distance ! = dev_sorted [ i ] . distance )
break ;
if ( OMPI_SUCCESS ! =
( ret = init_one_hca ( & btl_list , dev_sorted [ i ] . ib_dev ) ) )
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
break ;
2006-09-19 12:56:32 +04:00
}
if ( ret ! = OMPI_SUCCESS ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2008-03-24 02:10:15 +03:00
" error in hca init " , true , orte_process_info . nodename ) ;
2006-09-19 12:56:32 +04:00
}
2007-06-14 05:59:25 +04:00
2008-02-11 13:34:11 +03:00
free ( dev_sorted ) ;
2007-06-14 05:59:25 +04:00
/* If we got back from checking all the HCAs and find that there
are still items in the component . if_list , that means that they
didn ' t exist . Show an appropriate warning if the warning was
not disabled . */
if ( 0 ! = opal_argv_count ( mca_btl_openib_component . if_list ) & &
mca_btl_openib_component . warn_nonexistent_if ) {
char * str = opal_argv_join ( mca_btl_openib_component . if_list , ' , ' ) ;
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " nonexistent port " ,
2008-03-24 02:10:15 +03:00
true , orte_process_info . nodename ,
2008-01-21 15:11:18 +03:00
( ( NULL ! = mca_btl_openib_component . if_include ) ?
2007-06-14 05:59:25 +04:00
" in " : " ex " ) , str ) ;
free ( str ) ;
}
2008-01-21 15:11:18 +03:00
2006-09-19 12:56:32 +04:00
if ( 0 = = mca_btl_openib_component . ib_num_btls ) {
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " ,
2008-03-24 02:10:15 +03:00
" no active ports found " , true , orte_process_info . nodename ) ;
2006-09-19 12:56:32 +04:00
return NULL ;
}
2005-07-01 01:28:35 +04:00
/* Allocate space for btl modules */
2006-06-28 11:23:08 +04:00
mca_btl_openib_component . openib_btls =
2007-05-09 01:47:21 +04:00
malloc ( sizeof ( mca_btl_openib_module_t * ) *
2006-06-28 11:23:08 +04:00
mca_btl_openib_component . ib_num_btls ) ;
2005-07-12 17:38:54 +04:00
if ( NULL = = mca_btl_openib_component . openib_btls ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2005-07-01 01:28:35 +04:00
return NULL ;
}
2008-02-03 18:16:24 +03:00
btls = ( struct mca_btl_base_module_t * * )
malloc ( mca_btl_openib_component . ib_num_btls *
sizeof ( struct mca_btl_base_module_t * ) ) ;
2005-07-01 01:28:35 +04:00
if ( NULL = = btls ) {
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2005-07-01 01:28:35 +04:00
return NULL ;
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
/* Copy the btl module structs into a contiguous array and fully
initialize them */
2005-07-01 01:28:35 +04:00
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
2008-01-21 15:11:18 +03:00
item = opal_list_remove_first ( & btl_list ) ;
ib_selected = ( mca_btl_base_selected_module_t * ) item ;
2007-10-23 16:57:45 +04:00
mca_btl_openib_component . openib_btls [ i ] =
2008-01-21 15:11:18 +03:00
( mca_btl_openib_module_t * ) ib_selected - > btl_module ;
OBJ_RELEASE ( ib_selected ) ;
2007-05-09 01:47:21 +04:00
openib_btl = mca_btl_openib_component . openib_btls [ i ] ;
2007-10-23 15:10:52 +04:00
btls [ i ] = & openib_btl - > super ;
2007-10-23 16:57:45 +04:00
if ( finish_btl_init ( openib_btl ) ! = OMPI_SUCCESS )
return NULL ;
2007-10-23 15:10:52 +04:00
}
2005-07-01 01:28:35 +04:00
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
btl_openib_modex_send ( ) ;
2005-07-01 01:28:35 +04:00
* num_btl_modules = mca_btl_openib_component . ib_num_btls ;
2008-02-03 18:16:24 +03:00
ibv_free_device_list_compat ( ib_devs ) ;
2007-06-14 05:59:25 +04:00
if ( NULL ! = mca_btl_openib_component . if_include_list ) {
opal_argv_free ( mca_btl_openib_component . if_include_list ) ;
mca_btl_openib_component . if_include_list = NULL ;
}
if ( NULL ! = mca_btl_openib_component . if_exclude_list ) {
opal_argv_free ( mca_btl_openib_component . if_exclude_list ) ;
mca_btl_openib_component . if_exclude_list = NULL ;
}
2005-07-01 01:28:35 +04:00
return btls ;
2007-06-14 05:59:25 +04:00
no_btls :
/* If we fail early enough in the setup, we just modex around that
there are no openib BTL ' s in this process and return NULL . */
2008-05-02 15:52:33 +04:00
if ( MCA_BTL_XRC_ENABLED ) {
2007-11-28 10:18:59 +03:00
OBJ_DESTRUCT ( & mca_btl_openib_component . ib_addr_table ) ;
2008-05-02 15:52:33 +04:00
}
/* Be sure to shut down the fd listener */
ompi_btl_openib_fd_finalize ( ) ;
2007-11-28 10:18:59 +03:00
2007-06-14 05:59:25 +04:00
mca_btl_openib_component . ib_num_btls = 0 ;
btl_openib_modex_send ( ) ;
return NULL ;
2005-07-01 01:28:35 +04:00
}
2008-01-09 13:27:15 +03:00
static void progress_pending_eager_rdma ( mca_btl_base_endpoint_t * ep )
2006-08-31 00:21:47 +04:00
{
2008-01-09 13:27:15 +03:00
int qp ;
opal_list_item_t * frag ;
2006-12-14 18:52:13 +03:00
2008-01-09 13:27:15 +03:00
/* Go over all QPs and try to send high prio packets over eager rdma
* channel */
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
for ( qp = 0 ; qp < mca_btl_openib_component . num_qps ; qp + + ) {
while ( ep - > qps [ qp ] . qp - > sd_wqe > 0 & & ep - > eager_rdma_remote . tokens > 0 ) {
frag = opal_list_remove_first ( & ep - > qps [ qp ] . pending_frags [ 0 ] ) ;
if ( NULL = = frag )
break ;
mca_btl_openib_endpoint_post_send ( ep , to_send_frag ( frag ) ) ;
}
if ( ep - > eager_rdma_remote . tokens = = 0 )
break ;
2006-12-14 18:52:13 +03:00
}
2008-01-09 13:27:15 +03:00
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2006-08-31 00:21:47 +04:00
}
2008-01-09 13:27:15 +03:00
static inline int
get_enpoint_credits ( mca_btl_base_endpoint_t * ep , const int qp )
2007-11-28 10:13:34 +03:00
{
2008-01-09 13:27:15 +03:00
return BTL_OPENIB_QP_TYPE_PP ( qp ) ? ep - > qps [ qp ] . u . pp_qp . sd_credits : 1 ;
}
static void progress_pending_frags_pp ( mca_btl_base_endpoint_t * ep , const int qp )
{
int i ;
opal_list_item_t * frag ;
2008-01-21 15:11:18 +03:00
2008-01-09 13:27:15 +03:00
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
for ( i = 0 ; i < 2 ; i + + ) {
while ( ( get_enpoint_credits ( ep , qp ) +
( 1 - i ) * ep - > eager_rdma_remote . tokens ) > 0 ) {
frag = opal_list_remove_first ( & ep - > qps [ qp ] . pending_frags [ i ] ) ;
if ( NULL = = frag )
break ;
mca_btl_openib_endpoint_post_send ( ep , to_send_frag ( frag ) ) ;
}
}
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
}
void mca_btl_openib_frag_progress_pending_put_get ( mca_btl_base_endpoint_t * ep ,
const int qp )
2008-01-21 15:11:18 +03:00
{
2008-01-09 13:27:15 +03:00
mca_btl_openib_module_t * openib_btl = ep - > endpoint_btl ;
opal_list_item_t * frag ;
size_t i , len = opal_list_get_size ( & ep - > pending_get_frags ) ;
for ( i = 0 ; i < len & & ep - > qps [ qp ] . qp - > sd_wqe > 0 & & ep - > get_tokens > 0 ; i + + )
{
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
frag = opal_list_remove_first ( & ( ep - > pending_get_frags ) ) ;
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
if ( NULL = = frag )
break ;
if ( mca_btl_openib_get ( ( mca_btl_base_module_t * ) openib_btl , ep ,
& to_base_frag ( frag ) - > base ) = = OMPI_ERR_OUT_OF_RESOURCE )
break ;
}
2008-01-21 15:11:18 +03:00
2008-01-09 13:27:15 +03:00
len = opal_list_get_size ( & ep - > pending_put_frags ) ;
for ( i = 0 ; i < len & & ep - > qps [ qp ] . qp - > sd_wqe > 0 ; i + + ) {
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
frag = opal_list_remove_first ( & ( ep - > pending_put_frags ) ) ;
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
if ( NULL = = frag )
break ;
if ( mca_btl_openib_put ( ( mca_btl_base_module_t * ) openib_btl , ep ,
& to_base_frag ( frag ) - > base ) = = OMPI_ERR_OUT_OF_RESOURCE )
break ;
}
2007-11-28 10:13:34 +03:00
}
2006-08-31 00:21:47 +04:00
2006-09-12 13:17:59 +04:00
static int btl_openib_handle_incoming ( mca_btl_openib_module_t * openib_btl ,
2007-11-28 10:13:34 +03:00
mca_btl_openib_endpoint_t * ep ,
2008-01-21 15:11:18 +03:00
mca_btl_openib_recv_frag_t * frag ,
2007-11-28 10:12:44 +03:00
size_t byte_len )
2006-03-26 12:30:50 +04:00
{
2007-11-28 10:11:14 +03:00
mca_btl_base_descriptor_t * des = & to_base_frag ( frag ) - > base ;
mca_btl_openib_header_t * hdr = frag - > hdr ;
2007-11-28 10:13:34 +03:00
int rqp = to_base_frag ( frag ) - > base . order , cqp ;
uint16_t rcredits = 0 , credits ;
bool is_credit_msg ;
2007-11-28 10:11:14 +03:00
2007-11-28 10:13:34 +03:00
if ( ep - > nbo ) {
2007-11-28 10:11:14 +03:00
BTL_OPENIB_HEADER_NTOH ( * hdr ) ;
2007-01-13 02:14:45 +03:00
}
2007-03-14 17:36:03 +03:00
2006-09-12 13:17:59 +04:00
/* advance the segment address past the header and subtract from the
2007-11-28 10:13:34 +03:00
* length . */
2007-11-28 10:11:14 +03:00
des - > des_dst - > seg_len = byte_len - sizeof ( mca_btl_openib_header_t ) ;
2006-03-26 12:30:50 +04:00
2007-11-28 10:13:34 +03:00
if ( OPAL_LIKELY ( ! ( is_credit_msg = is_credit_message ( frag ) ) ) ) {
/* call registered callback */
2008-01-15 08:32:53 +03:00
mca_btl_active_message_callback_t * reg ;
reg = mca_btl_base_active_message_trigger + hdr - > tag ;
reg - > cbfunc ( & openib_btl - > super , hdr - > tag , des , reg - > cbdata ) ;
2007-12-09 16:56:13 +03:00
if ( MCA_BTL_OPENIB_RDMA_FRAG ( frag ) ) {
cqp = ( hdr - > credits > > 11 ) & 0x0f ;
hdr - > credits & = 0x87ff ;
} else {
cqp = rqp ;
}
2007-11-28 10:13:34 +03:00
if ( BTL_OPENIB_IS_RDMA_CREDITS ( hdr - > credits ) ) {
rcredits = BTL_OPENIB_CREDITS ( hdr - > credits ) ;
hdr - > credits = 0 ;
2007-11-28 10:12:44 +03:00
}
2007-07-24 17:23:08 +04:00
} else {
2007-11-28 10:13:34 +03:00
mca_btl_openib_rdma_credits_header_t * chdr = des - > des_dst - > seg_addr . pval ;
if ( ep - > nbo ) {
BTL_OPENIB_RDMA_CREDITS_HEADER_NTOH ( * chdr ) ;
}
cqp = chdr - > qpn ;
rcredits = chdr - > rdma_credits ;
}
credits = hdr - > credits ;
2007-11-28 10:12:44 +03:00
2007-11-28 10:13:34 +03:00
if ( hdr - > cm_seen )
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . cm_sent , - hdr - > cm_seen ) ;
/* Now return fragment. Don't touch hdr after this point! */
if ( MCA_BTL_OPENIB_RDMA_FRAG ( frag ) ) {
mca_btl_openib_eager_rdma_local_t * erl = & ep - > eager_rdma_local ;
OPAL_THREAD_LOCK ( & erl - > lock ) ;
MCA_BTL_OPENIB_RDMA_MAKE_REMOTE ( frag - > ftr ) ;
while ( erl - > tail ! = erl - > head ) {
mca_btl_openib_recv_frag_t * tf ;
tf = MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG ( ep , erl - > tail ) ;
if ( MCA_BTL_OPENIB_RDMA_FRAG_LOCAL ( tf ) )
break ;
OPAL_THREAD_ADD32 ( & erl - > credits , 1 ) ;
MCA_BTL_OPENIB_RDMA_NEXT_INDEX ( erl - > tail ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
}
2007-11-28 10:13:34 +03:00
OPAL_THREAD_UNLOCK ( & erl - > lock ) ;
} else {
MCA_BTL_IB_FRAG_RETURN ( frag ) ;
2007-11-28 10:20:26 +03:00
if ( BTL_OPENIB_QP_TYPE_PP ( rqp ) ) {
2007-11-28 10:13:34 +03:00
if ( OPAL_UNLIKELY ( is_credit_msg ) )
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . cm_received , 1 ) ;
else
OPAL_THREAD_ADD32 ( & ep - > qps [ rqp ] . u . pp_qp . rd_posted , - 1 ) ;
mca_btl_openib_endpoint_post_rr ( ep , cqp ) ;
2007-11-28 10:20:26 +03:00
} else {
mca_btl_openib_module_t * btl = ep - > endpoint_btl ;
OPAL_THREAD_ADD32 ( & btl - > qps [ rqp ] . u . srq_qp . rd_posted , - 1 ) ;
2007-11-28 17:52:31 +03:00
mca_btl_openib_post_srr ( btl , rqp ) ;
2007-11-28 10:13:34 +03:00
}
}
if ( rcredits > 0 ) {
OPAL_THREAD_ADD32 ( & ep - > eager_rdma_remote . tokens , rcredits ) ;
progress_pending_eager_rdma ( ep ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
}
2006-03-26 12:30:50 +04:00
2007-11-28 10:13:34 +03:00
assert ( ( cqp ! = MCA_BTL_NO_ORDER & & BTL_OPENIB_QP_TYPE_PP ( cqp ) ) | | ! credits ) ;
if ( credits ) {
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . sd_credits , credits ) ;
progress_pending_frags_pp ( ep , cqp ) ;
}
2007-12-09 16:56:13 +03:00
send_credits ( ep , cqp ) ;
2007-11-28 10:13:34 +03:00
2006-03-26 12:30:50 +04:00
return OMPI_SUCCESS ;
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
static char * btl_openib_component_status_to_string ( enum ibv_wc_status status )
2008-01-21 15:11:18 +03:00
{
switch ( status ) {
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
case IBV_WC_SUCCESS :
2008-01-21 15:11:18 +03:00
return " SUCCESS " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
break ;
case IBV_WC_LOC_LEN_ERR :
2008-01-21 15:11:18 +03:00
return " LOCAL LENGTH ERROR " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
break ;
case IBV_WC_LOC_QP_OP_ERR :
return " LOCAL QP OPERATION ERROR " ;
break ;
case IBV_WC_LOC_EEC_OP_ERR :
return " LOCAL EEC OPERATION ERROR " ;
break ;
case IBV_WC_LOC_PROT_ERR :
return " LOCAL PROTOCOL ERROR " ;
break ;
case IBV_WC_WR_FLUSH_ERR :
return " WORK REQUEST FLUSHED ERROR " ;
break ;
case IBV_WC_MW_BIND_ERR :
return " MEMORY WINDOW BIND ERROR " ;
break ;
case IBV_WC_BAD_RESP_ERR :
return " BAD RESPONSE ERROR " ;
break ;
case IBV_WC_LOC_ACCESS_ERR :
return " LOCAL ACCESS ERROR " ;
break ;
case IBV_WC_REM_INV_REQ_ERR :
return " INVALID REQUEST ERROR " ;
break ;
case IBV_WC_REM_ACCESS_ERR :
return " REMOTE ACCESS ERROR " ;
break ;
case IBV_WC_REM_OP_ERR :
return " REMOTE OPERATION ERROR " ;
break ;
case IBV_WC_RETRY_EXC_ERR :
return " RETRY EXCEEDED ERROR " ;
break ;
case IBV_WC_RNR_RETRY_EXC_ERR :
2007-04-27 01:03:38 +04:00
return " RECEIVER NOT READY RETRY EXCEEDED ERROR " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
break ;
case IBV_WC_LOC_RDD_VIOL_ERR :
return " LOCAL RDD VIOLATION ERROR " ;
break ;
case IBV_WC_REM_INV_RD_REQ_ERR :
return " INVALID READ REQUEST ERROR " ;
break ;
case IBV_WC_REM_ABORT_ERR :
return " REMOTE ABORT ERROR " ;
break ;
case IBV_WC_INV_EECN_ERR :
return " INVALID EECN ERROR " ;
break ;
case IBV_WC_INV_EEC_STATE_ERR :
return " INVALID EEC STATE ERROR " ;
break ;
2008-01-21 15:11:18 +03:00
case IBV_WC_FATAL_ERR :
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
return " FATAL ERROR " ;
break ;
case IBV_WC_RESP_TIMEOUT_ERR :
return " RESPONSE TIMEOUT ERROR " ;
break ;
case IBV_WC_GENERAL_ERR :
return " GENERAL ERROR " ;
break ;
default :
return " STATUS UNDEFINED " ;
break ;
}
2006-06-06 00:02:41 +04:00
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 23:30:37 +04:00
2008-01-09 17:46:41 +03:00
static void
progress_pending_frags_wqe ( mca_btl_base_endpoint_t * ep , const int qpn )
2007-11-28 10:15:20 +03:00
{
int i ;
opal_list_item_t * frag ;
2008-01-09 17:46:41 +03:00
mca_btl_openib_qp_t * qp = ep - > qps [ qpn ] . qp ;
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
2007-11-28 10:15:20 +03:00
for ( i = 0 ; i < 2 ; i + + ) {
while ( qp - > sd_wqe > 0 ) {
mca_btl_base_endpoint_t * ep ;
2008-01-09 17:46:41 +03:00
OPAL_THREAD_LOCK ( & qp - > lock ) ;
2007-11-28 10:15:20 +03:00
frag = opal_list_remove_first ( & qp - > pending_frags [ i ] ) ;
2008-01-09 17:46:41 +03:00
OPAL_THREAD_UNLOCK ( & qp - > lock ) ;
2007-11-28 10:15:20 +03:00
if ( NULL = = frag )
break ;
ep = to_com_frag ( frag ) - > endpoint ;
mca_btl_openib_endpoint_post_send ( ep , to_send_frag ( frag ) ) ;
}
}
2008-01-09 17:46:41 +03:00
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2007-11-28 10:15:20 +03:00
}
2008-01-09 13:27:15 +03:00
static void progress_pending_frags_srq ( mca_btl_openib_module_t * openib_btl ,
const int qp )
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
{
2007-11-28 10:11:14 +03:00
opal_list_item_t * frag ;
2007-11-28 10:12:44 +03:00
int i ;
2008-01-21 15:11:18 +03:00
2007-11-28 10:18:59 +03:00
assert ( BTL_OPENIB_QP_TYPE_SRQ ( qp ) | | BTL_OPENIB_QP_TYPE_XRC ( qp ) ) ;
2008-01-21 15:11:18 +03:00
2007-11-28 10:12:44 +03:00
for ( i = 0 ; i < 2 ; i + + ) {
2007-11-28 10:20:26 +03:00
while ( openib_btl - > qps [ qp ] . u . srq_qp . sd_credits > 0 ) {
2007-11-28 10:12:44 +03:00
OPAL_THREAD_LOCK ( & openib_btl - > ib_lock ) ;
2007-11-28 10:20:26 +03:00
frag = opal_list_remove_first (
& openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ i ] ) ;
2007-11-28 10:12:44 +03:00
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
if ( NULL = = frag )
break ;
mca_btl_openib_endpoint_send ( to_com_frag ( frag ) - > endpoint ,
to_send_frag ( frag ) ) ;
}
2007-05-09 01:47:21 +04:00
}
}
2007-12-23 15:29:34 +03:00
static char * cq_name [ ] = { " HP CQ " , " LP CQ " } ;
static void handle_wc ( mca_btl_openib_hca_t * hca , const uint32_t cq ,
struct ibv_wc * wc )
2006-11-02 19:15:21 +03:00
{
2007-12-23 15:29:34 +03:00
static int flush_err_printed [ ] = { 0 , 0 } ;
2007-11-28 10:11:14 +03:00
mca_btl_openib_com_frag_t * frag ;
mca_btl_base_descriptor_t * des ;
2007-12-23 15:29:34 +03:00
mca_btl_openib_endpoint_t * endpoint ;
2007-08-20 16:28:25 +04:00
mca_btl_openib_module_t * openib_btl = NULL ;
2007-12-23 15:29:34 +03:00
ompi_proc_t * remote_proc = NULL ;
2008-02-18 20:39:30 +03:00
int qp , btl_ownership ;
2006-11-02 19:15:21 +03:00
2007-12-23 15:29:34 +03:00
des = ( mca_btl_base_descriptor_t * ) ( uintptr_t ) wc - > wr_id ;
frag = to_com_frag ( des ) ;
2007-12-02 17:46:37 +03:00
2007-12-23 15:29:34 +03:00
/* For receive fragments "order" contains QP idx the fragment was posted
* to . For send fragments " order " contains QP idx the fragment was send
* through */
qp = des - > order ;
endpoint = frag - > endpoint ;
2007-11-28 10:11:14 +03:00
2007-12-23 15:29:34 +03:00
if ( endpoint )
openib_btl = endpoint - > endpoint_btl ;
2007-11-28 10:11:14 +03:00
2007-12-23 15:29:34 +03:00
if ( wc - > status ! = IBV_WC_SUCCESS )
goto error ;
2007-08-20 16:28:25 +04:00
2007-12-23 15:29:34 +03:00
/* Handle work completions */
switch ( wc - > opcode ) {
case IBV_WC_RDMA_READ :
OPAL_THREAD_ADD32 ( & endpoint - > get_tokens , 1 ) ;
/* fall through */
case IBV_WC_RDMA_WRITE :
case IBV_WC_SEND :
if ( openib_frag_type ( des ) = = MCA_BTL_OPENIB_FRAG_SEND ) {
opal_list_item_t * i ;
2008-02-18 20:39:30 +03:00
while ( ( i = opal_list_remove_first ( & to_send_frag ( des ) - > coalesced_frags ) ) ) {
btl_ownership = ( to_base_frag ( i ) - > base . des_flags & MCA_BTL_DES_FLAGS_BTL_OWNERSHIP ) ;
2007-12-23 15:29:34 +03:00
to_base_frag ( i ) - > base . des_cbfunc ( & openib_btl - > super , endpoint ,
& to_base_frag ( i ) - > base , OMPI_SUCCESS ) ;
2008-02-18 20:39:30 +03:00
if ( btl_ownership ) {
mca_btl_openib_free ( & openib_btl - > super , & to_base_frag ( i ) - > base ) ;
}
}
2007-12-23 15:29:34 +03:00
}
/* Process a completed send/put/get */
2008-02-18 20:39:30 +03:00
btl_ownership = ( des - > des_flags & MCA_BTL_DES_FLAGS_BTL_OWNERSHIP ) ;
2007-12-23 15:29:34 +03:00
des - > des_cbfunc ( & openib_btl - > super , endpoint , des , OMPI_SUCCESS ) ;
2008-02-18 20:39:30 +03:00
if ( btl_ownership ) {
mca_btl_openib_free ( & openib_btl - > super , des ) ;
}
2007-11-28 10:11:14 +03:00
2007-12-23 15:29:34 +03:00
/* return send wqe */
qp_put_wqe ( endpoint , qp ) ;
2005-11-10 23:15:02 +03:00
2007-12-23 15:29:34 +03:00
if ( IBV_WC_SEND = = wc - > opcode & & ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) {
OPAL_THREAD_ADD32 ( & openib_btl - > qps [ qp ] . u . srq_qp . sd_credits , 1 ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 05:15:59 +04:00
2007-12-23 15:29:34 +03:00
/* new SRQ credit available. Try to progress pending frags*/
progress_pending_frags_srq ( openib_btl , qp ) ;
}
/* new wqe or/and get token available. Try to progress pending frags */
2008-01-09 17:46:41 +03:00
progress_pending_frags_wqe ( endpoint , qp ) ;
2007-12-23 15:29:34 +03:00
mca_btl_openib_frag_progress_pending_put_get ( endpoint , qp ) ;
break ;
case IBV_WC_RECV :
if ( wc - > wc_flags & IBV_WC_WITH_IMM ) {
endpoint = ( mca_btl_openib_endpoint_t * )
opal_pointer_array_get_item ( hca - > endpoints , wc - > imm_data ) ;
frag - > endpoint = endpoint ;
openib_btl = endpoint - > endpoint_btl ;
}
2007-11-28 10:18:59 +03:00
2007-12-23 15:29:34 +03:00
/* Process a RECV */
if ( btl_openib_handle_incoming ( openib_btl , endpoint , to_recv_frag ( frag ) ,
wc - > byte_len ) ! = OMPI_SUCCESS ) {
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
2005-11-10 23:15:02 +03:00
break ;
2007-12-23 15:29:34 +03:00
}
2007-07-24 17:23:08 +04:00
2007-12-23 15:29:34 +03:00
/* decide if it is time to setup an eager rdma channel */
if ( ! endpoint - > eager_rdma_local . base . pval & & endpoint - > use_eager_rdma & &
wc - > byte_len < mca_btl_openib_component . eager_limit & &
openib_btl - > eager_rdma_channels <
mca_btl_openib_component . max_eager_rdma & &
OPAL_THREAD_ADD32 ( & endpoint - > eager_recv_count , 1 ) = =
mca_btl_openib_component . eager_rdma_threshold ) {
mca_btl_openib_endpoint_connect_eager_rdma ( endpoint ) ;
}
break ;
default :
BTL_ERROR ( ( " Unhandled work completion opcode is %d " , wc - > opcode ) ) ;
if ( openib_btl )
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
break ;
}
2007-07-26 17:56:07 +04:00
2007-12-23 15:29:34 +03:00
return ;
error :
if ( endpoint & & endpoint - > endpoint_proc & & endpoint - > endpoint_proc - > proc_ompi )
remote_proc = endpoint - > endpoint_proc - > proc_ompi ;
2008-05-07 20:53:05 +04:00
/* For iWARP, the TCP connection is tied to the QP once the QP is
* in RTS . And destroying the QP is thus tied to connection
* teardown for iWARP . To destroy the connection in iWARP you
* must move the QP out of RTS , either into CLOSING for a nice
* graceful close ( e . g . , via rdma_disconnect ( ) ) , or to ERROR if
* you want to be rude ( e . g . , just destroying the QP without
* disconnecting first ) . In both cases , all pending non - completed
* SQ and RQ WRs will automatically be flushed .
2008-05-07 01:57:40 +04:00
*/
2008-05-12 15:56:14 +04:00
# if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
2008-05-12 16:05:16 +04:00
if ( IBV_WC_WR_FLUSH_ERR = = wc - > status & &
2008-05-12 15:56:14 +04:00
IBV_TRANSPORT_IWARP = = hca - > ib_dev - > transport_type ) {
2008-05-07 01:57:40 +04:00
return ;
}
2008-05-12 15:56:14 +04:00
# endif
2008-05-07 01:57:40 +04:00
2008-05-12 16:05:16 +04:00
if ( IBV_WC_WR_FLUSH_ERR ! = wc - > status | | ! flush_err_printed [ cq ] + + ) {
2007-12-23 15:29:34 +03:00
BTL_PEER_ERROR ( remote_proc , ( " error polling %s with status %s "
" status number %d for wr_id %llu opcode %d qp_idx %d " ,
cq_name [ cq ] , btl_openib_component_status_to_string ( wc - > status ) ,
wc - > status , wc - > wr_id , wc - > opcode , qp ) ) ;
}
if ( IBV_WC_RETRY_EXC_ERR = = wc - > status )
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_show_help ( " help-mpi-btl-openib.txt " , " btl_openib:retry-exceeded " , true ) ;
2007-12-23 15:29:34 +03:00
if ( openib_btl )
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
}
static int poll_hca ( mca_btl_openib_hca_t * hca , int count )
{
2008-01-21 15:11:18 +03:00
int ne = 0 , cq ;
2007-12-23 15:29:34 +03:00
uint32_t hp_iter = 0 ;
struct ibv_wc wc ;
hca - > pollme = false ;
for ( cq = 0 ; cq < 2 & & hp_iter < mca_btl_openib_component . cq_poll_progress ; )
{
ne = ibv_poll_cq ( hca - > ib_cq [ cq ] , 1 , & wc ) ;
if ( 0 = = ne ) {
/* don't check low prio cq if there was something in high prio cq,
* but for each cq_poll_ratio hp cq polls poll lp cq once */
if ( count & & hca - > hp_cq_polls )
2005-07-01 01:28:35 +04:00
break ;
2007-12-23 15:29:34 +03:00
cq + + ;
hca - > hp_cq_polls = mca_btl_openib_component . cq_poll_ratio ;
continue ;
2005-07-01 01:28:35 +04:00
}
2007-12-23 15:29:34 +03:00
if ( ne < 0 )
goto error ;
count + + ;
if ( BTL_OPENIB_HP_CQ = = cq ) {
hca - > pollme = true ;
hp_iter + + ;
hca - > hp_cq_polls - - ;
}
2008-01-21 15:11:18 +03:00
2007-12-23 15:29:34 +03:00
handle_wc ( hca , cq , & wc ) ;
2005-07-01 01:28:35 +04:00
}
2008-01-21 15:11:18 +03:00
2005-07-01 01:28:35 +04:00
return count ;
2006-09-12 13:17:59 +04:00
error :
2008-05-02 15:52:33 +04:00
BTL_ERROR ( ( " error polling %s with %d errno says %s " , cq_name [ cq ] , ne ,
2008-01-21 15:11:18 +03:00
strerror ( errno ) ) ) ;
2006-09-05 19:59:02 +04:00
return count ;
2005-07-01 01:28:35 +04:00
}
2007-06-14 05:59:25 +04:00
2008-05-02 15:52:33 +04:00
# if OMPI_ENABLE_PROGRESS_THREADS
2008-01-09 13:27:15 +03:00
void * mca_btl_openib_progress_thread ( opal_object_t * arg )
2007-06-14 05:59:25 +04:00
{
2008-01-09 13:27:15 +03:00
opal_thread_t * thread = ( opal_thread_t * ) arg ;
mca_btl_openib_hca_t * hca = thread - > t_arg ;
struct ibv_cq * ev_cq ;
void * ev_ctx ;
2007-06-14 05:59:25 +04:00
2008-01-09 13:27:15 +03:00
/* This thread enter in a cancel enabled state */
pthread_setcancelstate ( PTHREAD_CANCEL_ENABLE , NULL ) ;
pthread_setcanceltype ( PTHREAD_CANCEL_ASYNCHRONOUS , NULL ) ;
2007-06-14 05:59:25 +04:00
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
orte_output ( - 1 , " WARNING: the openib btl progress thread code *does not yet work*. Your run is likely to hang, crash, break the kitchen sink, and/or eat your cat. You have been warned. " ) ;
2008-01-09 13:27:15 +03:00
while ( hca - > progress ) {
while ( opal_progress_threads ( ) ) {
while ( opal_progress_threads ( ) )
sched_yield ( ) ;
usleep ( 100 ) ; /* give app a chance to re-enter library */
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
if ( ibv_get_cq_event ( hca - > ib_channel , & ev_cq , & ev_ctx ) )
BTL_ERROR ( ( " Failed to get CQ event with error %s " ,
strerror ( errno ) ) ) ;
if ( ibv_req_notify_cq ( ev_cq , 0 ) ) {
BTL_ERROR ( ( " Couldn't request CQ notification with error %s " ,
strerror ( errno ) ) ) ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
ibv_ack_cq_events ( ev_cq , 1 ) ;
while ( poll_hca ( hca , 0 ) ) ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
return PTHREAD_CANCELED ;
}
# endif
2007-06-14 05:59:25 +04:00
2008-01-09 13:27:15 +03:00
static int progress_one_hca ( mca_btl_openib_hca_t * hca )
{
int i , c , count = 0 , ret ;
mca_btl_openib_recv_frag_t * frag ;
mca_btl_openib_endpoint_t * endpoint ;
uint32_t non_eager_rdma_endpoints = 0 ;
2007-06-14 05:59:25 +04:00
2008-01-09 13:27:15 +03:00
c = hca - > eager_rdma_buffers_count ;
non_eager_rdma_endpoints + = ( hca - > non_eager_rdma_endpoints + hca - > pollme ) ;
for ( i = 0 ; i < c ; i + + ) {
endpoint = hca - > eager_rdma_buffers [ i ] ;
if ( ! endpoint )
continue ;
OPAL_THREAD_LOCK ( & endpoint - > eager_rdma_local . lock ) ;
frag = MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG ( endpoint ,
endpoint - > eager_rdma_local . head ) ;
if ( MCA_BTL_OPENIB_RDMA_FRAG_LOCAL ( frag ) ) {
uint32_t size ;
mca_btl_openib_module_t * btl = endpoint - > endpoint_btl ;
opal_atomic_rmb ( ) ;
if ( endpoint - > nbo ) {
BTL_OPENIB_FOOTER_NTOH ( * frag - > ftr ) ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
size = MCA_BTL_OPENIB_RDMA_FRAG_GET_SIZE ( frag - > ftr ) ;
# if OMPI_ENABLE_DEBUG
if ( frag - > ftr - > seq ! = endpoint - > eager_rdma_local . seq )
BTL_ERROR ( ( " Eager RDMA wrong SEQ: received %d expected %d " ,
frag - > ftr - > seq ,
endpoint - > eager_rdma_local . seq ) ) ;
endpoint - > eager_rdma_local . seq + + ;
# endif
MCA_BTL_OPENIB_RDMA_NEXT_INDEX ( endpoint - > eager_rdma_local . head ) ;
OPAL_THREAD_UNLOCK ( & endpoint - > eager_rdma_local . lock ) ;
frag - > hdr = ( mca_btl_openib_header_t * ) ( ( ( char * ) frag - > ftr ) -
size + sizeof ( mca_btl_openib_footer_t ) ) ;
to_base_frag ( frag ) - > segment . seg_addr . pval =
( ( unsigned char * ) frag - > hdr ) + sizeof ( mca_btl_openib_header_t ) ;
ret = btl_openib_handle_incoming ( btl , to_com_frag ( frag ) - > endpoint ,
frag , size - sizeof ( mca_btl_openib_footer_t ) ) ;
if ( ret ! = MPI_SUCCESS ) {
btl - > error_cb ( & btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
return 0 ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
count + + ;
} else
OPAL_THREAD_UNLOCK ( & endpoint - > eager_rdma_local . lock ) ;
2007-06-14 05:59:25 +04:00
}
2008-01-09 13:27:15 +03:00
hca - > eager_rdma_polls - - ;
2007-06-14 05:59:25 +04:00
2008-01-09 13:27:15 +03:00
if ( 0 = = count | | non_eager_rdma_endpoints ! = 0 | | ! hca - > eager_rdma_polls ) {
count + = poll_hca ( hca , count ) ;
hca - > eager_rdma_polls = mca_btl_openib_component . eager_rdma_poll_ratio ;
}
return count ;
}
/*
* IB component progress .
*/
static int btl_openib_component_progress ( void )
{
int i ;
int count = 0 ;
# if OMPI_HAVE_THREADS
if ( OPAL_UNLIKELY ( mca_btl_openib_component . use_async_event_thread & &
mca_btl_openib_component . fatal_counter ) ) {
goto error ;
}
# endif
for ( i = 0 ; i < mca_btl_openib_component . hcas_count ; i + + ) {
mca_btl_openib_hca_t * hca =
opal_pointer_array_get_item ( & mca_btl_openib_component . hcas , i ) ;
count + = progress_one_hca ( hca ) ;
}
return count ;
# if OMPI_HAVE_THREADS
error :
/* Set the fatal counter to zero */
mca_btl_openib_component . fatal_counter = 0 ;
/* Lets found all fatal events */
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
mca_btl_openib_module_t * openib_btl =
mca_btl_openib_component . openib_btls [ i ] ;
if ( openib_btl - > hca - > got_fatal_event ) {
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ) ;
}
}
return count ;
# endif
2007-06-14 05:59:25 +04:00
}
2007-11-28 17:52:31 +03:00
int mca_btl_openib_post_srr ( mca_btl_openib_module_t * openib_btl , const int qp )
{
2007-11-28 17:57:15 +03:00
int rd_low = mca_btl_openib_component . qp_infos [ qp ] . rd_low ;
int rd_num = mca_btl_openib_component . qp_infos [ qp ] . rd_num ;
int num_post , i , rc ;
struct ibv_recv_wr * bad_wr , * wr_list = NULL , * wr = NULL ;
2007-11-28 17:52:31 +03:00
assert ( ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) ;
OPAL_THREAD_LOCK ( & openib_btl - > ib_lock ) ;
2007-11-28 17:57:15 +03:00
if ( openib_btl - > qps [ qp ] . u . srq_qp . rd_posted > rd_low ) {
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
return OMPI_SUCCESS ;
}
num_post = rd_num - openib_btl - > qps [ qp ] . u . srq_qp . rd_posted ;
for ( i = 0 ; i < num_post ; i + + ) {
ompi_free_list_item_t * item ;
2008-01-09 13:26:21 +03:00
OMPI_FREE_LIST_WAIT ( & openib_btl - > hca - > qps [ qp ] . recv_free , item , rc ) ;
2007-11-28 17:57:15 +03:00
to_base_frag ( item ) - > base . order = qp ;
to_com_frag ( item ) - > endpoint = NULL ;
if ( NULL = = wr )
wr = wr_list = & to_recv_frag ( item ) - > rd_desc ;
else
wr = wr - > next = & to_recv_frag ( item ) - > rd_desc ;
}
wr - > next = NULL ;
rc = ibv_post_srq_recv ( openib_btl - > qps [ qp ] . u . srq_qp . srq , wr_list , & bad_wr ) ;
if ( OPAL_LIKELY ( 0 = = rc ) ) {
2007-11-28 17:52:31 +03:00
OPAL_THREAD_ADD32 ( & openib_btl - > qps [ qp ] . u . srq_qp . rd_posted , num_post ) ;
2007-11-28 17:57:15 +03:00
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
return OMPI_SUCCESS ;
2007-11-28 17:52:31 +03:00
}
2007-11-28 17:57:15 +03:00
for ( i = 0 ; wr_list & & wr_list ! = bad_wr ; i + + , wr_list = wr_list - > next ) ;
BTL_ERROR ( ( " error posting receive descriptors to shared receive "
" queue %d (%d from %d) " , qp , i , num_post ) ) ;
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
return OMPI_ERROR ;
2007-11-28 17:52:31 +03:00
}