2012-07-18 17:29:37 +00:00
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
2005-06-30 21:28:35 +00:00
/*
2007-03-16 23:11:45 +00:00
* Copyright ( c ) 2004 - 2007 The Trustees of Indiana University and Indiana
2005-11-05 19:57:48 +00:00
* University Research and Technology
* Corporation . All rights reserved .
2013-07-04 08:34:37 +00:00
* Copyright ( c ) 2004 - 2013 The University of Tennessee and The University
2005-11-05 19:57:48 +00:00
* of Tennessee Research Foundation . All rights
* reserved .
2008-01-21 12:11:18 +00:00
* Copyright ( c ) 2004 - 2005 High Performance Computing Center Stuttgart ,
2005-06-30 21:28:35 +00:00
* University of Stuttgart . All rights reserved .
* Copyright ( c ) 2004 - 2005 The Regents of the University of California .
* All rights reserved .
2016-03-25 14:58:39 -07:00
* Copyright ( c ) 2006 - 2016 Cisco Systems , Inc . All rights reserved .
2015-12-30 00:12:19 +06:00
* Copyright ( c ) 2006 - 2015 Mellanox Technologies . All rights reserved .
2015-02-19 13:41:41 -07:00
* Copyright ( c ) 2006 - 2015 Los Alamos National Security , LLC . All rights
2008-01-21 12:11:18 +00:00
* reserved .
2007-09-24 10:11:52 +00:00
* Copyright ( c ) 2006 - 2007 Voltaire All rights reserved .
2012-03-01 17:29:40 +00:00
* Copyright ( c ) 2009 - 2012 Oracle and / or its affiliates . All rights reserved .
2015-07-09 12:51:55 -04:00
* Copyright ( c ) 2011 - 2015 NVIDIA Corporation . All rights reserved .
2012-07-02 15:20:12 +00:00
* Copyright ( c ) 2012 Oak Ridge National Laboratory . All rights reserved
2015-06-18 09:53:20 -07:00
* Copyright ( c ) 2013 - 2015 Intel , Inc . All rights reserved
2016-02-18 10:33:14 +09:00
* Copyright ( c ) 2014 - 2016 Research Organization for Information Science
2014-07-02 07:56:40 +00:00
* and Technology ( RIST ) . All rights reserved .
2014-12-09 18:43:15 +09:00
* Copyright ( c ) 2014 Bull SAS . All rights reserved .
2005-06-30 21:28:35 +00:00
* $ COPYRIGHT $
2008-01-21 12:11:18 +00:00
*
2005-06-30 21:28:35 +00:00
* Additional copyrights may follow
2008-01-21 12:11:18 +00:00
*
2005-06-30 21:28:35 +00:00
* $ HEADER $
*/
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
# include "opal_config.h"
2007-08-06 23:40:35 +00:00
2008-01-21 12:11:18 +00:00
# include <infiniband/verbs.h>
# include <errno.h>
2008-10-29 21:45:06 +00:00
# include <string.h>
2010-07-12 16:17:56 +00:00
# ifdef HAVE_UNISTD_H
2008-02-14 14:10:27 +00:00
# include <unistd.h>
2010-07-12 16:17:56 +00:00
# endif
2008-05-29 14:19:51 +00:00
# include <sys/types.h>
# include <sys/stat.h>
2014-12-23 22:45:56 +02:00
# include <fcntl.h>
2011-01-06 18:16:36 +00:00
# include <stdlib.h>
2009-06-23 14:08:04 +00:00
# include <stddef.h>
2007-08-06 23:40:35 +00:00
2016-02-18 10:33:14 +09:00
# include "opal/mca/memory/memory.h"
Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
# include "opal/mca/event/event.h"
2009-03-12 19:21:11 +00:00
# include "opal/align.h"
2009-02-14 02:26:12 +00:00
# include "opal/util/output.h"
2005-07-04 00:13:44 +00:00
# include "opal/util/argv.h"
2011-09-19 19:52:58 +00:00
# include "opal/mca/timer/base/base.h"
2006-09-19 13:27:05 +00:00
# include "opal/sys/atomic.h"
2014-12-23 22:45:56 +02:00
# include "opal/util/sys_limits.h"
2007-06-14 01:59:25 +00:00
# include "opal/util/argv.h"
2008-07-24 22:51:26 +00:00
# include "opal/memoryhooks/memory.h"
2012-05-09 20:18:31 +00:00
/* Define this before including hwloc.h so that we also get the hwloc
verbs helper header file , too . We have to do this level of
indirection because the hwloc subsystem is a component - - we don ' t
know its exact path . We have to rely on the framework header files
to find the right hwloc verbs helper file for us . */
# define OPAL_HWLOC_WANT_VERBS_HELPER 1
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
# include "opal/mca/hwloc/hwloc.h"
# include "opal/mca/hwloc/base/base.h"
2008-05-20 21:53:42 +00:00
# include "opal/mca/installdirs/installdirs.h"
2008-06-04 14:39:37 +00:00
# include "opal_stdint.h"
2013-02-13 16:31:59 +00:00
# include "opal/util/show_help.h"
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
# include "opal/mca/btl/btl.h"
# include "opal/mca/btl/base/base.h"
# include "opal/mca/mpool/base/base.h"
2015-11-02 12:07:08 -07:00
# include "opal/mca/rcache/rcache.h"
# include "opal/mca/rcache/base/base.h"
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
# include "opal/mca/common/cuda/common_cuda.h"
# include "opal/mca/common/verbs/common_verbs.h"
# include "opal/runtime/opal_params.h"
# include "opal/runtime/opal.h"
Per the PMIx RFC:
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
# include "opal/mca/pmix/pmix.h"
2014-10-03 14:19:48 -07:00
# include "opal/util/proc.h"
2007-08-06 23:40:35 +00:00
2005-06-30 21:28:35 +00:00
# include "btl_openib.h"
# include "btl_openib_frag.h"
2008-01-21 12:11:18 +00:00
# include "btl_openib_endpoint.h"
2006-03-26 08:30:50 +00:00
# include "btl_openib_eager_rdma.h"
2006-06-05 20:02:41 +00:00
# include "btl_openib_proc.h"
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
# include "btl_openib_ini.h"
# include "btl_openib_mca.h"
2007-11-28 07:18:59 +00:00
# include "btl_openib_xrc.h"
2011-01-19 20:58:22 +00:00
# if BTL_OPENIB_FAILOVER_ENABLED
2010-07-14 10:08:19 +00:00
# include "btl_openib_failover.h"
# endif
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 01:15:59 +00:00
# include "btl_openib_async.h"
2007-08-06 23:40:35 +00:00
# include "connect/base.h"
2010-02-25 21:04:09 +00:00
# include "btl_openib_ip.h"
2005-07-12 13:38:54 +00:00
2014-07-02 07:56:40 +00:00
# define EPS 1.e-6
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
/*
* Local functions
*/
2011-08-10 17:24:36 +00:00
static int btl_openib_component_register ( void ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
static int btl_openib_component_open ( void ) ;
static int btl_openib_component_close ( void ) ;
2008-01-09 10:27:15 +00:00
static mca_btl_base_module_t * * btl_openib_component_init ( int * , bool , bool ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
static int btl_openib_component_progress ( void ) ;
2013-11-01 12:19:40 +00:00
# if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
2013-05-10 21:29:08 +00:00
static void btl_openib_handle_incoming_completion ( mca_btl_base_module_t * btl ,
mca_btl_openib_endpoint_t * ep ,
mca_btl_base_descriptor_t * des ,
int status ) ;
2013-11-01 12:19:40 +00:00
# endif /* OPAL_CUDA_SUPPORT */
2008-05-20 21:53:42 +00:00
/*
* Local variables
*/
2008-07-23 00:28:59 +00:00
static mca_btl_openib_device_t * receive_queues_device = NULL ;
2013-06-05 12:12:09 +00:00
static int num_devices_intentionally_ignored = 0 ;
2005-06-30 21:28:35 +00:00
mca_btl_openib_component_t mca_btl_openib_component = {
2014-07-10 16:31:15 +00:00
. super = {
2005-06-30 21:28:35 +00:00
/* First, the mca_base_component_t struct containing meta information
about the component itself */
2014-07-10 16:31:15 +00:00
. btl_version = {
MCA_BTL_DEFAULT_VERSION ( " openib " ) ,
. mca_open_component = btl_openib_component_open ,
. mca_close_component = btl_openib_component_close ,
. mca_register_component_params = btl_openib_component_register ,
2005-06-30 21:28:35 +00:00
} ,
2014-07-10 16:31:15 +00:00
. btl_data = {
2008-10-16 15:09:00 +00:00
/* The component is checkpoint ready */
2014-07-10 16:31:15 +00:00
. param_field = MCA_BASE_METADATA_PARAM_CHECKPOINT
2005-06-30 21:28:35 +00:00
} ,
2014-07-10 16:31:15 +00:00
. btl_init = btl_openib_component_init ,
. btl_progress = btl_openib_component_progress ,
2005-06-30 21:28:35 +00:00
}
} ;
2011-08-10 17:24:36 +00:00
static int btl_openib_component_register ( void )
2005-06-30 21:28:35 +00:00
{
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
int ret ;
2006-06-09 18:02:45 +00:00
2011-08-10 17:24:36 +00:00
/* register IB component parameters */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! = ( ret = btl_openib_register_mca_params ( ) ) ) {
2013-06-05 11:53:18 +00:00
return ret ;
}
2011-08-10 17:24:36 +00:00
mca_btl_openib_component . max_send_size =
mca_btl_openib_module . super . btl_max_send_size ;
mca_btl_openib_component . eager_limit =
mca_btl_openib_module . super . btl_eager_limit ;
/* if_include and if_exclude need to be mutually exclusive */
2015-06-23 20:59:57 -07:00
if ( OPAL_SUCCESS ! =
2013-03-27 21:09:41 +00:00
mca_base_var_check_exclusive ( " ompi " ,
2011-08-10 17:24:36 +00:00
mca_btl_openib_component . super . btl_version . mca_type_name ,
mca_btl_openib_component . super . btl_version . mca_component_name ,
" if_include " ,
mca_btl_openib_component . super . btl_version . mca_type_name ,
mca_btl_openib_component . super . btl_version . mca_component_name ,
" if_exclude " ) ) {
/* Return ERR_NOT_AVAILABLE so that a warning message about
" open " failing is not printed */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_NOT_AVAILABLE ;
2011-08-10 17:24:36 +00:00
}
2015-07-09 12:51:55 -04:00
# if OPAL_CUDA_SUPPORT
mca_common_cuda_register_mca_variables ( ) ;
# endif
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2011-08-10 17:24:36 +00:00
}
/*
* Called by MCA framework to open the component
*/
static int btl_openib_component_open ( void )
{
2010-02-18 09:48:16 +00:00
opal_mutex_t * lock = & mca_btl_openib_component . srq_manager . lock ;
opal_hash_table_t * srq_addr_table = & mca_btl_openib_component . srq_manager . srq_addr_table ;
/* Construct hash table that stores pointers to SRQs */
OBJ_CONSTRUCT ( lock , opal_mutex_t ) ;
OBJ_CONSTRUCT ( srq_addr_table , opal_hash_table_t ) ;
2012-05-08 20:42:09 +00:00
2005-06-30 21:28:35 +00:00
/* initialize state */
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
mca_btl_openib_component . ib_num_btls = 0 ;
2016-02-15 14:36:59 +09:00
mca_btl_openib_component . num_default_gid_btls = 0 ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
mca_btl_openib_component . openib_btls = NULL ;
2008-07-23 00:28:59 +00:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . devices , opal_pointer_array_t ) ;
mca_btl_openib_component . devices_count = 0 ;
2008-07-10 15:08:49 +00:00
mca_btl_openib_component . cpc_explicitly_defined = false ;
2007-08-20 12:28:25 +00:00
2008-01-21 12:11:18 +00:00
/* initialize objects */
2005-07-03 16:22:16 +00:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_procs , opal_list_t ) ;
2013-12-23 17:47:43 +00:00
mca_btl_openib_component . memory_registration_verbose = - 1 ;
2005-06-30 21:28:35 +00:00
2014-08-12 19:41:46 +00:00
# if OPAL_CUDA_SUPPORT
mca_common_cuda_stage_one_init ( ) ;
# endif /* OPAL_CUDA_SUPPORT */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2005-06-30 21:28:35 +00:00
}
/*
* component cleanup - sanity checking of queue lengths
*/
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
static int btl_openib_component_close ( void )
2005-06-30 21:28:35 +00:00
{
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
int rc = OPAL_SUCCESS ;
2008-08-07 13:48:38 +00:00
2015-10-06 12:47:09 -06:00
/* remove the async event from the event base */
mca_btl_openib_async_fini ( ) ;
2010-02-18 09:48:16 +00:00
OBJ_DESTRUCT ( & mca_btl_openib_component . srq_manager . lock ) ;
OBJ_DESTRUCT ( & mca_btl_openib_component . srq_manager . srq_addr_table ) ;
2008-08-07 13:48:38 +00:00
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
opal_btl_openib_connect_base_finalize ( ) ;
opal_btl_openib_ini_finalize ( ) ;
2008-08-07 13:48:38 +00:00
2009-12-15 14:25:07 +00:00
if ( NULL ! = mca_btl_openib_component . default_recv_qps ) {
free ( mca_btl_openib_component . default_recv_qps ) ;
}
2015-06-23 20:59:57 -07:00
2013-12-23 17:47:43 +00:00
/* close memory registration debugging output */
opal_output_close ( mca_btl_openib_component . memory_registration_verbose ) ;
2014-08-12 19:41:46 +00:00
# if OPAL_CUDA_SUPPORT
mca_common_cuda_fini ( ) ;
# endif /* OPAL_CUDA_SUPPORT */
2008-08-07 13:48:38 +00:00
return rc ;
2005-06-30 21:28:35 +00:00
}
2008-05-02 11:52:33 +00:00
static void inline pack8 ( char * * dest , uint8_t value )
{
/* Copy one character */
* * dest = ( char ) value ;
/* Most the dest ahead one */
+ + * dest ;
}
2005-09-30 22:58:09 +00:00
/*
2008-05-02 11:52:33 +00:00
* Register local openib port information with the modex so that it
* can be shared with all other peers .
2005-09-30 22:58:09 +00:00
*/
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
static int btl_openib_modex_send ( void )
2005-09-30 22:58:09 +00:00
{
2008-05-02 11:52:33 +00:00
int rc , i , j ;
int modex_message_size ;
2008-01-14 23:22:03 +00:00
char * message , * offset ;
2008-05-02 11:52:33 +00:00
size_t size , msg_size ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
opal_btl_openib_connect_base_module_t * cpc ;
2008-01-14 23:22:03 +00:00
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " Starting to modex send " ) ;
2008-05-02 11:52:33 +00:00
if ( 0 = = mca_btl_openib_component . ib_num_btls ) {
2008-01-14 23:22:03 +00:00
return 0 ;
}
2009-06-23 14:08:04 +00:00
modex_message_size = offsetof ( mca_btl_openib_modex_message_t , end ) ;
2008-05-02 11:52:33 +00:00
/* The message is packed into multiple parts:
* 1. a uint8_t indicating the number of modules ( ports ) in the message
* 2. for each module :
* a . the common module data
* b . a uint8_t indicating how many CPCs follow
* c . for each CPC :
* a . a uint8_t indicating the index of the CPC in the all [ ]
* array in btl_openib_connect_base . c
* b . a uint8_t indicating the priority of this CPC
* c . a uint8_t indicating the length of the blob to follow
* d . a blob that is only meaningful to that CPC
*/
2011-07-04 14:00:41 +00:00
msg_size =
2008-05-02 11:52:33 +00:00
/* uint8_t for number of modules in the message */
1 +
/* For each module: */
2011-07-04 14:00:41 +00:00
mca_btl_openib_component . ib_num_btls *
2008-05-02 11:52:33 +00:00
(
/* Common module data */
2011-07-04 14:00:41 +00:00
modex_message_size +
2008-05-02 11:52:33 +00:00
/* uint8_t for how many CPCs follow */
1
) ;
/* For each module, add in the size of the per-CPC data */
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
2011-07-04 14:00:41 +00:00
for ( j = 0 ;
2008-05-02 11:52:33 +00:00
j < mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ;
+ + j ) {
2011-07-04 14:00:41 +00:00
msg_size + =
2008-05-02 11:52:33 +00:00
/* uint8_t for the index of the CPC */
1 +
/* uint8_t for the CPC's priority */
2011-07-04 14:00:41 +00:00
1 +
2008-05-02 11:52:33 +00:00
/* uint8_t for the blob length */
1 +
/* blob length */
mca_btl_openib_component . openib_btls [ i ] - > cpcs [ j ] - > data . cbm_modex_message_len ;
}
}
2010-07-12 16:17:56 +00:00
message = ( char * ) malloc ( msg_size ) ;
2008-01-14 23:22:03 +00:00
if ( NULL = = message ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " Failed malloc " ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_OUT_OF_RESOURCE ;
2008-01-21 12:11:18 +00:00
}
2008-01-14 23:22:03 +00:00
2008-05-02 11:52:33 +00:00
/* Pack the number of modules */
offset = message ;
pack8 ( & offset , mca_btl_openib_component . ib_num_btls ) ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " modex sending %d btls (packed: %d, offset now at %d) " , mca_btl_openib_component . ib_num_btls , * ( ( uint8_t * ) message ) , ( int ) ( offset - message ) ) ;
2008-01-14 23:22:03 +00:00
2008-05-02 11:52:33 +00:00
/* Pack each of the modules */
2008-01-14 23:22:03 +00:00
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
2008-05-02 11:52:33 +00:00
/* Pack the modex common message struct. */
size = modex_message_size ;
2009-12-15 14:25:07 +00:00
( mca_btl_openib_component . openib_btls [ i ] - > port_info ) . vendor_id =
( mca_btl_openib_component . openib_btls [ i ] - > device - > ib_dev_attr ) . vendor_id ;
( mca_btl_openib_component . openib_btls [ i ] - > port_info ) . vendor_part_id =
( mca_btl_openib_component . openib_btls [ i ] - > device - > ib_dev_attr ) . vendor_part_id ;
( mca_btl_openib_component . openib_btls [ i ] - > port_info ) . transport_type =
mca_btl_openib_get_transport_type ( mca_btl_openib_component . openib_btls [ i ] ) ;
2011-07-04 14:00:41 +00:00
memcpy ( offset ,
& ( mca_btl_openib_component . openib_btls [ i ] - > port_info ) ,
2008-05-02 11:52:33 +00:00
size ) ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " modex packed btl port modex message: 0x% " PRIx64 " , %d, %d (size: %d) " ,
2008-05-02 11:52:33 +00:00
mca_btl_openib_component . openib_btls [ i ] - > port_info . subnet_id ,
mca_btl_openib_component . openib_btls [ i ] - > port_info . mtu ,
mca_btl_openib_component . openib_btls [ i ] - > port_info . lid ,
( int ) size ) ;
2011-07-04 14:00:41 +00:00
2009-05-06 20:11:28 +00:00
# if !defined(WORDS_BIGENDIAN) && OPAL_ENABLE_HETEROGENEOUS_SUPPORT
2008-05-02 11:52:33 +00:00
MCA_BTL_OPENIB_MODEX_MSG_HTON ( * ( mca_btl_openib_modex_message_t * ) offset ) ;
2008-01-14 23:22:03 +00:00
# endif
2008-05-02 11:52:33 +00:00
offset + = size ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " modex packed btl %d: modex message, offset now %d " ,
2008-05-02 11:52:33 +00:00
i , ( int ) ( offset - message ) ) ;
/* Pack the number of CPCs that follow */
2011-07-04 14:00:41 +00:00
pack8 ( & offset ,
2008-05-02 11:52:33 +00:00
mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ) ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " modex packed btl %d: to pack %d cpcs (packed: %d, offset now %d) " ,
2008-05-02 11:52:33 +00:00
i , mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ,
* ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack each CPC */
2011-07-04 14:00:41 +00:00
for ( j = 0 ;
2008-05-02 11:52:33 +00:00
j < mca_btl_openib_component . openib_btls [ i ] - > num_cpcs ;
+ + j ) {
uint8_t u8 ;
cpc = mca_btl_openib_component . openib_btls [ i ] - > cpcs [ j ] ;
2011-07-04 14:00:41 +00:00
opal_output ( - 1 , " modex packed btl %d: packing cpc %s " ,
2008-05-02 11:52:33 +00:00
i , cpc - > data . cbm_component - > cbc_name ) ;
/* Pack the CPC index */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
u8 = opal_btl_openib_connect_base_get_cpc_index ( cpc - > data . cbm_component ) ;
2008-05-02 11:52:33 +00:00
pack8 ( & offset , u8 ) ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " packing btl %d: cpc %d: index %d (packed %d, offset now %d) " ,
2008-05-02 11:52:33 +00:00
i , j , u8 , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack the CPC priority */
pack8 ( & offset , cpc - > data . cbm_priority ) ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " packing btl %d: cpc %d: priority %d (packed %d, offset now %d) " ,
2008-05-02 11:52:33 +00:00
i , j , cpc - > data . cbm_priority , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* Pack the blob length */
u8 = cpc - > data . cbm_modex_message_len ;
pack8 ( & offset , u8 ) ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " packing btl %d: cpc %d: message len %d (packed %d, offset now %d) " ,
2008-05-02 11:52:33 +00:00
i , j , u8 , * ( ( uint8_t * ) ( offset - 1 ) ) , ( int ) ( offset - message ) ) ;
/* If the blob length is > 0, pack the blob */
if ( u8 > 0 ) {
memcpy ( offset , cpc - > data . cbm_modex_message , u8 ) ;
offset + = u8 ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " packing btl %d: cpc %d: blob packed %d %x (offset now %d) " ,
2008-05-02 11:52:33 +00:00
i , j ,
( ( uint32_t * ) cpc - > data . cbm_modex_message ) [ 0 ] ,
( ( uint32_t * ) cpc - > data . cbm_modex_message ) [ 1 ] ,
( int ) ( offset - message ) ) ;
}
2008-01-14 23:22:03 +00:00
2008-05-02 11:52:33 +00:00
/* Sanity check */
assert ( ( size_t ) ( offset - message ) < = msg_size ) ;
}
2005-09-30 22:58:09 +00:00
}
2008-01-14 23:22:03 +00:00
2008-05-02 11:52:33 +00:00
/* All done -- send it! */
2015-06-18 09:53:20 -07:00
OPAL_MODEX_SEND ( rc , OPAL_PMIX_GLOBAL ,
Per the PMIx RFC:
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
& mca_btl_openib_component . super . btl_version ,
message , msg_size ) ;
2008-01-14 23:22:03 +00:00
free ( message ) ;
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " Modex sent! %d calculated, %d actual \n " , ( int ) msg_size , ( int ) ( offset - message ) ) ;
2008-01-14 23:22:03 +00:00
2005-09-30 22:58:09 +00:00
return rc ;
}
2005-11-10 20:15:02 +00:00
/*
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 01:15:59 +00:00
* Active Message Callback function on control message .
2005-11-10 20:15:02 +00:00
*/
2007-11-28 07:13:34 +00:00
static void btl_openib_control ( mca_btl_base_module_t * btl ,
mca_btl_base_tag_t tag , mca_btl_base_descriptor_t * des ,
void * cbdata )
2005-11-10 20:15:02 +00:00
{
2007-11-28 07:11:14 +00:00
/* don't return credits used for control messages */
2007-12-09 14:05:13 +00:00
mca_btl_openib_module_t * obtl = ( mca_btl_openib_module_t * ) btl ;
2007-11-28 07:13:34 +00:00
mca_btl_openib_endpoint_t * ep = to_com_frag ( des ) - > endpoint ;
2007-11-28 07:11:14 +00:00
mca_btl_openib_control_header_t * ctl_hdr =
2015-03-10 14:21:42 -06:00
( mca_btl_openib_control_header_t * ) to_base_frag ( des ) - > segment . seg_addr . pval ;
2006-03-26 08:30:50 +00:00
mca_btl_openib_eager_rdma_header_t * rdma_hdr ;
2007-12-09 14:05:13 +00:00
mca_btl_openib_header_coalesced_t * clsc_hdr =
( mca_btl_openib_header_coalesced_t * ) ( ctl_hdr + 1 ) ;
2008-01-15 05:32:53 +00:00
mca_btl_active_message_callback_t * reg ;
2015-01-06 08:48:33 -07:00
size_t len = des - > des_segments - > seg_len - sizeof ( * ctl_hdr ) ;
2008-01-21 12:11:18 +00:00
2006-03-26 08:30:50 +00:00
switch ( ctl_hdr - > type ) {
2006-09-05 16:02:09 +00:00
case MCA_BTL_OPENIB_CONTROL_CREDITS :
2007-11-28 07:13:34 +00:00
assert ( 0 ) ; /* Credit message is handled elsewhere */
2007-01-12 23:14:45 +00:00
break ;
2006-03-26 08:30:50 +00:00
case MCA_BTL_OPENIB_CONTROL_RDMA :
rdma_hdr = ( mca_btl_openib_eager_rdma_header_t * ) ctl_hdr ;
2008-01-21 12:11:18 +00:00
2011-07-04 14:00:41 +00:00
BTL_VERBOSE ( ( " prior to NTOH received rkey % " PRIu32
2009-05-14 15:39:53 +00:00
" , rdma_start.lval % " PRIx64 " , pval %p, ival % " PRIu32 ,
rdma_hdr - > rkey ,
rdma_hdr - > rdma_start . lval ,
rdma_hdr - > rdma_start . pval ,
rdma_hdr - > rdma_start . ival
2007-01-12 23:14:45 +00:00
) ) ;
2008-01-21 12:11:18 +00:00
2007-11-28 07:13:34 +00:00
if ( ep - > nbo ) {
2007-11-28 07:12:44 +00:00
BTL_OPENIB_EAGER_RDMA_CONTROL_HEADER_NTOH ( * rdma_hdr ) ;
}
2008-01-21 12:11:18 +00:00
2011-07-04 14:00:41 +00:00
BTL_VERBOSE ( ( " received rkey % " PRIu32
2009-05-14 15:39:53 +00:00
" , rdma_start.lval % " PRIx64 " , pval %p, "
" ival % " PRIu32 , rdma_hdr - > rkey ,
rdma_hdr - > rdma_start . lval ,
rdma_hdr - > rdma_start . pval , rdma_hdr - > rdma_start . ival ) ) ;
2008-01-21 12:11:18 +00:00
2007-11-28 07:13:34 +00:00
if ( ep - > eager_rdma_remote . base . pval ) {
2008-01-09 21:54:11 +00:00
BTL_ERROR ( ( " Got RDMA connect twice! " ) ) ;
return ;
2006-03-26 08:30:50 +00:00
}
2007-11-28 07:13:34 +00:00
ep - > eager_rdma_remote . rkey = rdma_hdr - > rkey ;
ep - > eager_rdma_remote . base . lval = rdma_hdr - > rdma_start . lval ;
ep - > eager_rdma_remote . tokens = mca_btl_openib_component . eager_rdma_num - 1 ;
2006-03-26 08:30:50 +00:00
break ;
2007-12-09 14:05:13 +00:00
case MCA_BTL_OPENIB_CONTROL_COALESCED :
2012-03-01 17:29:40 +00:00
{
size_t pad = 0 ;
while ( len > 0 ) {
size_t skip ;
mca_btl_openib_header_coalesced_t * unalign_hdr = 0 ;
mca_btl_base_descriptor_t tmp_des ;
mca_btl_base_segment_t tmp_seg ;
assert ( len > = sizeof ( * clsc_hdr ) ) ;
if ( ep - > nbo )
BTL_OPENIB_HEADER_COALESCED_NTOH ( * clsc_hdr ) ;
skip = ( sizeof ( * clsc_hdr ) + clsc_hdr - > alloc_size - pad ) ;
2015-01-06 08:48:33 -07:00
tmp_des . des_segments = & tmp_seg ;
tmp_des . des_segment_count = 1 ;
2012-03-01 17:29:40 +00:00
tmp_seg . seg_addr . pval = clsc_hdr + 1 ;
tmp_seg . seg_len = clsc_hdr - > size ;
/* call registered callback */
reg = mca_btl_base_active_message_trigger + clsc_hdr - > tag ;
reg - > cbfunc ( & obtl - > super , clsc_hdr - > tag , & tmp_des , reg - > cbdata ) ;
len - = ( skip + pad ) ;
unalign_hdr = ( mca_btl_openib_header_coalesced_t * )
( ( unsigned char * ) clsc_hdr + skip ) ;
pad = ( size_t ) BTL_OPENIB_COALESCE_HDR_PADDING ( unalign_hdr ) ;
clsc_hdr = ( mca_btl_openib_header_coalesced_t * ) ( ( unsigned char * ) unalign_hdr +
pad ) ;
}
2007-12-09 14:05:13 +00:00
}
break ;
2008-10-06 00:46:02 +00:00
case MCA_BTL_OPENIB_CONTROL_CTS :
OPAL_OUTPUT ( ( - 1 , " received CTS from %s (buffer %p): posted recvs %d, sent cts %d " ,
2014-10-03 14:19:48 -07:00
opal_get_proc_hostname ( ep - > endpoint_proc - > proc_opal ) ,
2008-10-06 00:46:02 +00:00
( void * ) ctl_hdr ,
ep - > endpoint_posted_recvs , ep - > endpoint_cts_sent ) ) ;
ep - > endpoint_cts_received = true ;
/* Only send the CTS back and mark connected if:
- we have posted our receives ( it ' s possible that we can
get this CTS before this side ' s CPC has called
cpc_complete ( ) )
- we have not yet sent our CTS
We don ' t even want to mark the endpoint connected ( ) until
we have posted our receives because otherwise we will
trigger credit management ( because the rd_credits will
still be negative ) , and Bad Things will happen . */
if ( ep - > endpoint_posted_recvs ) {
2016-06-30 12:21:47 -06:00
/* need to hold to lock for both send_cts and connected */
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
2008-10-06 00:46:02 +00:00
if ( ! ep - > endpoint_cts_sent ) {
mca_btl_openib_endpoint_send_cts ( ep ) ;
}
mca_btl_openib_endpoint_connected ( ep ) ;
}
break ;
2011-01-19 20:58:22 +00:00
# if BTL_OPENIB_FAILOVER_ENABLED
2010-07-14 10:08:19 +00:00
case MCA_BTL_OPENIB_CONTROL_EP_BROKEN :
case MCA_BTL_OPENIB_CONTROL_EP_EAGER_RDMA_ERROR :
2012-03-01 17:29:40 +00:00
btl_openib_handle_failover_control_messages ( ctl_hdr , ep ) ;
break ;
2010-07-14 10:08:19 +00:00
# endif
2006-03-26 08:30:50 +00:00
default :
2007-01-12 23:14:45 +00:00
BTL_ERROR ( ( " Unknown message type received by BTL " ) ) ;
2006-03-26 08:30:50 +00:00
break ;
}
2005-11-10 20:15:02 +00:00
}
2015-11-02 12:07:08 -07:00
static int openib_reg_mr ( void * reg_data , void * base , size_t size ,
mca_rcache_base_registration_t * reg )
2006-12-17 12:26:41 +00:00
{
2008-07-23 00:28:59 +00:00
mca_btl_openib_device_t * device = ( mca_btl_openib_device_t * ) reg_data ;
2006-12-17 12:26:41 +00:00
mca_btl_openib_reg_t * openib_reg = ( mca_btl_openib_reg_t * ) reg ;
2015-09-29 15:28:00 -06:00
enum ibv_access_flags access_flag = 0 ;
2015-11-02 12:07:08 -07:00
if ( reg - > access_flags & MCA_RCACHE_ACCESS_REMOTE_READ ) {
2015-09-29 15:28:00 -06:00
access_flag | = IBV_ACCESS_REMOTE_READ ;
}
2015-11-02 12:07:08 -07:00
if ( reg - > access_flags & MCA_RCACHE_ACCESS_REMOTE_WRITE ) {
2015-11-04 15:23:11 -07:00
access_flag | = IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_LOCAL_WRITE ;
2015-09-29 15:28:00 -06:00
}
2015-11-02 12:07:08 -07:00
if ( reg - > access_flags & MCA_RCACHE_ACCESS_LOCAL_WRITE ) {
2015-09-29 15:28:00 -06:00
access_flag | = IBV_ACCESS_LOCAL_WRITE ;
}
2006-12-17 12:26:41 +00:00
2015-01-05 14:36:57 -07:00
# if HAVE_DECL_IBV_ATOMIC_HCA
2015-11-02 12:07:08 -07:00
if ( reg - > access_flags & MCA_RCACHE_ACCESS_REMOTE_ATOMIC ) {
2015-11-04 15:23:11 -07:00
access_flag | = IBV_ACCESS_REMOTE_ATOMIC | IBV_ACCESS_LOCAL_WRITE ;
2015-09-29 15:28:00 -06:00
}
2015-01-05 14:36:57 -07:00
# endif
2012-07-19 17:52:21 +00:00
if ( device - > mem_reg_max & &
device - > mem_reg_max < ( device - > mem_reg_active + size ) ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_OUT_OF_RESOURCE ;
2012-07-19 17:52:21 +00:00
}
device - > mem_reg_active + = size ;
2011-02-17 04:28:56 +00:00
# if HAVE_DECL_IBV_ACCESS_SO
2015-11-02 12:07:08 -07:00
if ( reg - > flags & MCA_RCACHE_FLAGS_SO_MEM ) {
2011-02-16 05:37:22 +00:00
access_flag | = IBV_ACCESS_SO ;
}
# endif
openib_reg - > mr = ibv_reg_mr ( device - > ib_pd , base , size , access_flag ) ;
2006-12-17 12:26:41 +00:00
2010-05-11 21:43:19 +00:00
if ( NULL = = openib_reg - > mr ) {
2013-10-04 18:56:06 +00:00
OPAL_OUTPUT_VERBOSE ( ( 5 , mca_btl_openib_component . memory_registration_verbose ,
" ibv_reg_mr() failed: base=%p, bound=%p, size=%d, flags=0x%x, errno=%d " ,
reg - > base , reg - > bound , ( int ) ( reg - > bound - reg - > base + 1 ) , reg - > flags , errno ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_OUT_OF_RESOURCE ;
2010-05-11 21:43:19 +00:00
}
2006-12-17 12:26:41 +00:00
2015-01-06 08:48:33 -07:00
openib_reg - > btl_handle . lkey = openib_reg - > mr - > lkey ;
openib_reg - > btl_handle . rkey = openib_reg - > mr - > rkey ;
2013-09-13 14:28:26 +00:00
OPAL_OUTPUT_VERBOSE ( ( 30 , mca_btl_openib_component . memory_registration_verbose ,
2013-09-13 19:47:05 +00:00
" openib_reg_mr: base=%p, bound=%p, size=%d, flags=0x%x " , reg - > base , reg - > bound ,
( int ) ( reg - > bound - reg - > base + 1 ) , reg - > flags ) ) ;
2013-09-13 14:28:26 +00:00
2013-11-01 12:19:40 +00:00
# if OPAL_CUDA_SUPPORT
2015-11-02 12:07:08 -07:00
if ( reg - > flags & MCA_RCACHE_FLAGS_CUDA_REGISTER_MEM ) {
mca_common_cuda_register ( base , size ,
openib_reg - > base . rcache - > rcache_component - > rcache_version . mca_component_name ) ;
2011-08-04 10:15:45 +00:00
}
# endif
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2006-12-17 12:26:41 +00:00
}
2015-11-02 12:07:08 -07:00
static int openib_dereg_mr ( void * reg_data , mca_rcache_base_registration_t * reg )
2006-12-17 12:26:41 +00:00
{
2012-07-19 17:52:21 +00:00
mca_btl_openib_device_t * device = ( mca_btl_openib_device_t * ) reg_data ;
2006-12-17 12:26:41 +00:00
mca_btl_openib_reg_t * openib_reg = ( mca_btl_openib_reg_t * ) reg ;
2013-09-13 14:28:26 +00:00
OPAL_OUTPUT_VERBOSE ( ( 30 , mca_btl_openib_component . memory_registration_verbose ,
2013-09-13 19:47:05 +00:00
" openib_dereg_mr: base=%p, bound=%p, size=%d, flags=0x%x " , reg - > base , reg - > bound ,
( int ) ( reg - > bound - reg - > base + 1 ) , reg - > flags ) ) ;
2013-09-13 14:28:26 +00:00
2006-12-17 12:26:41 +00:00
if ( openib_reg - > mr ! = NULL ) {
if ( ibv_dereg_mr ( openib_reg - > mr ) ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " %s: error unpinning openib memory errno says %s " ,
2007-10-15 17:53:02 +00:00
__func__ , strerror ( errno ) ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERROR ;
2006-12-17 12:26:41 +00:00
}
2011-08-04 10:15:45 +00:00
2013-11-01 12:19:40 +00:00
# if OPAL_CUDA_SUPPORT
2015-11-02 12:07:08 -07:00
if ( reg - > flags & MCA_RCACHE_FLAGS_CUDA_REGISTER_MEM ) {
2011-08-04 10:15:45 +00:00
mca_common_cuda_unregister ( openib_reg - > base . base ,
2015-11-02 12:07:08 -07:00
openib_reg - > base . rcache - > rcache_component - > rcache_version . mca_component_name ) ;
2011-08-04 10:15:45 +00:00
}
# endif
2006-12-17 12:26:41 +00:00
}
2012-07-19 17:52:21 +00:00
device - > mem_reg_active - = ( uint64_t ) ( reg - > bound - reg - > base + 1 ) ;
2006-12-17 12:26:41 +00:00
openib_reg - > mr = NULL ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2006-12-17 12:26:41 +00:00
}
2012-10-30 19:45:18 +00:00
2013-03-27 21:09:41 +00:00
static inline int param_register_uint ( const char * param_name , unsigned int default_value , unsigned int * storage )
{
* storage = default_value ;
( void ) mca_base_component_var_register ( & mca_btl_openib_component . super . btl_version ,
param_name , NULL , MCA_BASE_VAR_TYPE_UNSIGNED_INT ,
NULL , 0 , 0 , OPAL_INFO_LVL_9 ,
MCA_BASE_VAR_SCOPE_READONLY , storage ) ;
return * storage ;
2007-06-13 12:47:38 +00:00
}
2006-12-17 12:26:41 +00:00
2008-07-23 00:28:59 +00:00
static int init_one_port ( opal_list_t * btl_list , mca_btl_openib_device_t * device ,
2007-04-22 10:22:12 +00:00
uint8_t port_num , uint16_t pkey_index ,
struct ibv_port_attr * ib_port_attr )
2006-06-28 07:23:08 +00:00
{
2008-01-28 10:38:08 +00:00
uint16_t lid , i , lmc , lmc_step ;
2006-06-28 07:23:08 +00:00
mca_btl_openib_module_t * openib_btl ;
mca_btl_base_selected_module_t * ib_selected ;
2006-09-22 10:27:12 +00:00
union ibv_gid gid ;
2007-01-12 22:42:20 +00:00
uint64_t subnet_id ;
2007-04-21 00:15:05 +00:00
2011-02-24 14:09:22 +00:00
/* Ensure that the requested GID index (via the
btl_openib_gid_index MCA param ) is within the GID table
size . */
if ( mca_btl_openib_component . gid_index >
ib_port_attr - > gid_tbl_len ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " gid index too large " ,
2014-10-03 14:19:48 -07:00
true , opal_process_info . nodename ,
2011-02-24 14:09:22 +00:00
ibv_get_device_name ( device - > ib_dev ) , port_num ,
mca_btl_openib_component . gid_index ,
ib_port_attr - > gid_tbl_len ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_NOT_FOUND ;
2011-02-24 14:09:22 +00:00
}
BTL_VERBOSE ( ( " looking for %s:%d GID index %d " ,
ibv_get_device_name ( device - > ib_dev ) , port_num ,
mca_btl_openib_component . gid_index ) ) ;
2008-05-02 11:52:33 +00:00
/* If we have struct ibv_device.transport_type, then we're >= OFED
2008-05-07 14:51:36 +00:00
v1 .2 , and the transport could be iWarp or IB . If we don ' t have
that member , then we ' re < OFED v1 .2 , and it can only be IB . */
2008-05-02 11:52:33 +00:00
# if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
2008-07-23 00:28:59 +00:00
if ( IBV_TRANSPORT_IWARP = = device - > ib_dev - > transport_type ) {
2010-02-25 21:04:09 +00:00
subnet_id = mca_btl_openib_get_ip_subnet_id ( device - > ib_dev , port_num ) ;
2008-06-26 20:23:56 +00:00
BTL_VERBOSE ( ( " my iWARP subnet_id is %016 " PRIx64 , subnet_id ) ) ;
2008-05-02 11:52:33 +00:00
} else {
2008-06-26 20:23:56 +00:00
memset ( & gid , 0 , sizeof ( gid ) ) ;
2011-07-04 14:00:41 +00:00
if ( 0 ! = ibv_query_gid ( device - > ib_dev_context , port_num ,
2011-02-24 14:09:22 +00:00
mca_btl_openib_component . gid_index , & gid ) ) {
BTL_ERROR ( ( " ibv_query_gid failed (%s:%d, %d) \n " ,
ibv_get_device_name ( device - > ib_dev ) , port_num ,
mca_btl_openib_component . gid_index ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_NOT_FOUND ;
2008-06-26 20:23:56 +00:00
}
2010-02-25 20:57:05 +00:00
2013-08-22 17:44:20 +00:00
# if HAVE_DECL_IBV_LINK_LAYER_ETHERNET
2010-02-25 20:57:05 +00:00
if ( IBV_LINK_LAYER_ETHERNET = = ib_port_attr - > link_layer ) {
2013-10-23 06:08:54 +00:00
subnet_id = mca_btl_openib_component . rroce_enable ? 0 :
mca_btl_openib_get_ip_subnet_id ( device - > ib_dev , port_num ) ;
2010-02-25 20:57:05 +00:00
} else {
subnet_id = ntoh64 ( gid . global . subnet_prefix ) ;
}
# else
2008-05-02 11:52:33 +00:00
subnet_id = ntoh64 ( gid . global . subnet_prefix ) ;
2010-02-25 20:57:05 +00:00
# endif
2011-07-04 14:00:41 +00:00
BTL_VERBOSE ( ( " my IB subnet_id for HCA %s port %d is %016 " PRIx64 ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) , port_num , subnet_id ) ) ;
2008-05-02 11:52:33 +00:00
}
# else
2011-07-04 14:00:41 +00:00
if ( 0 ! = ibv_query_gid ( device - > ib_dev_context , port_num ,
2011-02-24 14:09:22 +00:00
mca_btl_openib_component . gid_index , & gid ) ) {
BTL_ERROR ( ( " ibv_query_gid failed (%s:%d, %d) \n " ,
ibv_get_device_name ( device - > ib_dev ) , port_num ,
mca_btl_openib_component . gid_index ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_NOT_FOUND ;
2008-06-26 20:23:56 +00:00
}
2007-01-12 22:42:20 +00:00
subnet_id = ntoh64 ( gid . global . subnet_prefix ) ;
2011-07-04 14:00:41 +00:00
BTL_VERBOSE ( ( " my IB-only subnet_id for HCA %s port %d is %016 " PRIx64 ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) , port_num , subnet_id ) ) ;
2008-05-02 11:52:33 +00:00
# endif
2016-02-15 14:36:59 +09:00
if ( mca_btl_openib_component . num_default_gid_btls > 0 & &
2007-01-12 22:42:20 +00:00
IB_DEFAULT_GID_PREFIX = = subnet_id & &
2006-09-26 12:12:33 +00:00
mca_btl_openib_component . warn_default_gid_prefix ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " default subnet prefix " ,
2014-10-03 14:19:48 -07:00
true , opal_process_info . nodename ) ;
2006-09-26 12:12:33 +00:00
}
2006-06-28 07:23:08 +00:00
2016-02-15 14:36:59 +09:00
if ( IB_DEFAULT_GID_PREFIX = = subnet_id ) {
mca_btl_openib_component . num_default_gid_btls + + ;
}
2006-06-28 07:23:08 +00:00
lmc = ( 1 < < ib_port_attr - > lmc ) ;
2008-01-28 10:38:08 +00:00
lmc_step = 1 ;
2006-06-28 07:23:08 +00:00
2008-01-21 12:11:18 +00:00
if ( 0 ! = mca_btl_openib_component . max_lmc & &
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
mca_btl_openib_component . max_lmc < lmc ) {
2006-06-28 07:23:08 +00:00
lmc = mca_btl_openib_component . max_lmc ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
}
2006-06-28 07:23:08 +00:00
2008-05-02 11:52:33 +00:00
/* APM support -- only meaningful if async event support is
enabled . If async events are not enabled , then there ' s nothing
to listen for the APM event to load the new path , so it ' s not
worth enabling APM . */
2008-01-28 10:38:08 +00:00
if ( lmc > 1 ) {
2008-02-20 13:44:05 +00:00
if ( - 1 = = mca_btl_openib_component . apm_lmc ) {
2008-01-28 10:38:08 +00:00
lmc_step = lmc ;
2008-02-20 13:44:05 +00:00
mca_btl_openib_component . apm_lmc = lmc - 1 ;
} else if ( 0 = = lmc % ( mca_btl_openib_component . apm_lmc + 1 ) ) {
lmc_step = mca_btl_openib_component . apm_lmc + 1 ;
2008-01-28 10:38:08 +00:00
} else {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " apm with wrong lmc " , true ,
2008-02-20 13:44:05 +00:00
mca_btl_openib_component . apm_lmc , lmc ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERROR ;
2008-01-28 10:38:08 +00:00
}
} else {
2008-02-20 13:44:05 +00:00
if ( mca_btl_openib_component . apm_lmc ) {
2008-01-28 10:38:08 +00:00
/* Disable apm and report warning */
2008-02-20 13:44:05 +00:00
mca_btl_openib_component . apm_lmc = 0 ;
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " apm without lmc " , true ) ;
2008-01-28 10:38:08 +00:00
}
}
2006-06-28 07:23:08 +00:00
for ( lid = ib_port_attr - > lid ;
2008-01-28 10:38:08 +00:00
lid < ib_port_attr - > lid + lmc ; lid + = lmc_step ) {
2006-06-28 07:23:08 +00:00
for ( i = 0 ; i < mca_btl_openib_component . btls_per_lid ; i + + ) {
2007-06-13 12:47:38 +00:00
char param [ 40 ] ;
2008-01-14 23:22:03 +00:00
2013-04-09 21:55:31 +00:00
openib_btl = ( mca_btl_openib_module_t * ) calloc ( 1 , sizeof ( mca_btl_openib_module_t ) ) ;
2006-06-28 07:23:08 +00:00
if ( NULL = = openib_btl ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_OUT_OF_RESOURCE ;
2006-06-28 07:23:08 +00:00
}
memcpy ( openib_btl , & mca_btl_openib_module ,
sizeof ( mca_btl_openib_module ) ) ;
memcpy ( & openib_btl - > ib_port_attr , ib_port_attr ,
sizeof ( struct ibv_port_attr ) ) ;
ib_selected = OBJ_NEW ( mca_btl_base_selected_module_t ) ;
ib_selected - > btl_module = ( mca_btl_base_module_t * ) openib_btl ;
2008-07-23 00:28:59 +00:00
openib_btl - > device = device ;
2006-06-28 07:23:08 +00:00
openib_btl - > port_num = ( uint8_t ) port_num ;
2007-04-22 10:22:12 +00:00
openib_btl - > pkey_index = pkey_index ;
2006-06-28 07:23:08 +00:00
openib_btl - > lid = lid ;
2008-02-20 13:44:05 +00:00
openib_btl - > apm_port = 0 ;
2006-06-28 07:23:08 +00:00
openib_btl - > src_path_bits = lid - ib_port_attr - > lid ;
2008-05-02 11:52:33 +00:00
2007-01-12 22:42:20 +00:00
openib_btl - > port_info . subnet_id = subnet_id ;
2008-07-23 00:28:59 +00:00
openib_btl - > port_info . mtu = device - > mtu ;
2008-02-20 13:44:05 +00:00
openib_btl - > port_info . lid = lid ;
2008-05-02 11:52:33 +00:00
openib_btl - > cpcs = NULL ;
openib_btl - > num_cpcs = 0 ;
2012-07-19 17:52:21 +00:00
openib_btl - > local_procs = 0 ;
2008-05-02 11:52:33 +00:00
2008-01-15 05:32:53 +00:00
mca_btl_base_active_message_trigger [ MCA_BTL_TAG_IB ] . cbfunc = btl_openib_control ;
mca_btl_base_active_message_trigger [ MCA_BTL_TAG_IB ] . cbdata = NULL ;
2007-04-21 00:15:05 +00:00
2015-01-06 08:48:33 -07:00
if ( openib_btl - > super . btl_get_limit > openib_btl - > ib_port_attr . max_msg_sz ) {
openib_btl - > super . btl_get_limit = openib_btl - > ib_port_attr . max_msg_sz ;
}
openib_btl - > super . btl_get_alignment = 0 ;
if ( openib_btl - > super . btl_put_limit > openib_btl - > ib_port_attr . max_msg_sz ) {
openib_btl - > super . btl_put_limit = openib_btl - > ib_port_attr . max_msg_sz ;
}
2015-02-19 16:13:37 -07:00
openib_btl - > super . btl_put_local_registration_threshold = openib_btl - > device - > max_inline_data ;
openib_btl - > super . btl_get_local_registration_threshold = 0 ;
2015-01-05 14:36:57 -07:00
# if HAVE_DECL_IBV_ATOMIC_HCA
2015-11-23 16:07:12 -07:00
openib_btl - > atomic_ops_be = false ;
2016-05-23 14:27:33 +09:00
# ifdef HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXT_ATOM
2015-11-30 12:32:04 -07:00
/* check that 8-byte atomics are supported */
2015-11-30 22:21:26 -07:00
if ( ! ( device - > ib_exp_dev_attr . ext_atom . log_atomic_arg_sizes & ( 1 < < 3ull ) ) ) {
2015-11-30 12:32:04 -07:00
openib_btl - > super . btl_flags & = ~ MCA_BTL_FLAGS_ATOMIC_FOPS ;
openib_btl - > super . btl_atomic_flags = 0 ;
openib_btl - > super . btl_atomic_fop = NULL ;
openib_btl - > super . btl_atomic_cswap = NULL ;
}
# endif
2016-07-21 09:19:54 +09:00
# ifdef HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXP_ATOMIC_CAP
2015-12-17 16:45:43 +09:00
switch ( openib_btl - > device - > ib_exp_dev_attr . exp_atomic_cap )
# else
switch ( openib_btl - > device - > ib_dev_attr . atomic_cap )
# endif
{
2015-11-23 16:07:12 -07:00
case IBV_ATOMIC_GLOB :
openib_btl - > super . btl_flags | = MCA_BTL_ATOMIC_SUPPORTS_GLOB ;
break ;
# if HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE
case IBV_EXP_ATOMIC_HCA_REPLY_BE :
openib_btl - > atomic_ops_be = true ;
break ;
# endif
case IBV_ATOMIC_HCA :
break ;
case IBV_ATOMIC_NONE :
default :
/* no atomics or an unsupported atomic type */
2015-01-05 14:36:57 -07:00
openib_btl - > super . btl_flags & = ~ MCA_BTL_FLAGS_ATOMIC_FOPS ;
openib_btl - > super . btl_atomic_flags = 0 ;
openib_btl - > super . btl_atomic_fop = NULL ;
openib_btl - > super . btl_atomic_cswap = NULL ;
}
# endif
2015-01-06 08:48:33 -07:00
openib_btl - > super . btl_put_alignment = 0 ;
openib_btl - > super . btl_registration_handle_size = sizeof ( mca_btl_base_registration_handle_t ) ;
2012-06-21 17:09:12 +00:00
2008-07-23 00:28:59 +00:00
/* Check bandwidth configured for this device */
sprintf ( param , " bandwidth_%s " , ibv_get_device_name ( device - > ib_dev ) ) ;
2013-03-27 21:09:41 +00:00
param_register_uint ( param , openib_btl - > super . btl_bandwidth , & openib_btl - > super . btl_bandwidth ) ;
2007-06-13 12:47:38 +00:00
2008-07-23 00:28:59 +00:00
/* Check bandwidth configured for this device/port */
sprintf ( param , " bandwidth_%s:%d " , ibv_get_device_name ( device - > ib_dev ) ,
2007-06-13 12:47:38 +00:00
port_num ) ;
2013-03-27 21:09:41 +00:00
param_register_uint ( param , openib_btl - > super . btl_bandwidth , & openib_btl - > super . btl_bandwidth ) ;
2007-06-13 12:47:38 +00:00
2008-07-23 00:28:59 +00:00
/* Check bandwidth configured for this device/port/LID */
2007-06-13 12:47:38 +00:00
sprintf ( param , " bandwidth_%s:%d:%d " ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) , port_num , lid ) ;
2013-03-27 21:09:41 +00:00
param_register_uint ( param , openib_btl - > super . btl_bandwidth , & openib_btl - > super . btl_bandwidth ) ;
2007-06-13 12:47:38 +00:00
2008-07-23 00:28:59 +00:00
/* Check latency configured for this device */
sprintf ( param , " latency_%s " , ibv_get_device_name ( device - > ib_dev ) ) ;
2013-03-27 21:09:41 +00:00
param_register_uint ( param , openib_btl - > super . btl_latency , & openib_btl - > super . btl_latency ) ;
2007-06-13 12:47:38 +00:00
2008-07-23 00:28:59 +00:00
/* Check latency configured for this device/port */
sprintf ( param , " latency_%s:%d " , ibv_get_device_name ( device - > ib_dev ) ,
2007-06-13 12:47:38 +00:00
port_num ) ;
2013-03-27 21:09:41 +00:00
param_register_uint ( param , openib_btl - > super . btl_latency , & openib_btl - > super . btl_latency ) ;
2007-06-13 12:47:38 +00:00
2008-07-23 00:28:59 +00:00
/* Check latency configured for this device/port/LID */
sprintf ( param , " latency_%s:%d:%d " , ibv_get_device_name ( device - > ib_dev ) ,
2007-06-13 12:47:38 +00:00
port_num , lid ) ;
2013-03-27 21:09:41 +00:00
param_register_uint ( param , openib_btl - > super . btl_latency , & openib_btl - > super . btl_latency ) ;
2007-06-13 12:47:38 +00:00
2007-04-21 00:15:05 +00:00
/* Auto-detect the port bandwidth */
if ( 0 = = openib_btl - > super . btl_bandwidth ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! =
opal_common_verbs_port_bw ( ib_port_attr ,
2012-09-01 01:42:37 +00:00
& openib_btl - > super . btl_bandwidth ) ) {
/* If we can't figure out the bandwidth, declare
this port unreachable ( do not * return
ERR_VALUE_OF_OUT_OF_BOUNDS ; that is reserved
for when we exceed the number of allowable
BTLs ) . */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_UNREACH ;
2007-04-21 00:15:05 +00:00
}
}
2012-09-01 01:42:37 +00:00
2006-06-28 07:23:08 +00:00
opal_list_append ( btl_list , ( opal_list_item_t * ) ib_selected ) ;
2008-07-23 00:28:59 +00:00
opal_pointer_array_add ( device - > device_btls , ( void * ) openib_btl ) ;
+ + device - > btls ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
+ + mca_btl_openib_component . ib_num_btls ;
if ( - 1 ! = mca_btl_openib_component . ib_max_btls & &
mca_btl_openib_component . ib_num_btls > =
mca_btl_openib_component . ib_max_btls ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_VALUE_OUT_OF_BOUNDS ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
}
2006-06-28 07:23:08 +00:00
}
}
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2006-06-28 07:23:08 +00:00
}
2008-07-23 00:28:59 +00:00
static void device_construct ( mca_btl_openib_device_t * device )
2007-12-23 12:29:34 +00:00
{
2008-07-23 00:28:59 +00:00
device - > ib_dev = NULL ;
device - > ib_dev_context = NULL ;
device - > ib_pd = NULL ;
device - > mpool = NULL ;
2015-11-02 12:07:08 -07:00
device - > rcache = NULL ;
2015-10-06 12:47:09 -06:00
# if OPAL_ENABLE_PROGRESS_THREADS == 1
2008-07-23 00:28:59 +00:00
device - > ib_channel = NULL ;
2014-06-25 20:43:28 +00:00
# endif
2008-07-23 00:28:59 +00:00
device - > btls = 0 ;
2013-04-09 21:55:31 +00:00
device - > endpoints = NULL ;
device - > device_btls = NULL ;
2008-07-23 00:28:59 +00:00
device - > ib_cq [ BTL_OPENIB_HP_CQ ] = NULL ;
device - > ib_cq [ BTL_OPENIB_LP_CQ ] = NULL ;
device - > cq_size [ BTL_OPENIB_HP_CQ ] = 0 ;
device - > cq_size [ BTL_OPENIB_LP_CQ ] = 0 ;
device - > non_eager_rdma_endpoints = 0 ;
device - > hp_cq_polls = mca_btl_openib_component . cq_poll_ratio ;
device - > eager_rdma_polls = mca_btl_openib_component . eager_rdma_poll_ratio ;
device - > pollme = true ;
device - > eager_rdma_buffers_count = 0 ;
device - > eager_rdma_buffers = NULL ;
2008-01-09 10:26:21 +00:00
# if HAVE_XRC
2008-07-23 00:28:59 +00:00
device - > xrc_fd = - 1 ;
2008-01-09 10:26:21 +00:00
# endif
2008-07-23 00:28:59 +00:00
device - > qps = NULL ;
OBJ_CONSTRUCT ( & device - > device_lock , opal_mutex_t ) ;
2015-02-19 13:41:41 -07:00
OBJ_CONSTRUCT ( & device - > send_free_control , opal_free_list_t ) ;
2008-07-23 00:28:59 +00:00
device - > max_inline_data = 0 ;
2014-01-06 19:51:30 +00:00
device - > ready_for_use = false ;
2007-12-23 12:29:34 +00:00
}
2008-07-23 00:28:59 +00:00
static void device_destruct ( mca_btl_openib_device_t * device )
2007-12-23 12:29:34 +00:00
{
2008-01-09 10:26:21 +00:00
int i ;
2015-10-06 12:47:09 -06:00
# if OPAL_ENABLE_PROGRESS_THREADS == 1
if ( device - > progress ) {
2008-07-23 00:28:59 +00:00
device - > progress = false ;
if ( pthread_cancel ( device - > thread . t_handle ) ) {
2008-06-05 12:20:13 +00:00
BTL_ERROR ( ( " Failed to cancel OpenIB progress thread " ) ) ;
2008-07-23 00:28:59 +00:00
goto device_error ;
2008-06-05 12:20:13 +00:00
}
2008-07-23 00:28:59 +00:00
opal_thread_join ( & device - > thread , NULL ) ;
2008-06-05 12:20:13 +00:00
}
2015-10-06 12:47:09 -06:00
2008-07-23 00:28:59 +00:00
if ( ibv_destroy_comp_channel ( device - > ib_channel ) ) {
2008-06-05 12:20:13 +00:00
BTL_VERBOSE ( ( " Failed to close comp_channel " ) ) ;
2008-07-23 00:28:59 +00:00
goto device_error ;
2008-06-05 12:20:13 +00:00
}
2014-06-25 20:43:28 +00:00
# endif
2015-10-06 12:47:09 -06:00
2008-07-23 00:28:59 +00:00
/* signaling to async_tread to stop poll for this device */
2015-10-06 12:47:09 -06:00
mca_btl_openib_async_rem_device ( device ) ;
2008-06-05 12:20:13 +00:00
2008-07-23 00:28:59 +00:00
if ( device - > eager_rdma_buffers ) {
2007-12-23 12:29:34 +00:00
int i ;
2008-07-23 00:28:59 +00:00
for ( i = 0 ; i < device - > eager_rdma_buffers_count ; i + + )
if ( device - > eager_rdma_buffers [ i ] )
OBJ_RELEASE ( device - > eager_rdma_buffers [ i ] ) ;
free ( device - > eager_rdma_buffers ) ;
2007-12-23 12:29:34 +00:00
}
2008-06-05 12:20:13 +00:00
2008-07-23 00:28:59 +00:00
if ( NULL ! = device - > qps ) {
2008-05-20 21:53:42 +00:00
for ( i = 0 ; i < mca_btl_openib_component . num_qps ; i + + ) {
2008-07-23 00:28:59 +00:00
OBJ_DESTRUCT ( & device - > qps [ i ] . send_free ) ;
OBJ_DESTRUCT ( & device - > qps [ i ] . recv_free ) ;
2008-05-20 21:53:42 +00:00
}
2008-07-23 00:28:59 +00:00
free ( device - > qps ) ;
2008-05-20 21:53:42 +00:00
}
2008-06-05 12:20:13 +00:00
2008-07-23 00:28:59 +00:00
OBJ_DESTRUCT ( & device - > send_free_control ) ;
2008-06-05 12:20:13 +00:00
/* Release CQs */
2008-07-23 00:28:59 +00:00
if ( device - > ib_cq [ BTL_OPENIB_HP_CQ ] ! = NULL ) {
if ( ibv_destroy_cq ( device - > ib_cq [ BTL_OPENIB_HP_CQ ] ) ) {
2008-06-05 12:20:13 +00:00
BTL_VERBOSE ( ( " Failed to close HP CQ " ) ) ;
2008-07-23 00:28:59 +00:00
goto device_error ;
2008-06-05 12:20:13 +00:00
}
}
2008-07-23 00:28:59 +00:00
if ( device - > ib_cq [ BTL_OPENIB_LP_CQ ] ! = NULL ) {
if ( ibv_destroy_cq ( device - > ib_cq [ BTL_OPENIB_LP_CQ ] ) ) {
2008-06-05 12:20:13 +00:00
BTL_VERBOSE ( ( " Failed to close LP CQ " ) ) ;
2008-07-23 00:28:59 +00:00
goto device_error ;
2008-06-05 12:20:13 +00:00
}
}
2015-11-02 12:07:08 -07:00
if ( OPAL_SUCCESS ! = mca_rcache_base_module_destroy ( device - > rcache ) ) {
BTL_VERBOSE ( ( " failed to release registration cache " ) ) ;
2008-07-23 00:28:59 +00:00
goto device_error ;
2008-06-05 12:20:13 +00:00
}
# if HAVE_XRC
2014-12-09 18:43:15 +09:00
2008-06-05 12:20:13 +00:00
if ( MCA_BTL_XRC_ENABLED ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! = mca_btl_openib_close_xrc_domain ( device ) ) {
2008-06-05 12:20:13 +00:00
BTL_VERBOSE ( ( " XRC Internal error. Failed to close xrc domain " ) ) ;
2008-07-23 00:28:59 +00:00
goto device_error ;
2008-06-05 12:20:13 +00:00
}
}
# endif
2008-07-23 00:28:59 +00:00
if ( ibv_dealloc_pd ( device - > ib_pd ) ) {
2008-06-05 12:20:13 +00:00
BTL_VERBOSE ( ( " Warning! Failed to release PD " ) ) ;
2008-07-23 00:28:59 +00:00
goto device_error ;
2008-06-05 12:20:13 +00:00
}
2008-07-23 00:28:59 +00:00
OBJ_DESTRUCT ( & device - > device_lock ) ;
2008-06-05 12:20:13 +00:00
2008-07-23 00:28:59 +00:00
if ( ibv_close_device ( device - > ib_dev_context ) ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( 1 = = opal_leave_pinned | | opal_leave_pinned_pipeline ) {
2008-07-23 00:28:59 +00:00
BTL_VERBOSE ( ( " Warning! Failed to close device " ) ) ;
goto device_error ;
2008-06-05 12:20:13 +00:00
} else {
2008-07-23 00:28:59 +00:00
BTL_ERROR ( ( " Error! Failed to close device " ) ) ;
goto device_error ;
2008-06-05 12:20:13 +00:00
}
}
2008-07-23 00:28:59 +00:00
BTL_VERBOSE ( ( " device was successfully released " ) ) ;
2008-06-05 12:20:13 +00:00
return ;
2008-07-23 00:28:59 +00:00
device_error :
BTL_VERBOSE ( ( " Failed to destroy device resources " ) ) ;
2007-12-23 12:29:34 +00:00
}
2008-07-23 00:28:59 +00:00
OBJ_CLASS_INSTANCE ( mca_btl_openib_device_t , opal_object_t , device_construct ,
device_destruct ) ;
2007-12-23 12:29:34 +00:00
2008-01-09 10:27:15 +00:00
static int
2008-07-23 00:28:59 +00:00
get_port_list ( mca_btl_openib_device_t * device , int * allowed_ports )
2008-01-09 10:27:15 +00:00
{
int i , j , k , num_ports = 0 ;
const char * dev_name ;
char * name ;
2008-07-23 00:28:59 +00:00
dev_name = ibv_get_device_name ( device - > ib_dev ) ;
2008-01-09 10:27:15 +00:00
name = ( char * ) malloc ( strlen ( dev_name ) + 4 ) ;
if ( NULL = = name ) {
return 0 ;
}
/* Assume that all ports are allowed. num_ports will be adjusted
below to reflect whether this is true or not . */
2008-07-23 00:28:59 +00:00
for ( i = 1 ; i < = device - > ib_dev_attr . phys_port_cnt ; + + i ) {
2008-01-09 10:27:15 +00:00
allowed_ports [ num_ports + + ] = i ;
}
num_ports = 0 ;
if ( NULL ! = mca_btl_openib_component . if_include_list ) {
2008-07-23 00:28:59 +00:00
/* If only the device name is given (eg. mtdevice0,mtdevice1) use all
2008-01-09 10:27:15 +00:00
ports */
i = 0 ;
while ( mca_btl_openib_component . if_include_list [ i ] ) {
2008-01-21 12:11:18 +00:00
if ( 0 = = strcmp ( dev_name ,
2008-01-09 10:27:15 +00:00
mca_btl_openib_component . if_include_list [ i ] ) ) {
2008-07-23 00:28:59 +00:00
num_ports = device - > ib_dev_attr . phys_port_cnt ;
2008-01-09 10:27:15 +00:00
goto done ;
}
+ + i ;
}
2008-07-23 00:28:59 +00:00
/* Include only requested ports on the device */
for ( i = 1 ; i < = device - > ib_dev_attr . phys_port_cnt ; + + i ) {
2008-01-09 10:27:15 +00:00
sprintf ( name , " %s:%d " , dev_name , i ) ;
2008-01-21 12:11:18 +00:00
for ( j = 0 ;
2008-01-09 10:27:15 +00:00
NULL ! = mca_btl_openib_component . if_include_list [ j ] ; + + j ) {
2008-01-21 12:11:18 +00:00
if ( 0 = = strcmp ( name ,
2008-01-09 10:27:15 +00:00
mca_btl_openib_component . if_include_list [ j ] ) ) {
allowed_ports [ num_ports + + ] = i ;
break ;
}
}
}
} else if ( NULL ! = mca_btl_openib_component . if_exclude_list ) {
2008-07-23 00:28:59 +00:00
/* If only the device name is given (eg. mtdevice0,mtdevice1) exclude
2008-01-09 10:27:15 +00:00
all ports */
i = 0 ;
while ( mca_btl_openib_component . if_exclude_list [ i ] ) {
2008-01-21 12:11:18 +00:00
if ( 0 = = strcmp ( dev_name ,
2008-01-09 10:27:15 +00:00
mca_btl_openib_component . if_exclude_list [ i ] ) ) {
num_ports = 0 ;
goto done ;
}
+ + i ;
}
2008-07-23 00:28:59 +00:00
/* Exclude the specified ports on this device */
for ( i = 1 ; i < = device - > ib_dev_attr . phys_port_cnt ; + + i ) {
2008-01-09 10:27:15 +00:00
sprintf ( name , " %s:%d " , dev_name , i ) ;
2008-01-21 12:11:18 +00:00
for ( j = 0 ;
2008-01-09 10:27:15 +00:00
NULL ! = mca_btl_openib_component . if_exclude_list [ j ] ; + + j ) {
2008-01-21 12:11:18 +00:00
if ( 0 = = strcmp ( name ,
2008-01-09 10:27:15 +00:00
mca_btl_openib_component . if_exclude_list [ j ] ) ) {
/* If found, set a sentinel value */
j = - 1 ;
break ;
}
}
/* If we didn't find it, it's ok to include in the list */
if ( - 1 ! = j ) {
allowed_ports [ num_ports + + ] = i ;
}
}
} else {
2008-07-23 00:28:59 +00:00
num_ports = device - > ib_dev_attr . phys_port_cnt ;
2008-01-09 10:27:15 +00:00
}
done :
/* Remove the following from the error-checking if_list:
- bare device name
- device name suffixed with port number */
if ( NULL ! = mca_btl_openib_component . if_list ) {
for ( i = 0 ; NULL ! = mca_btl_openib_component . if_list [ i ] ; + + i ) {
/* Look for raw device name */
if ( 0 = = strcmp ( mca_btl_openib_component . if_list [ i ] , dev_name ) ) {
j = opal_argv_count ( mca_btl_openib_component . if_list ) ;
opal_argv_delete ( & j , & ( mca_btl_openib_component . if_list ) ,
i , 1 ) ;
- - i ;
}
}
2008-07-23 00:28:59 +00:00
for ( i = 1 ; i < = device - > ib_dev_attr . phys_port_cnt ; + + i ) {
2008-01-09 10:27:15 +00:00
sprintf ( name , " %s:%d " , dev_name , i ) ;
for ( j = 0 ; NULL ! = mca_btl_openib_component . if_list [ j ] ; + + j ) {
if ( 0 = = strcmp ( mca_btl_openib_component . if_list [ j ] , name ) ) {
k = opal_argv_count ( mca_btl_openib_component . if_list ) ;
opal_argv_delete ( & k , & ( mca_btl_openib_component . if_list ) ,
j , 1 ) ;
- - j ;
break ;
}
}
}
}
free ( name ) ;
return num_ports ;
}
2008-05-20 21:53:42 +00:00
/*
* Prefer values that are already in the target
*/
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
static void merge_values ( opal_btl_openib_ini_values_t * target ,
opal_btl_openib_ini_values_t * src )
2008-01-09 10:27:15 +00:00
{
if ( ! target - > mtu_set & & src - > mtu_set ) {
target - > mtu = src - > mtu ;
target - > mtu_set = true ;
}
if ( ! target - > use_eager_rdma_set & & src - > use_eager_rdma_set ) {
target - > use_eager_rdma = src - > use_eager_rdma ;
target - > use_eager_rdma_set = true ;
}
2008-05-20 21:53:42 +00:00
if ( NULL = = target - > receive_queues & & NULL ! = src - > receive_queues ) {
target - > receive_queues = strdup ( src - > receive_queues ) ;
}
2008-06-24 17:18:07 +00:00
if ( ! target - > max_inline_data_set & & src - > max_inline_data_set ) {
target - > max_inline_data = src - > max_inline_data ;
target - > max_inline_data_set = true ;
}
2008-01-09 10:27:15 +00:00
}
static bool inline is_credit_message ( const mca_btl_openib_recv_frag_t * frag )
{
mca_btl_openib_control_header_t * chdr =
2015-03-10 14:21:42 -06:00
( mca_btl_openib_control_header_t * ) to_base_frag ( frag ) - > segment . seg_addr . pval ;
2013-08-30 14:53:59 +00:00
return ( MCA_BTL_TAG_IB = = frag - > hdr - > tag ) & &
2008-01-09 10:27:15 +00:00
( MCA_BTL_OPENIB_CONTROL_CREDITS = = chdr - > type ) ;
}
2008-10-06 00:46:02 +00:00
static bool inline is_cts_message ( const mca_btl_openib_recv_frag_t * frag )
{
mca_btl_openib_control_header_t * chdr =
2015-03-10 14:21:42 -06:00
( mca_btl_openib_control_header_t * ) to_base_frag ( frag ) - > segment . seg_addr . pval ;
2013-08-30 14:53:59 +00:00
return ( MCA_BTL_TAG_IB = = frag - > hdr - > tag ) & &
2008-10-06 00:46:02 +00:00
( MCA_BTL_OPENIB_CONTROL_CTS = = chdr - > type ) ;
}
2008-05-20 21:53:42 +00:00
static int32_t atoi_param ( char * param , int32_t dflt )
{
if ( NULL = = param | | ' \0 ' = = param [ 0 ] ) {
return dflt ? dflt : 1 ;
}
return atoi ( param ) ;
}
2008-07-23 00:28:59 +00:00
static void init_apm_port ( mca_btl_openib_device_t * device , int port , uint16_t lid )
2008-02-20 13:44:05 +00:00
{
int index ;
struct mca_btl_openib_module_t * btl ;
2008-07-23 00:28:59 +00:00
for ( index = 0 ; index < device - > btls ; index + + ) {
2010-07-12 16:17:56 +00:00
btl = ( mca_btl_openib_module_t * ) opal_pointer_array_get_item ( device - > device_btls , index ) ;
2008-02-20 13:44:05 +00:00
/* Ok, we already have btl for the fist port,
* second one will be used for APM */
btl - > apm_port = port ;
btl - > port_info . apm_lid = lid + btl - > src_path_bits ;
mca_btl_openib_component . apm_ports + + ;
BTL_VERBOSE ( ( " APM-PORT: Setting alternative port - %d, lid - %d "
, port , lid ) ) ;
}
}
2016-03-08 16:17:33 -07:00
static int get_var_source ( const char * var_name , mca_base_var_source_t * source )
{
int vari = mca_base_var_find ( " opal " , " btl " , " openib " , var_name ) ;
if ( 0 > vari ) {
return vari ;
}
return mca_base_var_get_value ( vari , NULL , source , NULL ) ;
}
2008-05-20 21:53:42 +00:00
static int setup_qps ( void )
{
char * * queues , * * params = NULL ;
int num_xrc_qps = 0 , num_pp_qps = 0 , num_srq_qps = 0 , qp = 0 ;
uint32_t max_qp_size , max_size_needed ;
int32_t min_freelist_size = 0 ;
2015-05-27 09:43:42 -06:00
int smallest_pp_qp = INT_MAX , ret = OPAL_ERROR ;
2008-05-20 21:53:42 +00:00
queues = opal_argv_split ( mca_btl_openib_component . receive_queues , ' : ' ) ;
if ( 0 = = opal_argv_count ( queues ) ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-05-20 21:53:42 +00:00
" no qps in receive_queues " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2008-05-20 21:53:42 +00:00
mca_btl_openib_component . receive_queues ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERROR ;
2008-05-20 21:53:42 +00:00
goto error ;
}
while ( queues [ qp ] ! = NULL ) {
if ( 0 = = strncmp ( " P, " , queues [ qp ] , 2 ) ) {
num_pp_qps + + ;
if ( smallest_pp_qp > qp ) {
smallest_pp_qp = qp ;
}
} else if ( 0 = = strncmp ( " S, " , queues [ qp ] , 2 ) ) {
num_srq_qps + + ;
} else if ( 0 = = strncmp ( " X, " , queues [ qp ] , 2 ) ) {
# if HAVE_XRC
num_xrc_qps + + ;
# else
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " No XRC support " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2008-05-20 21:53:42 +00:00
mca_btl_openib_component . receive_queues ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_NOT_AVAILABLE ;
2008-05-20 21:53:42 +00:00
goto error ;
# endif
} else {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-05-20 21:53:42 +00:00
" invalid qp type in receive_queues " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2008-05-20 21:53:42 +00:00
mca_btl_openib_component . receive_queues ,
queues [ qp ] ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-05-20 21:53:42 +00:00
goto error ;
}
qp + + ;
}
2015-05-27 09:43:42 -06:00
# if HAVE_XRC
2008-05-20 21:53:42 +00:00
/* Current XRC implementation can't used with other QP types - PP
and SRQ */
if ( num_xrc_qps > 0 & & ( num_pp_qps > 0 | | num_srq_qps > 0 ) ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " XRC with PP or SRQ " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2008-05-20 21:53:42 +00:00
mca_btl_openib_component . receive_queues ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-05-20 21:53:42 +00:00
goto error ;
}
/* Current XRC implementation can't used with btls_per_lid > 1 */
if ( num_xrc_qps > 0 & & mca_btl_openib_component . btls_per_lid > 1 ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " XRC with BTLs per LID " ,
2014-10-03 14:19:48 -07:00
true , opal_process_info . nodename ,
2008-05-20 21:53:42 +00:00
mca_btl_openib_component . receive_queues , num_xrc_qps ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-05-20 21:53:42 +00:00
goto error ;
}
2015-05-27 09:43:42 -06:00
# endif
2008-05-20 21:53:42 +00:00
mca_btl_openib_component . num_pp_qps = num_pp_qps ;
mca_btl_openib_component . num_srq_qps = num_srq_qps ;
mca_btl_openib_component . num_xrc_qps = num_xrc_qps ;
mca_btl_openib_component . num_qps = num_pp_qps + num_srq_qps + num_xrc_qps ;
mca_btl_openib_component . qp_infos = ( mca_btl_openib_qp_info_t * )
malloc ( sizeof ( mca_btl_openib_qp_info_t ) *
mca_btl_openib_component . num_qps ) ;
2013-10-28 20:04:49 +00:00
if ( NULL = = mca_btl_openib_component . qp_infos ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_OUT_OF_RESOURCE ;
2013-10-28 20:04:49 +00:00
goto error ;
}
2008-05-20 21:53:42 +00:00
qp = 0 ;
# define P(N) (((N) > count) ? NULL : params[(N)])
while ( queues [ qp ] ! = NULL ) {
int count ;
int32_t rd_low , rd_num ;
params = opal_argv_split_with_empty ( queues [ qp ] , ' , ' ) ;
count = opal_argv_count ( params ) ;
if ( ' P ' = = params [ 0 ] [ 0 ] ) {
int32_t rd_win , rd_rsv ;
if ( count < 3 | | count > 6 ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-05-20 21:53:42 +00:00
" invalid pp qp specification " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename , queues [ qp ] ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-05-20 21:53:42 +00:00
goto error ;
}
mca_btl_openib_component . qp_infos [ qp ] . type = MCA_BTL_OPENIB_PP_QP ;
mca_btl_openib_component . qp_infos [ qp ] . size = atoi_param ( P ( 1 ) , 0 ) ;
rd_num = atoi_param ( P ( 2 ) , 256 ) ;
/* by default set rd_low to be 3/4 of rd_num */
rd_low = atoi_param ( P ( 3 ) , rd_num - ( rd_num / 4 ) ) ;
rd_win = atoi_param ( P ( 4 ) , ( rd_num - rd_low ) * 2 ) ;
2015-05-27 09:43:42 -06:00
if ( 0 > = rd_win ) {
opal_show_help ( " help-mpi-btl-openib.txt " ,
" invalid pp qp specification " , true ,
opal_process_info . nodename , queues [ qp ] ) ;
ret = OPAL_ERR_BAD_PARAM ;
goto error ;
}
2008-05-20 21:53:42 +00:00
rd_rsv = atoi_param ( P ( 5 ) , ( rd_num * 2 ) / rd_win ) ;
BTL_VERBOSE ( ( " pp: rd_num is %d rd_low is %d rd_win %d rd_rsv %d " ,
rd_num , rd_low , rd_win , rd_rsv ) ) ;
/* Calculate the smallest freelist size that can be allowed */
if ( rd_num + rd_rsv > min_freelist_size ) {
min_freelist_size = rd_num + rd_rsv ;
}
mca_btl_openib_component . qp_infos [ qp ] . u . pp_qp . rd_win = rd_win ;
mca_btl_openib_component . qp_infos [ qp ] . u . pp_qp . rd_rsv = rd_rsv ;
if ( ( rd_num - rd_low ) > rd_win ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " non optimal rd_win " ,
2008-05-20 21:53:42 +00:00
true , rd_win , rd_num - rd_low ) ;
}
} else {
2009-12-15 15:52:10 +00:00
int32_t sd_max , rd_init , srq_limit ;
if ( count < 3 | | count > 7 ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-05-20 21:53:42 +00:00
" invalid srq specification " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename , queues [ qp ] ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-05-20 21:53:42 +00:00
goto error ;
}
mca_btl_openib_component . qp_infos [ qp ] . type = ( params [ 0 ] [ 0 ] = = ' X ' ) ?
MCA_BTL_OPENIB_XRC_QP : MCA_BTL_OPENIB_SRQ_QP ;
mca_btl_openib_component . qp_infos [ qp ] . size = atoi_param ( P ( 1 ) , 0 ) ;
rd_num = atoi_param ( P ( 2 ) , 256 ) ;
/* by default set rd_low to be 3/4 of rd_num */
rd_low = atoi_param ( P ( 3 ) , rd_num - ( rd_num / 4 ) ) ;
sd_max = atoi_param ( P ( 4 ) , rd_low / 4 ) ;
2009-12-15 15:52:10 +00:00
/* rd_init is initial value for rd_curr_num of all SRQs, 1/4 of rd_num by default */
rd_init = atoi_param ( P ( 5 ) , rd_num / 4 ) ;
/* by default set srq_limit to be 3/16 of rd_init (it's 1/4 of rd_low_local,
the value of rd_low_local we calculate in create_srq function ) */
srq_limit = atoi_param ( P ( 6 ) , ( rd_init - ( rd_init / 4 ) ) / 4 ) ;
/* If we set srq_limit less or greater than rd_init
( init value for rd_curr_num ) = > we receive the IBV_EVENT_SRQ_LIMIT_REACHED
event immediately and the value of rd_curr_num will be increased */
2009-12-16 14:05:35 +00:00
/* If we set srq_limit to zero, but size of SRQ greater than 1 => set it to be 1 */
if ( ( 0 = = srq_limit ) & & ( 1 < rd_num ) ) {
2009-12-15 15:52:10 +00:00
srq_limit = 1 ;
}
BTL_VERBOSE ( ( " srq: rd_num is %d rd_low is %d sd_max is %d rd_max is %d srq_limit is %d " ,
rd_num , rd_low , sd_max , rd_init , srq_limit ) ) ;
2008-05-20 21:53:42 +00:00
/* Calculate the smallest freelist size that can be allowed */
if ( rd_num > min_freelist_size ) {
min_freelist_size = rd_num ;
}
2009-12-15 15:52:10 +00:00
if ( rd_num < rd_init ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " rd_num must be >= rd_init " ,
2014-10-03 14:19:48 -07:00
true , opal_process_info . nodename , queues [ qp ] ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2009-12-15 15:52:10 +00:00
goto error ;
}
if ( rd_num < srq_limit ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " srq_limit must be > rd_num " ,
2014-10-03 14:19:48 -07:00
true , opal_process_info . nodename , queues [ qp ] ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2009-12-15 15:52:10 +00:00
goto error ;
}
2008-05-20 21:53:42 +00:00
mca_btl_openib_component . qp_infos [ qp ] . u . srq_qp . sd_max = sd_max ;
2009-12-15 15:52:10 +00:00
mca_btl_openib_component . qp_infos [ qp ] . u . srq_qp . rd_init = rd_init ;
mca_btl_openib_component . qp_infos [ qp ] . u . srq_qp . srq_limit = srq_limit ;
2008-05-20 21:53:42 +00:00
}
if ( rd_num < = rd_low ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " rd_num must be > rd_low " ,
2014-10-03 14:19:48 -07:00
true , opal_process_info . nodename , queues [ qp ] ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-05-20 21:53:42 +00:00
goto error ;
}
mca_btl_openib_component . qp_infos [ qp ] . rd_num = rd_num ;
mca_btl_openib_component . qp_infos [ qp ] . rd_low = rd_low ;
opal_argv_free ( params ) ;
qp + + ;
}
params = NULL ;
/* Sanity check some sizes */
max_qp_size = mca_btl_openib_component . qp_infos [ mca_btl_openib_component . num_qps - 1 ] . size ;
max_size_needed = ( mca_btl_openib_module . super . btl_eager_limit >
mca_btl_openib_module . super . btl_max_send_size ) ?
mca_btl_openib_module . super . btl_eager_limit :
mca_btl_openib_module . super . btl_max_send_size ;
2016-03-08 16:17:33 -07:00
if ( max_qp_size < max_size_needed ) {
mca_base_var_source_t eager_source = MCA_BASE_VAR_SOURCE_DEFAULT ;
mca_base_var_source_t max_send_source = MCA_BASE_VAR_SOURCE_DEFAULT ;
( void ) get_var_source ( " max_send_size " , & max_send_source ) ;
( void ) get_var_source ( " eager_limit " , & eager_source ) ;
/* the largest queue pair is too small for either the max send size or eager
* limit . check where we got the max_send_size and eager_limit and adjust if
* the user did not specify one or the other . */
if ( mca_btl_openib_module . super . btl_eager_limit > max_qp_size & &
MCA_BASE_VAR_SOURCE_DEFAULT = = eager_source ) {
mca_btl_openib_module . super . btl_eager_limit = max_qp_size ;
}
if ( mca_btl_openib_module . super . btl_max_send_size > max_qp_size & &
MCA_BASE_VAR_SOURCE_DEFAULT = = max_send_source ) {
mca_btl_openib_module . super . btl_max_send_size = max_qp_size ;
}
max_size_needed = ( mca_btl_openib_module . super . btl_eager_limit >
mca_btl_openib_module . super . btl_max_send_size ) ?
mca_btl_openib_module . super . btl_eager_limit :
mca_btl_openib_module . super . btl_max_send_size ;
}
2008-05-20 21:53:42 +00:00
if ( max_qp_size < max_size_needed ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-05-20 21:53:42 +00:00
" biggest qp size is too small " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename , max_qp_size ,
2008-05-20 21:53:42 +00:00
max_size_needed ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-05-20 21:53:42 +00:00
goto error ;
} else if ( max_qp_size > max_size_needed ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-05-20 21:53:42 +00:00
" biggest qp size is too big " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename , max_qp_size ,
2008-05-20 21:53:42 +00:00
max_size_needed ) ;
}
if ( mca_btl_openib_component . ib_free_list_max > 0 & &
min_freelist_size > mca_btl_openib_component . ib_free_list_max ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " freelist too small " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2008-05-20 21:53:42 +00:00
mca_btl_openib_component . ib_free_list_max ,
min_freelist_size ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-05-20 21:53:42 +00:00
goto error ;
}
mca_btl_openib_component . rdma_qp = mca_btl_openib_component . num_qps - 1 ;
2015-05-27 09:43:42 -06:00
if ( mca_btl_openib_component . num_qps > smallest_pp_qp ) {
mca_btl_openib_component . credits_qp = smallest_pp_qp ;
} else {
mca_btl_openib_component . credits_qp = mca_btl_openib_component . num_qps - 1 ;
}
2008-05-20 21:53:42 +00:00
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_SUCCESS ;
2008-05-20 21:53:42 +00:00
error :
if ( NULL ! = params ) {
opal_argv_free ( params ) ;
}
if ( NULL ! = queues ) {
opal_argv_free ( queues ) ;
}
return ret ;
}
2014-12-23 22:45:56 +02:00
/* read a single integer from a linux module parameters file */
2015-05-27 09:43:42 -06:00
static uint64_t read_module_param ( char * file , uint64_t value , uint64_t max )
2014-12-23 22:45:56 +02:00
{
int fd = open ( file , O_RDONLY ) ;
char buffer [ 64 ] ;
uint64_t ret ;
2015-05-27 09:43:42 -06:00
int rc ;
2014-12-23 22:45:56 +02:00
if ( 0 > fd ) {
return value ;
}
2015-05-27 09:43:42 -06:00
rc = read ( fd , buffer , 64 ) ;
2014-12-23 22:45:56 +02:00
close ( fd ) ;
2015-05-27 09:43:42 -06:00
if ( 0 = = rc ) {
return value ;
}
2014-12-23 22:45:56 +02:00
errno = 0 ;
ret = strtoull ( buffer , NULL , 10 ) ;
2015-05-27 09:43:42 -06:00
if ( ret > max ) {
/* NTH: probably should report a bogus value */
ret = max ;
}
2014-12-23 22:45:56 +02:00
return ( 0 = = errno ) ? ret : value ;
}
/* calculate memory registation limits */
static uint64_t calculate_total_mem ( void )
{
hwloc_obj_t machine ;
machine = hwloc_get_next_obj_by_type ( opal_hwloc_topology , HWLOC_OBJ_MACHINE , NULL ) ;
if ( NULL = = machine ) {
return 0 ;
}
return machine - > memory . total_memory ;
}
static uint64_t calculate_max_reg ( const char * device_name )
{
struct stat statinfo ;
uint64_t mtts_per_seg = 1 ;
uint64_t num_mtt = 1 < < 19 ;
uint64_t reserved_mtt = 0 ;
uint64_t max_reg , mem_total ;
mem_total = calculate_total_mem ( ) ;
/* On older OFED(<2.0), may need to turn off this parameter*/
if ( mca_btl_openib_component . allow_max_memory_registration ) {
max_reg = 2 * mem_total ;
/* Limit us to 87.5% of the registered memory (some fluff for QPs,
file systems , etc ) */
return ( max_reg * 7 ) > > 3 ;
}
2015-04-28 14:34:05 -07:00
/* Default to being able to register everything (to ensure that
max_reg is initialized in all cases ) */
max_reg = mem_total ;
2014-12-23 22:45:56 +02:00
if ( ! strncmp ( device_name , " mlx5 " , 4 ) ) {
max_reg = 2 * mem_total ;
} else if ( ! strncmp ( device_name , " mlx4 " , 4 ) ) {
if ( 0 = = stat ( " /sys/module/mlx4_core/parameters/log_num_mtt " , & statinfo ) ) {
2015-05-28 11:26:13 -06:00
mtts_per_seg = 1ull < < read_module_param ( " /sys/module/mlx4_core/parameters/log_mtts_per_seg " , 1 , 63 ) ;
num_mtt = 1ull < < read_module_param ( " /sys/module/mlx4_core/parameters/log_mtts_per_seg " , 1 , 63 ) ;
2014-12-23 22:45:56 +02:00
if ( 1 = = num_mtt ) {
/* NTH: is 19 a minimum? when log_num_mtt is set to 0 use 19 */
num_mtt = 1 < < 19 ;
max_reg = ( num_mtt - reserved_mtt ) * opal_getpagesize ( ) * mtts_per_seg ;
} else {
max_reg = ( num_mtt - reserved_mtt ) * opal_getpagesize ( ) * mtts_per_seg ;
}
}
} else if ( ! strncmp ( device_name , " mthca " , 5 ) ) {
if ( 0 = = stat ( " /sys/module/ib_mthca/parameters/num_mtt " , & statinfo ) ) {
2015-05-28 11:26:13 -06:00
mtts_per_seg = 1ull < < read_module_param ( " /sys/module/ib_mthca/parameters/log_mtts_per_seg " , 1 , 63 ) ;
2015-05-27 09:43:42 -06:00
num_mtt = read_module_param ( " /sys/module/ib_mthca/parameters/num_mtt " , 1 < < 20 , ( uint64_t ) - 1 ) ;
reserved_mtt = read_module_param ( " /sys/module/ib_mthca/parameters/fmr_reserved_mtts " , 0 , ( uint64_t ) - 1 ) ;
2014-12-23 22:45:56 +02:00
max_reg = ( num_mtt - reserved_mtt ) * opal_getpagesize ( ) * mtts_per_seg ;
} else {
max_reg = mem_total ;
}
} else {
/* Need to update to determine the registration limit for this
configuration */
max_reg = mem_total ;
}
/* Print a warning if we can't register more than 75% of physical
memory . Abort if the abort_not_enough_reg_mem MCA param was
set . */
if ( max_reg < mem_total * 3 / 4 ) {
char * action ;
if ( mca_btl_openib_component . abort_not_enough_reg_mem ) {
action = " Your MPI job will now abort. " ;
} else {
action = " Your MPI job will continue, but may be behave poorly and/or hang. " ;
}
opal_show_help ( " help-mpi-btl-openib.txt " , " reg mem limit low " , true ,
opal_process_info . nodename , ( unsigned long ) ( max_reg > > 20 ) ,
( unsigned long ) ( mem_total > > 20 ) , action ) ;
return 0 ; /* signal that we can't have enough memory */
}
/* Limit us to 87.5% of the registered memory (some fluff for QPs,
file systems , etc ) */
return ( max_reg * 7 ) > > 3 ;
}
2008-07-23 00:28:59 +00:00
static int init_one_device ( opal_list_t * btl_list , struct ibv_device * ib_dev )
2006-06-28 07:23:08 +00:00
{
2015-11-02 12:07:08 -07:00
mca_rcache_base_resources_t rcache_resources ;
2008-07-23 00:28:59 +00:00
mca_btl_openib_device_t * device ;
2007-06-14 01:59:25 +00:00
uint8_t i , k = 0 ;
int ret = - 1 , port_cnt ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
opal_btl_openib_ini_values_t values , default_values ;
2007-06-14 13:59:28 +00:00
int * allowed_ports = NULL ;
2008-06-24 17:18:07 +00:00
bool need_search ;
2012-09-12 20:47:47 +00:00
struct ibv_context * dev_context = NULL ;
/* Open up the device */
dev_context = ibv_open_device ( ib_dev ) ;
if ( NULL = = dev_context ) {
2015-04-25 05:49:32 -07:00
return OPAL_ERR_NOT_SUPPORTED ;
2012-09-12 20:47:47 +00:00
}
/* Find out if this device supports RC QPs */
2015-06-23 20:59:57 -07:00
if ( OPAL_SUCCESS ! = opal_common_verbs_qp_test ( dev_context ,
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
OPAL_COMMON_VERBS_FLAGS_RC ) ) {
2012-09-12 20:47:47 +00:00
ibv_close_device ( dev_context ) ;
BTL_VERBOSE ( ( " openib: RC QPs not supported -- skipping %s " ,
ibv_get_device_name ( ib_dev ) ) ) ;
2013-07-19 19:05:18 +00:00
+ + num_devices_intentionally_ignored ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_NOT_SUPPORTED ;
2012-09-12 20:47:47 +00:00
}
2008-01-21 12:11:18 +00:00
2008-07-23 00:28:59 +00:00
device = OBJ_NEW ( mca_btl_openib_device_t ) ;
if ( NULL = = device ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2012-09-12 20:47:47 +00:00
ibv_close_device ( dev_context ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_OUT_OF_RESOURCE ;
2006-06-28 07:23:08 +00:00
}
2008-01-21 12:11:18 +00:00
2012-07-19 17:52:21 +00:00
device - > mem_reg_active = 0 ;
2015-12-17 13:45:19 +06:00
device - > mem_reg_max_total = calculate_max_reg ( ibv_get_device_name ( ib_dev ) ) ;
device - > mem_reg_max = device - > mem_reg_max_total ;
2014-12-23 22:45:56 +02:00
if ( ( 0 = = device - > mem_reg_max ) & & mca_btl_openib_component . abort_not_enough_reg_mem ) {
return OPAL_ERROR ;
}
2012-07-19 17:52:21 +00:00
2008-07-23 00:28:59 +00:00
device - > ib_dev = ib_dev ;
2012-09-12 20:47:47 +00:00
device - > ib_dev_context = dev_context ;
2008-07-23 00:28:59 +00:00
device - > ib_pd = NULL ;
device - > device_btls = OBJ_NEW ( opal_pointer_array_t ) ;
if ( OPAL_SUCCESS ! = opal_pointer_array_init ( device - > device_btls , 2 , INT_MAX , 2 ) ) {
BTL_ERROR ( ( " Failed to initialize device_btls array: %s:%d " , __FILE__ , __LINE__ ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_OUT_OF_RESOURCE ;
2008-02-20 13:44:05 +00:00
}
2007-12-23 12:29:34 +00:00
2008-07-23 00:28:59 +00:00
if ( NULL = = device - > ib_dev_context ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " error obtaining device context for %s errno says %s " ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 10:26:21 +00:00
goto error ;
2008-01-21 12:11:18 +00:00
}
2015-11-30 12:32:04 -07:00
# if HAVE_DECL_IBV_EXP_QUERY_DEVICE
2015-12-08 10:34:16 -07:00
device - > ib_exp_dev_attr . comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1 ;
2015-11-30 22:21:26 -07:00
if ( ibv_exp_query_device ( device - > ib_dev_context , & device - > ib_exp_dev_attr ) ) {
2015-11-30 12:32:04 -07:00
BTL_ERROR ( ( " error obtaining device attributes for %s errno says %s " ,
ibv_get_device_name ( device - > ib_dev ) , strerror ( errno ) ) ) ;
goto error ;
}
2015-11-30 22:21:26 -07:00
# endif
2008-07-23 00:28:59 +00:00
if ( ibv_query_device ( device - > ib_dev_context , & device - > ib_dev_attr ) ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " error obtaining device attributes for %s errno says %s " ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) , strerror ( errno ) ) ) ;
2008-01-09 10:26:21 +00:00
goto error ;
2006-06-28 07:23:08 +00:00
}
2007-06-14 01:59:25 +00:00
/* If mca_btl_if_include/exclude were specified, get usable ports */
2008-07-23 00:28:59 +00:00
allowed_ports = ( int * ) malloc ( device - > ib_dev_attr . phys_port_cnt * sizeof ( int ) ) ;
2013-10-28 20:04:49 +00:00
if ( NULL = = allowed_ports ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_OUT_OF_RESOURCE ;
2013-10-28 20:04:49 +00:00
goto error ;
}
2008-07-23 00:28:59 +00:00
port_cnt = get_port_list ( device , allowed_ports ) ;
2008-05-20 21:53:42 +00:00
if ( 0 = = port_cnt ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_SUCCESS ;
2013-06-05 12:12:09 +00:00
+ + num_devices_intentionally_ignored ;
2008-05-18 19:11:45 +00:00
goto error ;
}
2008-05-20 21:53:42 +00:00
2008-07-23 00:28:59 +00:00
/* Load in vendor/part-specific device parameters. Note that even if
2006-08-30 20:21:47 +00:00
we don ' t find values for this vendor / part , " values " will be set
indicating that it does not have good values */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = opal_btl_openib_ini_query ( device - > ib_dev_attr . vendor_id ,
2008-07-23 00:28:59 +00:00
device - > ib_dev_attr . vendor_part_id ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
& values ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! = ret & &
OPAL_ERR_NOT_FOUND ! = ret ) {
2006-08-30 20:21:47 +00:00
/* If we get a serious error, propagate it upwards */
2008-01-09 10:26:21 +00:00
goto error ;
2006-08-30 20:21:47 +00:00
}
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_ERR_NOT_FOUND = = ret ) {
2008-07-23 00:28:59 +00:00
/* If we didn't find a matching device in the INI files, output a
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
warning that we ' re using default values ( unless overridden
that we don ' t want to see these warnings ) */
2008-07-23 00:28:59 +00:00
if ( mca_btl_openib_component . warn_no_device_params_found ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-07-23 00:28:59 +00:00
" no device params found " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) ,
device - > ib_dev_attr . vendor_id ,
device - > ib_dev_attr . vendor_part_id ) ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
}
2006-08-30 20:21:47 +00:00
}
2013-06-05 12:12:09 +00:00
/* If we're supposed to ignore devices of this vendor/part ID,
then do so */
if ( values . ignore_device_set & & values . ignore_device ) {
BTL_VERBOSE ( ( " device %s skipped; ignore_device=1 " ,
ibv_get_device_name ( device - > ib_dev ) ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_SUCCESS ;
2013-06-05 12:12:09 +00:00
+ + num_devices_intentionally_ignored ;
goto error ;
}
2006-08-30 20:21:47 +00:00
/* Note that even if we don't find default values, "values" will
be set indicating that it does not have good values */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = opal_btl_openib_ini_query ( 0 , 0 , & default_values ) ;
if ( OPAL_SUCCESS ! = ret & &
OPAL_ERR_NOT_FOUND ! = ret ) {
2006-08-30 20:21:47 +00:00
/* If we get a serious error, propagate it upwards */
2008-01-09 10:26:21 +00:00
goto error ;
2006-08-30 20:21:47 +00:00
}
2008-07-23 00:28:59 +00:00
/* If we did find values for this device (or in the defaults
2006-08-30 20:21:47 +00:00
section ) , handle them */
merge_values ( & values , & default_values ) ;
2015-09-02 02:22:30 -07:00
/* If MCA param was set, use it. If not, check the INI file
or default to IBV_MTU_1024 */
if ( 0 < mca_btl_openib_component . ib_mtu ) {
device - > mtu = mca_btl_openib_component . ib_mtu ;
} else if ( values . mtu_set ) {
2006-08-30 20:21:47 +00:00
switch ( values . mtu ) {
case 256 :
2008-07-23 00:28:59 +00:00
device - > mtu = IBV_MTU_256 ;
2006-08-30 20:21:47 +00:00
break ;
case 512 :
2008-07-23 00:28:59 +00:00
device - > mtu = IBV_MTU_512 ;
2006-08-30 20:21:47 +00:00
break ;
case 1024 :
2008-07-23 00:28:59 +00:00
device - > mtu = IBV_MTU_1024 ;
2006-08-30 20:21:47 +00:00
break ;
case 2048 :
2008-07-23 00:28:59 +00:00
device - > mtu = IBV_MTU_2048 ;
2006-08-30 20:21:47 +00:00
break ;
case 4096 :
2008-07-23 00:28:59 +00:00
device - > mtu = IBV_MTU_4096 ;
2006-08-30 20:21:47 +00:00
break ;
default :
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " invalid MTU value specified in INI file (%d); ignored " , values . mtu ) ) ;
2015-09-02 02:22:30 -07:00
device - > mtu = IBV_MTU_1024 ;
2006-08-30 20:21:47 +00:00
break ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
}
2006-08-30 20:21:47 +00:00
} else {
2015-09-02 02:22:30 -07:00
device - > mtu = IBV_MTU_1024 ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
}
2008-07-23 00:28:59 +00:00
/* Allocate the protection domain for the device */
device - > ib_pd = ibv_alloc_pd ( device - > ib_dev_context ) ;
if ( NULL = = device - > ib_pd ) {
2008-06-24 17:18:07 +00:00
BTL_ERROR ( ( " error allocating protection domain for %s errno says %s " ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) , strerror ( errno ) ) ) ;
2008-06-24 17:18:07 +00:00
goto error ;
}
/* Figure out what the max_inline_data value should be for all
2008-07-23 00:28:59 +00:00
ports and QPs on this device */
2008-06-24 17:18:07 +00:00
need_search = false ;
2008-11-14 12:15:35 +00:00
if ( - 2 ! = mca_btl_openib_component . ib_max_inline_data ) {
2011-07-04 14:00:41 +00:00
/* User has explicitly set btl_openib_max_inline_data MCA parameter
2008-11-14 12:15:35 +00:00
Per setup in _mca . c , we know that the MCA param value is guaranteed
to be > = - 1 */
if ( - 1 = = mca_btl_openib_component . ib_max_inline_data ) {
need_search = true ;
} else {
2011-07-04 14:00:41 +00:00
device - > max_inline_data = ( uint32_t )
2008-11-14 12:15:35 +00:00
mca_btl_openib_component . ib_max_inline_data ;
}
2008-06-24 17:18:07 +00:00
} else if ( values . max_inline_data_set ) {
2008-11-14 12:15:35 +00:00
if ( - 1 = = values . max_inline_data ) {
2008-06-24 17:18:07 +00:00
need_search = true ;
2008-11-14 12:15:35 +00:00
} else if ( values . max_inline_data > = 0 ) {
device - > max_inline_data = ( uint32_t ) values . max_inline_data ;
} else {
2011-07-04 14:00:41 +00:00
if ( default_values . max_inline_data_set & &
2008-11-14 12:15:35 +00:00
default_values . max_inline_data > = - 1 ) {
BTL_ERROR ( ( " Invalid max_inline_data value specified "
2011-07-04 14:00:41 +00:00
" in INI file (%d); using default value (%d) " ,
values . max_inline_data ,
2008-11-14 12:15:35 +00:00
default_values . max_inline_data ) ) ;
2011-07-04 14:00:41 +00:00
device - > max_inline_data = ( uint32_t )
2008-11-14 12:15:35 +00:00
default_values . max_inline_data ;
} else {
BTL_ERROR ( ( " Invalid max_inline_data value specified "
" in INI file (%d) " , values . max_inline_data ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_BAD_PARAM ;
2008-11-14 12:15:35 +00:00
goto error ;
2011-07-04 14:00:41 +00:00
}
2008-06-24 17:18:07 +00:00
}
}
2012-10-03 00:57:39 +00:00
/* If we don't have a set max inline data size, search for it */
if ( need_search ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
opal_common_verbs_find_max_inline ( device - > ib_dev ,
2012-10-03 00:57:39 +00:00
device - > ib_dev_context ,
device - > ib_pd ,
& device - > max_inline_data ) ;
2008-06-24 17:18:07 +00:00
}
2008-06-24 18:31:46 +00:00
/* Should we use RDMA for short / eager messages? First check MCA
param , then check INI file values . */
if ( mca_btl_openib_component . use_eager_rdma > = 0 ) {
2008-07-23 00:28:59 +00:00
device - > use_eager_rdma = mca_btl_openib_component . use_eager_rdma ;
2008-06-24 18:31:46 +00:00
} else if ( values . use_eager_rdma_set ) {
2008-07-23 00:28:59 +00:00
device - > use_eager_rdma = values . use_eager_rdma ;
2006-12-14 15:52:13 +00:00
}
2008-06-24 18:31:46 +00:00
/* Eager RDMA is not currently supported with progress threads */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( device - > use_eager_rdma & & OPAL_ENABLE_PROGRESS_THREADS ) {
2008-07-23 00:28:59 +00:00
device - > use_eager_rdma = 0 ;
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-06-24 18:31:46 +00:00
" eager RDMA and progress threads " , true ) ;
}
2006-12-14 15:52:13 +00:00
2015-11-02 12:07:08 -07:00
asprintf ( & rcache_resources . cache_name , " verbs.% " PRIu64 , device - > ib_dev_attr . node_guid ) ;
rcache_resources . reg_data = ( void * ) device ;
rcache_resources . sizeof_reg = sizeof ( mca_btl_openib_reg_t ) ;
rcache_resources . register_mem = openib_reg_mr ;
rcache_resources . deregister_mem = openib_dereg_mr ;
device - > rcache =
mca_rcache_base_module_create ( mca_btl_openib_component . ib_rcache_name ,
device , & rcache_resources ) ;
if ( NULL = = device - > rcache ) {
2009-04-01 17:52:16 +00:00
/* Don't print an error message here -- we'll get one from
2012-04-06 14:23:13 +00:00
mpool_create anyway */
2008-01-09 10:26:21 +00:00
goto error ;
2006-06-28 07:23:08 +00:00
}
2008-01-21 12:11:18 +00:00
2015-11-02 12:07:08 -07:00
device - > mpool = mca_mpool_base_module_lookup ( mca_btl_openib_component . ib_mpool_hints ) ;
if ( NULL = = device - > mpool ) {
goto error ;
}
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
# if OPAL_ENABLE_PROGRESS_THREADS
2008-07-23 00:28:59 +00:00
device - > ib_channel = ibv_create_comp_channel ( device - > ib_dev_context ) ;
if ( NULL = = device - > ib_channel ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " error creating channel for %s errno says %s " ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) ,
2006-11-02 16:15:21 +00:00
strerror ( errno ) ) ) ;
2008-01-09 10:26:21 +00:00
goto error ;
2006-11-02 16:15:21 +00:00
}
2014-06-25 20:43:28 +00:00
# endif
2006-11-02 16:15:21 +00:00
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_SUCCESS ;
2006-11-02 16:15:21 +00:00
2007-06-14 01:59:25 +00:00
/* Note ports are 1 based (i >= 1) */
for ( k = 0 ; k < port_cnt ; k + + ) {
2006-06-28 07:23:08 +00:00
struct ibv_port_attr ib_port_attr ;
2007-06-14 01:59:25 +00:00
i = allowed_ports [ k ] ;
2008-07-23 00:28:59 +00:00
if ( ibv_query_port ( device - > ib_dev_context , i , & ib_port_attr ) ) {
2006-06-28 07:23:08 +00:00
BTL_ERROR ( ( " error getting port attributes for device %s "
" port number %d errno says %s " ,
2008-07-23 00:28:59 +00:00
ibv_get_device_name ( device - > ib_dev ) , i , strerror ( errno ) ) ) ;
2008-01-21 12:11:18 +00:00
break ;
2006-06-28 07:23:08 +00:00
}
2008-02-20 13:44:05 +00:00
if ( IBV_PORT_ACTIVE = = ib_port_attr . state ) {
2012-03-01 17:29:40 +00:00
/* Select the lower of the HCA and port active speed. With QLogic
2008-12-15 21:17:58 +00:00
HCAs that are capable of 4 K MTU we had an issue when connected
to switches with 2 K MTU . This fix is valid for other IB vendors
as well . */
if ( ib_port_attr . active_mtu < device - > mtu ) {
device - > mtu = ib_port_attr . active_mtu ;
}
2008-07-23 00:28:59 +00:00
if ( mca_btl_openib_component . apm_ports & & device - > btls > 0 ) {
init_apm_port ( device , i , ib_port_attr . lid ) ;
2008-02-20 13:44:05 +00:00
break ;
}
2007-04-22 10:22:12 +00:00
if ( 0 = = mca_btl_openib_component . ib_pkey_val ) {
2008-10-08 20:55:40 +00:00
ret = init_one_port ( btl_list , device , i , 0 , & ib_port_attr ) ;
2008-02-20 13:44:05 +00:00
} else {
2007-04-22 10:22:12 +00:00
uint16_t pkey , j ;
2008-10-08 09:56:43 +00:00
for ( j = 0 ; j < device - > ib_dev_attr . max_pkeys ; j + + ) {
if ( ibv_query_pkey ( device - > ib_dev_context , i , j , & pkey ) ) {
BTL_ERROR ( ( " error getting pkey for index %d, device %s "
" port number %d errno says %s " ,
j , ibv_get_device_name ( device - > ib_dev ) , i , strerror ( errno ) ) ) ;
}
pkey = ntohs ( pkey ) & MCA_BTL_IB_PKEY_MASK ;
2007-04-22 10:22:12 +00:00
if ( pkey = = mca_btl_openib_component . ib_pkey_val ) {
2008-07-23 00:28:59 +00:00
ret = init_one_port ( btl_list , device , i , j , & ib_port_attr ) ;
2007-04-22 10:22:12 +00:00
break ;
}
}
}
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! = ret ) {
2008-01-21 12:11:18 +00:00
/* Out of bounds error indicates that we hit max btl number
2006-09-25 11:18:20 +00:00
* don ' t propagate the error to the caller */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_ERR_VALUE_OUT_OF_BOUNDS = = ret ) {
ret = OPAL_SUCCESS ;
2008-05-16 03:28:20 +00:00
}
2006-06-28 07:23:08 +00:00
break ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
}
2006-06-28 07:23:08 +00:00
}
}
2008-05-16 03:27:42 +00:00
free ( allowed_ports ) ;
2015-03-06 14:58:11 +09:00
allowed_ports = NULL ;
2006-06-28 07:23:08 +00:00
2008-05-16 03:28:20 +00:00
/* If we made a BTL, check APM status and return. Otherwise, fall
through and destroy everything */
2008-07-23 00:28:59 +00:00
if ( device - > btls > 0 ) {
2008-02-20 13:44:05 +00:00
/* if apm was enabled it should be > 1 */
if ( 1 = = mca_btl_openib_component . apm_ports ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-05-16 03:28:20 +00:00
" apm not enough ports " , true ) ;
2008-02-20 13:44:05 +00:00
mca_btl_openib_component . apm_ports = 0 ;
}
2009-12-15 14:25:07 +00:00
2010-02-10 16:53:26 +00:00
/* Check to ensure that all devices used in this process have
compatible receive_queues values ( we check elsewhere to see
if all devices used in other processes in this job have
compatible receive_queues values ) .
Not only is the check complex , but the reasons behind what
it does ( and does not do ) are complex . Before explaining
the code below , here ' s some notes :
1. The openib BTL component only supports 1 value of the
receive_queues between all of its modules .
- - > This could be changed to allow every module to have
its own receive_queues . But that would be a big
deal ; no one has time to code this up right now .
2. The receive_queues value can be specified either as an
MCA parameter or in the INI file . Specifying the value
as an MCA parameter overrides all INI file values
( meaning : that MCA param value will be used for all
openib BTL modules in the process ) .
Effectively , the first device through init_one_device ( )
gets to decide what the receive_queues will be for the all
modules in this process . This is an unfortunate artifact
of the openib BTL startup sequence ( see below for more
details ) . The first device will choose the receive_queues
2011-07-04 14:00:41 +00:00
value from : ( in priority order ) :
2010-02-10 16:53:26 +00:00
1. If the btl_openib_receive_queues MCA param was
specified , use that .
2. If this device has a receive_queues value specified in
the INI file , use that .
3. Otherwise , use the default MCA param value for
btl_openib_receive_queues .
If any successive device has a different value specified in
the INI file , we show_help and return up the stack that
this device failed .
In the case that the user does not specify a
mca_btl_openib_receive_queues value , the short description
of what is allowed is that either a ) no devices specify a
receive_queues value in the INI file ( in which case we use
the default MCA param value ) , b ) all devices specify the
same receive_queues value in the INI value , or c ) some / all
devices specify the same receive_queues value in the INI
value as the default MCA param value .
Let ' s take some sample cases to explain this more clearly . . .
THESE ARE THE " GOOD " CASES
- - - - - - - - - - - - - - - - - - - - - - - - - -
Case 1 : no INI values
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : no receive_queues in INI file
- device 1 : no receive_queues in INI file
- device 2 : no receive_queues in INI file
- - > use receive_queues value A with all devices
Case 2 : all INI values the same ( same as default )
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : receive_queues value A in the INI file
- device 1 : receive_queues value A in the INI file
- device 2 : receive_queues value A in the INI file
- - > use receive_queues value A with all devices
Case 3 : all INI values the same ( but different than default )
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : receive_queues value B in the INI file
- device 1 : receive_queues value B in the INI file
- device 2 : receive_queues value B in the INI file
- - > use receive_queues value B with all devices
Case 4 : some INI unspecified , but rest same as default
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : receive_queues value A in the INI file
- device 1 : no receive_queues in INI file
- device 2 : receive_queues value A in the INI file
- - > use receive_queues value A with all devices
Case 5 : some INI unspecified ( including device 0 ) , but rest same as default
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : no receive_queues in INI file
- device 1 : no receive_queues in INI file
- device 2 : receive_queues value A in the INI file
- - > use receive_queues value A with all devices
Case 6 : different default / INI values , but MCA param is specified
- MCA parameter : value D
- default receive_queues : value A
- device 0 : no receive_queues in INI file
- device 1 : receive_queues value B in INI file
- device 2 : receive_queues value C in INI file
- - > use receive_queues value D with all devices
What this means is that this selection process is
unfortunately tied to the order of devices . : - ( Device 0
effectively sets what the receive_queues value will be for
that process . If any later device disagrees , that ' s
problematic and we have to error / abort .
ALL REMAINING CASES WILL FAIL
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Case 7 : one INI value ( different than default )
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : receive_queues value B in INI file
- device 1 : no receive_queues in INI file
- device 2 : no receive_queues in INI file
- - > Jeff thinks that it would be great to use
receive_queues value B with all devices . However , it
shares one of the problems cited in case 8 , below . So
we need to fail this scenario ; print an error and
abort .
2011-07-04 14:00:41 +00:00
2010-02-10 16:53:26 +00:00
Case 8 : one INI value , different than default
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : no receive_queues in INI file
- device 1 : receive_queues value B in INI file
- device 2 : no receive_queues in INI file
- - > Jeff thinks that it would be great to use
receive_queues value B with all devices . However , it
has ( at least ) 2 problems :
1. The check for local receive_queue compatibility is
done here in init_one_device ( ) . By the time we call
init_one_device ( ) for device 1 , we have already
called init_one_device ( ) for device 0 , meaning that
device 0 ' s QPs have already been created and setup
using the MCA parameter ' s default receive_queues
value . So if device 1 * changes * the
component . receive_queues value , then device 0 and
device 1 now have different receive_queue sets ( more
specifically : the QPs setup for device 0 are now
effectively lost ) . This is Bad .
It would be great if we didn ' t have this restriction
- - either by letting each module have its own
receive_queues value or by scanning all devices and
figuring out a final receive_queues value * before *
actually setting up any QPs . But that ' s not the
current flow of the code ( patches would be greatly
appreciated here , of course ! ) . Unfortunately , no
one has time to code this up right now , so we ' re
leaving this as explicitly documented for some
future implementer . . .
2. Conside a scenario with server 1 having HCA A / subnet
X , and server 2 having HCA B / subnet X and HCA
C / subnet Y . And let ' s assume :
Server 1 :
HCA A : no receive_queues in INI file
Server 2 :
HCA B : no receive_queues in INI file
HCA C : receive_queues specified in INI file
2011-07-04 14:00:41 +00:00
2010-02-10 16:53:26 +00:00
A will therefore use the default receive_queues
value . B and C will use C ' s INI receive_queues .
But note that modex [ currently ] only sends around
vendor / part IDs for OpenFabrics devices - - not the
actual receive_queues value ( it was felt that
including the final receive_queues string value in
the modex would dramatically increase the size of
the modex ) . So processes on server 1 will get the
vendor / part ID for HCA B , look it up in the INI
file , see that it has no receive_queues value
specified , and then assume that it uses the default
receive_queues value . Hence , procs on server 1 will
try to connect HCA A - - > HCA B with the wrong
receive_queues value . Bad . Further , the error
won ' t be discovered by checks like this because A
won ' t check D ' s receive_queues because D is on a
different subnet .
This could be fixed , of course ; either by a ) send
the final receive_queues value in the modex ( perhaps
compressing or encoding it so that it can be much
shorter than the string - - the current vendor / part
ID stuff takes 8 bytes for each device ) , or b )
replicating the determination process of each host
in each process ( i . e . , procs on server 1 would see
both B and C , and use them both to figure out what
the " final " receive_queues value is for B ) .
Unfortunately , no one has time to code this up right
now , so we ' re leaving this as explicitly documented
for some future implementer . . .
Because of both of these problems , this case is
problematic and must fail with a show_help error .
Case 9 : two devices with same INI value ( different than default )
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : no receive_queues in INI file
- device 1 : receive_queues value B in INI file
- device 2 : receive_queues value B in INI file
- - > per case 8 , fail with a show_help message .
2011-07-04 14:00:41 +00:00
2010-02-10 16:53:26 +00:00
Case 10 : two devices with different INI values
- MCA parameter : not specified
- default receive_queues : value A
- device 0 : no receive_queues in INI file
- device 1 : receive_queues value B in INI file
- device 2 : receive_queues value C in INI file
- - > per case 8 , fail with a show_help message .
*/
2014-07-31 16:30:08 +00:00
{
/* we need to read this MCA param at this point in case someone
* altered it via MPI_T */
mca_base_var_source_t source ;
2016-03-08 16:17:33 -07:00
if ( OPAL_SUCCESS ! = ( ret = get_var_source ( " receive_queues " , & source ) ) ) {
BTL_ERROR ( ( " mca_base_var_get_value failed to get value for receive_queues: %s:%d " ,
__FILE__ , __LINE__ ) ) ;
goto error ;
2014-07-31 16:30:08 +00:00
}
2016-03-08 16:17:33 -07:00
mca_btl_openib_component . receive_queues_source = source ;
2014-07-31 16:30:08 +00:00
}
2010-02-10 16:53:26 +00:00
/* If the MCA param was specified, skip all the checks */
2014-11-04 14:25:02 -07:00
if ( MCA_BASE_VAR_SOURCE_DEFAULT ! = mca_btl_openib_component . receive_queues_source ) {
2010-02-10 16:53:26 +00:00
goto good ;
}
/* If we're the first device and we have a receive_queues
value from the INI file * that is different than the
already - existing default value * , then set the component to
use that . */
if ( 0 = = mca_btl_openib_component . devices_count ) {
if ( NULL ! = values . receive_queues & &
0 ! = strcmp ( values . receive_queues ,
mca_btl_openib_component . receive_queues ) ) {
if ( NULL ! = mca_btl_openib_component . receive_queues ) {
free ( mca_btl_openib_component . receive_queues ) ;
}
mca_btl_openib_component . receive_queues =
strdup ( values . receive_queues ) ;
mca_btl_openib_component . receive_queues_source =
2014-11-04 14:25:02 -07:00
BTL_OPENIB_RQ_SOURCE_DEVICE_INI ;
2010-02-10 16:53:26 +00:00
}
}
/* If we're not the first device, then we have to conform to
either the default value if the first device didn ' t set
anything , or to whatever the first device decided . */
else {
/* In all cases, if this device has a receive_queues value
in the INI , then it must agree with
component . receive_queues . */
if ( NULL ! = values . receive_queues ) {
2011-07-04 14:00:41 +00:00
if ( 0 ! = strcmp ( values . receive_queues ,
2009-12-15 14:25:07 +00:00
mca_btl_openib_component . receive_queues ) ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2010-02-10 16:53:26 +00:00
" locally conflicting receive_queues " , true ,
2014-05-08 02:26:45 +00:00
opal_install_dirs . opaldatadir ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2009-12-15 14:25:07 +00:00
ibv_get_device_name ( receive_queues_device - > ib_dev ) ,
receive_queues_device - > ib_dev_attr . vendor_id ,
receive_queues_device - > ib_dev_attr . vendor_part_id ,
mca_btl_openib_component . receive_queues ,
2010-02-10 16:53:26 +00:00
ibv_get_device_name ( device - > ib_dev ) ,
device - > ib_dev_attr . vendor_id ,
device - > ib_dev_attr . vendor_part_id ,
values . receive_queues ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_RESOURCE_BUSY ;
2009-12-15 14:25:07 +00:00
goto error ;
}
2010-02-10 16:53:26 +00:00
}
/* If this device doesn't have an INI receive_queues
value , then if the component . receive_queues value came
from the default , we ' re ok . But if the
component . receive_queues value came from the 1 st
device ' s INI file , we must error . */
2014-11-04 14:25:02 -07:00
else if ( ( mca_base_var_source_t ) BTL_OPENIB_RQ_SOURCE_DEVICE_INI = =
2010-02-10 16:53:26 +00:00
mca_btl_openib_component . receive_queues_source ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2010-02-10 16:53:26 +00:00
" locally conflicting receive_queues " , true ,
2014-05-08 02:26:45 +00:00
opal_install_dirs . opaldatadir ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2010-02-10 16:53:26 +00:00
ibv_get_device_name ( receive_queues_device - > ib_dev ) ,
receive_queues_device - > ib_dev_attr . vendor_id ,
receive_queues_device - > ib_dev_attr . vendor_part_id ,
mca_btl_openib_component . receive_queues ,
ibv_get_device_name ( device - > ib_dev ) ,
device - > ib_dev_attr . vendor_id ,
device - > ib_dev_attr . vendor_part_id ,
mca_btl_openib_component . default_recv_qps ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = OPAL_ERR_RESOURCE_BUSY ;
2010-02-10 16:53:26 +00:00
goto error ;
2009-12-15 14:25:07 +00:00
}
}
2010-02-10 16:53:26 +00:00
receive_queues_device = device ;
good :
mca_btl_openib_component . devices_count + + ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
If you have an HCA with no active ports, we still create an mpool.
This mpool will have no btl module owner there was no btl created for
the HCA with no ports, but it will still be tracked in the mpool
framework (i.e., it's available).
If MPI_ALLOC_MEM is called by the app, one of two things will happen:
1. if there's an HCA on the host with some active ports, the openib
btl component will still be in the process space, and therefore
the "mpool with no btl" (MWNB) module will still be able to call
the reg/dereg functions, and all will be fine. However, if
MPI_FREE_MEM is never invoked to free the memory, bad things will
happen during MPI_FINALIZE. The pml is finalized, which finalizes
all the btls. The btls finalize all their mpools and all is fine.
But later we close down the mpool framework which then finalizes
any left over mpool modules, such as MWNB. However, the openib
BTL module functions that the MWNB was registered with are no
longer in the process space, and it segv's while trying deregister
the memory.
2. if there are *no* HCA's on the host with active ports, then the
openib btl will have been unloaded, and when the MWNM tries to
register the memory, the functions it tries to call (in the openib
btl) are no longer there, and we segv.
This commit was SVN r15735.
2007-08-01 20:53:34 +00:00
}
2006-06-28 07:23:08 +00:00
2008-01-09 10:26:21 +00:00
error :
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! = ret ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2011-07-04 14:00:41 +00:00
" error in device init " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2008-10-06 00:46:02 +00:00
ibv_get_device_name ( device - > ib_dev ) ) ;
}
2015-03-03 13:22:01 +09:00
if ( NULL ! = allowed_ports ) {
free ( allowed_ports ) ;
}
2008-07-23 00:28:59 +00:00
OBJ_RELEASE ( device ) ;
2006-06-28 07:23:08 +00:00
return ret ;
}
2006-12-17 12:26:41 +00:00
2007-10-23 12:57:45 +00:00
static int finish_btl_init ( mca_btl_openib_module_t * openib_btl )
{
2008-01-09 10:26:21 +00:00
int qp ;
2007-10-23 12:57:45 +00:00
openib_btl - > num_peers = 0 ;
/* Initialize module state */
OBJ_CONSTRUCT ( & openib_btl - > ib_lock , opal_mutex_t ) ;
2008-01-21 12:11:18 +00:00
2007-10-23 12:57:45 +00:00
/* setup the qp structure */
openib_btl - > qps = ( mca_btl_openib_module_qp_t * )
2008-01-09 10:26:21 +00:00
calloc ( mca_btl_openib_component . num_qps ,
sizeof ( mca_btl_openib_module_qp_t ) ) ;
2013-10-28 20:04:49 +00:00
if ( NULL = = openib_btl - > qps ) {
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERR_OUT_OF_RESOURCE ;
2013-10-28 20:04:49 +00:00
}
2008-01-21 12:11:18 +00:00
/* setup all the qps */
2008-05-16 03:28:20 +00:00
for ( qp = 0 ; qp < mca_btl_openib_component . num_qps ; qp + + ) {
if ( ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) {
2007-11-28 07:12:44 +00:00
OBJ_CONSTRUCT ( & openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ 0 ] ,
opal_list_t ) ;
OBJ_CONSTRUCT ( & openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ 1 ] ,
opal_list_t ) ;
2008-01-21 12:11:18 +00:00
openib_btl - > qps [ qp ] . u . srq_qp . sd_credits =
2007-10-23 12:57:45 +00:00
mca_btl_openib_component . qp_infos [ qp ] . u . srq_qp . sd_max ;
2009-03-25 14:18:05 +00:00
openib_btl - > qps [ qp ] . u . srq_qp . srq = NULL ;
2007-10-23 12:57:45 +00:00
}
}
2008-07-23 00:28:59 +00:00
/* initialize the memory pool using the device */
openib_btl - > super . btl_mpool = openib_btl - > device - > mpool ;
2008-01-21 12:11:18 +00:00
2007-12-23 12:29:34 +00:00
openib_btl - > eager_rdma_channels = 0 ;
2007-10-23 12:57:45 +00:00
openib_btl - > eager_rdma_frag_size = OPAL_ALIGN (
sizeof ( mca_btl_openib_header_t ) +
2007-12-09 14:05:13 +00:00
sizeof ( mca_btl_openib_header_coalesced_t ) +
sizeof ( mca_btl_openib_control_header_t ) +
2007-10-23 12:57:45 +00:00
sizeof ( mca_btl_openib_footer_t ) +
openib_btl - > super . btl_eager_limit ,
mca_btl_openib_component . buffer_alignment , size_t ) ;
2015-06-23 20:59:57 -07:00
opal_output_verbose ( 1 , opal_btl_base_framework . framework_output ,
" [rank=%d] openib: using port %s:%d " ,
2014-11-11 17:00:42 -08:00
OPAL_PROC_MY_NAME . vpid ,
2014-07-21 20:38:52 +00:00
ibv_get_device_name ( openib_btl - > device - > ib_dev ) ,
openib_btl - > port_num ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2007-10-23 12:57:45 +00:00
}
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
struct dev_distance {
struct ibv_device * ib_dev ;
2014-07-02 07:56:40 +00:00
float distance ;
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
} ;
2008-02-11 10:34:11 +00:00
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
static int compare_distance ( const void * p1 , const void * p2 )
2008-02-11 10:34:11 +00:00
{
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
const struct dev_distance * d1 = ( const struct dev_distance * ) p1 ;
const struct dev_distance * d2 = ( const struct dev_distance * ) p2 ;
2008-02-11 10:34:11 +00:00
2014-07-02 07:56:40 +00:00
if ( d1 - > distance > ( d2 - > distance + EPS ) ) {
return 1 ;
} else if ( ( d1 - > distance + EPS ) < d2 - > distance ) {
return - 1 ;
} else {
return 0 ;
}
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
}
2008-02-11 10:34:11 +00:00
2014-07-02 07:56:40 +00:00
static float get_ib_dev_distance ( struct ibv_device * dev )
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
{
/* If we don't have hwloc, we'll default to a distance of 0,
because we have no way of measuring . */
2014-07-02 07:56:40 +00:00
float distance = 0 ;
2008-02-11 10:34:11 +00:00
2014-01-14 15:41:56 +00:00
/* Override any distance logic so all devices are used */
if ( 0 ! = mca_btl_openib_component . ignore_locality ) {
return distance ;
}
2014-07-02 07:56:40 +00:00
float a , b ;
int i ;
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
hwloc_cpuset_t my_cpuset = NULL , ibv_cpuset = NULL ;
hwloc_obj_t my_obj , ibv_obj , node_obj ;
2008-02-11 10:34:11 +00:00
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* Note that this struct is owned by hwloc; there's no need to
free it at the end of time */
static const struct hwloc_distances_s * hwloc_distances = NULL ;
2008-02-11 10:34:11 +00:00
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
if ( NULL = = hwloc_distances ) {
2015-06-23 20:59:57 -07:00
hwloc_distances =
hwloc_get_whole_distance_matrix_by_type ( opal_hwloc_topology ,
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
HWLOC_OBJ_NODE ) ;
}
2008-02-11 10:34:11 +00:00
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* If we got no info, just return 0 */
if ( NULL = = hwloc_distances | | NULL = = hwloc_distances - > latency ) {
goto out ;
}
2008-02-11 10:34:11 +00:00
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* Next, find the NUMA node where this IBV device is located */
ibv_cpuset = hwloc_bitmap_alloc ( ) ;
if ( NULL = = ibv_cpuset ) {
goto out ;
}
if ( 0 ! = hwloc_ibv_get_device_cpuset ( opal_hwloc_topology , dev , ibv_cpuset ) ) {
goto out ;
}
ibv_obj = hwloc_get_obj_covering_cpuset ( opal_hwloc_topology , ibv_cpuset ) ;
if ( NULL = = ibv_obj ) {
goto out ;
}
2008-02-11 10:34:11 +00:00
2015-09-04 15:43:49 -04:00
opal_output_verbose ( 5 , opal_btl_base_framework . framework_output ,
" hwloc_distances->nbobjs=%d " , hwloc_distances - > nbobjs ) ;
for ( i = 0 ; i < ( int ) ( 2 * hwloc_distances - > nbobjs ) ; i + + ) {
opal_output_verbose ( 5 , opal_btl_base_framework . framework_output ,
" hwloc_distances->latency[%d]=%f " , i , hwloc_distances - > latency [ i ] ) ;
}
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* If ibv_obj is a NUMA node or below, we're good. */
switch ( ibv_obj - > type ) {
case HWLOC_OBJ_NODE :
case HWLOC_OBJ_SOCKET :
case HWLOC_OBJ_CACHE :
case HWLOC_OBJ_CORE :
case HWLOC_OBJ_PU :
while ( NULL ! = ibv_obj & & ibv_obj - > type ! = HWLOC_OBJ_NODE ) {
ibv_obj = ibv_obj - > parent ;
}
break ;
2008-02-11 10:34:11 +00:00
2015-06-23 20:59:57 -07:00
default :
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* If it's above a NUMA node, then I don't know how to compute
the distance . . . */
2015-09-04 15:43:49 -04:00
opal_output_verbose ( 5 , opal_btl_base_framework . framework_output , " ibv_obj->type set to NULL " ) ;
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
ibv_obj = NULL ;
break ;
}
2008-02-11 10:34:11 +00:00
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* If we don't have an object for this ibv device, give up */
if ( NULL = = ibv_obj ) {
goto out ;
}
2008-02-11 10:34:11 +00:00
2015-09-04 15:43:49 -04:00
opal_output_verbose ( 5 , opal_btl_base_framework . framework_output ,
" ibv_obj->logical_index=%d " , ibv_obj - > logical_index ) ;
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* This function is only called if the process is bound, so let's
find out where we are bound to . For the moment , we only care
about the NUMA node to which we are bound . */
my_cpuset = hwloc_bitmap_alloc ( ) ;
if ( NULL = = my_cpuset ) {
goto out ;
}
if ( 0 ! = hwloc_get_cpubind ( opal_hwloc_topology , my_cpuset , 0 ) ) {
goto out ;
}
my_obj = hwloc_get_obj_covering_cpuset ( opal_hwloc_topology , my_cpuset ) ;
if ( NULL = = my_obj ) {
goto out ;
2008-02-11 10:34:11 +00:00
}
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* If my_obj is a NUMA node or below, we're good. */
switch ( my_obj - > type ) {
case HWLOC_OBJ_NODE :
case HWLOC_OBJ_SOCKET :
case HWLOC_OBJ_CACHE :
case HWLOC_OBJ_CORE :
case HWLOC_OBJ_PU :
while ( NULL ! = my_obj & & my_obj - > type ! = HWLOC_OBJ_NODE ) {
my_obj = my_obj - > parent ;
}
if ( NULL ! = my_obj ) {
2015-09-04 15:43:49 -04:00
opal_output_verbose ( 5 , opal_btl_base_framework . framework_output ,
" my_obj->logical_index=%d " , my_obj - > logical_index ) ;
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* Distance may be asymetrical, so calculate both of them
and take the max */
2014-01-14 15:41:56 +00:00
a = hwloc_distances - > latency [ my_obj - > logical_index +
2015-06-23 20:59:57 -07:00
( ibv_obj - > logical_index *
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
hwloc_distances - > nbobjs ) ] ;
2014-01-14 15:41:56 +00:00
b = hwloc_distances - > latency [ ibv_obj - > logical_index +
2015-06-23 20:59:57 -07:00
( my_obj - > logical_index *
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
hwloc_distances - > nbobjs ) ] ;
distance = ( a > b ) ? a : b ;
}
break ;
2008-02-11 10:34:11 +00:00
2015-06-23 20:59:57 -07:00
default :
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* If the obj is above a NUMA node, then we're bound to more than
one NUMA node . Find the max distance . */
i = 0 ;
for ( node_obj = hwloc_get_obj_inside_cpuset_by_type ( opal_hwloc_topology ,
2015-06-23 20:59:57 -07:00
ibv_obj - > cpuset ,
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
HWLOC_OBJ_NODE , i ) ;
2015-06-23 20:59:57 -07:00
NULL ! = node_obj ;
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
node_obj = hwloc_get_obj_inside_cpuset_by_type ( opal_hwloc_topology ,
2015-06-23 20:59:57 -07:00
ibv_obj - > cpuset ,
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
HWLOC_OBJ_NODE , + + i ) ) {
2014-01-14 15:41:56 +00:00
a = hwloc_distances - > latency [ node_obj - > logical_index +
2015-06-23 20:59:57 -07:00
( ibv_obj - > logical_index *
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
hwloc_distances - > nbobjs ) ] ;
2014-01-14 15:41:56 +00:00
b = hwloc_distances - > latency [ ibv_obj - > logical_index +
2015-06-23 20:59:57 -07:00
( node_obj - > logical_index *
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
hwloc_distances - > nbobjs ) ] ;
a = ( a > b ) ? a : b ;
distance = ( a > distance ) ? a : distance ;
}
break ;
}
2008-02-11 10:34:11 +00:00
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
out :
if ( NULL ! = ibv_cpuset ) {
hwloc_bitmap_free ( ibv_cpuset ) ;
}
if ( NULL ! = my_cpuset ) {
hwloc_bitmap_free ( my_cpuset ) ;
}
2008-02-11 10:34:11 +00:00
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
return distance ;
2008-02-11 10:34:11 +00:00
}
static struct dev_distance *
sort_devs_by_distance ( struct ibv_device * * ib_devs , int count )
{
int i ;
2010-07-13 01:52:22 +00:00
struct dev_distance * devs = ( struct dev_distance * ) malloc ( count * sizeof ( struct dev_distance ) ) ;
2013-10-28 20:04:49 +00:00
if ( NULL = = devs ) {
return NULL ;
}
2008-02-11 10:34:11 +00:00
2008-05-16 03:28:20 +00:00
for ( i = 0 ; i < count ; i + + ) {
2008-02-11 10:34:11 +00:00
devs [ i ] . ib_dev = ib_devs [ i ] ;
2015-09-04 15:43:49 -04:00
opal_output_verbose ( 5 , opal_btl_base_framework . framework_output ,
" Checking distance from this process to device=%s " , ibv_get_device_name ( ib_devs [ i ] ) ) ;
2014-07-28 22:00:18 +00:00
/* If we're not bound, just assume that the device is close. */
devs [ i ] . distance = 0 ;
if ( opal_process_info . cpuset ) {
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
/* If this process is bound to one or more PUs, we can get
an accurate distance . */
devs [ i ] . distance = get_ib_dev_distance ( ib_devs [ i ] ) ;
}
2015-09-04 15:43:49 -04:00
opal_output_verbose ( 5 , opal_btl_base_framework . framework_output ,
" Process is %s: distance to device is %f " ,
( opal_process_info . cpuset ? " bound " : " not bound " ) , devs [ i ] . distance ) ;
2008-02-11 10:34:11 +00:00
}
qsort ( devs , count , sizeof ( struct dev_distance ) , compare_distance ) ;
return devs ;
}
2008-08-06 14:22:03 +00:00
2005-06-30 21:28:35 +00:00
/*
* IB component initialization :
* ( 1 ) read interface list from kernel and compare against component parameters
* then create a BTL instance for selected interfaces
* ( 2 ) setup IB listen socket for incoming connection attempts
* ( 3 ) register BTL parameters with the MCA
*/
2008-01-21 12:11:18 +00:00
static mca_btl_base_module_t * *
btl_openib_component_init ( int * num_btl_modules ,
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
bool enable_progress_threads ,
bool enable_mpi_threads )
2005-06-30 21:28:35 +00:00
{
2008-01-21 12:11:18 +00:00
struct ibv_device * * ib_devs ;
2015-03-03 13:22:01 +09:00
mca_btl_base_module_t * * btls = NULL ;
2008-01-09 10:26:21 +00:00
int i , ret , num_devs , length ;
2008-01-21 12:11:18 +00:00
opal_list_t btl_list ;
mca_btl_openib_module_t * openib_btl ;
mca_btl_base_selected_module_t * ib_selected ;
opal_list_item_t * item ;
2008-01-09 10:26:21 +00:00
mca_btl_openib_frag_init_data_t * init_data ;
2008-02-11 10:34:11 +00:00
struct dev_distance * dev_sorted ;
2014-07-02 07:56:40 +00:00
float distance ;
2016-07-21 09:44:43 +09:00
int index ;
2008-10-06 00:46:02 +00:00
bool found ;
2013-03-27 21:09:41 +00:00
mca_base_var_source_t source ;
2008-12-02 22:42:01 +00:00
int list_count = 0 ;
2005-07-19 21:04:22 +00:00
2005-06-30 21:28:35 +00:00
/* initialization */
* num_btl_modules = 0 ;
2008-01-21 12:11:18 +00:00
num_devs = 0 ;
2005-06-30 21:28:35 +00:00
2012-06-29 13:51:36 +00:00
/* If we got this far, then setup the memory alloc hook (because
2015-06-12 20:08:18 -07:00
we ' re most likely going to be using this component ) . The hook
is to be set up as early as possible in this function since we
2016-02-10 14:27:47 +02:00
want most of the allocated resources be aligned .
*/
2016-02-18 10:33:14 +09:00
opal_memory - > memoryc_set_alignment ( 32 , mca_btl_openib_module . super . btl_eager_limit ) ;
2008-10-28 18:29:57 +00:00
2008-05-28 22:05:47 +00:00
/* Per https://svn.open-mpi.org/trac/ompi/ticket/1305, check to
see if $ sysfsdir / class / infiniband exists . If it does not ,
assume that the RDMA hardware drivers are not loaded , and
therefore we don ' t want OpenFabrics verbs support in this OMPI
job . No need to print a warning . */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( ! opal_common_verbs_check_basics ( ) ) {
2008-10-28 18:29:57 +00:00
goto no_btls ;
2008-05-28 22:05:47 +00:00
}
2008-07-23 00:28:59 +00:00
/* Read in INI files with device-specific parameters */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! = ( ret = opal_btl_openib_ini_init ( ) ) ) {
2007-06-14 01:59:25 +00:00
goto no_btls ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
}
2008-05-06 22:43:52 +00:00
2013-03-27 21:09:41 +00:00
index = mca_base_var_find ( " ompi " , " btl " , " openib " , " max_inline_data " ) ;
2008-11-14 12:15:35 +00:00
if ( index > = 0 ) {
2013-03-27 21:09:41 +00:00
if ( OPAL_SUCCESS = = mca_base_var_get_value ( index , NULL , & source , NULL ) ) {
2011-07-04 14:00:41 +00:00
if ( - 1 = = mca_btl_openib_component . ib_max_inline_data & &
2013-03-27 21:09:41 +00:00
MCA_BASE_VAR_SOURCE_DEFAULT = = source ) {
2008-11-14 12:15:35 +00:00
/* If the user has not explicitly set this MCA parameter
2011-07-04 14:00:41 +00:00
use max_inline_data value specified in the
2008-11-14 12:15:35 +00:00
device - specific parameters INI file */
mca_btl_openib_component . ib_max_inline_data = - 2 ;
}
}
}
2008-07-24 22:51:26 +00:00
2015-02-19 13:41:41 -07:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . send_free_coalesced , opal_free_list_t ) ;
OBJ_CONSTRUCT ( & mca_btl_openib_component . send_user_free , opal_free_list_t ) ;
OBJ_CONSTRUCT ( & mca_btl_openib_component . recv_user_free , opal_free_list_t ) ;
2008-01-09 10:26:21 +00:00
2010-07-12 16:17:56 +00:00
init_data = ( mca_btl_openib_frag_init_data_t * ) malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
2013-10-28 20:04:49 +00:00
if ( NULL = = init_data ) {
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
goto no_btls ;
}
2008-01-21 12:11:18 +00:00
2008-01-09 10:26:21 +00:00
init_data - > order = mca_btl_openib_component . rdma_qp ;
init_data - > list = & mca_btl_openib_component . send_user_free ;
2008-01-21 12:11:18 +00:00
2013-02-20 21:42:13 +00:00
/* Align fragments on 8-byte boundaries (instead of 2) to fix bus errors that
occur on some 32 - bit platforms . Depending on the size of the fragment this
will waste 2 - 6 bytes of space per frag . In most cases this shouldn ' t waste
any space . */
2015-02-19 13:41:41 -07:00
if ( OPAL_SUCCESS ! = opal_free_list_init (
2008-01-09 10:26:21 +00:00
& mca_btl_openib_component . send_user_free ,
2013-02-20 21:27:01 +00:00
sizeof ( mca_btl_openib_put_frag_t ) , 8 ,
2008-01-09 10:26:21 +00:00
OBJ_CLASS ( mca_btl_openib_put_frag_t ) ,
0 , 0 ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2015-02-19 13:41:41 -07:00
NULL , 0 , NULL , mca_btl_openib_frag_init , init_data ) ) {
2008-01-09 10:26:21 +00:00
goto no_btls ;
}
2008-01-21 12:11:18 +00:00
2010-07-12 16:17:56 +00:00
init_data = ( mca_btl_openib_frag_init_data_t * ) malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
2013-10-28 20:04:49 +00:00
if ( NULL = = init_data ) {
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
goto no_btls ;
}
2008-01-09 10:26:21 +00:00
init_data - > order = mca_btl_openib_component . rdma_qp ;
init_data - > list = & mca_btl_openib_component . recv_user_free ;
2008-01-21 12:11:18 +00:00
2015-02-19 13:41:41 -07:00
if ( OPAL_SUCCESS ! = opal_free_list_init (
2008-01-09 10:26:21 +00:00
& mca_btl_openib_component . recv_user_free ,
2013-02-20 21:27:01 +00:00
sizeof ( mca_btl_openib_get_frag_t ) , 8 ,
2008-01-09 10:26:21 +00:00
OBJ_CLASS ( mca_btl_openib_get_frag_t ) ,
0 , 0 ,
mca_btl_openib_component . ib_free_list_num ,
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2015-02-19 13:41:41 -07:00
NULL , 0 , NULL , mca_btl_openib_frag_init , init_data ) ) {
2008-01-09 10:26:21 +00:00
goto no_btls ;
}
2010-07-12 16:17:56 +00:00
init_data = ( mca_btl_openib_frag_init_data_t * ) malloc ( sizeof ( mca_btl_openib_frag_init_data_t ) ) ;
2013-10-28 20:04:49 +00:00
if ( NULL = = init_data ) {
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
goto no_btls ;
}
2008-01-09 10:26:21 +00:00
length = sizeof ( mca_btl_openib_coalesced_frag_t ) ;
init_data - > list = & mca_btl_openib_component . send_free_coalesced ;
2015-02-19 13:41:41 -07:00
if ( OPAL_SUCCESS ! = opal_free_list_init (
2008-01-09 10:26:21 +00:00
& mca_btl_openib_component . send_free_coalesced ,
2013-02-20 21:27:01 +00:00
length , 8 , OBJ_CLASS ( mca_btl_openib_coalesced_frag_t ) ,
2015-02-19 13:41:41 -07:00
0 , 0 , mca_btl_openib_component . ib_free_list_num ,
2008-01-09 10:26:21 +00:00
mca_btl_openib_component . ib_free_list_max ,
mca_btl_openib_component . ib_free_list_inc ,
2015-02-19 13:41:41 -07:00
NULL , 0 , NULL , mca_btl_openib_frag_init , init_data ) ) {
2008-01-09 10:26:21 +00:00
goto no_btls ;
}
2015-02-23 10:30:35 +02:00
/* If fork support is requested, try to enable it */
if ( OPAL_SUCCESS ! = ( ret = opal_common_verbs_fork_test ( ) ) ) {
goto no_btls ;
2007-04-21 00:15:05 +00:00
}
2007-06-14 01:59:25 +00:00
/* Parse the include and exclude lists, checking for errors */
mca_btl_openib_component . if_include_list =
2008-01-21 12:11:18 +00:00
mca_btl_openib_component . if_exclude_list =
2007-06-14 01:59:25 +00:00
mca_btl_openib_component . if_list = NULL ;
2008-12-02 22:42:01 +00:00
if ( NULL ! = mca_btl_openib_component . if_include )
list_count + + ;
if ( NULL ! = mca_btl_openib_component . if_exclude )
list_count + + ;
if ( NULL ! = mca_btl_openib_component . ipaddr_include )
list_count + + ;
if ( NULL ! = mca_btl_openib_component . ipaddr_exclude )
list_count + + ;
if ( list_count > 1 ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2007-06-14 01:59:25 +00:00
" specified include and exclude " , true ,
2008-12-02 23:17:46 +00:00
NULL = = mca_btl_openib_component . if_include ?
" <not specified> " : mca_btl_openib_component . if_include ,
NULL = = mca_btl_openib_component . if_exclude ?
" <not specified> " : mca_btl_openib_component . if_exclude ,
NULL = = mca_btl_openib_component . ipaddr_include ?
" <not specified> " : mca_btl_openib_component . ipaddr_include ,
NULL = = mca_btl_openib_component . ipaddr_exclude ?
" <not specified> " : mca_btl_openib_component . ipaddr_exclude ,
NULL ) ;
2007-06-14 01:59:25 +00:00
goto no_btls ;
} else if ( NULL ! = mca_btl_openib_component . if_include ) {
2008-01-21 12:11:18 +00:00
mca_btl_openib_component . if_include_list =
2007-06-14 01:59:25 +00:00
opal_argv_split ( mca_btl_openib_component . if_include , ' , ' ) ;
2008-01-21 12:11:18 +00:00
mca_btl_openib_component . if_list =
2007-06-14 01:59:25 +00:00
opal_argv_copy ( mca_btl_openib_component . if_include_list ) ;
} else if ( NULL ! = mca_btl_openib_component . if_exclude ) {
2008-01-21 12:11:18 +00:00
mca_btl_openib_component . if_exclude_list =
2007-06-14 01:59:25 +00:00
opal_argv_split ( mca_btl_openib_component . if_exclude , ' , ' ) ;
2008-01-21 12:11:18 +00:00
mca_btl_openib_component . if_list =
2007-06-14 01:59:25 +00:00
opal_argv_copy ( mca_btl_openib_component . if_exclude_list ) ;
}
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ib_devs = opal_ibv_get_device_list ( & num_devs ) ;
2005-06-30 21:28:35 +00:00
2008-02-03 15:16:24 +00:00
if ( 0 = = num_devs | | NULL = = ib_devs ) {
2008-07-23 00:28:59 +00:00
mca_btl_base_error_no_nics ( " OpenFabrics (openib) " , " device " ) ;
2008-02-03 15:16:24 +00:00
goto no_btls ;
2005-06-30 21:28:35 +00:00
}
2008-02-11 10:34:11 +00:00
dev_sorted = sort_devs_by_distance ( ib_devs , num_devs ) ;
2013-10-28 20:04:49 +00:00
if ( NULL = = dev_sorted ) {
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
goto no_btls ;
}
2008-02-11 10:34:11 +00:00
2008-01-21 12:11:18 +00:00
OBJ_CONSTRUCT ( & btl_list , opal_list_t ) ;
2005-07-03 22:45:48 +00:00
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_lock , opal_mutex_t ) ;
2015-10-06 12:47:09 -06:00
2008-06-05 19:08:08 +00:00
distance = dev_sorted [ 0 ] . distance ;
2011-07-04 14:00:41 +00:00
for ( found = false , i = 0 ;
2008-10-06 00:46:02 +00:00
i < num_devs & & ( - 1 = = mca_btl_openib_component . ib_max_btls | |
2007-11-28 07:14:34 +00:00
mca_btl_openib_component . ib_num_btls <
mca_btl_openib_component . ib_max_btls ) ; i + + ) {
2013-11-20 22:23:12 +00:00
if ( 0 ! = mca_btl_openib_component . ib_num_btls & &
2014-07-02 07:56:40 +00:00
( dev_sorted [ i ] . distance - distance ) > EPS ) {
2015-06-23 20:59:57 -07:00
opal_output_verbose ( 1 , opal_btl_base_framework . framework_output ,
" [rank=%d] openib: skipping device %s; it is too far away " ,
2014-11-11 17:00:42 -08:00
OPAL_PROC_MY_NAME . vpid ,
2014-04-18 18:09:09 +00:00
ibv_get_device_name ( dev_sorted [ i ] . ib_dev ) ) ;
2008-02-11 10:34:11 +00:00
break ;
2008-05-16 03:28:20 +00:00
}
2008-02-11 10:34:11 +00:00
2008-10-06 00:46:02 +00:00
/* Only take devices that match the type specified by
btl_openib_device_type */
switch ( mca_btl_openib_component . device_type ) {
case BTL_OPENIB_DT_IB :
# if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
if ( IBV_TRANSPORT_IWARP = = dev_sorted [ i ] . ib_dev - > transport_type ) {
BTL_VERBOSE ( ( " openib: only taking infiniband devices -- skipping %s " ,
ibv_get_device_name ( dev_sorted [ i ] . ib_dev ) ) ) ;
continue ;
}
# endif
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
break ;
2006-09-19 08:56:32 +00:00
2008-10-06 00:46:02 +00:00
case BTL_OPENIB_DT_IWARP :
# if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
if ( IBV_TRANSPORT_IB = = dev_sorted [ i ] . ib_dev - > transport_type ) {
BTL_VERBOSE ( ( " openib: only taking iwarp devices -- skipping %s " ,
ibv_get_device_name ( dev_sorted [ i ] . ib_dev ) ) ) ;
continue ;
}
# else
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " no iwarp support " ,
2008-10-06 00:46:02 +00:00
true ) ;
# endif
break ;
case BTL_OPENIB_DT_ALL :
break ;
}
2007-06-14 01:59:25 +00:00
2008-10-06 00:46:02 +00:00
found = true ;
2013-06-05 12:12:09 +00:00
ret = init_one_device ( & btl_list , dev_sorted [ i ] . ib_dev ) ;
2015-04-25 05:49:32 -07:00
if ( OPAL_ERR_NOT_SUPPORTED = = ret ) {
+ + num_devices_intentionally_ignored ;
continue ;
} else if ( OPAL_SUCCESS ! = ret ) {
2008-10-06 00:46:02 +00:00
free ( dev_sorted ) ;
goto no_btls ;
}
}
2008-02-11 10:34:11 +00:00
free ( dev_sorted ) ;
2008-10-06 00:46:02 +00:00
if ( ! found ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " no devices right type " ,
2014-10-03 14:19:48 -07:00
true , opal_process_info . nodename ,
2008-10-06 00:46:02 +00:00
( ( BTL_OPENIB_DT_IB = = mca_btl_openib_component . device_type ) ?
" InfiniBand " :
( BTL_OPENIB_DT_IWARP = = mca_btl_openib_component . device_type ) ?
" iWARP " : " <any> " ) ) ;
goto no_btls ;
}
2008-02-11 10:34:11 +00:00
2008-07-23 00:28:59 +00:00
/* If we got back from checking all the devices and find that
there are still items in the component . if_list , that means that
they didn ' t exist . Show an appropriate warning if the warning
was not disabled . */
2007-06-14 01:59:25 +00:00
if ( 0 ! = opal_argv_count ( mca_btl_openib_component . if_list ) & &
mca_btl_openib_component . warn_nonexistent_if ) {
char * str = opal_argv_join ( mca_btl_openib_component . if_list , ' , ' ) ;
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " , " nonexistent port " ,
2014-10-03 14:19:48 -07:00
true , opal_process_info . nodename ,
2008-01-21 12:11:18 +00:00
( ( NULL ! = mca_btl_openib_component . if_include ) ?
2007-06-14 01:59:25 +00:00
" in " : " ex " ) , str ) ;
free ( str ) ;
}
2008-01-21 12:11:18 +00:00
2006-09-19 08:56:32 +00:00
if ( 0 = = mca_btl_openib_component . ib_num_btls ) {
2013-06-05 12:12:09 +00:00
/* If there were unusable devices that weren't specifically
ignored , warn about it */
if ( num_devices_intentionally_ignored < num_devs ) {
opal_show_help ( " help-mpi-btl-openib.txt " ,
2015-06-23 20:59:57 -07:00
" no active ports found " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ) ;
2013-06-05 12:12:09 +00:00
}
2008-10-06 00:46:02 +00:00
goto no_btls ;
2006-09-19 08:56:32 +00:00
}
2015-04-25 04:33:42 -07:00
/* Now that we know we have devices and ports that we want to use,
init CPC components */
if ( OPAL_SUCCESS ! = ( ret = opal_btl_openib_connect_base_init ( ) ) ) {
goto no_btls ;
}
2008-05-20 21:53:42 +00:00
/* Setup the BSRQ QP's based on the final value of
mca_btl_openib_component . receive_queues . */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! = setup_qps ( ) ) {
2008-10-06 00:46:02 +00:00
goto no_btls ;
2008-05-28 11:36:38 +00:00
}
2010-02-18 09:48:16 +00:00
if ( mca_btl_openib_component . num_srq_qps > 0 | |
mca_btl_openib_component . num_xrc_qps > 0 ) {
opal_hash_table_t * srq_addr_table = & mca_btl_openib_component . srq_manager . srq_addr_table ;
if ( OPAL_SUCCESS ! = opal_hash_table_init (
srq_addr_table , ( mca_btl_openib_component . num_srq_qps +
mca_btl_openib_component . num_xrc_qps ) *
mca_btl_openib_component . ib_num_btls ) ) {
BTL_ERROR ( ( " SRQ internal error. Failed to allocate SRQ addr hash table " ) ) ;
goto no_btls ;
}
}
2008-05-20 21:53:42 +00:00
2011-07-04 14:00:41 +00:00
/* For XRC:
2008-05-28 11:31:38 +00:00
* from this point we know if MCA_BTL_XRC_ENABLED it true or false */
/* Init XRC IB Addr hash table */
if ( MCA_BTL_XRC_ENABLED ) {
OBJ_CONSTRUCT ( & mca_btl_openib_component . ib_addr_table ,
opal_hash_table_t ) ;
}
2005-06-30 21:28:35 +00:00
/* Allocate space for btl modules */
2006-06-28 07:23:08 +00:00
mca_btl_openib_component . openib_btls =
2010-07-12 16:17:56 +00:00
( mca_btl_openib_module_t * * ) malloc ( sizeof ( mca_btl_openib_module_t * ) *
2006-06-28 07:23:08 +00:00
mca_btl_openib_component . ib_num_btls ) ;
2005-07-12 13:38:54 +00:00
if ( NULL = = mca_btl_openib_component . openib_btls ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2008-10-06 00:46:02 +00:00
goto no_btls ;
2005-06-30 21:28:35 +00:00
}
2008-02-03 15:16:24 +00:00
btls = ( struct mca_btl_base_module_t * * )
malloc ( mca_btl_openib_component . ib_num_btls *
sizeof ( struct mca_btl_base_module_t * ) ) ;
2005-06-30 21:28:35 +00:00
if ( NULL = = btls ) {
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " Failed malloc: %s:%d " , __FILE__ , __LINE__ ) ) ;
2008-10-06 00:46:02 +00:00
goto no_btls ;
2005-06-30 21:28:35 +00:00
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
/* Copy the btl module structs into a contiguous array and fully
initialize them */
2009-06-11 17:30:30 +00:00
i = 0 ;
while ( NULL ! = ( item = opal_list_remove_first ( & btl_list ) ) ) {
2008-01-21 12:11:18 +00:00
ib_selected = ( mca_btl_base_selected_module_t * ) item ;
2008-05-28 11:31:38 +00:00
openib_btl = ( mca_btl_openib_module_t * ) ib_selected - > btl_module ;
2009-06-11 17:30:30 +00:00
/* Search for a CPC that can handle this port */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
ret = opal_btl_openib_connect_base_select_for_local_port ( openib_btl ) ;
2009-06-11 17:30:30 +00:00
/* If we get NOT_SUPPORTED, then no CPC was found for this
port . But that ' s not a fatal error - - just keep going ;
let ' s see if we find any usable openib modules or not . */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_ERR_NOT_SUPPORTED = = ret ) {
2009-06-11 17:30:30 +00:00
continue ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
} else if ( OPAL_SUCCESS ! = ret ) {
2009-06-11 17:30:30 +00:00
/* All others *are* fatal. Note that we already did a
show_help in the lower layer */
2008-10-06 00:46:02 +00:00
goto no_btls ;
2008-05-28 11:31:38 +00:00
}
2010-12-23 11:48:43 +00:00
if ( mca_btl_openib_component . max_hw_msg_size > 0 & &
2011-02-25 00:36:08 +00:00
( uint32_t ) mca_btl_openib_component . max_hw_msg_size > openib_btl - > ib_port_attr . max_msg_sz ) {
2011-02-23 15:50:37 +00:00
BTL_ERROR ( ( " max_hw_msg_size (% " PRIu32 " ) is larger than hw max message size (% " PRIu32 " ) " ,
mca_btl_openib_component . max_hw_msg_size , openib_btl - > ib_port_attr . max_msg_sz ) ) ;
2010-12-23 11:48:43 +00:00
}
2008-05-28 11:31:38 +00:00
mca_btl_openib_component . openib_btls [ i ] = openib_btl ;
2008-01-21 12:11:18 +00:00
OBJ_RELEASE ( ib_selected ) ;
2007-10-23 11:10:52 +00:00
btls [ i ] = & openib_btl - > super ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( finish_btl_init ( openib_btl ) ! = OPAL_SUCCESS ) {
2008-10-06 00:46:02 +00:00
goto no_btls ;
}
2009-06-11 17:30:30 +00:00
+ + i ;
}
/* If we got nothing, then error out */
if ( 0 = = i ) {
goto no_btls ;
}
/* Otherwise reset to the number of openib modules that we
actually got */
mca_btl_openib_component . ib_num_btls = i ;
2005-06-30 21:28:35 +00:00
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
btl_openib_modex_send ( ) ;
2005-06-30 21:28:35 +00:00
* num_btl_modules = mca_btl_openib_component . ib_num_btls ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
opal_ibv_free_device_list ( ib_devs ) ;
2007-06-14 01:59:25 +00:00
if ( NULL ! = mca_btl_openib_component . if_include_list ) {
opal_argv_free ( mca_btl_openib_component . if_include_list ) ;
mca_btl_openib_component . if_include_list = NULL ;
}
if ( NULL ! = mca_btl_openib_component . if_exclude_list ) {
opal_argv_free ( mca_btl_openib_component . if_exclude_list ) ;
mca_btl_openib_component . if_exclude_list = NULL ;
}
2011-07-04 14:00:41 +00:00
2016-04-08 14:38:09 -06:00
# if OPAL_CUDA_SUPPORT
if ( mca_btl_openib_component . cuda_want_gdr & & ( 0 = = opal_leave_pinned ) ) {
opal_show_help ( " help-mpi-btl-openib.txt " ,
" CUDA_gdr_and_nopinned " , true ,
opal_process_info . nodename ) ;
goto no_btls ;
}
# endif /* OPAL_CUDA_SUPPORT */
2013-12-23 17:47:43 +00:00
mca_btl_openib_component . memory_registration_verbose = opal_output_open ( NULL ) ;
opal_output_set_verbosity ( mca_btl_openib_component . memory_registration_verbose ,
mca_btl_openib_component . memory_registration_verbose_level ) ;
2008-08-06 14:22:03 +00:00
/* setup the fork warning message as we are sensitive
* to memory corruption issues when fork is called
*/
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
opal_warn_fork ( ) ;
2005-06-30 21:28:35 +00:00
return btls ;
2007-06-14 01:59:25 +00:00
no_btls :
/* If we fail early enough in the setup, we just modex around that
there are no openib BTL ' s in this process and return NULL . */
mca_btl_openib_component . ib_num_btls = 0 ;
btl_openib_modex_send ( ) ;
2015-03-03 13:22:01 +09:00
if ( NULL ! = btls ) {
free ( btls ) ;
}
2007-06-14 01:59:25 +00:00
return NULL ;
2005-06-30 21:28:35 +00:00
}
2009-04-23 20:04:39 +00:00
/*
* Progress the no_credits_pending_frags lists on all qp ' s
*/
static int progress_no_credits_pending_frags ( mca_btl_base_endpoint_t * ep )
2006-08-30 20:21:47 +00:00
{
2009-04-23 20:04:39 +00:00
int qp , pri , rc , len ;
2008-01-09 10:27:15 +00:00
opal_list_item_t * frag ;
2006-12-14 15:52:13 +00:00
2008-01-09 10:27:15 +00:00
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
2008-01-21 12:11:18 +00:00
2009-04-23 20:04:39 +00:00
/* Traverse all QPs and all priorities */
for ( qp = 0 ; qp < mca_btl_openib_component . num_qps ; + + qp ) {
for ( pri = 0 ; pri < 2 ; + + pri ) {
/* Note that entries in the no_credits_pending_frags list
may be eager RDMA or send fragments . So be sure to
check that we have at least 1 RDMA or send credit .
This loop needs a little explaining . : - \
In the body of the loop , we call _endpoint_post_send ( ) .
The frag will either be successfully sent , or it will
be [ re ] added to the no_credit_pending_frags list . So
if we keep trying to drain the no_credits_pending_frag
list , we could end up in an infinite loop . So instead ,
we get the initial length of the list and ensure to run
through every entry at least once . This attempts to
send * every * frag once and catches the case where a
frag may be on the RDMA list , but because of
coalescing , is now too big for RDMA and defaults over
to sending - - but then we ' re out of send credits , so it
doesn ' t go . But if we * do * still have some RDMA
credits and there are RDMA frags on the list behind
this now - too - big frag , they ' ll get a chance to go .
Specifically , the condition in this for loop is as follows :
- len > 0 : ensure to go through all entries in the list once
- the 2 nd part of the conditional checks to see if we
have any credits at all . Specifically , do we have
any RDMA credits or any send credits , * or * are we on
an SRQ , in which case we define that we * always * have
credits ( because the hardware will continually
retransmit for us ) .
*/
for ( len = opal_list_get_size ( & ep - > qps [ qp ] . no_credits_pending_frags [ pri ] ) ;
2011-07-04 14:00:41 +00:00
len > 0 & &
2009-04-23 20:04:39 +00:00
( ep - > eager_rdma_remote . tokens > 0 | |
ep - > qps [ qp ] . u . pp_qp . sd_credits > 0 | |
! BTL_OPENIB_QP_TYPE_PP ( qp ) ) ; - - len ) {
frag = opal_list_remove_first ( & ep - > qps [ qp ] . no_credits_pending_frags [ pri ] ) ;
2015-05-27 09:43:42 -06:00
assert ( NULL ! = frag ) ;
2009-04-23 20:04:39 +00:00
/* If _endpoint_post_send() fails because of
RESOURCE_BUSY , then the frag was re - added to the
no_credits_pending list . Specifically : either the
frag was initially an RDMA frag , but there were no
RDMA credits so it fell through the trying to send ,
but we had no send credits and therefore re - added
the frag to the no_credits list , or the frag was a
send frag initially ( and the same sequence
occurred , starting at the send frag out - of - credits
scenario ) . In this case , just continue and try the
rest of the frags in the list .
If it fails because of another error , return the
error upward . */
rc = mca_btl_openib_endpoint_post_send ( ep , to_send_frag ( frag ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_UNLIKELY ( OPAL_SUCCESS ! = rc & &
OPAL_ERR_RESOURCE_BUSY ! = rc ) ) {
2009-04-23 20:04:39 +00:00
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
return rc ;
}
2009-04-21 16:57:17 +00:00
}
2009-04-23 20:04:39 +00:00
}
2008-01-09 10:27:15 +00:00
}
2009-04-23 20:04:39 +00:00
2008-01-09 10:27:15 +00:00
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2008-01-09 10:27:15 +00:00
}
void mca_btl_openib_frag_progress_pending_put_get ( mca_btl_base_endpoint_t * ep ,
const int qp )
2008-01-21 12:11:18 +00:00
{
2008-01-09 10:27:15 +00:00
mca_btl_openib_module_t * openib_btl = ep - > endpoint_btl ;
opal_list_item_t * frag ;
size_t i , len = opal_list_get_size ( & ep - > pending_get_frags ) ;
2010-10-01 18:59:15 +00:00
int rc ;
2008-01-09 10:27:15 +00:00
2015-01-06 08:48:33 -07:00
for ( i = 0 ; i < len & & ep - > qps [ qp ] . qp - > sd_wqe > 0 & & ep - > get_tokens > 0 ; i + + ) {
2008-01-09 10:27:15 +00:00
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
frag = opal_list_remove_first ( & ( ep - > pending_get_frags ) ) ;
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2015-01-06 08:48:33 -07:00
if ( NULL = = frag )
2008-01-09 10:27:15 +00:00
break ;
2015-01-06 08:48:33 -07:00
rc = mca_btl_openib_get_internal ( ( mca_btl_base_module_t * ) openib_btl , ep ,
to_get_frag ( frag ) ) ;
if ( OPAL_ERR_OUT_OF_RESOURCE = = rc ) {
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
opal_list_prepend ( & ep - > pending_get_frags , frag ) ;
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2008-01-09 10:27:15 +00:00
break ;
2015-01-06 08:48:33 -07:00
}
2008-01-09 10:27:15 +00:00
}
2008-01-21 12:11:18 +00:00
2008-01-09 10:27:15 +00:00
len = opal_list_get_size ( & ep - > pending_put_frags ) ;
for ( i = 0 ; i < len & & ep - > qps [ qp ] . qp - > sd_wqe > 0 ; i + + ) {
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
frag = opal_list_remove_first ( & ( ep - > pending_put_frags ) ) ;
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2015-01-06 08:48:33 -07:00
if ( NULL = = frag )
2008-01-09 10:27:15 +00:00
break ;
2015-01-06 08:48:33 -07:00
rc = mca_btl_openib_put_internal ( ( mca_btl_base_module_t * ) openib_btl , ep ,
to_put_frag ( frag ) ) ;
if ( OPAL_ERR_OUT_OF_RESOURCE = = rc ) {
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
opal_list_prepend ( & ep - > pending_put_frags , frag ) ;
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2008-01-09 10:27:15 +00:00
break ;
2015-01-06 08:48:33 -07:00
}
2008-01-09 10:27:15 +00:00
}
2007-11-28 07:13:34 +00:00
}
2006-08-30 20:21:47 +00:00
2006-09-12 09:17:59 +00:00
static int btl_openib_handle_incoming ( mca_btl_openib_module_t * openib_btl ,
2007-11-28 07:13:34 +00:00
mca_btl_openib_endpoint_t * ep ,
2008-01-21 12:11:18 +00:00
mca_btl_openib_recv_frag_t * frag ,
2007-11-28 07:12:44 +00:00
size_t byte_len )
2006-03-26 08:30:50 +00:00
{
2007-11-28 07:11:14 +00:00
mca_btl_base_descriptor_t * des = & to_base_frag ( frag ) - > base ;
mca_btl_openib_header_t * hdr = frag - > hdr ;
2007-11-28 07:13:34 +00:00
int rqp = to_base_frag ( frag ) - > base . order , cqp ;
uint16_t rcredits = 0 , credits ;
bool is_credit_msg ;
2007-11-28 07:11:14 +00:00
2007-11-28 07:13:34 +00:00
if ( ep - > nbo ) {
2007-11-28 07:11:14 +00:00
BTL_OPENIB_HEADER_NTOH ( * hdr ) ;
2007-01-12 23:14:45 +00:00
}
2007-03-14 14:36:03 +00:00
2006-09-12 09:17:59 +00:00
/* advance the segment address past the header and subtract from the
2007-11-28 07:13:34 +00:00
* length . */
2015-01-06 08:48:33 -07:00
des - > des_segments - > seg_len = byte_len - sizeof ( mca_btl_openib_header_t ) ;
2006-03-26 08:30:50 +00:00
2007-11-28 07:13:34 +00:00
if ( OPAL_LIKELY ( ! ( is_credit_msg = is_credit_message ( frag ) ) ) ) {
/* call registered callback */
2008-01-15 05:32:53 +00:00
mca_btl_active_message_callback_t * reg ;
2013-01-17 22:34:43 +00:00
2013-11-01 12:19:40 +00:00
# if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
2013-01-17 22:34:43 +00:00
/* The COPY_ASYNC flag should not be set */
assert ( 0 = = ( des - > des_flags & MCA_BTL_DES_FLAGS_CUDA_COPY_ASYNC ) ) ;
2013-11-01 12:19:40 +00:00
# endif /* OPAL_CUDA_SUPPORT */
2008-01-15 05:32:53 +00:00
reg = mca_btl_base_active_message_trigger + hdr - > tag ;
reg - > cbfunc ( & openib_btl - > super , hdr - > tag , des , reg - > cbdata ) ;
2013-11-01 12:19:40 +00:00
# if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
2015-06-23 20:59:57 -07:00
if ( des - > des_flags & MCA_BTL_DES_FLAGS_CUDA_COPY_ASYNC ) {
/* Since ASYNC flag is set, we know this descriptor is being used
2013-01-17 22:34:43 +00:00
* for asynchronous copy and cannot be freed yet . Therefore , set
* up callback for PML to call when complete , add argument into
* descriptor and return . */
des - > des_cbfunc = btl_openib_handle_incoming_completion ;
2015-03-10 14:34:38 -06:00
to_in_frag ( des ) - > endpoint = ep ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2013-01-17 22:34:43 +00:00
}
2013-11-01 12:19:40 +00:00
# endif /* OPAL_CUDA_SUPPORT */
2007-12-09 13:56:13 +00:00
if ( MCA_BTL_OPENIB_RDMA_FRAG ( frag ) ) {
cqp = ( hdr - > credits > > 11 ) & 0x0f ;
hdr - > credits & = 0x87ff ;
} else {
cqp = rqp ;
}
2007-11-28 07:13:34 +00:00
if ( BTL_OPENIB_IS_RDMA_CREDITS ( hdr - > credits ) ) {
rcredits = BTL_OPENIB_CREDITS ( hdr - > credits ) ;
hdr - > credits = 0 ;
2007-11-28 07:12:44 +00:00
}
2007-07-24 13:23:08 +00:00
} else {
2010-07-12 16:17:56 +00:00
mca_btl_openib_rdma_credits_header_t * chdr =
2015-01-06 08:48:33 -07:00
( mca_btl_openib_rdma_credits_header_t * ) des - > des_segments - > seg_addr . pval ;
2007-11-28 07:13:34 +00:00
if ( ep - > nbo ) {
BTL_OPENIB_RDMA_CREDITS_HEADER_NTOH ( * chdr ) ;
}
cqp = chdr - > qpn ;
rcredits = chdr - > rdma_credits ;
}
credits = hdr - > credits ;
2007-11-28 07:12:44 +00:00
2007-11-28 07:13:34 +00:00
if ( hdr - > cm_seen )
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . cm_sent , - hdr - > cm_seen ) ;
/* Now return fragment. Don't touch hdr after this point! */
if ( MCA_BTL_OPENIB_RDMA_FRAG ( frag ) ) {
mca_btl_openib_eager_rdma_local_t * erl = & ep - > eager_rdma_local ;
OPAL_THREAD_LOCK ( & erl - > lock ) ;
MCA_BTL_OPENIB_RDMA_MAKE_REMOTE ( frag - > ftr ) ;
while ( erl - > tail ! = erl - > head ) {
mca_btl_openib_recv_frag_t * tf ;
tf = MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG ( ep , erl - > tail ) ;
if ( MCA_BTL_OPENIB_RDMA_FRAG_LOCAL ( tf ) )
break ;
OPAL_THREAD_ADD32 ( & erl - > credits , 1 ) ;
MCA_BTL_OPENIB_RDMA_NEXT_INDEX ( erl - > tail ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 01:15:59 +00:00
}
2007-11-28 07:13:34 +00:00
OPAL_THREAD_UNLOCK ( & erl - > lock ) ;
} else {
2008-10-06 00:46:02 +00:00
if ( is_cts_message ( frag ) ) {
/* If this was a CTS, free it here (it was
malloc ' ed + ibv_reg_mr ' ed - - so it should * not * be
FRAG_RETURN ' ed ) . */
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
int rc = opal_btl_openib_connect_base_free_cts ( ep ) ;
if ( OPAL_SUCCESS ! = rc ) {
2008-10-06 00:46:02 +00:00
return rc ;
}
2007-11-28 07:20:26 +00:00
} else {
2008-10-06 00:46:02 +00:00
/* Otherwise, FRAG_RETURN it and repost if necessary */
MCA_BTL_IB_FRAG_RETURN ( frag ) ;
if ( BTL_OPENIB_QP_TYPE_PP ( rqp ) ) {
if ( OPAL_UNLIKELY ( is_credit_msg ) ) {
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . cm_received , 1 ) ;
} else {
OPAL_THREAD_ADD32 ( & ep - > qps [ rqp ] . u . pp_qp . rd_posted , - 1 ) ;
}
mca_btl_openib_endpoint_post_rr ( ep , cqp ) ;
} else {
mca_btl_openib_module_t * btl = ep - > endpoint_btl ;
OPAL_THREAD_ADD32 ( & btl - > qps [ rqp ] . u . srq_qp . rd_posted , - 1 ) ;
mca_btl_openib_post_srr ( btl , rqp ) ;
}
2007-11-28 07:13:34 +00:00
}
}
assert ( ( cqp ! = MCA_BTL_NO_ORDER & & BTL_OPENIB_QP_TYPE_PP ( cqp ) ) | | ! credits ) ;
2009-04-23 20:04:39 +00:00
/* If we got any credits (RDMA or send), then try to progress all
the no_credits_pending_frags lists */
if ( rcredits > 0 ) {
OPAL_THREAD_ADD32 ( & ep - > eager_rdma_remote . tokens , rcredits ) ;
}
if ( credits > 0 ) {
2007-11-28 07:13:34 +00:00
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . sd_credits , credits ) ;
2009-04-23 20:04:39 +00:00
}
if ( rcredits + credits > 0 ) {
int rc ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! =
2009-04-23 20:04:39 +00:00
( rc = progress_no_credits_pending_frags ( ep ) ) ) {
return rc ;
}
2007-11-28 07:13:34 +00:00
}
2007-12-09 13:56:13 +00:00
send_credits ( ep , cqp ) ;
2007-11-28 07:13:34 +00:00
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2006-03-26 08:30:50 +00:00
}
2013-11-01 12:19:40 +00:00
# if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_RECV */
2013-01-17 22:34:43 +00:00
/**
* Called by the PML when the copying of the data out of the fragment
* is complete .
*/
2013-05-10 21:29:08 +00:00
static void btl_openib_handle_incoming_completion ( mca_btl_base_module_t * btl ,
2013-01-17 22:34:43 +00:00
mca_btl_base_endpoint_t * ep ,
mca_btl_base_descriptor_t * des ,
int status )
{
mca_btl_openib_recv_frag_t * frag = ( mca_btl_openib_recv_frag_t * ) des ;
mca_btl_openib_header_t * hdr = frag - > hdr ;
int rqp = to_base_frag ( frag ) - > base . order , cqp ;
uint16_t rcredits = 0 , credits ;
2015-03-10 14:34:38 -06:00
ep = to_in_frag ( des ) - > endpoint ;
2013-01-17 22:34:43 +00:00
OPAL_OUTPUT ( ( - 1 , " handle_incoming_complete frag=%p " , ( void * ) des ) ) ;
if ( MCA_BTL_OPENIB_RDMA_FRAG ( frag ) ) {
cqp = ( hdr - > credits > > 11 ) & 0x0f ;
hdr - > credits & = 0x87ff ;
} else {
cqp = rqp ;
}
if ( BTL_OPENIB_IS_RDMA_CREDITS ( hdr - > credits ) ) {
rcredits = BTL_OPENIB_CREDITS ( hdr - > credits ) ;
hdr - > credits = 0 ;
}
credits = hdr - > credits ;
if ( hdr - > cm_seen )
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . cm_sent , - hdr - > cm_seen ) ;
2013-05-10 21:29:08 +00:00
/* We should not be here with eager, control, or credit messages */
2013-01-17 22:34:43 +00:00
assert ( openib_frag_type ( frag ) ! = MCA_BTL_OPENIB_FRAG_EAGER_RDMA ) ;
assert ( 0 = = is_cts_message ( frag ) ) ;
2013-05-10 21:29:08 +00:00
assert ( 0 = = is_credit_message ( frag ) ) ;
2013-01-17 22:34:43 +00:00
/* HACK - clear out flags. Must be better way */
des - > des_flags = 0 ;
/* Otherwise, FRAG_RETURN it and repost if necessary */
MCA_BTL_IB_FRAG_RETURN ( frag ) ;
if ( BTL_OPENIB_QP_TYPE_PP ( rqp ) ) {
2013-05-10 21:29:08 +00:00
OPAL_THREAD_ADD32 ( & ep - > qps [ rqp ] . u . pp_qp . rd_posted , - 1 ) ;
2013-01-17 22:34:43 +00:00
mca_btl_openib_endpoint_post_rr ( ep , cqp ) ;
} else {
mca_btl_openib_module_t * btl = ep - > endpoint_btl ;
OPAL_THREAD_ADD32 ( & btl - > qps [ rqp ] . u . srq_qp . rd_posted , - 1 ) ;
mca_btl_openib_post_srr ( btl , rqp ) ;
}
assert ( ( cqp ! = MCA_BTL_NO_ORDER & & BTL_OPENIB_QP_TYPE_PP ( cqp ) ) | | ! credits ) ;
/* If we got any credits (RDMA or send), then try to progress all
the no_credits_pending_frags lists */
if ( rcredits > 0 ) {
OPAL_THREAD_ADD32 ( & ep - > eager_rdma_remote . tokens , rcredits ) ;
}
if ( credits > 0 ) {
OPAL_THREAD_ADD32 ( & ep - > qps [ cqp ] . u . pp_qp . sd_credits , credits ) ;
}
if ( rcredits + credits > 0 ) {
int rc ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( OPAL_SUCCESS ! =
2013-01-17 22:34:43 +00:00
( rc = progress_no_credits_pending_frags ( ep ) ) ) {
2013-05-23 18:45:23 +00:00
/* This is a fatal issue so call into PML and let it know. */
mca_btl_openib_module_t * openib_btl = ( mca_btl_openib_module_t * ) btl ;
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ,
NULL , NULL ) ;
return ;
2013-01-17 22:34:43 +00:00
}
}
send_credits ( ep , cqp ) ;
}
2013-11-01 12:19:40 +00:00
# endif /* OPAL_CUDA_SUPPORT */
2013-01-17 22:34:43 +00:00
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
static char * btl_openib_component_status_to_string ( enum ibv_wc_status status )
2008-01-21 12:11:18 +00:00
{
switch ( status ) {
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
case IBV_WC_SUCCESS :
2008-01-21 12:11:18 +00:00
return " SUCCESS " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
break ;
case IBV_WC_LOC_LEN_ERR :
2008-01-21 12:11:18 +00:00
return " LOCAL LENGTH ERROR " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
break ;
case IBV_WC_LOC_QP_OP_ERR :
return " LOCAL QP OPERATION ERROR " ;
break ;
case IBV_WC_LOC_PROT_ERR :
return " LOCAL PROTOCOL ERROR " ;
break ;
case IBV_WC_WR_FLUSH_ERR :
return " WORK REQUEST FLUSHED ERROR " ;
break ;
case IBV_WC_MW_BIND_ERR :
return " MEMORY WINDOW BIND ERROR " ;
break ;
case IBV_WC_BAD_RESP_ERR :
return " BAD RESPONSE ERROR " ;
break ;
case IBV_WC_LOC_ACCESS_ERR :
return " LOCAL ACCESS ERROR " ;
break ;
case IBV_WC_REM_INV_REQ_ERR :
return " INVALID REQUEST ERROR " ;
break ;
case IBV_WC_REM_ACCESS_ERR :
return " REMOTE ACCESS ERROR " ;
break ;
case IBV_WC_REM_OP_ERR :
return " REMOTE OPERATION ERROR " ;
break ;
case IBV_WC_RETRY_EXC_ERR :
return " RETRY EXCEEDED ERROR " ;
break ;
case IBV_WC_RNR_RETRY_EXC_ERR :
2007-04-26 21:03:38 +00:00
return " RECEIVER NOT READY RETRY EXCEEDED ERROR " ;
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
break ;
case IBV_WC_LOC_RDD_VIOL_ERR :
return " LOCAL RDD VIOLATION ERROR " ;
break ;
case IBV_WC_REM_INV_RD_REQ_ERR :
return " INVALID READ REQUEST ERROR " ;
break ;
case IBV_WC_REM_ABORT_ERR :
return " REMOTE ABORT ERROR " ;
break ;
case IBV_WC_INV_EECN_ERR :
return " INVALID EECN ERROR " ;
break ;
case IBV_WC_INV_EEC_STATE_ERR :
return " INVALID EEC STATE ERROR " ;
break ;
2008-01-21 12:11:18 +00:00
case IBV_WC_FATAL_ERR :
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
return " FATAL ERROR " ;
break ;
case IBV_WC_RESP_TIMEOUT_ERR :
return " RESPONSE TIMEOUT ERROR " ;
break ;
case IBV_WC_GENERAL_ERR :
return " GENERAL ERROR " ;
break ;
default :
return " STATUS UNDEFINED " ;
break ;
}
2006-06-05 20:02:41 +00:00
}
Bring over all the work from the /tmp/ib-hw-detect branch. In
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
2006-08-14 19:30:37 +00:00
2008-01-09 14:46:41 +00:00
static void
progress_pending_frags_wqe ( mca_btl_base_endpoint_t * ep , const int qpn )
2007-11-28 07:15:20 +00:00
{
2015-05-27 09:43:42 -06:00
int ret ;
2007-11-28 07:15:20 +00:00
opal_list_item_t * frag ;
2008-01-09 14:46:41 +00:00
mca_btl_openib_qp_t * qp = ep - > qps [ qpn ] . qp ;
OPAL_THREAD_LOCK ( & ep - > endpoint_lock ) ;
2015-05-27 09:43:42 -06:00
for ( int i = 0 ; i < 2 ; i + + ) {
2007-11-28 07:15:20 +00:00
while ( qp - > sd_wqe > 0 ) {
2009-01-07 14:10:58 +00:00
mca_btl_base_endpoint_t * tmp_ep ;
frag = opal_list_remove_first ( & ep - > qps [ qpn ] . no_wqe_pending_frags [ i ] ) ;
2007-11-28 07:15:20 +00:00
if ( NULL = = frag )
break ;
2015-07-06 09:45:01 +09:00
assert ( 0 = = frag - > opal_list_item_refcount ) ;
2009-01-07 14:10:58 +00:00
tmp_ep = to_com_frag ( frag ) - > endpoint ;
2015-05-27 09:43:42 -06:00
ret = mca_btl_openib_endpoint_post_send ( tmp_ep , to_send_frag ( frag ) ) ;
if ( OPAL_SUCCESS ! = ret ) {
/* NTH: this handles retrying if we are out of credits but other errors are not
* handled ( maybe abort ? ) . */
2015-07-06 09:45:01 +09:00
if ( OPAL_ERR_RESOURCE_BUSY ! = ret ) {
opal_list_prepend ( & ep - > qps [ qpn ] . no_wqe_pending_frags [ i ] , ( opal_list_item_t * ) frag ) ;
}
2015-05-27 09:43:42 -06:00
break ;
}
2007-11-28 07:15:20 +00:00
}
}
2008-01-09 14:46:41 +00:00
OPAL_THREAD_UNLOCK ( & ep - > endpoint_lock ) ;
2007-11-28 07:15:20 +00:00
}
2008-01-09 10:27:15 +00:00
static void progress_pending_frags_srq ( mca_btl_openib_module_t * openib_btl ,
const int qp )
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 01:15:59 +00:00
{
2007-11-28 07:11:14 +00:00
opal_list_item_t * frag ;
2007-11-28 07:12:44 +00:00
int i ;
2008-01-21 12:11:18 +00:00
2007-11-28 07:18:59 +00:00
assert ( BTL_OPENIB_QP_TYPE_SRQ ( qp ) | | BTL_OPENIB_QP_TYPE_XRC ( qp ) ) ;
2008-01-21 12:11:18 +00:00
2007-11-28 07:12:44 +00:00
for ( i = 0 ; i < 2 ; i + + ) {
2007-11-28 07:20:26 +00:00
while ( openib_btl - > qps [ qp ] . u . srq_qp . sd_credits > 0 ) {
2007-11-28 07:12:44 +00:00
OPAL_THREAD_LOCK ( & openib_btl - > ib_lock ) ;
2007-11-28 07:20:26 +00:00
frag = opal_list_remove_first (
& openib_btl - > qps [ qp ] . u . srq_qp . pending_frags [ i ] ) ;
2007-11-28 07:12:44 +00:00
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
if ( NULL = = frag )
break ;
mca_btl_openib_endpoint_send ( to_com_frag ( frag ) - > endpoint ,
to_send_frag ( frag ) ) ;
}
2007-05-08 21:47:21 +00:00
}
}
2007-12-23 12:29:34 +00:00
static char * cq_name [ ] = { " HP CQ " , " LP CQ " } ;
2008-07-23 00:28:59 +00:00
static void handle_wc ( mca_btl_openib_device_t * device , const uint32_t cq ,
2007-12-23 12:29:34 +00:00
struct ibv_wc * wc )
2006-11-02 16:15:21 +00:00
{
2007-12-23 12:29:34 +00:00
static int flush_err_printed [ ] = { 0 , 0 } ;
2007-11-28 07:11:14 +00:00
mca_btl_openib_com_frag_t * frag ;
mca_btl_base_descriptor_t * des ;
2007-12-23 12:29:34 +00:00
mca_btl_openib_endpoint_t * endpoint ;
2007-08-20 12:28:25 +00:00
mca_btl_openib_module_t * openib_btl = NULL ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
const opal_proc_t * remote_proc = NULL ;
2008-02-18 17:39:30 +00:00
int qp , btl_ownership ;
2012-12-26 10:19:12 +00:00
int n ;
2006-11-02 16:15:21 +00:00
2007-12-23 12:29:34 +00:00
des = ( mca_btl_base_descriptor_t * ) ( uintptr_t ) wc - > wr_id ;
frag = to_com_frag ( des ) ;
2007-12-02 14:46:37 +00:00
2007-12-23 12:29:34 +00:00
/* For receive fragments "order" contains QP idx the fragment was posted
* to . For send fragments " order " contains QP idx the fragment was send
* through */
qp = des - > order ;
2015-05-27 09:43:42 -06:00
if ( IBV_WC_RECV = = wc - > opcode & & ( wc - > wc_flags & IBV_WC_WITH_IMM ) ) {
# if !defined(WORDS_BIGENDIAN) && OPAL_ENABLE_HETEROGENEOUS_SUPPORT
wc - > imm_data = ntohl ( wc - > imm_data ) ;
# endif
frag - > endpoint = ( mca_btl_openib_endpoint_t * )
opal_pointer_array_get_item ( device - > endpoints , wc - > imm_data ) ;
}
2007-12-23 12:29:34 +00:00
endpoint = frag - > endpoint ;
2007-11-28 07:11:14 +00:00
2015-05-27 09:43:42 -06:00
assert ( NULL ! = endpoint ) ;
openib_btl = endpoint - > endpoint_btl ;
2007-11-28 07:11:14 +00:00
2008-10-06 00:46:02 +00:00
if ( wc - > status ! = IBV_WC_SUCCESS ) {
OPAL_OUTPUT ( ( - 1 , " Got WC: ERROR " ) ) ;
2007-12-23 12:29:34 +00:00
goto error ;
2008-10-06 00:46:02 +00:00
}
2007-08-20 12:28:25 +00:00
2007-12-23 12:29:34 +00:00
/* Handle work completions */
switch ( wc - > opcode ) {
case IBV_WC_RDMA_READ :
2015-01-05 14:36:57 -07:00
case IBV_WC_COMP_SWAP :
case IBV_WC_FETCH_ADD :
2015-01-06 08:48:33 -07:00
OPAL_OUTPUT ( ( - 1 , " Got WC: RDMA_READ or RDMA_WRITE " ) ) ;
2015-01-05 14:36:57 -07:00
OPAL_THREAD_ADD32 ( & endpoint - > get_tokens , 1 ) ;
2015-01-06 08:48:33 -07:00
2015-01-05 14:36:57 -07:00
mca_btl_openib_get_frag_t * get_frag = to_get_frag ( des ) ;
2015-01-06 08:48:33 -07:00
2015-11-23 16:07:12 -07:00
/* check if atomic result needs to be byte swapped (mlx5) */
if ( openib_btl - > atomic_ops_be & & IBV_WC_RDMA_READ ! = wc - > opcode ) {
* ( ( int64_t * ) frag - > sg_entry . addr ) = ntoh64 ( * ( ( int64_t * ) frag - > sg_entry . addr ) ) ;
}
2015-01-05 14:36:57 -07:00
get_frag - > cb . func ( & openib_btl - > super , endpoint , ( void * ) ( intptr_t ) frag - > sg_entry . addr ,
get_frag - > cb . local_handle , get_frag - > cb . context , get_frag - > cb . data ,
OPAL_SUCCESS ) ;
2015-05-27 09:43:42 -06:00
/* fall through */
2015-01-05 14:36:57 -07:00
case IBV_WC_RDMA_WRITE :
if ( MCA_BTL_OPENIB_FRAG_SEND_USER = = openib_frag_type ( des ) ) {
2015-01-06 08:48:33 -07:00
mca_btl_openib_put_frag_t * put_frag = to_put_frag ( des ) ;
put_frag - > cb . func ( & openib_btl - > super , endpoint , ( void * ) ( intptr_t ) frag - > sg_entry . addr ,
put_frag - > cb . local_handle , put_frag - > cb . context , put_frag - > cb . data ,
OPAL_SUCCESS ) ;
put_frag - > cb . func = NULL ;
}
/* fall through */
2007-12-23 12:29:34 +00:00
case IBV_WC_SEND :
2008-10-06 00:46:02 +00:00
OPAL_OUTPUT ( ( - 1 , " Got WC: RDMA_WRITE or SEND " ) ) ;
2007-12-23 12:29:34 +00:00
if ( openib_frag_type ( des ) = = MCA_BTL_OPENIB_FRAG_SEND ) {
opal_list_item_t * i ;
2008-02-18 17:39:30 +00:00
while ( ( i = opal_list_remove_first ( & to_send_frag ( des ) - > coalesced_frags ) ) ) {
btl_ownership = ( to_base_frag ( i ) - > base . des_flags & MCA_BTL_DES_FLAGS_BTL_OWNERSHIP ) ;
2011-01-19 20:58:22 +00:00
# if BTL_OPENIB_FAILOVER_ENABLED
2011-01-19 16:09:17 +00:00
/* The check for the callback flag is only needed when running
* with the failover case because there is a chance that a fragment
* generated from a sendi call ( which does not set the flag ) gets
* coalesced . In normal operation , this cannot happen as the sendi
* call will never queue up a fragment which could potentially become
* a coalesced fragment . It will revert to a regular send . */
2010-11-10 19:09:47 +00:00
if ( to_base_frag ( i ) - > base . des_flags & MCA_BTL_DES_SEND_ALWAYS_CALLBACK ) {
2010-07-14 10:08:19 +00:00
# endif
to_base_frag ( i ) - > base . des_cbfunc ( & openib_btl - > super , endpoint ,
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
& to_base_frag ( i ) - > base , OPAL_SUCCESS ) ;
2011-01-19 20:58:22 +00:00
# if BTL_OPENIB_FAILOVER_ENABLED
2010-07-14 10:08:19 +00:00
}
# endif
2008-02-18 17:39:30 +00:00
if ( btl_ownership ) {
mca_btl_openib_free ( & openib_btl - > super , & to_base_frag ( i ) - > base ) ;
}
}
2007-12-23 12:29:34 +00:00
}
/* Process a completed send/put/get */
2008-02-18 17:39:30 +00:00
btl_ownership = ( des - > des_flags & MCA_BTL_DES_FLAGS_BTL_OWNERSHIP ) ;
2009-03-25 16:53:26 +00:00
if ( des - > des_flags & MCA_BTL_DES_SEND_ALWAYS_CALLBACK ) {
2015-01-05 14:36:57 -07:00
des - > des_cbfunc ( & openib_btl - > super , endpoint , des , OPAL_SUCCESS ) ;
2009-03-25 16:53:26 +00:00
}
2008-02-18 17:39:30 +00:00
if ( btl_ownership ) {
mca_btl_openib_free ( & openib_btl - > super , des ) ;
}
2007-11-28 07:11:14 +00:00
2007-12-23 12:29:34 +00:00
/* return send wqe */
qp_put_wqe ( endpoint , qp ) ;
2005-11-10 20:15:02 +00:00
2012-12-26 10:19:12 +00:00
/* return wqes that were sent before this frag */
n = qp_frag_to_wqe ( endpoint , qp , to_com_frag ( des ) ) ;
2007-12-23 12:29:34 +00:00
if ( IBV_WC_SEND = = wc - > opcode & & ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) {
2012-12-26 10:19:12 +00:00
OPAL_THREAD_ADD32 ( & openib_btl - > qps [ qp ] . u . srq_qp . sd_credits , 1 + n ) ;
This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
2007-07-18 01:15:59 +00:00
2007-12-23 12:29:34 +00:00
/* new SRQ credit available. Try to progress pending frags*/
progress_pending_frags_srq ( openib_btl , qp ) ;
}
/* new wqe or/and get token available. Try to progress pending frags */
2008-01-09 14:46:41 +00:00
progress_pending_frags_wqe ( endpoint , qp ) ;
2007-12-23 12:29:34 +00:00
mca_btl_openib_frag_progress_pending_put_get ( endpoint , qp ) ;
break ;
case IBV_WC_RECV :
2009-05-14 15:39:53 +00:00
OPAL_OUTPUT ( ( - 1 , " Got WC: RDMA_RECV, qp %d, src qp %d, WR ID % " PRIx64 ,
wc - > qp_num , wc - > src_qp , wc - > wr_id ) ) ;
2008-12-10 22:07:48 +00:00
2007-12-23 12:29:34 +00:00
/* Process a RECV */
if ( btl_openib_handle_incoming ( openib_btl , endpoint , to_recv_frag ( frag ) ,
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
wc - > byte_len ) ! = OPAL_SUCCESS ) {
2010-05-19 11:55:45 +00:00
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ,
NULL , NULL ) ;
2005-11-10 20:15:02 +00:00
break ;
2007-12-23 12:29:34 +00:00
}
2007-07-24 13:23:08 +00:00
2007-12-23 12:29:34 +00:00
/* decide if it is time to setup an eager rdma channel */
if ( ! endpoint - > eager_rdma_local . base . pval & & endpoint - > use_eager_rdma & &
wc - > byte_len < mca_btl_openib_component . eager_limit & &
openib_btl - > eager_rdma_channels <
mca_btl_openib_component . max_eager_rdma & &
OPAL_THREAD_ADD32 ( & endpoint - > eager_recv_count , 1 ) = =
mca_btl_openib_component . eager_rdma_threshold ) {
mca_btl_openib_endpoint_connect_eager_rdma ( endpoint ) ;
}
break ;
default :
BTL_ERROR ( ( " Unhandled work completion opcode is %d " , wc - > opcode ) ) ;
if ( openib_btl )
2010-05-19 11:55:45 +00:00
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ,
NULL , NULL ) ;
2007-12-23 12:29:34 +00:00
break ;
}
2007-07-26 13:56:07 +00:00
2007-12-23 12:29:34 +00:00
return ;
error :
2015-05-28 11:26:13 -06:00
if ( endpoint - > endpoint_proc & & endpoint - > endpoint_proc - > proc_opal )
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
remote_proc = endpoint - > endpoint_proc - > proc_opal ;
2007-12-23 12:29:34 +00:00
2008-05-07 16:53:05 +00:00
/* For iWARP, the TCP connection is tied to the QP once the QP is
* in RTS . And destroying the QP is thus tied to connection
* teardown for iWARP . To destroy the connection in iWARP you
* must move the QP out of RTS , either into CLOSING for a nice
* graceful close ( e . g . , via rdma_disconnect ( ) ) , or to ERROR if
* you want to be rude ( e . g . , just destroying the QP without
* disconnecting first ) . In both cases , all pending non - completed
* SQ and RQ WRs will automatically be flushed .
2008-05-06 21:57:40 +00:00
*/
2008-05-12 11:56:14 +00:00
# if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
2011-07-04 14:00:41 +00:00
if ( IBV_WC_WR_FLUSH_ERR = = wc - > status & &
2008-07-23 00:28:59 +00:00
IBV_TRANSPORT_IWARP = = device - > ib_dev - > transport_type ) {
2008-05-06 21:57:40 +00:00
return ;
}
2008-05-12 11:56:14 +00:00
# endif
2008-05-06 21:57:40 +00:00
2008-05-12 12:05:16 +00:00
if ( IBV_WC_WR_FLUSH_ERR ! = wc - > status | | ! flush_err_printed [ cq ] + + ) {
2007-12-23 12:29:34 +00:00
BTL_PEER_ERROR ( remote_proc , ( " error polling %s with status %s "
2009-05-14 15:39:53 +00:00
" status number %d for wr_id % " PRIx64 " opcode %d vendor error %d qp_idx %d " ,
2007-12-23 12:29:34 +00:00
cq_name [ cq ] , btl_openib_component_status_to_string ( wc - > status ) ,
2011-07-04 14:00:41 +00:00
wc - > status , wc - > wr_id ,
2009-05-14 15:39:53 +00:00
wc - > opcode , wc - > vendor_err , qp ) ) ;
2007-12-23 12:29:34 +00:00
}
2009-08-05 22:23:26 +00:00
2008-06-02 11:03:48 +00:00
if ( IBV_WC_RNR_RETRY_EXC_ERR = = wc - > status | |
IBV_WC_RETRY_EXC_ERR = = wc - > status ) {
2013-08-21 19:01:54 +00:00
const char * peer_hostname ;
2014-10-03 14:19:48 -07:00
peer_hostname = opal_get_proc_hostname ( endpoint - > endpoint_proc - > proc_opal ) ;
2011-07-04 14:00:41 +00:00
const char * device_name =
2008-06-02 11:03:48 +00:00
ibv_get_device_name ( endpoint - > qps [ qp ] . qp - > lcl_qp - > context - > device ) ;
if ( IBV_WC_RNR_RETRY_EXC_ERR = = wc - > status ) {
2014-08-08 13:35:29 +00:00
// The show_help checker script gets confused if the topic
// is an inline logic check, so separate it into two calls
// to show_help.
if ( BTL_OPENIB_QP_TYPE_PP ( qp ) ) {
opal_show_help ( " help-mpi-btl-openib.txt " ,
" pp rnr retry exceeded " ,
true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2014-08-08 13:35:29 +00:00
device_name ,
peer_hostname ) ;
} else {
opal_show_help ( " help-mpi-btl-openib.txt " ,
" srq rnr retry exceeded " ,
true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2014-08-08 13:35:29 +00:00
device_name ,
peer_hostname ) ;
}
2008-06-02 11:03:48 +00:00
} else if ( IBV_WC_RETRY_EXC_ERR = = wc - > status ) {
2013-02-12 21:10:11 +00:00
opal_show_help ( " help-mpi-btl-openib.txt " ,
2008-06-02 11:03:48 +00:00
" pp retry exceeded " , true ,
2014-10-03 14:19:48 -07:00
opal_process_info . nodename ,
2013-08-21 19:01:54 +00:00
device_name , peer_hostname ) ;
2008-06-02 11:03:48 +00:00
}
}
2007-12-23 12:29:34 +00:00
2011-01-19 20:58:22 +00:00
# if BTL_OPENIB_FAILOVER_ENABLED
2010-07-14 10:08:19 +00:00
mca_btl_openib_handle_endpoint_error ( openib_btl , des , qp ,
remote_proc , endpoint ) ;
# else
2009-08-05 22:23:26 +00:00
if ( openib_btl )
2010-05-19 11:55:45 +00:00
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ,
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
( struct opal_proc_t * ) remote_proc , NULL ) ;
2010-07-14 10:08:19 +00:00
# endif
2007-12-23 12:29:34 +00:00
}
2008-07-23 00:28:59 +00:00
static int poll_device ( mca_btl_openib_device_t * device , int count )
2007-12-23 12:29:34 +00:00
{
2008-01-21 12:11:18 +00:00
int ne = 0 , cq ;
2007-12-23 12:29:34 +00:00
uint32_t hp_iter = 0 ;
2012-11-05 14:02:37 +00:00
struct ibv_wc wc [ MCA_BTL_OPENIB_CQ_POLL_BATCH_DEFAULT ] ;
int i ;
2007-12-23 12:29:34 +00:00
2008-07-23 00:28:59 +00:00
device - > pollme = false ;
2007-12-23 12:29:34 +00:00
for ( cq = 0 ; cq < 2 & & hp_iter < mca_btl_openib_component . cq_poll_progress ; )
{
2012-11-05 14:02:37 +00:00
ne = ibv_poll_cq ( device - > ib_cq [ cq ] , mca_btl_openib_component . cq_poll_batch , wc ) ;
2007-12-23 12:29:34 +00:00
if ( 0 = = ne ) {
/* don't check low prio cq if there was something in high prio cq,
* but for each cq_poll_ratio hp cq polls poll lp cq once */
2008-07-23 00:28:59 +00:00
if ( count & & device - > hp_cq_polls )
2005-06-30 21:28:35 +00:00
break ;
2007-12-23 12:29:34 +00:00
cq + + ;
2008-07-23 00:28:59 +00:00
device - > hp_cq_polls = mca_btl_openib_component . cq_poll_ratio ;
2007-12-23 12:29:34 +00:00
continue ;
2005-06-30 21:28:35 +00:00
}
2007-12-23 12:29:34 +00:00
if ( ne < 0 )
goto error ;
count + + ;
if ( BTL_OPENIB_HP_CQ = = cq ) {
2008-07-23 00:28:59 +00:00
device - > pollme = true ;
2007-12-23 12:29:34 +00:00
hp_iter + + ;
2008-07-23 00:28:59 +00:00
device - > hp_cq_polls - - ;
2007-12-23 12:29:34 +00:00
}
2008-01-21 12:11:18 +00:00
2012-11-05 14:02:37 +00:00
for ( i = 0 ; i < ne ; i + + )
handle_wc ( device , cq , & wc [ i ] ) ;
2005-06-30 21:28:35 +00:00
}
2008-01-21 12:11:18 +00:00
2005-06-30 21:28:35 +00:00
return count ;
2006-09-12 09:17:59 +00:00
error :
2008-05-02 11:52:33 +00:00
BTL_ERROR ( ( " error polling %s with %d errno says %s " , cq_name [ cq ] , ne ,
2008-01-21 12:11:18 +00:00
strerror ( errno ) ) ) ;
2006-09-05 15:59:02 +00:00
return count ;
2005-06-30 21:28:35 +00:00
}
2007-06-14 01:59:25 +00:00
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
# if OPAL_ENABLE_PROGRESS_THREADS
2008-01-09 10:27:15 +00:00
void * mca_btl_openib_progress_thread ( opal_object_t * arg )
2007-06-14 01:59:25 +00:00
{
2008-01-09 10:27:15 +00:00
opal_thread_t * thread = ( opal_thread_t * ) arg ;
2008-07-23 00:28:59 +00:00
mca_btl_openib_device_t * device = thread - > t_arg ;
2008-01-09 10:27:15 +00:00
struct ibv_cq * ev_cq ;
void * ev_ctx ;
2007-06-14 01:59:25 +00:00
2008-01-09 10:27:15 +00:00
/* This thread enter in a cancel enabled state */
pthread_setcancelstate ( PTHREAD_CANCEL_ENABLE , NULL ) ;
pthread_setcanceltype ( PTHREAD_CANCEL_ASYNCHRONOUS , NULL ) ;
2007-06-14 01:59:25 +00:00
2008-06-09 14:53:58 +00:00
opal_output ( - 1 , " WARNING: the openib btl progress thread code *does not yet work*. Your run is likely to hang, crash, break the kitchen sink, and/or eat your cat. You have been warned. " ) ;
2008-01-09 10:27:15 +00:00
2008-07-23 00:28:59 +00:00
while ( device - > progress ) {
2012-08-16 22:53:14 +00:00
#if 0
Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
while ( ompi_progress_threads ( ) ) {
while ( ompi_progress_threads ( ) )
2008-01-09 10:27:15 +00:00
sched_yield ( ) ;
usleep ( 100 ) ; /* give app a chance to re-enter library */
2007-06-14 01:59:25 +00:00
}
2012-08-16 22:53:14 +00:00
# endif
2008-01-09 10:27:15 +00:00
2008-07-23 00:28:59 +00:00
if ( ibv_get_cq_event ( device - > ib_channel , & ev_cq , & ev_ctx ) )
2008-01-09 10:27:15 +00:00
BTL_ERROR ( ( " Failed to get CQ event with error %s " ,
strerror ( errno ) ) ) ;
if ( ibv_req_notify_cq ( ev_cq , 0 ) ) {
BTL_ERROR ( ( " Couldn't request CQ notification with error %s " ,
strerror ( errno ) ) ) ;
2007-06-14 01:59:25 +00:00
}
2008-01-09 10:27:15 +00:00
ibv_ack_cq_events ( ev_cq , 1 ) ;
2008-07-23 00:28:59 +00:00
while ( poll_device ( device , 0 ) ) ;
2007-06-14 01:59:25 +00:00
}
2008-01-09 10:27:15 +00:00
return PTHREAD_CANCELED ;
}
2014-06-25 20:43:28 +00:00
# endif
2007-06-14 01:59:25 +00:00
2008-07-23 00:28:59 +00:00
static int progress_one_device ( mca_btl_openib_device_t * device )
2008-01-09 10:27:15 +00:00
{
int i , c , count = 0 , ret ;
mca_btl_openib_recv_frag_t * frag ;
mca_btl_openib_endpoint_t * endpoint ;
uint32_t non_eager_rdma_endpoints = 0 ;
2007-06-14 01:59:25 +00:00
2008-07-23 00:28:59 +00:00
c = device - > eager_rdma_buffers_count ;
non_eager_rdma_endpoints + = ( device - > non_eager_rdma_endpoints + device - > pollme ) ;
2008-01-09 10:27:15 +00:00
for ( i = 0 ; i < c ; i + + ) {
2008-07-23 00:28:59 +00:00
endpoint = device - > eager_rdma_buffers [ i ] ;
2008-01-09 10:27:15 +00:00
if ( ! endpoint )
continue ;
OPAL_THREAD_LOCK ( & endpoint - > eager_rdma_local . lock ) ;
frag = MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG ( endpoint ,
endpoint - > eager_rdma_local . head ) ;
if ( MCA_BTL_OPENIB_RDMA_FRAG_LOCAL ( frag ) ) {
uint32_t size ;
mca_btl_openib_module_t * btl = endpoint - > endpoint_btl ;
2008-10-15 16:56:42 +00:00
opal_atomic_mb ( ) ;
2008-01-09 10:27:15 +00:00
if ( endpoint - > nbo ) {
BTL_OPENIB_FOOTER_NTOH ( * frag - > ftr ) ;
2007-06-14 01:59:25 +00:00
}
2008-01-09 10:27:15 +00:00
size = MCA_BTL_OPENIB_RDMA_FRAG_GET_SIZE ( frag - > ftr ) ;
2009-05-06 20:11:28 +00:00
# if OPAL_ENABLE_DEBUG
2008-01-09 10:27:15 +00:00
if ( frag - > ftr - > seq ! = endpoint - > eager_rdma_local . seq )
BTL_ERROR ( ( " Eager RDMA wrong SEQ: received %d expected %d " ,
frag - > ftr - > seq ,
endpoint - > eager_rdma_local . seq ) ) ;
endpoint - > eager_rdma_local . seq + + ;
# endif
MCA_BTL_OPENIB_RDMA_NEXT_INDEX ( endpoint - > eager_rdma_local . head ) ;
OPAL_THREAD_UNLOCK ( & endpoint - > eager_rdma_local . lock ) ;
frag - > hdr = ( mca_btl_openib_header_t * ) ( ( ( char * ) frag - > ftr ) -
2012-03-01 17:29:40 +00:00
size - BTL_OPENIB_FTR_PADDING ( size ) + sizeof ( mca_btl_openib_footer_t ) ) ;
2015-03-10 14:21:42 -06:00
to_base_frag ( frag ) - > segment . seg_addr . pval =
2008-01-09 10:27:15 +00:00
( ( unsigned char * ) frag - > hdr ) + sizeof ( mca_btl_openib_header_t ) ;
ret = btl_openib_handle_incoming ( btl , to_com_frag ( frag ) - > endpoint ,
frag , size - sizeof ( mca_btl_openib_footer_t ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
if ( ret ! = OPAL_SUCCESS ) {
2010-05-19 11:55:45 +00:00
btl - > error_cb ( & btl - > super , MCA_BTL_ERROR_FLAGS_FATAL , NULL , NULL ) ;
2008-01-09 10:27:15 +00:00
return 0 ;
2007-06-14 01:59:25 +00:00
}
2008-01-09 10:27:15 +00:00
count + + ;
} else
OPAL_THREAD_UNLOCK ( & endpoint - > eager_rdma_local . lock ) ;
2007-06-14 01:59:25 +00:00
}
2008-07-23 00:28:59 +00:00
device - > eager_rdma_polls - - ;
2007-06-14 01:59:25 +00:00
2008-07-23 00:28:59 +00:00
if ( 0 = = count | | non_eager_rdma_endpoints ! = 0 | | ! device - > eager_rdma_polls ) {
count + = poll_device ( device , count ) ;
device - > eager_rdma_polls = mca_btl_openib_component . eager_rdma_poll_ratio ;
2008-01-09 10:27:15 +00:00
}
return count ;
}
/*
* IB component progress .
*/
static int btl_openib_component_progress ( void )
{
int i ;
int count = 0 ;
if ( OPAL_UNLIKELY ( mca_btl_openib_component . use_async_event_thread & &
2010-05-24 18:57:55 +00:00
mca_btl_openib_component . error_counter ) ) {
2008-01-09 10:27:15 +00:00
goto error ;
}
2008-07-23 00:28:59 +00:00
for ( i = 0 ; i < mca_btl_openib_component . devices_count ; i + + ) {
mca_btl_openib_device_t * device =
2010-07-12 16:17:56 +00:00
( mca_btl_openib_device_t * ) opal_pointer_array_get_item ( & mca_btl_openib_component . devices , i ) ;
2016-06-28 10:26:16 -06:00
if ( NULL ! = device ) {
count + = progress_one_device ( device ) ;
}
2008-01-09 10:27:15 +00:00
}
2013-11-01 12:19:40 +00:00
# if OPAL_CUDA_SUPPORT /* CUDA_ASYNC_SEND */
2013-01-17 22:34:43 +00:00
/* Check to see if there are any outstanding dtoh CUDA events that
* have completed . If so , issue the PML callbacks on the fragments .
* The only thing that gets completed here are asynchronous copies
* so there is no need to free anything .
*/
2015-06-23 20:59:57 -07:00
{
2013-01-17 22:34:43 +00:00
int local_count = 0 ;
mca_btl_base_descriptor_t * frag ;
while ( local_count < 10 & & ( 1 = = progress_one_cuda_dtoh_event ( & frag ) ) ) {
2013-05-14 20:49:42 +00:00
OPAL_OUTPUT ( ( - 1 , " btl_openib: event completed on frag=%p " , ( void * ) frag ) ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
frag - > des_cbfunc ( NULL , NULL , frag , OPAL_SUCCESS ) ;
2013-01-17 22:34:43 +00:00
local_count + + ;
}
count + = local_count ;
}
if ( count > 0 ) {
2013-05-14 20:49:42 +00:00
OPAL_OUTPUT ( ( - 1 , " btl_openib: DONE with openib progress, count=%d " , count ) ) ;
2013-01-17 22:34:43 +00:00
}
2013-11-01 12:19:40 +00:00
# endif /* OPAL_CUDA_SUPPORT */
2013-01-17 22:34:43 +00:00
2008-01-09 10:27:15 +00:00
return count ;
error :
/* Set the fatal counter to zero */
2010-05-24 18:57:55 +00:00
mca_btl_openib_component . error_counter = 0 ;
/* Lets find all error events */
2008-01-09 10:27:15 +00:00
for ( i = 0 ; i < mca_btl_openib_component . ib_num_btls ; i + + ) {
mca_btl_openib_module_t * openib_btl =
mca_btl_openib_component . openib_btls [ i ] ;
2008-07-23 00:28:59 +00:00
if ( openib_btl - > device - > got_fatal_event ) {
2010-05-19 11:55:45 +00:00
openib_btl - > error_cb ( & openib_btl - > super , MCA_BTL_ERROR_FLAGS_FATAL ,
NULL , NULL ) ;
2008-01-09 10:27:15 +00:00
}
2010-05-24 18:57:55 +00:00
if ( openib_btl - > device - > got_port_event ) {
/* These are non-fatal so just ignore it. */
openib_btl - > device - > got_port_event = false ;
2011-01-19 20:58:22 +00:00
# if BTL_OPENIB_FAILOVER_ENABLED
2010-07-14 10:08:19 +00:00
mca_btl_openib_handle_btl_error ( openib_btl ) ;
# endif
2010-05-24 18:57:55 +00:00
}
2008-01-09 10:27:15 +00:00
}
return count ;
2007-06-14 01:59:25 +00:00
}
2007-11-28 14:52:31 +00:00
int mca_btl_openib_post_srr ( mca_btl_openib_module_t * openib_btl , const int qp )
{
2009-12-15 15:52:10 +00:00
int rd_low_local = openib_btl - > qps [ qp ] . u . srq_qp . rd_low_local ;
int rd_curr_num = openib_btl - > qps [ qp ] . u . srq_qp . rd_curr_num ;
2007-11-28 14:57:15 +00:00
int num_post , i , rc ;
struct ibv_recv_wr * bad_wr , * wr_list = NULL , * wr = NULL ;
2007-11-28 14:52:31 +00:00
assert ( ! BTL_OPENIB_QP_TYPE_PP ( qp ) ) ;
OPAL_THREAD_LOCK ( & openib_btl - > ib_lock ) ;
2009-12-15 15:52:10 +00:00
if ( openib_btl - > qps [ qp ] . u . srq_qp . rd_posted > rd_low_local ) {
2007-11-28 14:57:15 +00:00
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2007-11-28 14:57:15 +00:00
}
2009-12-15 15:52:10 +00:00
num_post = rd_curr_num - openib_btl - > qps [ qp ] . u . srq_qp . rd_posted ;
2007-11-28 14:57:15 +00:00
2010-06-24 09:59:45 +00:00
if ( 0 = = num_post ) {
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2010-06-24 09:59:45 +00:00
}
2007-11-28 14:57:15 +00:00
for ( i = 0 ; i < num_post ; i + + ) {
2015-02-19 13:41:41 -07:00
opal_free_list_item_t * item ;
item = opal_free_list_wait ( & openib_btl - > device - > qps [ qp ] . recv_free ) ;
2007-11-28 14:57:15 +00:00
to_base_frag ( item ) - > base . order = qp ;
to_com_frag ( item ) - > endpoint = NULL ;
if ( NULL = = wr )
wr = wr_list = & to_recv_frag ( item ) - > rd_desc ;
else
wr = wr - > next = & to_recv_frag ( item ) - > rd_desc ;
}
wr - > next = NULL ;
rc = ibv_post_srq_recv ( openib_btl - > qps [ qp ] . u . srq_qp . srq , wr_list , & bad_wr ) ;
if ( OPAL_LIKELY ( 0 = = rc ) ) {
2009-12-15 15:52:10 +00:00
struct ibv_srq_attr srq_attr ;
2007-11-28 14:52:31 +00:00
OPAL_THREAD_ADD32 ( & openib_btl - > qps [ qp ] . u . srq_qp . rd_posted , num_post ) ;
2009-12-15 15:52:10 +00:00
if ( true = = openib_btl - > qps [ qp ] . u . srq_qp . srq_limit_event_flag ) {
srq_attr . max_wr = openib_btl - > qps [ qp ] . u . srq_qp . rd_curr_num ;
srq_attr . max_sge = 1 ;
srq_attr . srq_limit = mca_btl_openib_component . qp_infos [ qp ] . u . srq_qp . srq_limit ;
openib_btl - > qps [ qp ] . u . srq_qp . srq_limit_event_flag = false ;
if ( ibv_modify_srq ( openib_btl - > qps [ qp ] . u . srq_qp . srq , & srq_attr , IBV_SRQ_LIMIT ) ) {
BTL_ERROR ( ( " Failed to request limit event for srq on %s. "
" Fatal error, stoping asynch event thread " ,
ibv_get_device_name ( openib_btl - > device - > ib_dev ) ) ) ;
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERROR ;
2009-12-15 15:52:10 +00:00
}
}
2007-11-28 14:57:15 +00:00
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_SUCCESS ;
2007-11-28 14:52:31 +00:00
}
2007-11-28 14:57:15 +00:00
for ( i = 0 ; wr_list & & wr_list ! = bad_wr ; i + + , wr_list = wr_list - > next ) ;
BTL_ERROR ( ( " error posting receive descriptors to shared receive "
" queue %d (%d from %d) " , qp , i , num_post ) ) ;
OPAL_THREAD_UNLOCK ( & openib_btl - > ib_lock ) ;
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
return OPAL_ERROR ;
2007-11-28 14:52:31 +00:00
}
2012-05-08 20:42:09 +00:00
2015-10-06 12:47:09 -06:00
struct mca_btl_openib_event_t {
opal_event_t super ;
void * ( * fn ) ( void * ) ;
void * arg ;
opal_event_t * event ;
} ;
typedef struct mca_btl_openib_event_t mca_btl_openib_event_t ;
static void * mca_btl_openib_run_once_cb ( int fd , int flags , void * context )
{
mca_btl_openib_event_t * event = ( mca_btl_openib_event_t * ) context ;
void * ret ;
ret = event - > fn ( event - > arg ) ;
opal_event_del ( & event - > super ) ;
free ( event ) ;
return ret ;
}
int mca_btl_openib_run_in_main ( void * ( * fn ) ( void * ) , void * arg )
{
mca_btl_openib_event_t * event = malloc ( sizeof ( mca_btl_openib_event_t ) ) ;
if ( OPAL_UNLIKELY ( NULL = = event ) ) {
return OPAL_ERR_OUT_OF_RESOURCE ;
}
event - > fn = fn ;
event - > arg = arg ;
opal_event_set ( opal_sync_event_base , & event - > super , - 1 , OPAL_EV_READ ,
mca_btl_openib_run_once_cb , event ) ;
opal_event_active ( & event - > super , OPAL_EV_READ , 1 ) ;
return OPAL_SUCCESS ;
}