There is still a problem with OpenIB and threads (external to C/R functionality). It has been reported in Ticket #1539
Additionally:
* Fix a file cleanup bug in CRS Base.
* Fix a possible deadlock in the TCP ft_event function
* Add a mca_base_param_deregister() function to MCA base
* Add whole process checkpoint timers
* Add support for BTL: OpenIB, MX, Shared Memory
* Add support Mpool: rdma, sm
* Sundry bounds checking an cleanup in some scattered functions
This commit was SVN r19756.
I run IMB exchange on two QS22 machines with r19674 and it got stucked after 256 or 512 bytes every time.
After applying r19717 the test passed, so I guess this is a essential patch.
This commit was SVN r19752.
The following SVN revision numbers were found above:
r19674 --> open-mpi/ompi@15c47a2473
r19717 --> open-mpi/ompi@0a765cd788
following names are all new for v1.3, and therefore haven't been
officially released yet:
* btl_openib_of_cq_size
* btl_openib_of_max_inline_data
* btl_openib_of_pkey
* btl_openib_of_psn
* btl_openib_of_mtu
The "_of_" (for OpenFabrics) in there is redundant. It used to be
"_ib_", indicating that these values are pretty much passed directly
to the verbs stack. But I think the "openib" in the name implies this
already; having "_of_" in there just seems redundant, makes the name
longer, and seems redundant. It's also redundant.
So I took those "_of_"'s out of the MCA names. The old (v1.2) names
are still valid (but deprecated), such ash btl_openib_ib_cq_size.
This commit was SVN r19718.
* Change name: mca_btl_openib_of_pkey_value -> mca_btl_openib_of_pkey
(since now there's no index, the "_value" suffix is somewhat
superfluous)
* Put in a better help message for the _pkey MCA param (to agree with
the new help message in v1.2.8)
This commit was SVN r19716.
Commit from a long-standing Mercurial tree that ended up incorporating a lot of things:
* A few fixes for CPC interface changes in all the CPCs
* Attempts (but not yet finished) to fix shutdown problems in the IB CM CPC
* #1319: add CTS support (i.e., initiator guarantees to send first message; automatically activated for iWARP over the RDMA CM CPC)
* Some variable and function renamings to make this be generic (e.g., alloc_credit_frag became alloc_control_frag)
* CPCs no longer post receive buffers; they only post a single receive buffer for the CTS if they use CTS. Instead, the main BTL now posts the main sets of receive buffers.
* CPCs allocate a CTS buffer only if they're about to make a connection
* RDMA CM improvements:
* Use threaded mode openib fd monitoring to wait for for RDMA CM events
* Synchronize endpoint finalization and disconnection between main thread and service thread to avoid/fix some race conditions
* Converted several structs to be OBJs so that we can use reference counting to know when to invoke destructors
* Make some new OBJ's have opal_list_item_t's as their base, thereby eliminating the need for the local list_item_t type
* Renamed many variables to be internally consistent
* Centralize the decision in an inline function as to whether this process or the remote process is supposed to be the initiator
* Add oodles of OPAL_OUTPUT statements for debugging (hard-wired to output stream -1; to be activated by developers if they want/need them)
* Use rdma_create_qp() instead of ibv_create_qp()
* openib fd monitoring improvements:
* Renamed a bunch of functions and variables to be a little more obvious as to their true function
* Use pipes to communicate between main thread and service thread
* Add ability for main thread to invoke a function back on the service thread
* Ensure to set initiator_depth and responder_resources properly, but putting max_qp_rd_ataom and ma_qp_init_rd_atom in the modex (see rdma_connect(3))
* Ensure to set the source IP address in rdma_resolve() to ensure that we select the correct OpenFabrics source port
* Make new MCA param: openib_btl_connect_rdmacm_resolve_timeout
* Other improvements:
* btl_openib_device_type MCA param: can be "iw" or "ib" or "all" (or "infiniband" or "iwarp")
* Somewhat improved error handling
* Bunches of spelling fixes in comments, VERBOSE, and OUTPUT statements
* Oodles of little coding style fixes
* Changed shutdown ordering of btl; the device is now an OBJ with ref counting for destruction
* Added some more show_help error messages
* Change configury to only build IBCM / RDMACM if we have threads (because we need a progress thread)
This commit was SVN r19686.
The following Trac tickets were found above:
Ticket 1210 --> https://svn.open-mpi.org/trac/ompi/ticket/1210
(related to the presence of posix threads and ptmalloc2) is now a
little outdated: since we don't build ptmalloc2 as part of libopal
anymore, the openib BTL's requirements are not directly tied to
ptmalloc2's anymore. Specifically, I altered the test to:
1. At compile time, if no threads are found, the ptmalloc2 component
is going to be built, '''and the ptmalloc2 component is going to be
inside libopal,''' then refuse to build the openib BTL.
1. At run time, if no threads were available at compile time and the
ptmalloc2 component is part of the process, then refuse to use the
openib BTL.
Fixes trac:1537.
This commit was SVN r19652.
The following Trac tickets were found above:
Ticket 1537 --> https://svn.open-mpi.org/trac/ompi/ticket/1537
figure it out at runtime (really meaning: we'll still default to "0"
unless something explicitly overrides to 1, such as the openib BTL).
This way, ompi_info doesn't confusingly report mpi_leave_pinned==0 for
mpi_leave_pinned, but we end up running with mpi_leave_pinned==1.
Fixes trac:1502.
This commit was SVN r19571.
The following Trac tickets were found above:
Ticket 1502 --> https://svn.open-mpi.org/trac/ompi/ticket/1502
See opal/mca/paffinity/paffinity.h for explanation as to the physical vs logical nature of the params used in the API.
Fixes trac:1435
This commit was SVN r19391.
The following Trac tickets were found above:
Ticket 1435 --> https://svn.open-mpi.org/trac/ompi/ticket/1435
Roland noticed that the QLogic HCA driver was using the PCIe vendor ID
for the ibv_query_device so the IEEE OUI value is now used. This means
the config file should recognize the vendor ID value 0x1175 too.
This commit was SVN r19277.
This is setup so that it only is issued once (as opposed to every time they do it), and goes through orte_show_help so the user doesn't get hammered by #procs copies of the warning. In addition, there is a new MCA param (can't have too many!) to shut the warning off altogether.
This closes ticket #1244
This commit was SVN r19196.
* add "register" function to mca_base_component_t
* converted coll:basic and paffinity:linux and paffinity:solaris to
use this function
* we'll convert the rest over time (I'll file a ticket once all
this is committed)
* add 32 bytes of "reserved" space to the end of mca_base_component_t
and mca_base_component_data_2_0_0_t to make future upgrades
[slightly] easier
* new mca_base_component_t size: 196 bytes
* new mca_base_component_data_2_0_0_t size: 36 bytes
* MCA base version bumped to v2.0
* '''We now refuse to load components that are not MCA v2.0.x'''
* all MCA frameworks versions bumped to v2.0
* be a little more explicit about version numbers in the MCA base
* add big comment in mca.h about versioning philosophy
This commit was SVN r19073.
The following Trac tickets were found above:
Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392
* Use synonym/deprecated MCA param API for some mca base params
* In openib BTL, if we have appropriate memory hooks support, and if
mpi_leave_pinned and mpi_leave_pinned_pipeline were not set by the
user, set mpi_leave_pinned to 1.
* Defer checking mpi_leave_pinned_* until as late as possible (i.e.,
until after the btl's have had a chance to set mpi_leave_pinned to
1):
* in ob1 pml
* in rdma mpool
This commit was SVN r19022.
The following Trac tickets were found above:
Ticket 1379 --> https://svn.open-mpi.org/trac/ompi/ticket/1379
"!OpenFabrics" / neutral (i.e., refer to IB and/or iWARP).
* Mostly just type, variable/field, and funcion name changes, such as
s/hca/device/g, etc.
* Changed the INI file for the hardware-specific parameters to be
mca-btl-openib-device-params.ini.
* Updated a lot of help messages in the help-*.txt files, not just to
update it to be !OpenFabrics/neutral language, but also for some
consistency of tone, indenting, etc.
* Deprecated a bunch of MCA params in favor of language-neutral new
ones:
* btl_openib_warn_no_hca_params_found (s/hca/device/)
* btl_openib_hca_param_files
* btl_openib_ib_cq_size (s/_ib_/_of_/)
* btl_openib_ib_max_inline_data
* btl_openib_ib_psn
* btl_openib_ib_mtu
* btl_openib_ib_pkey_ix
* btl_openib_ib_pkey_val
This commit was SVN r18985.
The following Trac tickets were found above:
Ticket 1295 --> https://svn.open-mpi.org/trac/ompi/ticket/1295
The rdmacm event handler has no way of reporting fatal errors to the upper
layers. By calling mca_btl_openib_endpoint_invoke_error in the rdmacm event
handler for the errors encountered, these errors can now be handled
appropriately.
Closes out Ticket #1283
This commit was SVN r18980.
If IBCM was explicitly specified with exclude/include parameter,
OpenIB BTL will enable verbose report for "/dev/infiniband/ucm" error,
other way the error will not be reported.
This commit was SVN r18868.
first to the trunk. So, here is the trunk checkin:
The call to orte_show_help() to notify truncation of the max_inline value
was missing the want_error_header boolean, which eventually results in
a SEGV. This change corrects the call with the bool set to true.
This commit was SVN r18839.
* Move the passive side QP move to RTS to before we send the reply
(vs. sending it after we get the RTU). A lengthy comment explains
the need for this.
* Add some timers to the code for analyzing where time is spent.
* Clarify a few error messages.
* Currently have a loop around ib_cm_listen() because sometimes it
fails for seeminly no reason. Have pending e-mails in to Sean
Hefty to see if we can figure out why this is happening. Note that
the more MPI processes you add, the more likely this error is to
occur (e.g., ran 720 processes and it happens at least 50% of the
time). This makes IBCM somewhat unattractive for general use;
hopefully we can get a fix...
This commit was SVN r18766.
endpoint.c) because it's almost identical in all the CPC's. OOB,
XOOB, and IBCM now all invoke the btl error handler properly if
there's an error during wireup. RDMACM still needs to be done.
This commit was SVN r18764.
The following SVN revision numbers were found above:
r18762 --> open-mpi/ompi@3eda04578f