* Move the passive side QP move to RTS to before we send the reply
(vs. sending it after we get the RTU). A lengthy comment explains
the need for this.
* Add some timers to the code for analyzing where time is spent.
* Clarify a few error messages.
* Currently have a loop around ib_cm_listen() because sometimes it
fails for seeminly no reason. Have pending e-mails in to Sean
Hefty to see if we can figure out why this is happening. Note that
the more MPI processes you add, the more likely this error is to
occur (e.g., ran 720 processes and it happens at least 50% of the
time). This makes IBCM somewhat unattractive for general use;
hopefully we can get a fix...
This commit was SVN r18766.
endpoint.c) because it's almost identical in all the CPC's. OOB,
XOOB, and IBCM now all invoke the btl error handler properly if
there's an error during wireup. RDMACM still needs to be done.
This commit was SVN r18764.
The following SVN revision numbers were found above:
r18762 --> open-mpi/ompi@3eda04578f
upper level btl (and therefore the PML) when something goes wrong
during wireup.
Refs trac:1283.
This commit was SVN r18762.
The following Trac tickets were found above:
Ticket 1283 --> https://svn.open-mpi.org/trac/ompi/ticket/1283
* Properly handle non-symmetric subnet ID's
* Be a bit more stringent when checking for the GID
* Add lots of BTL_VERBOSE's for diagnostics
This commit was SVN r18754.
as the send and the put share the same execution path and in the case of the
put the MCA_BTL_DES_SEND_ALWAYS_CALLBACK was not set.
This commit was SVN r18735.
ompi_info will crash due to a reference to a unalloced struct. Add a check for
this struct to prevent this crash.
This commit was SVN r18727.
The following SVN revision numbers were found above:
r18725 --> open-mpi/ompi@c2351d39f1
If rdmacm cpc is disabled and an iwarp device is forced to be used, then
mca_btl_openib_get_iwarp_subnet_id will attempt to reference an uninitalized
list. Modify the code to check for the uninitalized list and return zero.
Also, fix a memory leak caused by not freeing alloced addresses structs during
shutdown.
This commit was SVN r18725.
the fragments that failed to be send, there is no need to replicate the same
mechanism in the BTL.
Force the SM BTL to empty all ack fragments in the component progress function.
This commit was SVN r18724.
INI file use_eager_rdma value (fixes trac:1169)
* fixed a typo in a MCA param help message
* made the check for enabling short/eager RDMA more robust in the
presence of progress threads; it now emits a show_help warning
This commit was SVN r18723.
The following Trac tickets were found above:
Ticket 1169 --> https://svn.open-mpi.org/trac/ompi/ticket/1169
specified, probe for max value supported by device.
This commit was SVN r18720.
The following Trac tickets were found above:
Ticket 1355 --> https://svn.open-mpi.org/trac/ompi/ticket/1355
* Ensure to destroy the correct QP (local->id[num]->qp will always
have a valid pointer in it, even if we setup a dummy qp)
* Note two notable places where we need to figure out how to
propagate errors up from the CPC to the main BTL / PML when errors
occur. Probably have the same issue in IBCM, too.
This commit was SVN r18700.
two processes on the same server (!). So for today, we'll simply mark
all local processes that use iWARP adapters as "unreachable".
More details in #1352.
This commit was SVN r18699.
(e.g., if you're excluding some devices, their destructors will be
invoked before the async event thread was setup for them).
This commit was SVN r18698.
multiple adapters (eg., Chelsio T3).
But we need to figure out how to determine a good value for the
resident adapter(s) at runtime. It's problematic because, for
example, Mellanox ConnectX and Chelsio T3 report max_inline values
differently at run-time. If you ibv_create_qp with a max_inline value
of 0, ConnectX reports back a value that is a formular based on a few
other values (e.g., max_send_sge and max_recv_sge). But T3 always
reports back "64".
We're looking into this to figure out the best way -- reducing the
default right now should allow other adapters to run while we figure
it out.
This commit was SVN r18697.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.
I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.
This commit was SVN r18619.