1
1
openmpi/ompi/mca
Jeff Squyres 6ae45b34fc usnic: check connectivity on first communication to a peer
Previously, we were only checking connectivity upon first ''send'' to
a peer.  But this ignores the case where the first communication to a
peer is actually an ACK -- i.e., we successfully received something
from the peer and we need to send an ACK back.  So we need to verify
that the ACK will actually get there.

Specifically, certain asymmetric routing cases can lead to a hang if
we don't check the connectivity in both directions.  E.g., if the
sender is able to get traffic to the receiver, but the receiver is
unable to get traffic back to the sender because it made a different
routing decision than the sender.

In this case, the connectivity checker from the sender could succeed
(because the connectivity checker will ACK along the same path in
which the ping was received), but sending a BTL ACK could fail
(because the BTL ACK will be sent back along the path chosen by the
graph algorithm, which, in an erroneous asymmetric routing scenario,
may be different/wrong).

Hence, we want to trigger the connectivity checker at the first
communication from A->B, which may either be a BTL send or an ACK.

Reviewed by Dave Goodell.

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32309.
2014-07-24 21:32:56 +00:00
..
allocator allocator/bucket: free all memory associated with a bucket allocator 2014-05-14 21:15:39 +00:00
bcol ompi_mpi_abort had one extra argument that was never used. Clean it up. 2014-07-03 00:34:44 +00:00
bml Per RFC: Remove des_src and des_dst members from the 2014-07-10 16:31:15 +00:00
btl usnic: check connectivity on first communication to a peer 2014-07-24 21:32:56 +00:00
coll HCOLL: fix misplaced hcoll_init return value check. 2014-07-22 18:47:34 +00:00
common Create new error message so we can better pinpoint where an error occurs. 2014-07-24 15:18:55 +00:00
crcp get the FT code to compile again by adding/removing #includes 2014-06-25 18:42:17 +00:00
dpm Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. 2014-07-15 03:48:00 +00:00
fbtl Next step of RFC: OMPI_CHECK_FUNC_LIB -> OPAL_CHECK_FUNC_LIB 2014-05-01 14:57:43 +00:00
fcoll clean up of the MCA parameters of the fcoll framework. Most parameters are now 2014-07-23 19:03:14 +00:00
fs The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature: 2014-05-07 21:48:53 +00:00
io clean up of the MCA parameters of the fcoll framework. Most parameters are now 2014-07-23 19:03:14 +00:00
mpool Fix CUDA registration where we run out of memory being allocated. 2014-07-23 21:10:17 +00:00
mtl MXM: use builk connection establishment API 2014-07-17 08:35:55 +00:00
op The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature: 2014-05-07 21:48:53 +00:00
osc This commit fixes trac:4662 - "Portals4/MTL hangs in c_get_accumulate test". 2014-07-23 19:13:07 +00:00
pml Fix typo in r32196 2014-07-14 21:00:53 +00:00
pubsub Per the RFC issued here: 2014-06-01 04:28:17 +00:00
rcache Fix longstanding issue with our multi-project support. Rather than using 2014-01-07 22:11:15 +00:00
rte Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. 2014-07-15 03:48:00 +00:00
sbgp Looks like someone missed a '|' in this comparison, so correct it here 2014-05-14 16:53:22 +00:00
sharedfp The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature: 2014-05-07 21:48:53 +00:00
topo Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed. 2014-06-25 20:43:28 +00:00
vprotocol ompi_mpi_abort had one extra argument that was never used. Clean it up. 2014-07-03 00:34:44 +00:00