performance on a 2G Myrinet card, as it look like pipelining the messages
by 1M is faster than a simple send/receive. However, when using a 10G card
the send/receive will limit the maximum bandwidth to 2.5Gbs. The reason is
the scarce bus resources that have to be shared between the Myrinet hardware
and the memcpy operation. The PUT protocol remove the memcpy, we now have a
true zero-copy mechanism. But, there is no pipelining yet as it look like the
RDMA pipeline somehow disappeared from the OB1 PML ...
This commit was SVN r12925.
It contains four algorithms:
Bruck (ciel(logP) steps), Recursive Doubling (log(P) for power-of-2 processes), Ring (P-1 steps),
and Neighbor Exchange (P/2 steps for even number of processes).
All algorithms passed occ, IMB-2.3, and intel verification tests from ompi-tests/ for up to 56 processes.
The fixed decision function is based on results collected over MX on the Grig cluster at
the University of Tennessee at Knoxville.
I have also added (and commented out) copy of MPICH2 decision function for allgather
(from their IJHPCA 2005 paper).
This commit was SVN r12910.
which can cause segfaults on shutdown. Calling mx_finalize() isn't enough
to shutdown the thread, so must close endpoints as well.
Refs trac:513
This commit was SVN r12908.
The following Trac tickets were found above:
Ticket 513 --> https://svn.open-mpi.org/trac/ompi/ticket/513
if the remote architecture differs from the local architecture and the
btl doesn't support heterogeneous transport.
Refs trac:587
This commit was SVN r12879.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
udapl/openib/vapi/gm mpools a deprecated. rdma mpool has parameter that allows
to limit its size mpool_rdma_rcache_size_limit (default is 0 - unlimited).
This commit was SVN r12878.
Move the req_mtl structure back to the end of each of the structures in
the CM PML. The req_mtl structure is cast into a mtl_*_request_structure
for each MTL, which is larger than the req_mtl itself. The cast will cause
the *_request to overwrite parts of the heavy requests if the req_mtl
isn't the *LAST* thing on each structure (hence the comment). This was
moved as an optimization at some point, which caused buffer sends to fail...
Refs trac:669
This commit was SVN r12873.
The following SVN revision numbers were found above:
r12871 --> open-mpi/ompi@597598b712
The following Trac tickets were found above:
Ticket 669 --> https://svn.open-mpi.org/trac/ompi/ticket/669
CM PML. The req_mtl structure is cast into a mtl_*_request_structure for
each MTL, which is larger than the req_mtl itself. The cast will cause
the *_request to overwrite parts of the heavy requests if the req_mtl
isn't the *LAST* thing on each structure (hence the comment). This was
moved as an optimization at some point, which caused buffer sends to
fail...
Refs trac:669
This commit was SVN r12871.
The following Trac tickets were found above:
Ticket 669 --> https://svn.open-mpi.org/trac/ompi/ticket/669
Add ability for ini files to recognize "use_eager_rdma" flag. Set the
default to "no" (because we should assume that HCAs cannot support the
property necessary for using RDMA for eager messages -- that the last
byte of the message is guaranteed to be written to memory last --
unless proven otherwise. For example, iWARP cards apparently do not
provide this guarantee), and then set all Mellanox and IBM HCAs to
override the default to enable this behavior on these cards.
This commit was SVN r12851.
The following Trac tickets were found above:
Ticket 366 --> https://svn.open-mpi.org/trac/ompi/ticket/366
I found only two places that were looking at the tokens:
1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message.
2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was *always* the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes).
This commit was SVN r12813.
usually is ok on little-endian systems, as the upper 32 bits will likely
be ignored, but on 32-bit big-endian systems, lval is complete junk.
Use ival if 32 bit mode, lval if 64.
Mixing of 32 and 64 bit architectures won't work without more changes.
This commit was SVN r12802.
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
r12714) for supporting compilers / architectures with different
padding rules.
This commit was SVN r12749.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r12491
r12714
hits the buffer on the other side. For this kind of BTLs we need to send
FIN through the same BTL, PUT was performed with so network will handle
ordering for us. If we will use another BTL, receiver can get FIN before
data will hit the buffer and complete request prematurely. We mark such
problematic BTLs with MCA_BTL_FLAGS_FAKE_RDMA flag (this kind of RDMA
is really fake, because the real one guaranties that sender will see the
completion only after receiver's NIC confirmed that all the data was
received).
This commit was SVN r12732.
It calls mca_pml_ob1_send_fin_btl() which may fail and doesn't check return
code. This breaks all RDMA transports event when only one BTL is used. Revert
it for now, I am working on a real fix for the problem (I hope).
This commit was SVN r12731.
The following SVN revision numbers were found above:
r12720 --> open-mpi/ompi@3e3689320b
regresion from v1.1 was reviewed and put to v1.2 branch. So revert this part
of r12721 back.
This commit was SVN r12730.
The following SVN revision numbers were found above:
r12433 --> open-mpi/ompi@82f7c0dd69
r12721 --> open-mpi/ompi@3edd850d2e
protocol when multiple NICS are available between 2 peers. The fix force
the FIN message to take exactly the same path as the fragment it describe
(i.e. same path means same BTL). Otherwise, the FIN can be received by
the peer before the RDMA complete and the request will get freed
too early.
This commit was SVN r12720.
- consistent error message when something fails (via BTL_ERROR macro)
- decrease the number of jumps.
- cleanup some parts of the code.
This commit was SVN r12719.
The temporary solution is to switch into EV_NONBLOCK mode earlier (right after the mx_connect loop) so that there isn't a giant slowdown when processes enter the stage gate 2 barrier before other proesses. They will now not block in the event library for any period of time, which appears to have a 50% speedup when running at > 64 procs.
Refs trac:645
This commit was SVN r12713.
The following Trac tickets were found above:
Ticket 645 --> https://svn.open-mpi.org/trac/ompi/ticket/645
so this isn't an issue there either. Refs trac:488
This commit was SVN r12675.
The following Trac tickets were found above:
Ticket 488 --> https://svn.open-mpi.org/trac/ompi/ticket/488
* Fix a counter roll-over issue that could result from a large (but
not excessive) number of outstanding put/get/accumulate calls
during a single synchronization issues (Refs trac:506)
* Fix epoch issue with rdma component that would effect PWSC
synchronization (Refs trac:507)
This commit was SVN r12673.
The following Trac tickets were found above:
Ticket 506 --> https://svn.open-mpi.org/trac/ompi/ticket/506
Ticket 507 --> https://svn.open-mpi.org/trac/ompi/ticket/507