Cisco v1.6 git commit 913ec6c and upstream trunk r29593 (segfault fix)
introduced a performance regression by inadvertently disabling the
`module_recv_buffers` functionality. With those changes in place, the
`btl_usnic_recv.c` logic would end up mallocing a buffer that should
have otherwise come from a `module_recv_buffers` pool. It also resulted
in a small, bounded memory leak (128 buffers at each power-of-two size
interval).
The new version just places the buffer after the free list item with a
flexible array member. I bumped the pool to allocate all 128 elements
up front because the deferred allocation was modestly impacting IMB
Sendrecv performance at a few sizes.
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29631.
The following SVN revision numbers were found above:
r29593 --> open-mpi/ompi@1ed9b8ff43
If we need to use a convertor, go back to stashing that convertor in the
frag and populating segments "on the fly" (in
ompi_btl_usnic_module_progress_sends). Previously we would pack into a
chain of chunk segments at prepare_src time, unnecessarily consuming
additional memory.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29592.
This new routine can be called in exceptional situations, either
conditionally in BTL code or from a debugger, to help with debugging in
cases where MSGDEBUG1/2 or stats logging are impractical but more detail
is needed.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
This commit was SVN r29483.
MSGDEBUG2 now means "print a one-liner for all PML calls into BTL, and
also when BTL calls PML with a recv completion (not send completions)"
MSGDEBUG1 means print more internal gory detail
MSGDEBUG is gone, replaced by MSGDEBUG1
In the process also found that PUT_DEST style fragments could
potentially be leaked in usnic_free() since send_fragment tests were
being applied to see if it was eligible to be freed.
This commit was SVN r29185.
changes required to support MPI_Bsend(). Introduces concept of
attaching a buffer to a large segment that the PML can scribble into and
we will send from. The reason we don't use a pinned buffer and send
directly from that is that usnic_verbs does not (yes) support num_sge>1
for regular sends. This means the data gets copied twice, but that is
unavoidable.
changed the logic in handle_large_send to be more sensible
Incorporated David's review comments
This commit was SVN r29184.
Do not assume that the "size" passed to alloc_send() will be the same as
the size of the message the resulting fragment will hold when
usnic_send() is called. This means usnic_send()/usnic_put() can never
trust any pre-computed size values, and are only allowed to look at the
lengths and pointers of the elements in the desc SG list.
This commit was SVN r29183.
- tag needs to be sent in *our* header, not the PML header
- usnic_alloc() should return smaller value if too much data requested
- be careful about callbacks vs removing items from lists
(we need to remove from outr lists *before* the callback)
- improve send callback handling
- add some more MSGDEBUG2 logging and cleanup
This commit was SVN r29181.
The fix for the HPL SEGV was incorrect because it assumed the
prepare_src() routine was always allowed to return "bytes processed"
less than the requested "bytes to send". It turns out this is only true
if the convertor is what limits the size, we are not allowed to limit
the data sent for our own reasons, else we break login in the upper
layers.
This means we need to learn the number of bytes out of the size
requested the convertor will give us, no matter how big the size is.
Unfortunately, this is a destructive test, and (currently) the only way to
learn that number is to actually have the convertor copy the data out into
buffers.
This change implements this, copying the entire data out into a chain of
send segments which are attached to the large send fragment. Now we can
always return the proper size value to the PML.
Fixes Cisco bug CSCuj08024
Authored-by: Reese Faucette <rfaucett@cisco.com>
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)
This commit was SVN r29137.
The following Trac tickets were found above:
Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
Brian (rightfully) hit me on the head with the
don't-use-ORTE-use-the-rte-framework clue bat; the usnic BTL now
nicely plays with the RTE framework.
This commit was SVN r28907.
This BTL accesses the Cisco usNIC Linux device via the Linux verbs
API via Unreliable Datagram queue pairs. A few noteworthy points:
* This BTL does most of its own fragmentation; it tells the PML that
it has a very high max_send_size (much higher than the network
MTU).
* Since UD fragments are, by definition, unreliable, the usnic BTL
handles all of its own reliability via a sliding window approach
using the opal_hotel construct and many tricks stolen from the
corpus of knowledge surrounding efficient TCP.
* There is a fun PML latency-metric based optimization for NUMA
awareness of short messages.
* Note that this is ''not'' a generic UD verbs BTL; it is specific to
the Cisco usNIC device.
This commit was SVN r28879.