With certain datatypes the opal_datatype_unpack method for performing
the accumulate operation does not work. This commit modifies the
accumulate code in the osc base to use opal_convertor_raw instead.
Fixes#385
The ompi_osc_signal_outgoing was moved from ompi_osc_rdma_frag_start to frag_send
which gave correct results for the bug reproducer but hangs with simple OSC
tests. Moved the ompi_osc_signal_outgoing back and it now passes all tests.
Closes#256
Some of the counters used by the "rdma" one-sided component are intended
to overflow. Since overflow behavior is undefined for signed integers in
C it is safer to use unsigned integers here.
osc/rdma uses counters to determine if all messages have been received
before exiting synchronization calls. The problem is that the active
target counter is always increasing (never zeroed). If over 2^31-1
messages are sent this causes the counter to overflow (in itself this
isn't an error). This causes test/wait to return before the communication
is complete. There is an additional error in the use of the fragment
flush function. If PSCW synchronization is in use this function CAN NOT
be called unless a post message has arrived.
Relevant mailing list thread: http://www.open-mpi.org/community/lists/devel/2014/10/16016.php
This commit fixes both issues. Tested against MTT and issue reproducer.
Closes#224.
Initialize the blocking_fence flag to false as the code logic indicates that it should only be set if someone provides that flag.
Thanks to Lisandro Dalcin for reporting it
cmr=v1.8.4:reviewer=hjelmn
This commit was SVN r32812.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
- Portals4/OSC was unable to acquire an exclusive lock due to an invalid
local address in the atomic operation. This caused the reported hang.
- After fixing the hang, the test continued to fail because
ompi_datatype_is_contiguous_memory_layout() reports that MPI_EMPTY (the
origin datatype) is noncontiguous and Portals4/OSC does not support
noncontiguous datatypes at this time. However, in this case the origin
count is zero so contiguous/noncontiguous is irrelevant. Now we skip
the contiguous check if the count is zero.
cmr=v1.8.3:reviewer=regrant:subject=Fix for "Portals4/MTL hangs in c_get_accumulate test"
This commit was SVN r32295.
The following Trac tickets were found above:
Ticket 4662 --> https://svn.open-mpi.org/trac/ompi/ticket/4662
This commit adds a check to see if the target is in an access epoch. If
not we return OMPI_ERR_RMA_SYNC. This fixes test_start3 in the onesided
test suite. The cost of this extra check is 1 byte/peer for the boolean
flag indicating that the peer is in an access epoch.
I also fixed a problem where mupliple unexpected post messages are not
correctly handled.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r32160.
cmr=v1.8.2:reviewer=tkordenbrock:subject=Portals4/MTL hanging fix
This commit was SVN r32113.
The following Trac tickets were found above:
Ticket 4681 --> https://svn.open-mpi.org/trac/ompi/ticket/4681
cmr=v1.8.2:reviewer=tkordenbrock:subject=Move r32112 to v1.8.2 branch
This commit was SVN r32112.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r32112
The following Trac tickets were found above:
Ticket 4682 --> https://svn.open-mpi.org/trac/ompi/ticket/4682
The post and start window calls are supposed to be matching. The code
did not check to see that an incoming post matched with the start call.
This commit fixes the bug by placing the post on a pending list that
will be checked by the next call to start.
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r32017.
The replace callback did not increment the incoming frag counter. This
leads to a hang during synchronization. This commit adds the increment
and also puts the request on the garbage collection list to fix a leak.
This fixes a hang found when running the mpich test suite.
cmr=v1.8.2:reviewer=bbenton
This commit was SVN r32016.
The wrong type was used when calculating the amount of space needed
for an accumulate fragment. Fixed the calculation and took the
opportunity to eliminate the get_acc header as it is identical to the
acc header.
This fixes trac:4719 and #4718
Tracking these fixes for 1.8.2 in this CMR.
Throwing this to Brad for review as he is the one who ran into the issue.
cmr=v1.8.2:reviewer=bbenton
This commit was SVN r32015.
The following Trac tickets were found above:
Ticket 4719 --> https://svn.open-mpi.org/trac/ompi/ticket/4719
in self optimization
This addresses an issue found with the MPICH pscw_ordering test. Eager sends
were not yet active (which is ok for the standard path) but not ok for the
self optimization. Fixed by waiting for all post messages before checking
the sync state.
Fixes trac:4724
Tracking the 1.8.2 issue in this CMR.
cmr=v1.8.2:reviewer=bbenton
This commit was SVN r32012.
The following Trac tickets were found above:
Ticket 4724 --> https://svn.open-mpi.org/trac/ompi/ticket/4724
Brad correctly pointed out that the total window size should not be an
int. Changed it to an unsigned long.
cmr=v1.8.2:reviewer=bbenton
This commit was SVN r32010.
Only one field is valid for RMA requests: MPI_ERROR. This field is set
to the correct value in ompi_request_empty so there is no reason to
allocate and keep track of osc/sm requests because they are always
complete on return. Since we are no longer using the osc/sm request
structure or free list they are now removed.
Closes trac:4723
Tracking this issue with the CMR. Brad, can you verify the issue is indeed fixed.
cmr=v1.8.2:reviewer=bbenton
This commit was SVN r32009.
The following Trac tickets were found above:
Ticket 4723 --> https://svn.open-mpi.org/trac/ompi/ticket/4723
This commit fixes a bug that can cause request and communicator leaks
when cleaning up an OSC window. The should prevent a hang seen with
IMB-EXT.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31539.
When sending PUT_LONG, the data is sent before headers, and sometimes
the header is not flushed immediately. This creates a lot of unexpected
receives in the peer, since it would posts a receive only when gets the
header, which makes it run out of receive buffers. When the sender
eventually flushes the window, the receiver already has no buffers to
receive the header, which causes a deadlock.
The fix is to always flush the headers when doing put_long.
cmr=v1.8.1:reviewer=hjelmn
This commit was SVN r31378.
While testing one-sided on LANL systems I found a couple more OSC
bugs that were not caught during the initial testing:
- In the passive target code we read the read lock count as a
char instead of the intended uint32_t. This causes lock to
lockup when using shared locks after 127 iterations.
- The post code used the wrong group when trying to increment post
counters. This causes a segmentation fault.
- Both the post and wait code used the wrong check in the inner
loop leading to an infinite loop.
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31354.
There was a typo in the ompi_osc_gacc_long_start that was causing a
segmentation fault when executing long get accumulate operations.
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31353.