A previous commit updated the one-sided code to register the state
region only once. This created an issue when using the scratch lock
with fetching atomics. In this case on any rank that isn't local rank
0 the module->state_handle is NULL. This commit fixes the issue by
removing the scratch lock and using a fragment pointer instead.
Fixesopen-mpi/ompi#1290
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the way ompi_proc_t's are retained/released by
ompi_group_t's. Before this change ompi_proc_t's were retained once
for the group and then once for each retain of a group. This method
adds unnecessary overhead (need to traverse the group list each time
the group is retained) and causes problems when using an async
add_procs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1324733: Null pointer dereferences (FORWARD_NULL)
Fix CID 1324734: Null pointer dereferences (FORWARD_NULL)
Fix CID 1324735: Null pointer dereferences (FORWARD_NULL)
Fix CID 1324736: Null pointer dereferences (FORWARD_NULL)
Fix CID 1324737: Null pointer dereferences (FORWARD_NULL)
Fix CID 1324751: Memory - illegal accesses (USE_AFTER_FREE)
Fix CID 1324750: (USE_AFTER_FREE)
Fix CID 1324749: Memory - corruptions (USE_AFTER_FREE)
Fix CID 1324748: Memory - illegal accesses (USE_AFTER_FREE)
Fix CID 1324747: (USE_AFTER_FREE)
Fix CID 1324746: Memory - corruptions (USE_AFTER_FREE)
Add missing return on an error path.
Fix CID 1324745: Code maintainability issues (UNUSED_VALUE)
Ignore return code from barrier. It was not being used anyway.
Fix CID 1324738: Null pointer dereferences (FORWARD_NULL)
Fix CID 1324741: Null pointer dereferences (REVERSE_INULL)
module->selected_btl can not be NULL in osc/rdma during normal
operation. Removed the unnecessary NULL check.
Fix CID 1324752: Memory - illegal accesses (USE_AFTER_FREE)
Move ompi_osc_pt2pt_module_lock_remove to before the lock is freed.
Fix CID 1324744: Uninitialized variables (UNINIT)
Fix CID 1324743: Uninitialized variables (UNINIT)
This array is not used unitialized but there is no reason not to use
calloc here to silence the warning.
The following CID is a false positive: 1324742. I will mark it such in
coverity.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for performing one-sided operations over
supported hardware (currently Infiniband and Cray Gemini/Aries). This
component is still undergoing active development.
Current features:
- Use network atomic operations (fadd, cswap) for implementing
locking and PSCW synchronization.
- Aggregate small contiguous puts.
- Reduced memory footprint by storing window data (pointer, keys,
etc) at the lowest rank on each node. The data is fetched as each
process needs to communicate with a new peer. This is a trade-off
between the performance of the first operation on a peer and the
memory utilization of a window.
TODO:
- Add support for the accumulate_ops info key. If it is known that
the same op or same op/no op is used it may be possible to use
hardware atomics for fetch-and-op and compare-and-swap.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
osc/rdma uses counters to determine if all messages have been received
before exiting synchronization calls. The problem is that the active
target counter is always increasing (never zeroed). If over 2^31-1
messages are sent this causes the counter to overflow (in itself this
isn't an error). This causes test/wait to return before the communication
is complete. There is an additional error in the use of the fragment
flush function. If PSCW synchronization is in use this function CAN NOT
be called unless a post message has arrived.
Relevant mailing list thread: http://www.open-mpi.org/community/lists/devel/2014/10/16016.php
This commit fixes both issues. Tested against MTT and issue reproducer.
Closes#224.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
This commit adds a check to see if the target is in an access epoch. If
not we return OMPI_ERR_RMA_SYNC. This fixes test_start3 in the onesided
test suite. The cost of this extra check is 1 byte/peer for the boolean
flag indicating that the peer is in an access epoch.
I also fixed a problem where mupliple unexpected post messages are not
correctly handled.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r32160.
The post and start window calls are supposed to be matching. The code
did not check to see that an incoming post matched with the start call.
This commit fixes the bug by placing the post on a pending list that
will be checked by the next call to start.
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r32017.
The last fix prevented a hang but had some cases where the results were
wrong. Fixed. Tested with armci, openmpi/ibm, openmpi/onesided.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31284.
results
The code to handle completion messages did not correctly increment the
number of expected messages. This could cause wait to return before all
incoming messages are complete.
I also added a check to ensure that start returns an error if we are in
a passive access epoch.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31203.
- Return an error if the caller specified both MPI_MODE_NOPRECEDE and
MPI_MODE_NOSUCCEED to MPI_Win_fence.
- Return an error if the caller attempts to enter an active target
epoch while already in a passive target epoch.
- End an active target epoch if MPI_Win_fence is called with
MPI_MODE_NOSUCCEED.
cmr=v1.7.5:ticket=trac:4382
This commit was SVN r31043.
The following Trac tickets were found above:
Ticket 4382 --> https://svn.open-mpi.org/trac/ompi/ticket/4382
need to enable the access epoch in MPI_Win_fence.
I missed this change when I fixed the semantics of MPI_Win_create. With
this commit our one-sided MTT runs are now running clean.
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r31041.
Dave Goodell correctly pointed out that it is unusual to return MPI
error classes from internal ompi functions. Correct this in the RMA
case by adding an internal error code to match MPI_ERR_RMA_SYNC.
This does change OMPI_ERR_MAX. I don't think this will cause any
problems with ABI.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31012.
This commit resolves a number of crashed discovered my the onesided
tests in MTT. The functions in question were operating on the assumption
the user was calling RMA functions correctly.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31008.