This commit fixes a bug identified by MTT that occurred when mixing
passive and active target synchronization. The bugs fixed in this
commit are:
- Do not update incoming fragment counts for any type of unbuffered
control message. These messages are out-of-band and should not be
considered towards the signal counts.
- Complete a change from using received counts to expected counts for
lock, unlock, and flush acks. Part of the change made it into
master before the rest was ready. This was preventing wakeups in
some cases.
- Turn the passive_target_access_epoch module member into a
counter. As long as at least one peer is locked we are in a
passive-target epoch and not an active target one. This fix will
ensure that fragment flags are set appropriately.
fixes#538
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Based on some on-list and IM discussion with @hjelmn about
open-mpi/ompi@40b7643119, change the testing to a switch/case. If we
fall into the default case, assert() error (because it's an OMPI
developer programming error).
The fragment flush code tries to send the active fragment before
sending any queued fragments. This could cause osc messages to arrive
out-of-order at the target (bad). Ensure ordering by alway sending
the active fragment after sending queued fragments.
This commit also fixes a bug when a synchronization message (unlock,
flush, complete) can not be packed at the end of an existing active
fragment. In this case the source process will end up sending 1 more
fragment than claimed in the synchronization message. To fix the issue
a check has been added that fixes the fragment count if this situation
is detected.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.
All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit fixes a bug in osc/pt2pt which causes MPI_Win_unlock_all
to hang. The problem was caused by code refactoring that moved the
unlock of the local process to after the loop that waits for unlock
acks. This will cause the code to loop forever waiting on the self
ack.
Fixes#444
Use of the old ompi_free_list_t and ompi_free_list_item_t is
deprecated. These classes will be removed in a future commit.
This commit updates the entire code base to use opal_free_list_t and
opal_free_list_item_t.
Notes:
OMPI_FREE_LIST_*_MT -> opal_free_list_* (uses opal_using_threads ())
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds an owner file in each of the component directories
for each framework. This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page. Currently there are two
"fields" in the file, an owner and a status. A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
Commit open-mpi/ompi@1a3597aam changed the type of the `convertor`
variable from `ompi_osc_base_convertor_t` (which contained an
`opal_convertor_t`) to an `opal_convertor_t`. Hence, using memchecker
to ensure that the inner convertor of the `ompi_osc_base_convertor_t`
is considered initialized is now unnecessary.
With certain datatypes the opal_datatype_unpack method for performing
the accumulate operation does not work. This commit modifies the
accumulate code in the osc base to use opal_convertor_raw instead.
Fixes#385
The ompi_osc_signal_outgoing was moved from ompi_osc_rdma_frag_start to frag_send
which gave correct results for the bug reproducer but hangs with simple OSC
tests. Moved the ompi_osc_signal_outgoing back and it now passes all tests.
Closes#256
Some of the counters used by the "rdma" one-sided component are intended
to overflow. Since overflow behavior is undefined for signed integers in
C it is safer to use unsigned integers here.
osc/rdma uses counters to determine if all messages have been received
before exiting synchronization calls. The problem is that the active
target counter is always increasing (never zeroed). If over 2^31-1
messages are sent this causes the counter to overflow (in itself this
isn't an error). This causes test/wait to return before the communication
is complete. There is an additional error in the use of the fragment
flush function. If PSCW synchronization is in use this function CAN NOT
be called unless a post message has arrived.
Relevant mailing list thread: http://www.open-mpi.org/community/lists/devel/2014/10/16016.php
This commit fixes both issues. Tested against MTT and issue reproducer.
Closes#224.
Initialize the blocking_fence flag to false as the code logic indicates that it should only be set if someone provides that flag.
Thanks to Lisandro Dalcin for reporting it
cmr=v1.8.4:reviewer=hjelmn
This commit was SVN r32812.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
- Portals4/OSC was unable to acquire an exclusive lock due to an invalid
local address in the atomic operation. This caused the reported hang.
- After fixing the hang, the test continued to fail because
ompi_datatype_is_contiguous_memory_layout() reports that MPI_EMPTY (the
origin datatype) is noncontiguous and Portals4/OSC does not support
noncontiguous datatypes at this time. However, in this case the origin
count is zero so contiguous/noncontiguous is irrelevant. Now we skip
the contiguous check if the count is zero.
cmr=v1.8.3:reviewer=regrant:subject=Fix for "Portals4/MTL hangs in c_get_accumulate test"
This commit was SVN r32295.
The following Trac tickets were found above:
Ticket 4662 --> https://svn.open-mpi.org/trac/ompi/ticket/4662
This commit adds a check to see if the target is in an access epoch. If
not we return OMPI_ERR_RMA_SYNC. This fixes test_start3 in the onesided
test suite. The cost of this extra check is 1 byte/peer for the boolean
flag indicating that the peer is in an access epoch.
I also fixed a problem where mupliple unexpected post messages are not
correctly handled.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r32160.