1
1
Граф коммитов

285 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
1b564f62bd Revert "Merge pull request #275 from hjelmn/btlmod"
This reverts commit ccaecf0fd6, reversing
changes made to 6a19bf85dd.
2014-11-19 23:22:43 -07:00
Nathan Hjelm
0d413fb73f Revert "Remove stale file reference"
This reverts commit 4c8fa17234.
2014-11-19 23:16:16 -07:00
Ralph Castain
4c8fa17234 Remove stale file reference 2014-11-19 18:32:19 -08:00
Nathan Hjelm
5a0a48c3c4 osc: remove lingering rdma component files 2014-11-19 12:11:54 -07:00
Nathan Hjelm
22625b005b osc/pt2pt: threading fixes and code cleanup 2014-11-19 11:33:04 -07:00
Nathan Hjelm
45d1fac8af ugni thread safety fixes 2014-11-19 11:33:03 -07:00
Nathan Hjelm
29e4e1c90a Rename the OSC "rdma" component to pt2p to better reflect that it does not actually use btl rdma 2014-11-19 11:33:03 -07:00
Ralph Castain
616f0894ce Add missing parens on values being passed to OPAL_THREAD_ADD32 2014-10-31 19:11:48 -07:00
Nathan Hjelm
672d96704c osc/rdma: fix regression introduced by eed7b45db5
The ompi_osc_signal_outgoing was moved from ompi_osc_rdma_frag_start to frag_send
which gave correct results for the bug reproducer but hangs with simple OSC
tests. Moved the ompi_osc_signal_outgoing back and it now passes all tests.

Closes #256
2014-10-30 23:16:11 -06:00
Nathan Hjelm
23dd3af946 osc/rdma: use unsigned types for all counters
Some of the counters used by the "rdma" one-sided component are intended
to overflow. Since overflow behavior is undefined for signed integers in
C it is safer to use unsigned integers here.
2014-10-22 15:36:15 -06:00
George Bosilca
7541c03b4c Mark all instances where atomic operations are used but their return value is unnecessary 2014-10-15 21:47:32 -04:00
Nathan Hjelm
eed7b45db5 osc/rdma: fix issue identified by Berk Hess
osc/rdma uses counters to determine if all messages have been received
before exiting synchronization calls. The problem is that the active
target counter is always increasing (never zeroed). If over 2^31-1
messages are sent this causes the counter to overflow (in itself this
isn't an error). This causes test/wait to return before the communication
is complete. There is an additional error in the use of the fragment
flush function. If PSCW synchronization is in use this function CAN NOT
be called unless a post message has arrived.

Relevant mailing list thread: http://www.open-mpi.org/community/lists/devel/2014/10/16016.php

This commit fixes both issues. Tested against MTT and issue reproducer.

Closes #224.
2014-10-07 11:45:22 -06:00
Ralph Castain
eb95d6f892 ompi_info_get_bool returns "success" if the value isn't found, setting "flag" to false, but doesn't set the value of the param itself. So if you don't specify "blocking_fence" in MPI_Info, then the "blocking_fence" flag wasn't being set.
Initialize the blocking_fence flag to false as the code logic indicates that it should only be set if someone provides that flag.

Thanks to Lisandro Dalcin for reporting it

cmr=v1.8.4:reviewer=hjelmn

This commit was SVN r32812.
2014-09-29 17:21:28 +00:00
Gilles Gouaillardet
5f1e0f284a Fix compilation when --enable-hetorogeneous
This commit was SVN r32410.
2014-08-04 10:35:08 +00:00
Ralph Castain
61bf7af9d2 Per Paul Hargrove's suggestion, create an opal_pagesize function to abstract the various ways of obtaining that value. Rather than creating a separate file for only that one function, put it in a convenient place that is at least somewhat related.
Refs trac:4826

This commit was SVN r32407.

The following Trac tickets were found above:
  Ticket 4826 --> https://svn.open-mpi.org/trac/ompi/ticket/4826
2014-08-02 18:38:16 +00:00
Ryan Grant
caa10a5faf Portals fixes after latest move
This commit was SVN r32330.
2014-07-28 19:25:03 +00:00
Ralph Castain
552c9ca5a0 George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT:    Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL

All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies.  This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP.  Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose.  UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs.  A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.

This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
Todd Kordenbrock
42a871efd4 This commit fixes trac:4662 - "Portals4/MTL hangs in c_get_accumulate test".
- Portals4/OSC was unable to acquire an exclusive lock due to an invalid
local address in the atomic operation.  This caused the reported hang.
- After fixing the hang, the test continued to fail because
ompi_datatype_is_contiguous_memory_layout() reports that MPI_EMPTY (the
origin datatype) is noncontiguous and Portals4/OSC does not support
noncontiguous datatypes at this time.  However, in this case the origin
count is zero so contiguous/noncontiguous is irrelevant.  Now we skip
the contiguous check if the count is zero.

cmr=v1.8.3:reviewer=regrant:subject=Fix for "Portals4/MTL hangs in c_get_accumulate test"

This commit was SVN r32295.

The following Trac tickets were found above:
  Ticket 4662 --> https://svn.open-mpi.org/trac/ompi/ticket/4662
2014-07-23 19:13:07 +00:00
Ralph Castain
3d1b32a2c6 Silence warning
cmr=v1.8.2:reviewer=hjelmn

This commit was SVN r32231.
2014-07-14 19:27:30 +00:00
Nathan Hjelm
32ab6f850e osc/rdma: fix warning
cmr=v1.8.2:reviewer=rhc

This commit was SVN r32201.
2014-07-10 18:42:55 +00:00
Nathan Hjelm
b6abe68972 osc/rdma: check for more types of window access violations
This commit adds a check to see if the target is in an access epoch. If
not we return OMPI_ERR_RMA_SYNC. This fixes test_start3 in the onesided
test suite. The cost of this extra check is 1 byte/peer for the boolean
flag indicating that the peer is in an access epoch.

I also fixed a problem where mupliple unexpected post messages are not
correctly handled.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32160.
2014-07-08 21:11:12 +00:00
Ryan Grant
a1d312343b This commit fixes trac:4681 - ibm c_fence_lock hangs
cmr=v1.8.2:reviewer=tkordenbrock:subject=Portals4/MTL hanging fix

This commit was SVN r32113.

The following Trac tickets were found above:
  Ticket 4681 --> https://svn.open-mpi.org/trac/ompi/ticket/4681
2014-07-01 17:03:03 +00:00
Ryan Grant
5cb8cc856c Refs trac:4682 - This commit fixes c_flush test failure in the ibm test suite for Portals 4 OSC
cmr=v1.8.2:reviewer=tkordenbrock:subject=Move r32112 to v1.8.2 branch

This commit was SVN r32112.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r32112

The following Trac tickets were found above:
  Ticket 4682 --> https://svn.open-mpi.org/trac/ompi/ticket/4682
2014-07-01 16:26:16 +00:00
Ralph Castain
f3cb124e50 Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed.
This commit was SVN r32089.

The following SVN revision numbers were found above:
  r32070 --> open-mpi/ompi@12d92d0c22
  r32082 --> open-mpi/ompi@aa6438ef7a
2014-06-25 20:43:28 +00:00
Ralph Castain
12d92d0c22 Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS
This commit was SVN r32070.
2014-06-24 17:05:11 +00:00
Nathan Hjelm
fd21b244ce osc/rdma: better name for lookup function
cmr=v1.8.2:ticket=trac:4732:reviewer=ompi-rm1.8

This commit was SVN r32021.

The following Trac tickets were found above:
  Ticket 4732 --> https://svn.open-mpi.org/trac/ompi/ticket/4732
2014-06-17 19:49:17 +00:00
Nathan Hjelm
390f8f52b4 osc/rdma: clean up group process matching a bit
cmr=v1.8.2:ticket=trac:4732:reviewer=dgoodell

This commit was SVN r32018.

The following Trac tickets were found above:
  Ticket 4732 --> https://svn.open-mpi.org/trac/ompi/ticket/4732
2014-06-17 17:48:30 +00:00
Nathan Hjelm
7f20868179 osc/rdma: ensure matching of post/start calls
The post and start window calls are supposed to be matching. The code
did not check to see that an incoming post matched with the start call.
This commit fixes the bug by placing the post on a pending list that
will be checked by the next call to start.

cmr=v1.8.2:reviewer=dgoodell

This commit was SVN r32017.
2014-06-17 15:23:06 +00:00
Nathan Hjelm
927098d567 osc/rdma: fix hang when accumulating with MPI_REPLACE
The replace callback did not increment the incoming frag counter. This
leads to a hang during synchronization. This commit adds the increment
and also puts the request on the garbage collection list to fix a leak.

This fixes a hang found when running the mpich test suite.

cmr=v1.8.2:reviewer=bbenton

This commit was SVN r32016.
2014-06-17 14:53:29 +00:00
Nathan Hjelm
7f6de57653 osc/rdma: fix accumulate fragment size calculation
The wrong type was used when calculating the amount of space needed
for an accumulate fragment. Fixed the calculation and took the
opportunity to eliminate the get_acc header as it is identical to the
acc header.

This fixes trac:4719 and #4718

Tracking these fixes for 1.8.2 in this CMR.

Throwing this to Brad for review as he is the one who ran into the issue.

cmr=v1.8.2:reviewer=bbenton

This commit was SVN r32015.

The following Trac tickets were found above:
  Ticket 4719 --> https://svn.open-mpi.org/trac/ompi/ticket/4719
2014-06-17 14:53:24 +00:00
Nathan Hjelm
2f96f16416 osc/rdma: ensure eager sends are active before checking for sync errors
in self optimization

This addresses an issue found with the MPICH pscw_ordering test. Eager sends
were not yet active (which is ok for the standard path) but not ok for the
self optimization. Fixed by waiting for all post messages before checking
the sync state.

Fixes trac:4724

Tracking the 1.8.2 issue in this CMR.

cmr=v1.8.2:reviewer=bbenton

This commit was SVN r32012.

The following Trac tickets were found above:
  Ticket 4724 --> https://svn.open-mpi.org/trac/ompi/ticket/4724
2014-06-17 04:53:47 +00:00
Nathan Hjelm
41f0059f1e osc/sm: use an unsigned long when calculating the total segment size
Brad correctly pointed out that the total window size should not be an
int. Changed it to an unsigned long.

cmr=v1.8.2:reviewer=bbenton

This commit was SVN r32010.
2014-06-17 04:33:43 +00:00
Nathan Hjelm
6ec9c6c422 osc/sm: return ompi_request_empty for all request ops
Only one field is valid for RMA requests: MPI_ERROR. This field is set
to the correct value in ompi_request_empty so there is no reason to
allocate and keep track of osc/sm requests because they are always
complete on return. Since we are no longer using the osc/sm request
structure or free list they are now removed.

Closes trac:4723

Tracking this issue with the CMR. Brad, can you verify the issue is indeed fixed.

cmr=v1.8.2:reviewer=bbenton

This commit was SVN r32009.

The following Trac tickets were found above:
  Ticket 4723 --> https://svn.open-mpi.org/trac/ompi/ticket/4723
2014-06-17 04:27:02 +00:00
Gilles Gouaillardet
96ae38823d Fix a memory leak in ompi_osc_base_finalize()
cmr=v1.8.2:reviewer=rhc

This commit was SVN r31788.
2014-05-16 05:25:24 +00:00
Nathan Hjelm
2f5b1ca4cf osc/rdma: do not leak the receive request
This commit fixes a bug that can cause request and communicator leaks
when cleaning up an OSC window. The should prevent a hang seen with
IMB-EXT.

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31539.
2014-04-28 19:55:18 +00:00
Yossi Etigin
7efb724d7b osc/rdma: fix deadlock with put_long protocol.
When sending PUT_LONG, the data is sent before headers, and sometimes 
the header is not flushed immediately. This creates a lot of unexpected 
receives in the peer, since it would posts a receive only when gets the 
header, which makes it run out of receive buffers. When the sender 
eventually flushes the window, the receiver already has no buffers to 
receive the header, which causes a deadlock.

The fix is to always flush the headers when doing put_long.

cmr=v1.8.1:reviewer=hjelmn

This commit was SVN r31378.
2014-04-13 16:24:56 +00:00
Nathan Hjelm
7aece0a7fd osc/sm: fix bugs in both the passive and active target paths
While testing one-sided on LANL systems I found a couple more OSC
bugs that were not caught during the initial testing:

 - In the passive target code we read the read lock count as a
   char instead of the intended uint32_t. This causes lock to
   lockup when using shared locks after 127 iterations.

 - The post code used the wrong group when trying to increment post
   counters. This causes a segmentation fault.

 - Both the post and wait code used the wrong check in the inner
   loop leading to an infinite loop.

cmr=v1.8.1:reviewer=jsquyres

This commit was SVN r31354.
2014-04-08 21:55:00 +00:00
Nathan Hjelm
a31bfbeb2c osc/rdma: fix typo in get accumulate path
There was a typo in the ompi_osc_gacc_long_start that was causing a
segmentation fault when executing long get accumulate operations.

cmr=v1.8.1:reviewer=jsquyres

This commit was SVN r31353.
2014-04-08 21:54:52 +00:00
Ralph Castain
b12ee27b3d Add missing files - thanks to Mr. Anonymous for reporting them as missing from the 1.8 tarball
cmr=v1.8.1:reviewer=jsquyres:subject=add missing portals4 files

This commit was SVN r31332.
2014-04-08 02:55:14 +00:00
Nathan Hjelm
fdf4c3b900 osc/rdma: really fix active message support
The last fix prevented a hang but had some cases where the results were
wrong. Fixed. Tested with armci, openmpi/ibm, openmpi/onesided.

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31284.
2014-03-28 22:06:16 +00:00
Nathan Hjelm
6913a0f3cf osc/base: defensive programming. handle one more possible datatype case
It might be possible (don't know) for a datatype to made of a contiguous block
of a primitive datatype and have an lb. If this is ever the case the code
would have done the wrong thing. Add the lb in to be safe.

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31283.
2014-03-28 22:06:05 +00:00
Nathan Hjelm
ee7a1478ee osc/rdma: fix test/wait hang
There are differences between how active and passive messages are
accounted for in this component. Active message counts on the sender
side are set to zero before the control message is sent so we do not
have to add one to the expected number of messages or we end up
double counting the control message. This commit should fix that error.

Fixes regression in one-sided/test_rma1

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31281.
2014-03-28 20:49:20 +00:00
Nathan Hjelm
efa37c17c8 osc/base: fix one more case in ompi_osc_base_sndrcv_op
This fixes more issues identified by armci. More issues still remain and fixes are
coming for those as well.

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31272.
2014-03-28 18:31:10 +00:00
Nathan Hjelm
545d5daced osc: add missing MPI_ERR_RMA_SHARED error code and internal equivalent
cmr=v1.8:reviewer=jsquyres

This commit was SVN r31259.
2014-03-27 20:06:43 +00:00
Nathan Hjelm
fc941edaf8 osc/base: adjust the logic in ompi_osc_base_sndrcv_op to adjust for
the case fix in ompi_osc_base_process_op in r31204.

There are two cases that needed to be handled:

 - The target is a simple datatype (contiguous block of a primitive
   type) but the origin is not. In this case we still need to pack
   the origin data but we can not rely on the convertor to do the
   unpack (see r31204).

 - Both the origin and target datatypes are simple datatypes. In this
   case we can use ompi_op_reduce to do the accumulation without having
   to pack the origin data.

cmr=v1.8:ticket=trac:4449

This commit was SVN r31231.

The following SVN revision numbers were found above:
  r31204 --> open-mpi/ompi@949abe45cd

The following Trac tickets were found above:
  Ticket 4449 --> https://svn.open-mpi.org/trac/ompi/ticket/4449
2014-03-26 17:07:29 +00:00
Nathan Hjelm
925af4706c osc/sm: fix bugs in window initialization and finalization
Fixed two bugs:

 - Use module->comm NOT comm to get the CID for the shared memory backing
   file. This fixes the case where there are multiple shared memory windows
   at the same time.

 - Remember to unlink the shared memory backing file.

Refs trac:4438

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31227.

The following Trac tickets were found above:
  Ticket 4438 --> https://svn.open-mpi.org/trac/ompi/ticket/4438
2014-03-26 15:52:51 +00:00
Nathan Hjelm
020f011552 osc/rdma: fix bugs in lock_all and flush_all
This commit fixes two bugs:

 - We were not correctly setting the lock type in the outstanding lock
   for lock_all. This caused undefined behavior.

 - flush_all was incorrectly checking for comm size - 1 lock acks but
   comm size flush acks. This is the reverse of what was intended.

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31226.
2014-03-25 23:39:43 +00:00
Nathan Hjelm
0d703759f6 osc/rdma: fix possible error when encountering accumulate lock contention
It is possible to get into a situation where a small accumulate operation
can not be completed because a large accumulate operation holds the lock.
In this case we may return from wait/flush/etc before the operation is
complete. To handle this case increment the expected incoming fragment
count when queuing an accumulate operation and increment the incoming
fragment count after processing the accumulate operation.

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31224.
2014-03-25 21:00:43 +00:00
Nathan Hjelm
3df85b47e9 osc/rdma: quiet warning in r31197
cmr=v1.8:ticket=trac:4441

This commit was SVN r31223.

The following SVN revision numbers were found above:
  r31197 --> open-mpi/ompi@0ed44f2fdb

The following Trac tickets were found above:
  Ticket 4441 --> https://svn.open-mpi.org/trac/ompi/ticket/4441
2014-03-25 21:00:36 +00:00
Nathan Hjelm
20af8339e6 osc/base: add support for datatypes that are a contiguous combination
of the primitive datatype

In this case we can not use the convertor to run the accumulate operation
since the datatype is a more or less a primitive type.

cmr=v1.8:ticket=trac:4449

This commit was SVN r31222.

The following Trac tickets were found above:
  Ticket 4449 --> https://svn.open-mpi.org/trac/ompi/ticket/4449
2014-03-25 21:00:26 +00:00