The primary issue with udcm is that the immediate data in message
acks were often bogus. This caused the sender to keep trying even
though a message was received and acked. The fix is to use the
source LID and QP to determine which message is being acked. In
most cases this should work well since only one message will be
in flight to any peer.
This commit was SVN r28444.
of individual regions (each region is a multiple of page size in
length), and each process claims its own regions by binding it to its
local memory. Each process would end up membining something like 16
individual regions in the overall shmem segment.
There were two errors in this code relating to the memory affinity
pinning. Some combination of these two errors would lead to kernel
panics (!) on my RHEL 6.2 x86_64 machines when used with mmap'ed
shared memory (not posix or sysv shared memory, curiously enough):
1. The shared memory segment is initially divided into two regions:
control and data. The control starts at the beginning of the shmem
segment, the data starts after that. The data portion, unfortunately,
was ''not'' aligned to a page. So all the multiple-of-page-size
regions that we divvy up were also not alined on page boundaries. And
therefore all the regions we tried to membind were not on page
boundaries.
The solution was to ensure that the data portion started on a page
boundary. Then all of the individual regions were on page boundaries,
too.
That being said, in my tests, Linux mbind() fails gracefully when the
address is not on a page boundary. So I'm not sure how this worked at
all / led to a kernel panic...
2. There was some bad pointer math that resulted in membinding regions
larger than they should have been, resulting in region overlaps.
There were definitely overlaps between regions in the same process;
it's likely that there were overlaps between regions of multiple
processes, too -- I'm not sure (and don't care to figure out :-) ).
The solution was to fix the pointer math so that each region membinds
exactly only itself and no neighboring/overlapping regions.
cmr:v1.7.2:reviewer=samuel
This commit was SVN r28442.
- increase number of wqe to minimize number of RNRs
- it is better to have high watermark and post relatively small number of wqes
- increased TX queue size
This commit was SVN r28440.
MPI_COMM_SELF in reverse order during MPI_FINALIZE (well, actually,
''all'' attributes are now deleted in reverse order whenever a
communicator is destructed).
Also revamped a few things in the MPI attribute implementation:
* use a One Big Lock philosophy for making the implementation thread
safe (vs. the pair of locks we were using before). One Big Lock is
quite a bit simpler and has fewer corner cases; the code for
attributes is still complicated, but is definitely less complex
than it used to be.
* The COPY_ATTR_CALLBACKS and DELETE_ATTR_CALLBACKS macros no longer
return; they simply set a value if something went wrong. Then we
check this value after the macros complete. This simplifies
unlocking, etc.
* Added write barriers right before releasing locks to ensure memory
consistency.
* Fixed a bunch of typos in comments, and some indenting.
Many thanks to KAWASHIMA Takahiro who contributed the original patch
for attribute destruction ordering, and who helped test/debug/evolve
the patch to its final form.
Fixes trac:3123.
cmr:v1.7.2:reviewer=bosilca
This commit was SVN r28439.
The following Trac tickets were found above:
Ticket 3123 --> https://svn.open-mpi.org/trac/ompi/ticket/3123
many, many MTT timeouts when running jobs under SLURM: send the right
command at the end to cause remote orteds to shut down.
This commit was SVN r28438.
requests. This patch fixes trac:3475.
CMR v1.6, v1.7
This commit was SVN r28431.
The following Trac tickets were found above:
Ticket 3475 --> https://svn.open-mpi.org/trac/ompi/ticket/3475
from the list (just for good measure), and then free() it (without
using _SAFE, we were accessing memory that was just free()'d to get to
the next item). Also be a little more thorough -- DESTRUCT the list
when we're all done.
This commit was SVN r28429.
http://www.open-mpi.org/community/lists/users/2013/04/21689.php, the
assert in opal_datatype_is_contiguous_memory_layout() is not always
correct -- he supplied a test case where it was not valid,
essentially:
1. Call MPI_Type_create_indexed_block(0, ..., &newtype) and commit newtype
1. Call MPI_Type_create_resized(newtype, 0, nonzero_value, &resized) and commit resized
1. Call MPI_File_set_view with resized
This will eventually call opal_datatype_is_contiguous_memory_layout(),
and the assert will fail. After some consultation with George, it was
determined that the assert() is basically good, but it needs to also
check for (count != 0).
This commit was SVN r28398.