NOTE: This build system does not work with the current autogen.sh. Modified one is under heavy testing to make sure it does not have side effects
This commit was SVN r16110.
simultaneously, but is doing it incorrectly. If the function is running already
for one communicator and it is called from another thread for other communicator
with lower cid the check comm->c_contextid != ompi_comm_lowest_cid()
will fail and the function will be executed for two different communicators by
two threads simultaneously. There is nothing in the algorithm that prevent it
from been running simultaneously for different communicators as far as I can see,
but ompi_comm_unregister_cid() assumes that it is always called for a communicator
with the lowest cid and this is not always the case. This patch removes bogus
lowest cid check and fix ompi_comm_register_cid() to properly remove cid from
the list.
This commit was SVN r16088.
off and bogus addresses to show up for the requests, which in turns causes
message queues not showing up when debugging a 64 bit app on a 32 bit
tvd and dll on only Solaris SPARC.
This commit was SVN r16052.
Basically revert this part of r16015.
This commit was SVN r16029.
The following SVN revision numbers were found above:
r16015 --> open-mpi/ompi@435e7d80e9
from showing up in the message queue graph. Tags are now casted to int
before the negative checks, since tags by the spec are stored as
mqs_tword_t, an unsigned long long.
This commit was SVN r16027.
The following SVN revision numbers were found above:
r15915 --> open-mpi/ompi@b9ea4c92e7
no more work associated with this request. No more outstanding completions or
packets and send scheduling isn't running in another thread.
This commit was SVN r16013.
the ompi_convertor_need_buffers function to only return 0 if the convertor
is homogeneous (which it never does on the trunk, but does to on v1.2, but
that's a different issue). Only enable the heterogeneous rdma code for
a btl if it supports it (via a flag), as some btls need some work for this
to work properly. Currently only TCP and OpenIB extensively tested
This commit was SVN r15990.
side too. Otherwise a content of the recvreq->req_rdma array is replaced later
without freeing previous content and refcount on registration in mpool become
wrong.
This commit was SVN r15978.
to always first check for a NULL frag pointer before trying to send the
fragment. This avoids an issue in multi-threaded execution in which
multiple threads working on the same endpoint can result in a thread
finding itself here with nothing to send.
This commit was SVN r15963.
difference between the user specified length, and the one available from
Open MPI (this allow to se the truncated receives). Moreover, if the
data-type used is named we now print the count as well as the name of
the used data-type.
This commit was SVN r15962.
from the posted generic_recv-queue.
- Move the PERUSE_COMM_MSG_MATCH_POSTED_REQ from
MCA_PML_OB1_RECV_REQUEST_MATCHED to
mca_pml_ob1_recv_frag_match() as suggested by Terry Dontje
Only post, if this is not a probe/iprobe request.
- Do not post PERUSE_COMM_REQ_MATCH_UNEX for probes / iprobes and
do in correct order before PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q
This commit was SVN r15947.
PERUSE_COMM_REQ_ACTIVATE event.
Therefore move the PERUSE_TRACE_COMM_EVENT for this event from
MCA_PML_BASE_SEND_REQUEST_INIT / MCA_PML_BASE_RECV_REQUEST_INIT
to the proper places into pml_ob1_isend.c / pml_ob1_irecv.c right
after the MCA_PML_OB1_SEND_REQUEST_INIT /
MCA_PML_OB1_RECV_REQUEST_INIT.
This commit was SVN r15945.
that it is >= 1, so making it a size_t makes it easier to interact
with all the other size_t variables and removes a compiler warning.
This commit was SVN r15935.
one HCA. Multiple ports, LMC, multiple BTLs per one LID. Having only one CQ for
all of them substantially reduce polling time.
This commit was SVN r15933.
on a parameter and are determined in runtime. r15346 removed calculation of
correct sizes for this structures. This patch adds it back and fixes trac:1116, #1114.
This commit was SVN r15932.
The following SVN revision numbers were found above:
r15346 --> open-mpi/ompi@433f8a7694
The following Trac tickets were found above:
Ticket 1116 --> https://svn.open-mpi.org/trac/ompi/ticket/1116
of the process in the MPI_COMM_WORLD, while a local rank is the rank
of the process in the communicator where the request was posted. In
order to get the message graph nicely, each request has to have the
global rank set correctly.
This commit was SVN r15926.
used at nce (up to one unique collective module per collective function).
Matches r15795:15921 of the tmp/bwb-coll-select branch
This commit was SVN r15924.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15795
r15921
A subset of this patch needs to be applied to v1.2
Refs trac:928
This commit was SVN r15918.
The following Trac tickets were found above:
Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928
mpi_preconnect_oob_simultaneous > np. Need to scale back
simultaneous to equal np in those cases. Reviewed by Brian.
This commit fixes trac:1064.
This commit was SVN r15916.
The following Trac tickets were found above:
Ticket 1064 --> https://svn.open-mpi.org/trac/ompi/ticket/1064
fix some of the multi-threading problems for the cid allocation. Two bugs
specifically:
- since we do not have a queue for incoming fragments of unknown cid, we need
to synchronize all processes before exiting the communicator creation. This
synchronization was/is located in comm_activate, which was however too late
for the multi-threaded case. Thus, for multi-threaded scenarios we are now
synchronizing 'before' we allow another thread to enter the cid-allocation
loop.
- for synchronization, we used for the sake of simplicity allreduce
operations. It turns out, that these operations interefered with the
allreductions in the cid-allocation routine, which lead to non-sense results
in the cid-allocation and potentially to endless loops.
Multi-threaded communicator creation seems to work now, is however still 'very
very' slow. I think, the busy wait of threads is killing the performance of
the active threads in the cid allocation. But this is another topic.
This commit was SVN r15910.
only static libraries. Previously, we were linking the libraries into
directly into the common, btl, and mtl code. This seemed to work fine
for me on my Opteron Fedora box, but caused Lisa some issues (PtlNIInit
would succeed, but the network handle would fail when used with
PtlEQAlloc).
Instead, link the portals libraries directly into libmpi and not at
all into the common, btl, or mtl components. THen use some linker
tricks to force the linker to bring in the public interface for the
reference implementation (which thankfully is pretty small).
This commit was SVN r15902.
lowest_free and number_free detect if the communicator list has changed.
If not, there is no reason to rebuild it, just use the old one.
This commit was SVN r15895.
Fixed a condition test while checking that all segments are empty.
Without this fix, a NULL segment pointer could make it past the
test, resulting in a SegV when dereferenced.
This commit was SVN r15891.
The following Trac tickets were found above:
Ticket 1134 --> https://svn.open-mpi.org/trac/ompi/ticket/1134
{{{
ompi/debuggers/ompi_dll.c:102: error: initializer element is not
constant
}}}
The fix is stupid and I suspect that we'll want to ''not'' print out
all this debugging information all the time. But I'll leave that to
George to fix... :-)
This commit was SVN r15880.
* Code cleanup and rationalization
* Fixed: mca_pml_base_send/recv_request are now allocated before recreation by the PML-V
* Fixed: pointer arithmetic bug in sender based that crashed
* Changed: directory structure. This is one step forward using autogen.sh to build static-components.h (it needs to have the directory structure of a mca framework for this).
This commit was SVN r15878.
opal_list_t, ompi_free_list_t so every time there is a modification
in one of these files (such as changing the way we allocate the
elements in the free list) the debugger interface have to
reflect these changes.
This commit was SVN r15876.
semicolons but the new specitifcation string used colons. The text
parser now looks for colons.
* Changed all opal_output() error messages to
much-more-helpful/descriptive opal_show_help() messages.
* A few minor style/indenting fixes
This commit was SVN r15850.
The following SVN revision numbers were found above:
r15848 --> open-mpi/ompi@dd30597f39
in the OMPI proc structures. For now, use an extension of the modex that is
keyed on strings. Eventually, this should use the attribute put/get that is
part of the RSL interface.
This commit was SVN r15820.
/tmp/jms-modular-wireup branch):
* This commit moves all the openib BTL connection code out of
btl_openib_endpoint.c and into a connect "pseudo-component" area,
meaning that different schemes for doing OFA connection schemes can
be chosen via function pointer (i.e., MCA parameter) at run-time.
* The connect/connect.h file includes comments describing the
specific interface for the connect pseudo-component.
* Two pseudo-components are in this commit (more can certainly be
added).
* oob: use the same old oob/rml scheme for creating OFA connections
that we've had forever; this now just puts the logic into this
self-contained pseudo-component.
* rdma_cm: a currently-empty set of functions (that currently
return NOT_IMPLEMENTED) that will someday use the RDMA connection
manager to make OFA connections.
This commit was SVN r15786.
the Sandia reference implementation of Portals, and doesn't have the cnos
functions. This file should never be compiled (and wasn't being compiled)
on the Cray machines, so doesn't need to be updated to support CNL.
This commit was SVN r15778.
The following SVN revision numbers were found above:
r15756 --> open-mpi/ompi@755658694e
Application Level Placement Scheduler (ALPS).
This commit was tested under two Cray machines at ORNL: Jaguar (Catamount)
and Rizzo (CNL Test cage). Both machines performed as they should across
the commit.
It is likely that mor changes will follow this the work and environment
stabilizes.
Most of the infrastructure works the same for Catamount and CNL
except for a few bits. Below are the highlights:
Default IFACE Change:
On Catamount we can use PTL_IFACE_DEFAULT, but on the CNL system we have access
to will fail on this interface, and should be set to:
IFACE_FROM_BRIDGE_AND_NALID(PTL_BRIDGE_UK,PTL_IFACE_SS).
So if we detect that we are running with YOD then use the former interface
and if we detect that we are running with ALPS then use the latter.
We will want to pursue a more elegant solution if this interface continues to
change across machines.
PtlGetId and cnos_register_ptlid:
The header suggests that these should never be called when launching with YOD.
But in the ALPS environment the cnos_barrier() will hang forever if these
functions are not called after PtlNIInit(). Since these functions only need to
be called once, and the orte rmgr/cnos component is loaded before the ompi
common/portals componet then just call these functions once in the rmgr/cnos
component.
cnos_barrier_init():
This is a noop for YOD, but critical for ALPS. So be sure to call it before
calling the first barrier in the rmgr/cnos component.
cnos_barrier vs cnos_pm_barrier:
It is suggested the cnos_pm_barrier only be used during finalization
as it will indicate to the launcher (yod or aprun) that the app is about
to complete. It was suggested that we use the regular cnos_barrier() instead.
I want to look into this a bit more to make sure there are not adverse
side effects. A note has been placed in the code to indicate this reasoning.
This commit was SVN r15756.
2GB to 4GB-1 by using long instead of size_t for the sm size.
* it is done to prevent user from running into the ftruncate() in
common sm component (and possibly others) problem that ftruncate
takes an off_t which is a signed long integer. If we use an
unsigned long, it'll run into an invalid argument errno=22.
* See trac #1117
This commit was SVN r15752.
mpi_show_mpi_alloc_mem_leaks
When activated, MPI_FINALIZE displays a list of memory allocations
from MPI_ALLOC_MEM that were not freed by MPI_FREE_MEM (in each MPI
process).
* If set to a positive integer, display only that many leaks.
* If set to a negative integer, display all leaks.
* If set to 0, do not show any leaks.
This commit was SVN r15736.
This mpool will have no btl module owner there was no btl created for
the HCA with no ports, but it will still be tracked in the mpool
framework (i.e., it's available).
If MPI_ALLOC_MEM is called by the app, one of two things will happen:
1. if there's an HCA on the host with some active ports, the openib
btl component will still be in the process space, and therefore
the "mpool with no btl" (MWNB) module will still be able to call
the reg/dereg functions, and all will be fine. However, if
MPI_FREE_MEM is never invoked to free the memory, bad things will
happen during MPI_FINALIZE. The pml is finalized, which finalizes
all the btls. The btls finalize all their mpools and all is fine.
But later we close down the mpool framework which then finalizes
any left over mpool modules, such as MWNB. However, the openib
BTL module functions that the MWNB was registered with are no
longer in the process space, and it segv's while trying deregister
the memory.
2. if there are *no* HCA's on the host with active ports, then the
openib btl will have been unloaded, and when the MWNM tries to
register the memory, the functions it tries to call (in the openib
btl) are no longer there, and we segv.
This commit was SVN r15735.
passed to MPI_COMM_FREE, it invoked the error handler on
MPI_COMM_WORLD, not on MPI_COMM_FREE. This commit changes the
behavior: if MPI_COMM_SELF is passed to MPI_COMM_FREE, we invoke the
error handler on MPI_COMM_SELF (not MPI_COMM_WORLD). Fixes trac:1109.
This commit was SVN r15682.
The following Trac tickets were found above:
Ticket 1109 --> https://svn.open-mpi.org/trac/ompi/ticket/1109