inserted in the ompi_proc_list as soon as it is created and it
is removed only upon the call to the destructor. In ompi_proc_finalize
we loop over all procs in ompi_proc_finalize and release them once.
However, as a proc is not removed from this list right away, we
decrease the ref count for each proc until it reach zero and the
proc is finally removed. Thus, we cannot clean the BML/BTL after
the call the ompi_proc_finalize.
A quick fix is to delay the call to ompi_proc_finalize until all
other frameworks have been finalized, and then the behavior
depicted above will give the expected outcome.
Background: In order to support atomics each btl needs to provide support
for communicating with self unless the btl module can guarantee global
atomicity. Before this commit bml/r2 discarded any BTL with lower
exclusivity than an existing send btl. This would cause the BML to
discard any btl other than self.
The new behavior is as follows:
- If an exisiting send btl has higher exclusivity then the btl will not be
added to the send btl list for the endpoint.
- If a btl provides RDMA support then it is always added to the rdma btl
list.
- bml_btl weight for send btls is now calculated across all send btls.
- bml_btl weight for rdma btls is now calculated across all rdma btls.
With this change self should still win as the only send btl for loopback
without disqualifying other btls (ugni, openib) for atomic operations.
attempting to break the get into multiple rdma fragments
A little background. Historically ob1 always registered the entire memory
region when the RGET protocol was in use. This changed when Mellanox
added support to fragment RGET using the btl_prepare_dst function. Now
that the BTL layer has changed to split out the limits of get/put there
is explicit fragmentation code in ob1. Before this commit the registration
was still done per RGET fragment.
This commit will attempt to register the entire region before creating
RGET fragments. If the registration is successfull then all RGET
fragments will use this registration otherwise they will each attempt
to register their own segment of the receive buffer. If that fails
enough times each fragment will give up and fall back on send/recv.
PSM has issues when trying calling psm_ep_connect() more than once for a
specific peer. Use the psm_ep_connect mask argument to avoid connecting
to processes that are already connected.
OMPI ticket #268.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
When running many ranks on a single node using PSM, it's possible to
exhaust the network hardware contexts (there are 16). This patch checks
if only a single node is being used. If so, the 'ipath' component of PSM
is disabled and no hardware contexts are opened.
The ompi_osc_signal_outgoing was moved from ompi_osc_rdma_frag_start to frag_send
which gave correct results for the bug reproducer but hangs with simple OSC
tests. Moved the ompi_osc_signal_outgoing back and it now passes all tests.
Closes#256
opal_mutex_t must be OBJ_DESTRUCTed in order to avoid
a memory leak (pthread_mutex_init allocates memory under
Cygwin, so pthread_mutex_destroy is mandatory)
Thanks to Marco Atzeri for reporting this issue
Some of the counters used by the "rdma" one-sided component are intended
to overflow. Since overflow behavior is undefined for signed integers in
C it is safer to use unsigned integers here.
These two macros set the prefix for the OPAL and ORTE libraries,
respectively. Specifically, the OPAL library will be named
libPREFIXopen-pal.la and the ORTE library will be named
libPREFIXopen-rte.la.
These macros must be called, even if the prefix argument is empty.
The intent is that Open MPI will call these macros with an empty
prefix, but other projects (such as ORCM) will call these macros with
a non-empty prefix. For example, ORCM libraries can be named
liborcm-open-pal.la and liborcm-open-rte.la.
This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL. For example,
when running MPI applications under ORTE, if the ORTE and OPAL
libraries between OMPI and ORCM are not identical (which, because they
are released at different times, are likely to be different), we need
to ensure that the OMPI applications link against their ORTE and OPAL
libraries, but the ORCM executables link against their ORTE and OPAL
libraries.
the OPAL and ORTE libraries. This is required by projects such as ORCM
that have their own ORTE and OPAL libraries in order to avoid library
confusion. By renaming their version of the libraries, the OMPI
applications can correctly dynamically load the correct one for their
build."
This reverts commit 63f619f871.
Add MPI_Ibarrier.3in to reference MPI_Barrier.3, and update
MPI_Barrier.3in to include bindings for MPI_Ibarrier. Slightly update
the text to be inclusive of the non-blocking case.
Fixes#242.
* redefine orte_process_name_t so it can be converted
between host and network format as an opal_identifier_t
aka uint64_t by the OPAL layer.
* correctly send OPAL_DSTORE_ARCH key
of the topology is higher than the communicator size
It is possible to have a topology degree higher than the size of the communicator.
For example, a periodic cartesian communicator on MPI_COMM_SELF. This will leave
the neighborhood collectives with a request buffer that is too small.
This commits introduces a semantic change :
from now, c_topo must be set before invoking coll_select
osc/rdma uses counters to determine if all messages have been received
before exiting synchronization calls. The problem is that the active
target counter is always increasing (never zeroed). If over 2^31-1
messages are sent this causes the counter to overflow (in itself this
isn't an error). This causes test/wait to return before the communication
is complete. There is an additional error in the use of the fragment
flush function. If PSCW synchronization is in use this function CAN NOT
be called unless a post message has arrived.
Relevant mailing list thread: http://www.open-mpi.org/community/lists/devel/2014/10/16016.php
This commit fixes both issues. Tested against MTT and issue reproducer.
Closes#224.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
Update the VERSION file scheme:
* Remove "want_repo_rev".
* Add "tarball_version".
All values are now always included (major, minor, release, greek,
repo_rev). However, configure.ac now runs "opal_get_version.sh
... --tarball", which will return the value of tarball_version (if it
is non-empty) or the "full" version string (i.e.,
"major.minor.releasegreek").
Remove configure.params support: configure.params hasn't been used in
years.
Also remove autogen.subdirs support; those should really be handled by
their respective Makefile.am's.
A problem was found with the libnbc MPI_Iallgather
routine when using intercommunicators. Special
thanks to Takahiro Kawashima(Fujitsu) for the patch
and a test case. Verified master fails without the
patch and the test passes with the patch applied.
fixes#219
Initialize the blocking_fence flag to false as the code logic indicates that it should only be set if someone provides that flag.
Thanks to Lisandro Dalcin for reporting it
cmr=v1.8.4:reviewer=hjelmn
This commit was SVN r32812.
MTT found that the addition of the MPI_SIZEOF interfaces to mpif.h was
causing a linker error with the Absoft compiler. Absoft is working on
a fix, but we can workaround the issue for now. See comment in
Makefile.am in this commit for a lengthy explanation.
Refs trac:4917
This commit was SVN r32797.
The following Trac tickets were found above:
Ticket 4917 --> https://svn.open-mpi.org/trac/ompi/ticket/4917
of the topology is higher than the communicator size
It is possible to have a topology degree higher than the size of the communicator.
For example, a periodic cartesian communicator on MPI_COMM_SELF. This will leave
the neighborhood collectives with a request buffer that is too small. This commit
adds a call that will dynamically increase the size of the request buffer if it
is too small.
A better fix would be to create the topology *before* calling the coll_select
routine on a communicator. This will take some discussion and the solution will
not likely be ready anytime soon.
Thanks to Lisandro Dalcin for reporting this.
Original thread: http://www.open-mpi.org/community/lists/devel/2014/08/15713.php
cmr=v1.8.3:reviewer=jsquyres
This commit was SVN r32796.
gfortran 4.8 does not support storage_size() on all relevant types
that we need. So add a configure test to check and see if the
compiler's storage_size() intrinsic supports enough types for us to do
MPI_SIZEOF.
Also remove an accidentally redundant check for fortran INTERFACE.
Refs trac:4917
This commit was SVN r32790.
The following Trac tickets were found above:
Ticket 4917 --> https://svn.open-mpi.org/trac/ompi/ticket/4917
mpi-f08.F90 includes sizeof_f08.h, so we need to add a Makefile
dependency to ensure that sizeof_f08.h is built first.
Refs trac:4917
This commit was SVN r32789.
The following Trac tickets were found above:
Ticket 4917 --> https://svn.open-mpi.org/trac/ompi/ticket/4917
1. Fixes according to (http://www.open-mpi.org/community/lists/devel/2014/09/15869.php)
2. Force mpisync:rank0 to gather results. Now sync info is written by rank0 to the output file.
3. Improve mpirun_prof: 1) adopt to the environment (SLURM/TORQUE); 2) recognize some noteset-related mpirun options.
This commit was SVN r32772.
CLEANFILES was previously set; we need to use += to add to it.
refs trac:4917
This commit was SVN r32769.
The following Trac tickets were found above:
Ticket 4917 --> https://svn.open-mpi.org/trac/ompi/ticket/4917
What started as a simple ticket ended up reaching the way up to the
MPI Forum.
It turns out that we are supposed to have MPI_SIZEOF for all Fortran
interfaces: mpif.h, the mpi module, and the mpi_f08 module.
It further turns out that to properly support MPI_SIZEOF, your Fortran
compiler *has* support the INTERFACE keyword and ISO_FORTRAN_ENV. We
can't use "ignore TKR" functionality, because the whole point of
MPI_SIZEOF is that the implementation knows what type was passed to it
("ignore TKR" functionality, by definition, throws that information
away). Hence, we have to have an MPI_SIZEOF interface+implementation
for all intrinsic types, kinds, and ranks.
This commit therefore adds a perl script that generates both the
interfaces and implementations for MPI_SIZEOF in each of mpif.h, the
mpi module, and mpi_f08 module (yay consolidation!).
The perl script uses the results of some new configure tests:
* check if the Fortran compiler supports the INTERFACE keyword
* check if the Fortran compiler supports ISO_FORTRAN_ENV
* find the max array rank (i.e., dimension) that the compiler supports
If the Fortran compiler supports both INTERFACE and ISO_FORTRAN_ENV,
then we'll build the MPI_SIZEOF interfaces. If not, we'll skip
MPI_SIZEOF in mpif.h and the mpi module. Note that we won't build the
mpi_f08 module -- to include the MPI_SIZEOF interfaces -- if the
Fortran compiler doesn't support INTERFACE, ISO_FORTRAN_ENV, and a
whole bunch of ther modern Fortran stuff.
Since MPI_SIZEOF interfaces are now generated by the perl script, this
commit also removes all the old MPI_SIZEOF implementations (which were
laden with a zillion #if blocks).
cmr=v1.8.3
This commit was SVN r32764.
reviewed by miked
cmr=v1.8.3:reviewer=ompi-rm1.8
This commit was SVN r32753.
The following SVN revision numbers were found above:
r32735 --> open-mpi/ompi@5fecf65daf
As noted in the comments of these files, they aren't used. Instead,
the Fortran interfaces for WTICK/WTIME just BIND(C) invoke the
back-end C functions (yay BIND(C)!). Hence, there's no need to keep
these old wrapper files around any more.
cmr=v1.8.3
This commit was SVN r32751.
reviewed by miked
cmr=v1.8.3:reviewer=ompi-rm1.8
This commit was SVN r32740.
The following SVN revision numbers were found above:
r32735 --> open-mpi/ompi@5fecf65daf
Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup.
This commit was SVN r32738.
sharedfp functionality is being used. Return an error however if no
sharedfp component is selected and the applications calls a
file_read/write_shared function.
This commit was SVN r32718.
should not be stored on the file handle anyway, since it is not a property of
the file.
- protect a realloc for zero byte scenarios.
This commit was SVN r32678.
number of bytes written and read. Status contains now the actual number of
bytes written for individual operations. For collective operations, this is
unfortunately not possible.
This commit was SVN r32674.
when CHECK_AND_RECYCLE detects an error, a message is displayed
if the error occurs on an intrinsic communicator, then abort
the program (instead of trying to free the communicator)
cmr=v1.8.3:reviewer=hjelmn
This commit was SVN r32659.
r32622 was the first half of the fix -- we need the PMPI variants as well.
Refs trac:4882
This commit was SVN r32627.
The following SVN revision numbers were found above:
r32622 --> open-mpi/ompi@cf0f734a98
The following Trac tickets were found above:
Ticket 4882 --> https://svn.open-mpi.org/trac/ompi/ticket/4882
Thanks to Lisandro Dalcin for identifying the problem.
Fixes trac:4876
Submitted by George Boscila, reviewed by Jeff Squyres.
cmr=v1.8.3:reviewer=ompi-rm1.8
This commit was SVN r32615.
The following Trac tickets were found above:
Ticket 4876 --> https://svn.open-mpi.org/trac/ompi/ticket/4876
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
In core library portions of the configury (e.g., top-level
configure.ac itself), we were calling AC_CHECK_LIB and
OPAL_CHECK_FUNC_LIB to check for various libraries.
'''SIDENOTE:''' It turns out that modern Autoconf has AC_SEARCH_LIBS,
which does just about exactly what OPAL_CHECK_FUNC_LIB does. So this
commit effectively replaces OPAL_CHECK_FUNC_LIB with AC_SEARCH_LIBS.
However, we never bothered to add these found libraries to the wrapper
compiler list of libraries used for static linking (doh!). We've been
getting lucky for quite a while that components were adding the same
libraries to their wrapper compiler LIBS list.
This is problematic, however, if we don't build some of these
components. For example, Paul Hargrove noticed that if he configured
with --disable-shared --enable-static --disable-io-romio, ROMIO was no
longer adding some libraries to the wrapper LIBS list -- libraries
that just happened to also be needed by core OPAL/ORTE/OMPI layers.
The solution is not to use AC_CHECK_LIB or OPAL_CHECK_FUNC_LIB, but
use a pair of new macros:
* OPAL_SEARCH_LIBS_CORE: a wrapper around AC_SEARCH_LIBS. If we add
something to $LIBS, then also add it to the wrapper list of static
libraries. This is the main piece of functionality that was
wrong/missing.
* OPAL_SEARCH_LIBS_COMPONENT: similar to OPAL_SEARCH_LIBS_CORE, but
instead of directly adding it to the wrapper list of static
libaries, add it to <framework>_<component>_LIBS (which eventually
gets slurped up into the wrapper list of static libraries. See the
lengthy comment in config/opal_setup_wrappers.m4 near the beginning
of OPAL_SETUP_WRAPPER_INIT() for a more detailed explanation).
Most components did this correctly already, but one or two weren't
right, so I implemented this second macro quite similar to the
first and put it everywhere we already used AC_SEARCH_LIBS or
OPAL_CHECK_FUNC_LIB.
This needs to soak for a day or two on the trunk before moving to the
v1.8 branch.
Refs trac:4834
cmr=v1.8.2:reviewer=ggouaillardet
This commit was SVN r32447.
The following Trac tickets were found above:
Ticket 4834 --> https://svn.open-mpi.org/trac/ompi/ticket/4834
also replase the OMPI_CAST_RTE_NAME macro with
an inline function if OPAL_ENABLE_DEBUG, so we can
get warnings from the compiler if ampersand is missing.
Thanks to Paul Hargrove for reporting the bugs
This commit was SVN r32408.
This fixes some duplicate symbols, once the .o files for the modules
were restored into the library (some compilers need the .o files, some
don't (!)).
Also, remove trailing whitespace. :-)
This commit was SVN r32386.
communication library should use to initialize itself.
Ralph will champion this change back with an RFC if there is a realistic
need/use case from the community.
This commit was SVN r32361.
The following SVN revision numbers were found above:
r32355 --> open-mpi/ompi@c903917f47
The only user of this code was coll/sm. I implemented a basic replacement
for the removed code. This gets the trunk compiling again with
--disable-dlopen.
This commit was SVN r32333.
common/ofacm is only used by the iboffload code in ompi. This code does
not currently work so it is safe to ignore these components until it is
fixed.
This commit was SVN r32331.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
Let's not make the move to OPAL any harder than it has to be; this
commit can wait until after the BTL move.
This commit was SVN r32316.
The following SVN revision numbers were found above:
r32315 --> open-mpi/ompi@7b7ed8ed97
CMR'ing just to (try to) keep the differences between trunk and v1.8
branch (somewhat) small.
Reviewed by Dave Goodell
cmr=v1.8.3:reviewer=ompi-rm1.8
This commit was SVN r32315.
Previously, we were only checking connectivity upon first ''send'' to
a peer. But this ignores the case where the first communication to a
peer is actually an ACK -- i.e., we successfully received something
from the peer and we need to send an ACK back. So we need to verify
that the ACK will actually get there.
Specifically, certain asymmetric routing cases can lead to a hang if
we don't check the connectivity in both directions. E.g., if the
sender is able to get traffic to the receiver, but the receiver is
unable to get traffic back to the sender because it made a different
routing decision than the sender.
In this case, the connectivity checker from the sender could succeed
(because the connectivity checker will ACK along the same path in
which the ping was received), but sending a BTL ACK could fail
(because the BTL ACK will be sent back along the path chosen by the
graph algorithm, which, in an erroneous asymmetric routing scenario,
may be different/wrong).
Hence, we want to trigger the connectivity checker at the first
communication from A->B, which may either be a BTL send or an ACK.
Reviewed by Dave Goodell.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32309.
Ensure that target directories exists before creating symlinks.
cmr=v1.8.2:reviewer=jsquyres
Thanks Jeff to step up as an reviewer.
This commit was SVN r32305.
- Portals4/OSC was unable to acquire an exclusive lock due to an invalid
local address in the atomic operation. This caused the reported hang.
- After fixing the hang, the test continued to fail because
ompi_datatype_is_contiguous_memory_layout() reports that MPI_EMPTY (the
origin datatype) is noncontiguous and Portals4/OSC does not support
noncontiguous datatypes at this time. However, in this case the origin
count is zero so contiguous/noncontiguous is irrelevant. Now we skip
the contiguous check if the count is zero.
cmr=v1.8.3:reviewer=regrant:subject=Fix for "Portals4/MTL hangs in c_get_accumulate test"
This commit was SVN r32295.
The following Trac tickets were found above:
Ticket 4662 --> https://svn.open-mpi.org/trac/ompi/ticket/4662