Make sure to surround directory variables with quotes so that they
function properly, even if there's spaces in the directory name.
While Open MPI doesn't generally support directory names with spaces,
this fix at least allows `autogen.pl` to complete successfully.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 8f13c3b587)
In multi-threaded scenarios, any thread that attempts to read a CQ
when there's a pending error CQ entry gets an -FI_EAVAIL. Without
any serialization here (which is okay, since libfabric will protect
access to critical CQ objects), all threads proceed to read from the
error CQ, but only one thread fetches the entry while others get
-FI_EAGAIN indicating an empty queue, which is not erroneous.
Signed-off-by: Raghu Raja <craghun@amazon.com>
(cherry picked from commit 415dddb9af)
related to #7968
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 04f853d53d657b14f055c4d87a031829015a6929)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 2a606fbde3)
Also fix a probably harmless but definitely wrong missing comment
character in the copyright header.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Add the 4.0.5 release notes to the NEWS file and clean up some
of the formatting in recent releases to match previous updates.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 5d42272605)
Reword the error message to suggest only the Fortran INTEGER size
can be changed via adhoc compiler flags.
This is a one-off commit for the release branches, master does
it differently (and breaks ABI).
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
In the below type creation sequence
MPI_Type_create_resized(MPI_INT, 0, 6, &mydt1);
MPI_Type_contiguous(1, mydt1, &mydt2);
I think both mydt1 and mydt2 should have extent 6.
The Type_create_resized would add an UB marker into the type map,
and the definition of Type_contiguous would maintain the same
markers in the new map.
The only counter argument I can think of to the above is if
we declared that mydt1 is illegal because it's putting data
on addresses that don't satisfy the alignment requirement.
But in my interpretation of the standard the term "alignment
requirement" is a property of the system memory, and MPI defines
"extent" in a way to make it easy to create MPI datatypes that
support the system's alignment requirements. But the standard
isn't saying it's illegal to make MPI datatypes that don't satisfy
the system's alignment requirements. I think this is true also
because the MPI datatypes might be used in file IO where the
requirements are different, so that's my long winded explanation
for why I don't think we can declare mydt1 illegal.
Complete example:
#include <stdio.h>
#include <mpi.h>
int main() {
MPI_Datatype mydt1, mydt2;
MPI_Aint lb, ext;
MPI_Init(0, 0);
MPI_Type_create_resized(MPI_INT, 0, 6, &mydt1);
MPI_Type_commit(&mydt1);
MPI_Type_contiguous(1, mydt1, &mydt2);
MPI_Type_commit(&mydt2);
MPI_Type_get_extent(mydt1, &lb, &ext);
printf("mydt1 extent %d\n", (int)ext);
MPI_Type_get_extent(mydt2, &lb, &ext);
printf("mydt2 extent %d\n", (int)ext);
MPI_Type_free(&mydt1);
MPI_Type_free(&mydt2);
MPI_Finalize();
return(0);
}
% mpicc -o x test.c
% mpirun -np 1 ./x
Without this PR the output is
> mydt1 extent 6
> mydt2 extent 8
With this PR both extents are 6.
Fwiw I also tested with mpich and they give 6 for both extents.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit bca3c0ed17)
configure was previously failing to check for the fi_info.nic struct because
fabric.h relied on other header files in the ofi/include dir. This adds that
include to CPPFLAGS before running that check so that configure can check for
the struct.
Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
(cherry picked from commit 08e8205fb7)
Made a couple vars static if they didn't look like they were used
more than one place, and added prefixes to a few.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit 34b42b1b09)
This patch fix a race condition, which caused the main thread
to sometimes write to a closed pipe.
The following are details:
Currently, during shut down, the main thread will do the following:
1. set listen_thread_action to false.
2. write to stop_thread pipe to tell the listener thread to exit.
The listener thread do the following:
1. call select() to listen to a set of file descriptors with
a maximum wait time.
2. check listen_thread_action. If it is false, close the stop_thread
pipe.
The main thread will write to closed pipe, when
1. listener's call to select() finished because maximum wait time reached.
2. main thread set listen_thread_action to false
3. listener thread check listen_thread_action and closed the pipe
4. main thread write to the closed pipe.
This patch address the issue by having the main thread close the pipe
after the listener thread has been joined. This way, main thread
will both write and close the thread, so there is no conflict.
Note This patch was opened directly against v4.1.x branch because
the orte/mca/oob/tcp directory has been removed from master branch.
Signed-off-by: Wei Zhang <wzam@amazon.com>
This is a meta commit, that encapsulate all the ADAPT commits in the master
into a single PR for 4.1. The master commits included here are:
fe73586, a4be3bb, d712645, c2970a3, e59bde9, ee592f3 and c98e387.
Here is a detailed list of added capabilities:
* coll/adapt: Fix naming conventions and C11 atomic use
* coll/adapt: Remove unused component field in module
* Consistent handling of zero counts in the MPI API.
* Correctly handle non-blocking collectives tags
* As it is possible to have multiple outstanding non-blocking collectives
provided by different collective modules, we need a consistent
mechanism to allow them to select unique tags for each instance of a
collective.
* Add support for fallback to previous coll module on non-commutative operations (#30)
* Replace mutexes by atomic operations.
* Use the correct nbc request type (for both ibcast and ireduce)
* coll/base: document type casts in ompi_coll_base_retain_*
* add module-wide topology cache
* use standard instead of synchronous send and add mca parameter to control mode of initial send in ireduce/ibcast
* reduce number of memory allocations
* call the default request completion.
* Remove the requests from the Fortran lookup conversion tables before completing
and free it.
* piggybacking Bull functionalities
Signed-off-by: Xi Luo <xluo12@vols.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Marc Sergent <marc.sergent@atos.net>
Co-authored-by: Joseph Schuchart <schuchart@hlrs.de>
Co-authored-by: Lemarinier, Pierre <pierre.lemarinier@atos.net>
Co-authored-by: pierrele <31764860+pierrele@users.noreply.github.com>
The gather and scatter operations did not use the correct message size
(Only did datatype size * com size). This did not correctly reflect the
total message size and prevents fine tuning within a com size. This
patch multiplies the value by the number of elements sent.
Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 50823fe9a9)
improve configury to check whether icc is handling no long double.
This prevents seeing 100s of messages like this:
icc: command line warning #10148: option '-Wno-long-double' not supported
A similar patch will be needed for pmix.
Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
(cherry picked from commit 6df0e53421)
Reduce scatter block and reduce scatter algorithms were hitting
correctness issues for non commutative strided tests. We will revert to
the original default algorithms for those two collectives (basic linear
and non overlapping respectively) in the non commutative op case.
See #8010
Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 57b95bcb45)
the ofi mtl mrecv was not properly setting the message in/out
arg to MPI_MRECV to MPI_MESSAGE_NULL.
Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
(cherry picked from commit e6f81ed6d6)
The ofi_rxm provider is dependent upon the underlying hardware for its
implementation of FI_DELIVERY_COMPLETE. Since this can lead to early
completions, we disable the provider to avoid correctness issues.
This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.
Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 41acfee2bb)
EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.
This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.
Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit a7dcfd9874)
Alter the test to validate misaligned data.
Fixes#7954.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit b6d71aa893)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
The btl/ofi does not currently utilize the common ofi include/exclude
list. Added verification code similar to the mtl/ofi that will check if
the info object is in the include or exclude list. If it isn't in the
include list or is in the exclude list, validate_info will return
OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint
when calling getinfo, instead filtering the provider during
validate_info.
This patch also moves the is_in_list MTL function into common code and
adds additional debugging output to the BTL to match the MTL standard.
Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 9b8f463a76)