1. Allow fallback to a lesser AVX support during make
Due to the fact that some distro restrict the compiule architecture
during make (while not setting any restrictions during configure) we
need to detect the target architecture also during make in order to
restrict the code we generate.
2. Add comments and better protect the arch specific code.
Identify all the vectorial functions used and clasify them according to
the neccesary hardware capabilities.
Use these requirements to protect the code for load and stores (the rest
of the code being automatically generated it is more difficult to
protect).
3. Correctly check for AVX* support.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
1. Consistent march flag order between configure and make.
2. op/avx: give the option to skip some tests
it is possible to skip some intrinsic tests by setting some environment variables to "no" before invoking configure:
- ompi_cv_op_avx_check_avx512
- ompi_cv_op_avx_check_avx2
- ompi_cv_op_avx_check_avx
- ompi_cv_op_avx_check_sse41
- ompi_cv_op_avx_check_sse3
3. op/avx: update AVX512 flags
try
-mavx512f -mavx512bw -mavx512vl -mavx512dq
instead of
-march=skylake-avx512
since the former is less likely to conflict with user provided CFLAGS
(e.g. -march=...)
Thanks Bart Oldeman for pointing this.
4. op/avx: have the op/avx library depend on libmpi.so
Refs. open-mpi/ompi#8323
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Currently, mca_btl_ofi_put (get, aop, afop, acswp) will allocate
a mca_btl_ofi_rdma_completion_t object and use it as the context
for fi_write/fi_read/fi_atomic/fi_fetch_atomic/fi_compare_atomic.
In normal code path, this completion object when processing completion
entry. However, when error happened when calling
fi_write/fi_read/fi_atomic/fi_fetch_atomic/fi_compare_atomic,
there will be no completion entry from libfabric, in this case the
completion object's memory is leaked.
This patch address the issue by calling opal_free_list_return() in
the error handling code path.
Signed-off-by: Wei Zhang <wzam@amazon.com>
remove now unused mca parameter, get rid of an unnecesary if-else part,
and move setting the flag outside of the while loop.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
its however restricted to collective I/O operations, at this point
only from vulcan and dynamic_gen2. required some more infrastructure
to be added to recognize individual I/O and multi-threaded environments.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the lack of performing data sieving has been identified as a main reason for the poor performance in some instances on the Lustre file system. This commit introduces the fundamental ability to perform data sieving for read operations (which should not be controversial). The code itself is correct, what is still lacking is a) the logic when and how to activate data sieving and b) the logic to limit the size of the temporary buffer when doing data sieving.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
the dynamic_gen_file_write_all component distinguishes between the amount of data communicated
to aggregators, and the amount of data written in a cycle by the aggregator (in contrary e.g. to the vulcan component).
There was a bug in calculating which chunks have to be written in a cycle by an aggregator: we added as many elements into the
io_array until we filled one stripe. Unfortuantely, the metric used was the amount of data instead of ensuring that all offsets
fall within a single stripe. This commit fixes this issue. Note, the bug did not create a correctness problem, just a performance
problem in case there were gaps in the file view.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
A path that was being used in oversubscribed cases caused a help message to
output for each process. This replaces the help message with a debug output to
prevent excessive output unless the user enables debug output.
Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
Thanks FX Coudert for reporting this issue and pointing
to a solution.
Refs. open-mpi/ompi#8218
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This has shown to be more effective in achieving overlap
of inter- and intra-node communication and reduces the inital
delay before hitting the network.
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Also make coll/tuned the default for shared memory communication
as coll/sm has shown performance issues that need investigation.
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>