1
1

396 Коммитов

Автор SHA1 Сообщение Дата
Joseph Schuchart
634f67b216
Merge pull request #7843 from devreal/clang-tidy-free
Some fixups for issues detected by clang-tidy
2020-06-25 17:30:04 +02:00
Joseph Schuchart
d9b11b29cd Properly free memory in case of error in mca_common_ompio_prepare_to_group
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-19 12:31:14 +02:00
Joseph Schuchart
ed1ca1a84b Don't free memory escaping mca_common_ompio_prepare_to_group
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2020-06-19 12:30:38 +02:00
Edgar Gabriel
4a8a330bba common/ompio: use avg. file view size in the aggregator selection logic
This is a fix  based on a bugreport on github/mailing list from CGNS.
The core of the problem was that different processes entered different branches of
our aggregator selection logic, due to the fact that in some cases processes had
a matching file_view size and contiguous chunk size (thus assuming 1-D distribution),
and some processes did not (thus assuming 2-D distribution). The fix is to calculate
the avg. file view size across all processes and use this value, thus ensuring that
all processes enter the same branch.

Fixes issue #7809

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2020-06-15 09:17:44 -05:00
Nathan Hjelm
160ff188b8
Merge pull request #7169 from hjelmn/fix_what_wg21_calls_our_problem_not_theirs_seriously__in_some_ways_they_are_correct_but_wtf
configure: use -iquote for non-system include paths
2020-03-30 09:22:54 -07:00
Noah Evans
ee3517427e Add threads framework
Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobot:

https://github.com/pmodels/argobots

https://github.com/Qthreads/qthreads

The default threading model is pthreads.  Alternate thread models are
specificed at configure time using the --with-threads=X option.

The framework is static.  The theading model to use is selected at
Open MPI configure/build time.

mca/threads: implement Argobots threading layer

config: fix thread configury

- Add double quotations
- Change Argobot to Argobots
config: implement Argobots check

If the poll time is too long, MPI hangs.

This quick fix just sets it to 0, but it is not good for the
Pthreads version. Need to find a good way to abstract it.

Note that even 1 (= 1 millisecond) causes disastrous performance
degradation.

rework threads MCA framework configury

It now works more like the ompi/mca/rte configury,
modulo some edge items that are special for threading package
linking, etc.

qthreads module
some argobots cleanup

Signed-off-by: Noah Evans <noah.evans@gmail.com>
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-03-27 10:15:45 -06:00
Gilles Gouaillardet
69bc2e8372 misc: fix <> vs "" includes throught the ompi codebase
This commit fixes an issue with the include usage in some
ompi source files. These source files are using the <> form
of include when the "" form is correct (as these are internal,
**not** system headers).

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-03-09 21:13:49 -04:00
Austen Lauria
4f6978466d
Merge pull request #7284 from bosilca/fix/monitoring_registration
Minor cleanup in the monitoring PML.
2020-01-27 13:01:30 -05:00
Charles Shereda
cbc6feaab2 Created opal_gethostname() as safer gethostname substitute.
The opal_gethostname() function provides a more robust mechanism
to retrieve the hostname than gethostname(), which can return
results that are not null-terminated, and which can vary in its
behavior from system to system.

opal_gethostname() just returns the value in opal_process_info.nodename;
this is populated in opal_init_gethostname() inside opal_init.c.

-Changed all gethostname calls in opal subtree to opal_gethostname
-Changed all gethostname calls in orte subtree to opal_gethostname
-Changed all gethostname calls in ompi subdir to opal_gethostname
-Changed all gethostname calls in oshmem subdir to opal_gethostname
-Changed opal_if.c in test subdir to use opal_gethostname
-Changed opal_init.c to include opal_init_gethostname. This function
 returns an int and directly sets opal_process_info.nodename per
 jsquyres' modifications.

Relates to open-mpi#6801

Signed-off-by: Charles Shereda <cpshereda@lanl.gov>
2020-01-13 08:52:17 -08:00
George Bosilca
05093f9cb1
Minor cleanup in the monitoring PML.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-01-13 09:24:00 -05:00
XuanWang1982
b1dc58eeb2 First version for GPFS module. To be tested
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-11-25 09:01:38 -06:00
Edgar Gabriel
ad5d0df4e9 common/ompio: fix calculation in simple-grouping option
This is based on a bug reported on the mailing list using a netcdf testcase.
The problem occurs if processes are using a custom file view, but on some
of them it appears as if the default file view is being used. Because of that,
the simple-grouping option lead to different number of aggregators used on different
processes, and ultimately to a deadlock. This patch fixes the problem by not using
the file_view size anymore for the calculation in the simple-grouping option,
but the contiguous chunk size (which is identical on all processes).

Fixes issue #7109

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-10-29 12:30:41 -05:00
Edgar Gabriel
a130f569df comomn_ompio_file_read/write: fix 2GB limiting issue
individual read/write operations exceeding 2GB fail in ompio
due to improper conversions from size_t to int in two different
locations. This commit fixes an issue reported by Richard Warren
from the HDF5 group.

Fixes Issue #397

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-10-05 09:50:02 -05:00
George Bosilca
2930bd9d21 Whitespace cleanup
No code or logic changes.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-08-14 11:06:47 -04:00
Harald Klimach
e222a04ae5 Suggestion to fix division by zero in file view.
In common_ompi_aggregators calc_cost routine:
do not cast the real division to an int intermediately.
This patch removes the obsolete int variable c and assigns
the result of the P_a/P_x division directly to n_as.

With the intermediate int c variable, n_as gets 0 if P_a < P_x,
resulting in a division by 0 when computing n_s.

Signed-off-by: Harald Klimach <harald.klimach@uni-siegen.de>
2019-06-13 18:47:32 +02:00
Edgar Gabriel
8eda9f2ecd common/ompio: fix coverty warnings
this commmit fixes coverty warnings CID 1445198 and CID 1445197
For a reason that is a bit unclear to me, coverty only complained about the read
files, but the write operations had the same issue, so I fixed that within the
same commit as well.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-23 13:40:39 -05:00
Edgar Gabriel
27b2ec71a7 common/ompio: add support for read operations and collective I/O
external32 data representation is now support by ompio for everything
but non-blocking collective I/O operations. The support can further be improved
in a second step to limit the temporary buffer size (at least for blocking operations),
but it does work now for many scenarios.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 17:56:16 -05:00
Edgar Gabriel
ab56e6f0db common/ompio: make individual read operations work.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 17:22:33 -05:00
Edgar Gabriel
f6b3a0af52 common/ompio: individual write of external32 works
both blocking and non-blocking. collective write and read operations not yet.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 16:26:14 -05:00
Edgar Gabriel
d955753cb8 common/ompio: abstraction for different convertor types
introduce separate convertors for memory vs. file representation. Adjust the interfaces for decode_datatype to provide the convertor to be used for that.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 13:35:38 -05:00
Edgar Gabriel
35be18b266 common/ompio: rename ompio_cuda* to ompio_buffer*
the infrastructure put in place to manage cuda buffers is actually
a lot more generic than just for cuda buffers. Specifically, we ca
reuse much of the code to implement the external32 data representation.
This commit converts the code from common_ompio_cuda* to
common_ompio_buffer*. There are just very few places where we actually need to keep the OPAL_CUDA_SUPPORT ifdef in place.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 12:50:04 -05:00
Edgar Gabriel
a96efb7620 common/ompio: add comm_ompio_read_all/write_all functions
in preparation for adding support for the external32 data
representation.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-05-20 12:49:36 -05:00
Edgar Gabriel
d43427fc76 common/ompio: refactor the build_io_array function
abstract out the io_array structure to be used in common_ompio_build_io_array function.
This is preparation for a future component that would like to use the same function,
but not modify the io_array stored on the file handle itself.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-04-17 14:42:33 -05:00
George Bosilca
e42b573cd3
Fix the PVAR allocation usage.
According to the MPI standard the obj_handle is a pointer to an MPI
object, and therefore cannot be MPI_COMM_WORLD. The MPI standard example
14.6 highlight this usage.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-02-02 19:03:43 -05:00
Edgar Gabriel
c0f8ce0fff common/ompio: fix a floating point division problem
This commit fixes  a problem reported on the mailing list with
individual writes larger than 512 MB.

The culprit is a floating point division of two large, close values.
Changing the datatypes from float to double (which is what is being
used in the fcoll components) fixes the problem.

See issue #6285 and

 https://forum.hdfgroup.org/t/cannot-write-more-than-512-mb-in-1d/5118

Thanks for Axel Huebl and René Widera for reporting the issue.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-01-21 17:59:12 -06:00
René Widera
a91fab80a1 common/ompio: possible rounding issue
Similar to #6286 rounding number of bytes into a single precision floating point value to round up the result of a division is a potential risk due to rounding errors.

- remove floating point operations for `round up`
- removes floating point conversion for round down (native behavior of integer division)

Signed-off-by: René Widera <r.widera@hzdr.de>
2019-01-18 14:05:23 +01:00
Edgar Gabriel
bf058ca6b0 common/ompio: check datatypes when setting file view
return MPI_ERR_ARG if the size of the fileview is not a
multiple of the size of the etype provided.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-10-11 14:43:32 -05:00
Edgar Gabriel
05d25383c2 common/ompio: return correct error code for improper access
return MPI_ERR_ACCESS if the user tries to read from  a file
that was opened using MPI_MODE_WRONLY

return MPI_ERR_READ_ONLY if the user tries to write a file
that was opened using MPI_MODE_RDONLY

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-10-11 14:41:58 -05:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Jeff Squyres
5be0ba0247 common/monitoring: fix include files
Move includes to top of file.  Set some #defines so that
monitoring_prof.c compiles without warning (as identified by gcc 8 on
MacOS).  Also ensure to include the internal Open MPI "mpi.h" file
(not some random system <mpi.h> file).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-15 06:04:13 -07:00
Nathan Hjelm
000f9eed4d opal: add types for atomic variables
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:48:55 -06:00
Edgar Gabriel
e6a344ba63
Merge pull request #5561 from edgargabriel/pr/file_open_sharedfp_ordering
common/ompio: fix an ordering problem during file_open
2018-08-20 10:18:14 -05:00
Edgar Gabriel
2742273ee3 common/ompio: fix an ordering problem during file_open
the sharedfp component has to be selected and opened before
we set the default file view during file_open. Otherwise
there is a sperious error message from the sharefp_file_seek
operation that is called during the file_set_view.

Fixes Issue #5560

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-08-20 09:28:29 -05:00
Gaëtan Bossu
ccc96efc2e DDN's Infinite Memory Engine support for OMPIO
Changes made:
 - Create a new fs component for IME
 - Create a new fbtl component for IME
 - Modify the close function of OMPIO to finalize IME if necessary

Signed-off-by: Gaëtan Bossu <gbossu@ddn.com>
Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
2018-08-16 11:45:47 +02:00
Gaëtan Bossu
8522ba112c MCA/IO/OMPIO: fix MPI_File_delete implementation.
OMPIO now uses the correct delete function depending on the fs

mca_common_ompio_file_delete now works this way instead
of calling POSIX unlink:
 - create a minimal file handle with the given file name
 - select the best fs component using this file handle
 - call the component-specific file delete function

Signed-off-by: Gaëtan Bossu <gbossu@ddn.com>
2018-07-17 18:17:13 +02:00
Edgar Gabriel
fd8c5fba4e common/ompio: fix the fview based grouping options
a bug sneaked into constructing the list of aggregators
processes when using the fileview based grouping options

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-22 14:01:31 -05:00
Edgar Gabriel
743e0dff5a common/ompio: fix zero size fview issue
handle the situation where the user requests a non-zero amount
of data but has a zero-size fileview. My instrinct would have been
to return an error code, but according to the test that I used
it should be MPI_SUCCESS and zero bytes. It is definitely better
than segfaulting :-)

THis makes another test from the IBM testsuite pass.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-21 17:02:13 -05:00
Edgar Gabriel
7808379a47 common/ompio: incorporate George's comments
incorporate a couple of comments by George as part of the
review on github.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-21 09:29:49 -05:00
Edgar Gabriel
3c10ed4ed1 common/ompio: use allocator to manage temporary buffers
use an allocator to manage temporary buffers when copying
unmanaged data from GPU buffer to host. This is necessary,
since the buffers have to be pinned for better performance,
which is an expensive operation.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-21 09:25:50 -05:00
Edgar Gabriel
6a532101aa io/ompio and common/ompio: add initial support for cuda buffers in ompio
this commit adds the initial support for cuda buffers in ompio, for blocking
and non-blocking individual read and write operations.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-21 09:25:50 -05:00
Edgar Gabriel
df4431bd48 io/ompio: add support for some info objects
add support for the info objects cb_buffer_size and collective_buffering.
Also, introduce a new mca parameter that allows to give feedback
on whether an info object is recognized (and honored).

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-19 19:34:36 -05:00
Edgar Gabriel
bc0f60dfd9 sharedfp/all components: revamp internal operations
this commit revamps the internal operations of the sharedfp components.
Specifically, it is focused around removing the second file_open
operation for shared file pointers. This makes the code more efficient.
Because of that, there is no necessity anymore for the sharedfp_lazy_open
mca parameter.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-18 14:34:05 -05:00
Gilles Gouaillardet
cd45c7abb6 ompio: misc renames
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-06-14 09:41:10 +09:00
Gilles Gouaillardet
36b35ae0db ompio: fix abstraction
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-06-14 09:41:10 +09:00
Edgar Gabriel
14bd114973 common/ompio: return error code from file_delete operation in file_close
in case the user opened a file using the DELETE_ON_CLOSE flag,
return the error code generated in the delete operation.

Note, that this is however just a partial fix to the e_close_1 test
from the ibm testsuite, since the object destructor that triggers
the file_close function does not have a mechanism right now to recognize
and return an error code.

Signed-off-by: Edgar Gabriel <gabriel@cs.uh.edu>
2018-06-07 19:30:14 -05:00
Edgar Gabriel
8feb497dbe io/ompio: cleanup the aggregator selection logic
and some internal structure elements/components. Along the way,
add support for the cb_nodes Info object.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-07 16:47:10 -05:00
Edgar Gabriel
529d882ff0 io/ompio and common/ompio: relocate ompio_request code to common
since the request code is now being accessed also from the vulcan fcoll
component, the request code was relocated into the common/ompio
directory to avoid ld load problems.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-06-07 16:13:12 -05:00
Edgar Gabriel
6b03cee7f1 io/ompio: erroneous condition in selecting aggregator selection logic
fix the logic in the decision which aggregator selection algorithm
to use.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-05-24 15:52:19 -05:00
Ninad Prabhukhanolkar
1518d7e003 Updated aggregate_profile.pl
The files array was also storing $phase.prof. This was leading to
$phase.prof's output getting dumped into itself again and again. Updated
code to initialise files array with files other than $phase.prof.

Signed-off-by: Ninad Prabhukhanolkar <ninadchess96@gmail.com>
2018-04-26 20:34:24 +05:30
Edgar Gabriel
c4879ec29f io/ompio: don't reset amode if MODE_SEQUENTIAL is set
the ompio module resets the amode from WRONLY to RDWR in order
to accoomodate data sieving in the two-phase fcoll componet. This
leads however to an error if MPI_MODE_SEQUENTIAL has been requested
by the user, since MODE_SEQUENTIAL is incompatible with MODE_RDWR.
SInce the change to the amode was done after opening the file for
individual file pointers but before opening the file for shared filepointers,
this lead to an error message in the sharedfp component.

Note, that data sieving is never necessary if MODE_SEQUENTIAL is set,
so this should not be a problem for any scenario.

Fixes #4991

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-03-30 07:56:47 -05:00