This fixes a bug reported in-house occuring with this component. It is triggered if the data assigned to different aggregators is highly differing, leading to different number of internal iterations required to handle it.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
--disable-io-ompio is a shortcut that disable the following
frameworks and components
- fbtl
- fcoll
- sharedfp
- common/ompio
- io/ompio
Fixesopen-mpi/ompi#1934
- move the sort_iovec operations to fcoll/base
- move set_view_internal to common/ompio
- move set_file_default to common/ompio
- remove io_ompio_sort, not used anymore.
the coll_array functions are truly only used by the fcoll modules, so move
them to fcoll/base. There is currently one exception to that rule (number of aggreagtors
logic), but that function will be moved in a long term also to fcoll/base.
Fix CID 72362: Explicit null dereferenced (FORWARD_NULL)
From what I can tell the code @ fcoll_static_file_read_all.c:649
should be setting bytes_per_process[i] to 0 not bytes_per_process.
Fix CID 72361: Explicit null dereferenced (FORWARD_NULL)
Modified check to check for blocklen_per_process non-NULL before
trying to free blocklen_per_process[l]. This is sufficient because
free (NULL) is safe. Also cleaned up the initialization of this an a
couple other arrays. They were allocated with malloc() then
initialized to 0. Changed to used calloc().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 72296: Resource leak (RESOURCE_LEAK):
Changed code to goto exit instead of returning to ensure memory is
freed.
Fix CID 712589: Out-of-bounds read (OVERRUN):
In this loop i and j are identical and always less than
iov_count. The CID was triggered because i was incremented if i was <
iov_count. This meant that if the loop did go on the next iteration
would access an invalid index.
Fix CID 741363: Uninitialized scalar variable (UNINIT):
Allocate tmp_len with calloc to insure every index is initialized.
Fix CID 741364: Uninitialized pointer read (UNINIT):
Allocate recv_types with calloc to ensure all indices are always
initialized. Also added a check to not loop and destroy if recv_types
is NULL.
Also added a NULL check on the allocation of decoded iov. This is not
the cause of CID 126784 but should be fixed.
Fix CID 712588: Out-of-bounds read (OVERRUN):
Similar to CID 712589. Should silence the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fixes CID 72320: Explicit NULL dereferenced
On error it is possible that the blocklen_per_process array is
NULL. Change the NULL check before the free to check for non-NULL on
the array not the array element. Also clean up allocation of this
array to use calloc instead of malloc + setting each element to NULL.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fixes CIDs 72300, 72344, 1196764-1196768, 72300: Resource leaks
Mulitple allocated arrays are going out of scope at the end of
mca_fcoll_two_phase_file_write_all. Free these arrays. Also removed
the extraneous NULL checks since free (NULL) is safe in C.
Change returns to goto exit where the allocated resources are freed.
Fixes CIDs 72285-72292, 72297, 72298: Resource leaks
Change all appropriate return statements to goto exit to ensure that
all resources are freed. Also removed the NULL checks since free
(NULL) is safe in C.
Fixes CIDs 72295, 72296: Resource leaks
Moved free of requests and recv_types to after exit label. This will
ensure these are freed on error.
Also added a loop and statement to free send_buf which is going out of
scope at the end of the function.
Fixes CIDs 72336-72240, 735197, 735198: Resource leaks
Moved the exit label before to before the resources are released and
changed all appropriate return statements to goto exit. Also removed
extraneous NULL checks because free (NULL) is safe in C.
Fixes CIDs 72341, 72343, 1196805-1196809: Resource leaks
Free all resources after exit label and change return statements to
goto exit to ensure all resources are freed on error.
Fixes CID 1269973: Unused value
Check return code of ompi_request_wait_all. If it fails jump to the
exit.
Fixes CID 714119: Dereference before NULL check
Wrong value checked in conditional.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
After long debugging, I found last week the reason this optimization originally broke
some hdf5 tests. We now pass the hdf5 test suite with the optimization being actively used.
Specifically:
- reduce the number of realloc's and malloc's by moving
some arrays out of the cycle loop, if we know that there
size is not changing
- store the rank of the aggregator in a separate variable to avoid
continuous dereferencing
- change the wait_all logic in write_all to use a fix number of requests
(even if they are MPI_REQUEST_NULL)
- fix the timing to considere the two initial allgather and the one
allgatherv operation to be a part of it
- add more comments.
- make the internal structure follow the Open MPI naming convention
- provide a single flag/macro which controls the compilation/utilization of this
feature, to avoid that somebody using this has to modify every single
fcoll component. A configure option could be added later if desired.