1
1
Граф коммитов

6152 Коммитов

Автор SHA1 Сообщение Дата
Todd Kordenbrock
3498bed650 Merge pull request #1555 from shawone/check_reduce_ret
coll-portals4: check return value from reduce kary tree functions
2016-05-03 10:17:23 -05:00
Jeff Squyres
33dd8ca81e osc_rdma_peer: properly include ompi_config.h
Thanks to Paul Hargrove for reporting.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-03 07:39:55 -07:00
Devendar Bureddy
cafd55f18c HCOLL: fix hang in hcoll barrier called from finalize for MXM/yalla
tear down

HCOLL barrier may not complete if HCOLL progress is not called periodically.
which is the case in HCOLL teardown progress in the finalize.
(cherry picked from commit 793244d75dd94d1d5e0243bcccf6d04318750f3f)
2016-05-03 00:49:57 +03:00
Nathan Hjelm
d3d779f6d9 osc/rdma: clear all_sync object when obtaining a lock
This commit fixes a bad synchronization detection bug that occurs when
mixing MPI_Win_fence() and MPI_Win_lock(). If no communication has
occurred in the fence epoch it is safe to just clear the all_sync
object (it was set up by fence).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 15:28:47 -06:00
Jeff Squyres
265e5b9795 Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1
ompi/opal/orte/oshmem/test: max hostname length cleanup
2016-05-02 09:44:18 -04:00
Ralph Castain
6ac7929bd0 Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Nathan Hjelm
7bda3eb2dc osc/rdma: fix global index array calculation
This commit fixes a bug that occurs when ranks are either not mapped
evenly or by something other than core.

Fixes open-mpi/ompi#1599

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-28 19:11:11 -06:00
Nathan Hjelm
f0f3383006 Merge pull request #1590 from hjelmn/thread_multiple
osc/pt2pt: do not drop/reacquire the ompi_request_lock
2016-04-26 16:48:37 -06:00
Nathan Hjelm
34ff6293bd osc/pt2pt: do not drop/reacquire the ompi_request_lock
This lock is now recursive so it is safe to call into the pml without
dropping the lock.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-26 14:19:38 -06:00
George Bosilca
bf190671e9 Make the request lock recursive.
If during the request completion callback we post another request that
completes right away (such a small send or a match for an unexpected
short message) we will try to complete the second request while holding
the lock for the completion of the first. For performance reasons
(mainly to avoid unlocking and locking the request mutex several times)
we have made the request lock recursive.
2016-04-26 16:16:07 -04:00
Nathan Hjelm
c16e639b2f Merge pull request #1563 from hjelmn/ompi_coverity
ompi coverity fixes
2016-04-26 09:17:48 -06:00
Karol Mroz
3322347da9 ompi: fixup hostname max length usage
Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-04-25 07:08:23 +02:00
Nathan Hjelm
ae0ffbb67f Merge pull request #1397 from hjelmn/enable_thread_multiple
ompi: always enable MPI_THREAD_MULTIPLE support
2016-04-23 08:40:22 -06:00
Nathan Hjelm
1ff3d3b16b pml/ob1: fix coverity issue
Fix CID 1357978 (1 of 1): Logically dead code (DEADCODE):

Remove duplicate check for NULL == endpoint.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 14:48:13 -06:00
Nathan Hjelm
70533e6d50 fcoll/static: fix coverity issues
Fix CID 72362: Explicit null dereferenced (FORWARD_NULL)

From what I can tell the code @ fcoll_static_file_read_all.c:649
should be setting bytes_per_process[i] to 0 not bytes_per_process.

Fix CID 72361: Explicit null dereferenced (FORWARD_NULL)

Modified check to check for blocklen_per_process non-NULL before
trying to free blocklen_per_process[l]. This is sufficient because
free (NULL) is safe. Also cleaned up the initialization of this an a
couple other arrays. They were allocated with malloc() then
initialized to 0. Changed to used calloc().

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 14:48:13 -06:00
Nathan Hjelm
8871bdb2f8 fcoll/two_phase: fix coverity issues
Fix CID 72296: Resource leak (RESOURCE_LEAK):

Changed code to goto exit instead of returning to ensure memory is
freed.

Fix CID 712589: Out-of-bounds read (OVERRUN):

In this loop i and j are identical and always less than
iov_count. The CID was triggered because i was incremented if i was <
iov_count. This meant that if the loop did go on the next iteration
would access an invalid index.

Fix CID 741363: Uninitialized scalar variable (UNINIT):

Allocate tmp_len with calloc to insure every index is initialized.

Fix CID 741364: Uninitialized pointer read (UNINIT):

Allocate recv_types with calloc to ensure all indices are always
initialized. Also added a check to not loop and destroy if recv_types
is NULL.

Also added a NULL check on the allocation of decoded iov. This is not
the cause of CID 126784 but should be fixed.

Fix CID 712588: Out-of-bounds read (OVERRUN):

Similar to CID 712589. Should silence the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 14:47:41 -06:00
Valentin Petrov
21f1c572c0 Adds mapping to hcoll complex dte 2016-04-19 14:14:28 +03:00
Nicolas Chevalier
c86d4035d2 coll-portals4: check return value from reduce kary tree functions 2016-04-18 12:02:30 +00:00
Nathan Hjelm
3245428e82 Merge pull request #1535 from kawashima-fj/pr/osc-pt2pt-header-fix
osc/pt2pt: Fix a struct name typo
2016-04-14 15:55:25 -06:00
Nathan Hjelm
330302c4b4 Merge pull request #1534 from kawashima-fj/pr/parallel-rma-fix
osc/pt2pt: Fix tag conflicts on parallel RMA communications
2016-04-14 15:13:32 -06:00
Jeff Squyres
fdf33674b3 Merge pull request #1532 from kmroz/wip-hindexed-cleanup-1
romio,java: cleanup deprecated hindexed call
2016-04-14 17:07:31 -04:00
KAWASHIMA Takahiro
35ea9e5c3c Add FUJITSU copyright 2016-04-12 13:47:53 +09:00
KAWASHIMA Takahiro
39bcbe439a osc/pt2pt: Fix a struct name typo
Fortunately the sizes of `ompi_osc_pt2pt_header_put_t` and
`ompi_osc_pt2pt_header_get_t` are same. So this doesn't affect
the behavior.
2016-04-11 20:55:22 +09:00
KAWASHIMA Takahiro
28a0577364 osc/pt2pt: Insert breaks in long lines 2016-04-11 19:06:01 +09:00
KAWASHIMA Takahiro
5ac95df9dc osc/pt2pt: use two distinct "namespaces" for tags - revised
Before this commit, a same PML tag may be used for distinct
communications for long messages. For example, consider a condition
where rank A calls ```MPI_PUT``` targeting rank B and rank B calls
```MPI_GET``` targeting rank A simultaneously.
A PML tag for the ```MPI_PUT``` is acquired on rank A and is used
for the long-message communication from rank A to rank B.
A PML tag for the ```MPI_GET``` is acquired on rank B and is used
for the long-message communication from rank A to rank B.
These two tags may become a same value because they are managed
independently on each rank. This will cause a data corruption.

This commit separates the tag used in a single RMA communication
call, one for communication from an origin to a target, and one
for communication from a target to an origin. A "base" tag
is acquired using ```get_tag``` function and PML tag is caluculated
from the base tag by ```tag_to_target``` and ```tag_to_origin```
function.
2016-04-11 19:05:20 +09:00
KAWASHIMA Takahiro
3576ecafa7 Revert "osc/pt2pt: use two distinct "namespaces" for tags"
This reverts commit 06ecdb6aa7
to reimplement the fix completely.
2016-04-11 19:04:11 +09:00
Karol Mroz
5c54184986 romio: replace deprecated hindexed call
Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-04-10 19:56:22 +02:00
Nathan Hjelm
c6b19818be bml: always enable the bml
This commit ensures the bml is always enabled whether or not it will
be used. This ensures that any available btls communicate their modex
so that they can be used for one-sided communication.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-08 21:14:17 -06:00
George Bosilca
896f857fc4 Thanks @hjelmn for catching up the typo. 2016-04-07 13:56:26 -04:00
Thananon Patinyasakdikul
92290b94e0 Fixed Coverity reports 1358014-1358018 (DEADCODE and CHECK_RETURN) 2016-04-07 12:52:17 -04:00
Ryan Grant
7cdf50533c Merge pull request #1314 from francois-wellenreiter/osc_disable_portals4_evt_send
OSC portals4 : do not generate an EVENT_SEND to avoid to filter it
2016-04-07 10:04:27 -06:00
George Bosilca
004c0cc05b Fix issues identified by @derbeyn. 2016-03-29 15:50:32 -04:00
Jeff Squyres
91c54d7a07 Merge pull request #1491 from ICLDisco/progress_thread
BTL TCP async progress
2016-03-29 06:26:10 -04:00
George Bosilca
f69eba1bc4 Update the copyright and cleanup the code.
Per @jsquyres suggestion remove all trailing spaces.
Credit to `sed -i.bak 's/ *$//' */[ch]`.
2016-03-28 14:41:01 -04:00
Thananon Patinyasakdikul
92062492b9 Enable Threading in the BTL TCP
Added mca parameter to turn progress thread on/off
Add a flag to check if we have btl progress thread.
Added macro for ob1 matching lock.
Update the AUTHORS file.
2016-03-28 14:41:01 -04:00
Nathan Hjelm
9d5eeecb8a pml/ob1: detect unreachable errors
This commit adds code to detect when procs are unreachable when using
the dynamic add_procs functionality.

Fixes #1501

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-28 10:52:40 -06:00
George Bosilca
57eadb0dd6 Fix for Coverity CID 1357152.
Or at least that was the origin of the issue. It turns out
we were freeing the wrong buffer (but as it only happen in the
case of an error we never noticed).
2016-03-24 00:53:30 -04:00
George Bosilca
4b38b6bd0c Fix multiple issues with the collective requests.
This patch addresses most (if not all) @derbeyn concerns
expressed on #1015. I added checks for the requests allocation
in all functions, ompi_coll_base_free_reqs is called with the
right number of requests, I removed the unnecessary basic_module_comm_t
and use the base_module_comm_t instead, I remove all uses of the
COLL_BASE_BCAST_USE_BLOCKING define, and other minor fixes.
2016-03-23 18:35:41 -04:00
Todd Kordenbrock
2122a15217 Merge pull request #1443 from francois-wellenreiter/fix_trig_rndv
MTL portals4 : fix around triggered rndv operations
2016-03-21 08:16:33 -05:00
Ralph Castain
c146c4969b Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
Fixes #1467

Restore debugger attach operations

Fixes #1225
2016-03-18 21:49:04 -07:00
Nathan Hjelm
075dfa4121 topo/treematch: fix component coverity issues
Fix CID 1315298: Resource leak (RESOURCE_LEAK) :
Fix CID 1315300: Resource leak (RESOURCE_LEAK):
Fix CID 1315299: Resource leak (RESOURCE_LEAK):
Fix CID 1315297 (#1 of 1): Resource leak (RESOURCE_LEAK):

Confirmed leaks in error paths. Added the leaked arrays to the
ERR_EXIT macro to ensure they are freed.

Fix CID 1315296 (#1 of 1): Resource leak (RESOURCE_LEAK):

Confirmed leak in error paths. Both the oversub and reqs arrays are
leaked. Free these arrays on error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 11:31:11 -06:00
Nathan Hjelm
3540b65f7d bcol: fix coverity issues
Fix CID 1269976 (#1 of 1): Unused value (UNUSED_VALUE):
Fix CID 1269979 (#1 of 1): Unused value (UNUSED_VALUE):

Removed unused variables k_temp1 and k_temp2.

Fix CID 1269981 (#1 of 1): Unused value (UNUSED_VALUE):
Fix CID 1269974 (#1 of 1): Unused value (UNUSED_VALUE):

Removed gotos and use the matched flags to decide whether to return.

Fix CID 715755 (#1 of 1): Dereference null return value (NULL_RETURNS):

This was also a leak. The items on cs->ctl_structures are allocated using OBJ_NEW so they mist be released using OBJ_RELEASE not OBJ_DESTRUCT. Replaced the loop with OPAL_LIST_DESTRUCT().

Fix CID 715776 (#1 of 1): Dereference before null check (REVERSE_INULL):

Rework error path to remove REVERSE_INULL. Also added a free to an error path where it was missing.

Fix CID 1196603 (#1 of 1): Bad bit shift operation (BAD_SHIFT):
Fix CID 1196601 (#1 of 1): Bad bit shift operation (BAD_SHIFT):

Both of these are false positives but it is still worthwhile to fix so they no longer appear. The loop conditional has been updated to use radix_mask_pow instead of radix_mask to quiet these issues.

Fix CID 1269804 (#1 of 1): Argument cannot be negative (NEGATIVE_RETURNS):

In general close (-1) is safe but coverity doesn’t like it. Reworked the error path for open to not try to close (-1).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 10:59:46 -06:00
Nathan Hjelm
c8b077f232 coll/ml: fix coverity issues
Fix CID 715744 (#1 of 1): Logically dead code (DEADCODE):
Fix CID 715745 (#1 of 1): Logically dead code (DEADCODE):

The free of scratch_num in either place is defensive programming. Instead of removing the free the conditional around the free has been removed to quiet the warning.

Fix CID 715753 (#1 of 1): Dereference after null check (FORWARD_NULL):
Fix CID 715778 (#1 of 1): Dereference before null check (REVERSE_INULL):

Fixed the conditional to check for collective_alg != NULL instead of collective_alg->functions != NULL.

Fix CID 715749 (#1 of 4): Explicit null dereferenced (FORWARD_NULL):

Updated code to ensure that none of the parse functions are reached with a non-NULL value.

Fix CID 715746 (#1 of 1): Logically dead code (DEADCODE):

Removed dead code.

Fix CID 715768 (#1 of 1): Resource leak (RESOURCE_LEAK):
Fix CID 715769 (#2 of 2): Resource leak (RESOURCE_LEAK):
Fix CID 715772 (#1 of 1): Resource leak (RESOURCE_LEAK):

Move free calls to before error checks to cleanup leak in error paths.

Fix CID 741334 (#1 of 1): Explicit null dereferenced (FORWARD_NULL):

Added a check to ensure temp is not dereferenced if it is NULL.

Fix CID 1196605 (#1 of 1): Bad bit shift operation (BAD_SHIFT):

Fixed overflow in calculation by replacing int mask with 1ul.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 10:11:16 -06:00
Nathan Hjelm
2f4e5325aa coll/base: fix coverity issues
Fix CID 1325868 (#1 of 1): Dereference after null check (FORWARD_NULL):
Fix CID 1325869 (#1-2 of 2): Dereference after null check (FORWARD_NULL):

Here reqs can indeed be NULL. Added a check to
ompi_coll_base_free_reqs to prevent dereferencing NULL pointer.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 09:31:43 -06:00
Nathan Hjelm
2ed4501490 osc: fix coverity issues
Fix CID 1324726 (#1 of 1): Free of address-of expression (BAD_FREE):

Indeed, if a lock conflicts with the lock_all we will end up trying to
free an invalid pointer.

Fix CID 1328826 (#1 of 1): Dereference after null check (FORWARD_NULL):

This was intentional but it would be a good idea to check for
module->comm being non_NULL to be safe. Also cleaned out some checks
for NULL before free().

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 09:11:48 -06:00
Nathan Hjelm
ec9712050b Merge pull request #1118 from hjelmn/mpool_rewrite
mpool/rcache rewrite
2016-03-15 10:46:24 -06:00
Nathan Hjelm
deae9e52bf Merge pull request #1259 from kawashima-fj/pr/osc-sm-align
osc/sm: Fix a bus error on MPI_WIN_{POST,START}.
2016-03-15 09:13:38 -06:00
Francois WELLENREITER
2bc432d95f MTL portals4 : fix around triggered rndv operations 2016-03-15 15:31:04 +01:00
Nathan Hjelm
d4afb16f5a opal: rework mpool and rcache frameworks
This commit rewrites both the mpool and rcache frameworks. Summary of
changes:

 - Before this change a significant portion of the rcache
   functionality lived in mpool components. This meant that it was
   impossible to add a new memory pool to use with rdma networks
   (ugni, openib, etc) without duplicating the functionality of an
   existing mpool component. All the registration functionality has
   been removed from the mpool and placed in the rcache framework.

 - All registration cache mpools components (udreg, grdma, gpusm,
   rgpusm) have been changed to rcache components. rcaches are
   allocated and released in the same way mpool components were.

 - It is now valid to pass NULL as the resources argument when
   creating an rcache. At this time the gpusm and rgpusm components
   support this. All other rcache components require non-NULL
   resources.

 - A new mpool component has been added: hugepage. This component
   supports huge page allocations on linux.

 - Memory pools are now allocated using "hints". Each mpool component
   is queried with the hints and returns a priority. The current hints
   supported are NULL (uses posix_memalign/malloc), page_size=x (huge
   page mpool), and mpool=x.

 - The sm mpool has been moved to common/sm. This reflects that the sm
   mpool is specialized and not meant for any general
   allocations. This mpool may be moved back into the mpool framework
   if there is any objection.

 - The opal_free_list_init arguments have been updated. The unused0
   argument is not used to pass in the registration cache module. The
   mpool registration flags are now rcache registration flags.

 - All components have been updated to make use of the new framework
   interfaces.

As this commit makes significant changes to both the mpool and rcache
frameworks both versions have been bumped to 3.0.0.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-14 10:50:41 -06:00
Gilles Gouaillardet
fbed6df4a3 coll/base: fix a typo
typo was introduced in open-mpi/ompi@c98e97a46e
2016-03-11 14:18:03 +09:00
Aurélien Bouteiller
c98e97a46e Do not return MPI_ERR_PENDING from collectives. 2016-03-09 16:13:34 -05:00
Joshua Ladd
4dffae2f88 Fixing MXM Yalla and MTL add procs behavior. MXM cannot support dynamic add procs, so propaget this info to the MTL and PML layers. 2016-03-08 01:46:24 +02:00
Aurélien Bouteiller
892e1ed57e Fix a potential race condition in which a progress matching thread could match a request while we are cancelling it. 2016-03-01 16:43:45 -05:00
Gilles Gouaillardet
8aff67c399 topo/base: correctly support MPI_UNWEIGHTED in mca_topo_base_dist_graph_neighbors()
Thanks Jun Kudo for the bug report.
2016-03-01 10:28:28 +09:00
George Bosilca
dbe93b0b19 Use mca_bml_base_get_endpoint
Correctly use mca_bml_base_get_endpoint instead of accessing the
endpoint directly.
2016-02-25 11:00:30 -06:00
Sylvain Jeaugey
5f32f49eb8 pml/ob1: Fix segmentation fault on CUDA path.
Fix segfault due to mca_pml_ob1_cuda_need_buffers not handling the case of the
endpoint not being there. Calling mca_bml_get_endpoint() seems to fix the problem.

Fixes open-mpi/ompi#1402
2016-02-24 21:32:25 -08:00
Nathan Hjelm
230d04327e ompi: always enable MPI_THREAD_MULTIPLE support
This commit removes the --with-mpi-thread-multiple option and forces
MPI_THREAD_MULTIPLE support. This cleans up an abstration violation
in opal where OMPI_ENABLE_THREAD_MULTIPLE determines whether the
opal_using_threads is meaningful. To reduce the performance hit on
MPI_THREAD_SINGLE programs an OPAL_UNLIKELY is used for the
check on opal_using_threads in OPAL_THREAD_* macros.

This commit does not clean up the arguments to the various functions
that take whether muti-threading support is enabled. That should be
done at a later time.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-23 10:02:14 -07:00
Edgar Gabriel
45003ef78d fix the data size counter for large ops for the static fcoll component 2016-02-23 08:33:50 -06:00
yohann
59b6d041f8 mtl/ofi: Check allocated pointer. 2016-02-19 16:59:47 -08:00
yohann
bd47062764 mtl/ofi: Fix error handling. 2016-02-19 16:58:41 -08:00
yohann
404987e9b3 mtl/ofi: Fix mismatching types. 2016-02-19 16:57:26 -08:00
yohann
3ad59435ce mtl/ofi: Prevent possible memory leak. 2016-02-19 16:57:02 -08:00
Edgar Gabriel
92d1b99468 optimize the shuffle step:
1. use communicator collectives if possible for performance reasons
 2. combined multiple allgathers into a single one
2016-02-19 11:04:04 -06:00
Edgar Gabriel
e63836c653 clean up the mca parameter handling of the component. Add new parameters for number of sub groups and write chunk size. This will allow to perform a systematic parameter study. 2016-02-19 10:15:28 -06:00
Edgar Gabriel
4f400314e0 add the dynamic_gen2 component into the fcoll selection table. 2016-02-19 09:32:54 -06:00
Edgar Gabriel
268d525053 change the tag to be a positive value. handle 0-byte situations correctly. 2016-02-19 08:28:50 -06:00
Edgar Gabriel
ad79012059 first cut on the version which overlaps the communication/computation of 2 iterations. 2016-02-19 08:28:50 -06:00
Ralph Castain
60a7bc2e50 Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion.
Fixes ##1225
2016-02-18 09:29:12 -08:00
yohann
7fe395c82a mtl/ofi: cleanup 2016-02-16 09:57:57 -08:00
yohann
22eddfee10 mtl/ofi: update copyright dates. 2016-02-16 09:56:09 -08:00
George Bosilca
68c36ea9dc Fix two annoying warnings in our UCX support. 2016-02-14 00:02:16 -05:00
yohann
67ce4a080a mtl/ofi: FI_AV_MAP support only. 2016-02-12 10:06:52 -08:00
yohann
b3d8ead76e mtl/ofi: Fix dynamic add_procs. 2016-02-12 10:05:52 -08:00
Gilles Gouaillardet
b55b9e6aee sentinel: fix sentinel to proc_name conversion
converting an opal_process_name_t means the loss of one bit,
it was decided to restrict the local job id to 15 bits, so the
useful information of an opal_process_name_t can fit in 63 bits.
2016-02-10 15:44:07 +09:00
Gilles Gouaillardet
030a5f2054 sentinel: use type uintptr_t for sentinel
MSB is now automatically cleared when right shifting
Thanks George for pointing this
2016-02-10 11:28:56 +09:00
George Bosilca
7c574a3530 Typo. 2016-02-07 07:22:22 +02:00
Nathan Hjelm
5b9c82a964 osc/pt2pt: bug fixes
This commit fixes several bugs identified by @ggouaillardet and MTT:

 - Fix SEGV in long send completion caused by missing update to the
   request callback data.

 - Add an MPI_Barrier to the fence short-cut. This fixes potential
   semantic issues where messages may be received before fence is
   reached.

 - Ensure fragments are flushed when using request-based RMA. This
   allows MPI_Test/MPI_Wait/etc to work as expected.

 - Restore the tag space back to 16-bits. It was intended that the
   space be expanded to 32-bits but the required change to the
   fragment headers was not committed. The tag space may be expanded
   in a later commit.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-04 16:59:39 -07:00
Gilles Gouaillardet
6eac6a8b00 osc/sm: create datafile into the per proc directory in order to make it unique per communicator
Thanks Peter Wind for the report
2016-02-03 10:12:37 +09:00
Nathan Hjelm
519fffb65e osc/pt2pt: eager sends are always active if MPI_MODE_NOCHECK is used
This commit fixes open-mpi/ompi#1299.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-02 12:44:17 -07:00
Nathan Hjelm
d7264aa613 osc/pt2pt: various threading fixes
This commit fixes several bugs identified by a new multi-threaded RMA
benchmarking suite. The following bugs have been identified and fixed:

 - The code that signaled the actual start of an access epoch changed
   the eager_send_active flag on a synchronization object without
   holding the object's lock. This could cause another thread waiting
   on eager sends to block indefinitely because the entirety of
   ompi_osc_pt2pt_sync_expected could exectute between the check of
   eager_send_active and the conditon wait of
   ompi_osc_pt2pt_sync_wait.

 - The bookkeeping of fragments could get screwed up when performing
   long put/accumulate operations from different threads. This was
   caused by the fragment flush code at the end of both put and
   accumulate. This code was put in place to avoid sending a large
   number of unexpected messages to a peer. To fix the bookkeeping
   issue we now 1) wait for eager sends to be active before stating
   any large isend's, and 2) keep track of the number of large isends
   associated with a fragment. If the number of large isends reaches
   32 the active fragment is flushed.

 - Use atomics to update the large receive/send tag counters. This
   prevents duplicate tags from being used. The tag space has also
   been updated to use the entire 16-bits of the tag space.

These changes should also fix open-mpi/ompi#1299.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-02 12:33:33 -07:00
Edgar Gabriel
3f7fff5780 Merge pull request #1331 from edgargabriel/solaris-statfs-fix
Solaris statfs fix
2016-01-28 20:16:33 -06:00
Nathan Hjelm
a19c265ab5 osc/rdma: fix typo in ompi_osc_rdma_complete_atomic
The typo caused SEGVs on systems with only fetching atomic
support.

Fixes open-mpi/ompi#1329

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-26 15:44:07 -07:00
Edgar Gabriel
b4a725c26a need to check for the parent dir as well, since the file might not exist yet. 2016-01-26 13:49:21 -06:00
Edgar Gabriel
722aab92e6 - extend opal_path_nfs to retrieve the file system type
- use opal_path_nfs in the fs_base function to avoid code duplication.
2016-01-26 13:36:21 -06:00
Joshua Ladd
69e3c6f289 Merge pull request #1321 from jladd-mlnx/topic/add-allgatherv-reduce
Adding entry points for Allgatherv, iAllgatherv, Reduce, and iReduce.
2016-01-25 20:46:52 -05:00
Nathan Hjelm
500e90422d Merge pull request #1320 from hjelmn/osc_rdma_fix
osc/rdma: fix hang when performing large unaligned gets
2016-01-25 09:36:13 -07:00
Nathan Hjelm
45da311473 osc/rdma: fix hang when performing large unaligned gets
This commit adds code to handle large unaligned gets. There are two
possible code paths for these transactions:

 1) The remote region and local region have the same alignment. In
 this case the get will be broken down into at most three get
 transactions: 1 transaction to get the unaligned start of the region
 (buffered), 1 transaction to get the aligned portion of the region,
 and 1 transaction to get the end of the region.

 2) The remote and local regions do not have the same alignment. This
 should be an uncommon case and is not optimized. In this case a
 buffer is allocated and registered locally to hold the aligned data
 from the remote region. There may be cases where this fails (low
 memory, can't register memory). Those conditions are unlikely and
 will be handled later.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-22 21:06:46 -07:00
Valentin Petrov
5e2a2c0755 BufFix for coll/hcoll: coll_request must be set to ACTIVE when alloced
If the state of the request is not set to OMPI_REQUEST_ACTIVE
       then MPI_Test would immediately signal such request completed
       while hcoll may still be working on it.

Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>
2016-01-23 03:23:59 +02:00
Joshua Ladd
e398bf6f3a Adding entry points for Allgatherv, iAllgatherv, Reduce, and iReduce.
Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>
2016-01-23 03:09:29 +02:00
Nathan Hjelm
49d2f44b97 osc/rdma: use correct endpoint for local state
If atomics are not globally visible (cpu and nic atomics do not mix)
then a btl endpoint must be used to access local ranks. To avoid
issues that are caused by having the same region registered with
multiple handles osc/rdma was updated to always use the handle for
rank 0. There was a bug in the update that caused osc/rdma to continue
using the local endpoint for accessing the state even though the
pointer/handle are not valid for that endpoint. This commit fixes the
bug.

Fixes open-mpi/ompi#1241.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-22 10:41:27 -07:00
Nathan Hjelm
6180386bea osc/rdma: disable put aggregation when using threads
Optimizing put aggregation in the presence of threads will require a
redesign of the code. For now just ensure that put aggregation is
turned off when MPI_THREAD_MULTIPLE is enabled.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-21 15:50:35 -07:00
Edgar Gabriel
b253d4e887 fix CID 1349739, CID 1349738, CID 1349736 and (probably) CID 1349740 (not entirely sure about the last one, since I don't understand why block[i] is a problem but max_len[i] allocated and treated exactly the same way 1 line later is not). 2016-01-21 08:32:23 -06:00
Edgar Gabriel
9b8d769e41 will rivist the addproc component later in spring, right now it is constantly in the way of doing my tests. 2016-01-20 15:05:51 -06:00
Francois WELLENREITER
411b7301c3 OSC portals4 : do not generate an EVENT_SEND to avoid to filter it 2016-01-20 11:47:46 +01:00
Edgar Gabriel
a9ca37059a improve the communicaton abstraction. This commit also allows all aggregators to work simultaniously, instead of the slightly staggered way of the previous version. 2016-01-17 09:48:49 -06:00
Edgar Gabriel
56e11bfc97 initialize the stripe_size variable as well. 2016-01-17 09:48:49 -06:00
Edgar Gabriel
26c57ef374 separate the size of the buffer used for the shuffle step and the size of the buffer used for a pwritev operation. 2016-01-17 09:48:49 -06:00
Edgar Gabriel
39d5c8c281 further bug fixes silencing a compiler warning and fixing a memory overrun 2016-01-17 09:48:49 -06:00
Edgar Gabriel
2bcae84e11 further debugging 2016-01-17 09:48:49 -06:00
Edgar Gabriel
2bdd6ba17a correctly free some buffers, and ensure that lustre_stripe_size and stripe_count are always read from the file system. 2016-01-17 09:48:49 -06:00
Edgar Gabriel
4bbb22bd0b add a new field to the ompio data structure (stripe_count) and set it correctly on pvfs2 and lustre. 2016-01-17 09:48:49 -06:00
Edgar Gabriel
d282e94b67 add the new dynamic_gen2 component, designed to coexist for now with the original dynamic component 2016-01-17 09:48:49 -06:00
Jeff Squyres
60ffe713b8 common syms: whitelist bison-generated common symbols
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-01-16 03:53:14 -08:00
Joshua Ladd
18c5a21562 Fix typo in error handling flow. 2016-01-14 22:28:54 +02:00
Joshua Ladd
afa62d8ca1 Addressing reviewers' comments for https://github.com/open-mpi/ompi-release/pull/891 2016-01-14 19:22:27 +02:00
Tomislav Janjusic
3858bc8e62 Adding support for dynamic endpoint creation
Signed-off-by: Tomislav Janjusic <tomislavj@mngx-apl-01.mtl.labs.mlnx>
Signed-off-by: Tomislavj Janjusic <tomislavj@mellanox.com>
Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>
2016-01-12 22:17:03 +02:00
Nathan Hjelm
dd4d49cbbb Merge pull request #1278 from ggouaillardet/poc/osc_pt2pt
osc/pt2pt: use two distinct "namespaces" for tags
2016-01-12 09:49:31 -07:00
Edgar Gabriel
0a1b735eed use the actual preadv and pwritev functions if available. That's what the fbtl interfaces have been designed for. 2016-01-07 08:29:17 -06:00
Edgar Gabriel
1b0b849994 remove the MCA parameter setting the number of hosts in PLFS, since the plfs_setxattr function used is causing linking problems with PLFS 2.5
remove unused variables.
2016-01-05 11:13:23 -06:00
Edgar Gabriel
7861a8c357 revise the logic in the fbtl plfs avoiding the memcpy operation 2016-01-05 10:04:46 -06:00
Edgar Gabriel
da309ac962 - use a unique pid for each process as requested by the API
- sync the file before closing it
- use plfs_access() instead of access() before closing the file
2016-01-05 10:04:12 -06:00
KAWASHIMA Takahiro
ad26899110 osc/sm: Fix a bus error on MPI_WIN_{POST,START}.
A bus error occurs in sm OSC under the following conditions.

- sparc64 or any other architectures which need strict alignment.
- `MPI_WIN_POST` or `MPI_WIN_START` is called for a window created
  by sm OSC.
- The communicator size is odd and greater than 3.

The lines 283-285 in current `ompi/mca/osc/sm/osc_sm_component.c` has
the following code.

```c
module->global_state = (ompi_osc_sm_global_state_t *) (module->segment_base);
module->node_states = (ompi_osc_sm_node_state_t *) (module->global_state + 1);
module->posts[0] = (uint64_t *) (module->node_states + comm_size);
```

The size of `ompi_osc_sm_node_state_t` is multiples of 4 but not
multiples of 8. So if `comm_size` is odd, `module->posts[0]` does
not aligned to 8. This causes a bus error when accessing
`module->posts[i][j]`.

This patch fixes the alignment of `module->posts[0]` by setting
`module->posts[0]` first.
2016-01-05 19:04:53 +09:00
Gilles Gouaillardet
06ecdb6aa7 osc/pt2pt: use two distinct "namespaces" for tags 2016-01-05 16:57:37 +09:00
Gilles Gouaillardet
14fdf75944 fs/pvfs2: fix typo
Thanks Dave Love for reporting this issue.

Fixes #1272
2016-01-03 23:28:35 +09:00
Artem Polyakov
2abb2972ac Fix Mellanox copyrights with respect to the following PRs:
* https://github.com/open-mpi/ompi/pull/1184
* https://github.com/open-mpi/ompi/pull/1188
* https://github.com/open-mpi/ompi/pull/1197
* https://github.com/open-mpi/ompi/pull/1202
* https://github.com/open-mpi/ompi/pull/1210
* https://github.com/open-mpi/ompi/pull/1216
* https://github.com/open-mpi/ompi/pull/1236
* https://github.com/open-mpi/ompi/pull/1237
* https://github.com/open-mpi/ompi/pull/1248
* https://github.com/open-mpi/ompi/pull/1260
* https://github.com/open-mpi/ompi/pull/1264
2015-12-30 00:12:19 +06:00
Ralph Castain
810f2446b7 Add pmix120 component, update the error handling functions in the PMIx API.
Update the configure logic for the new pmix120 component

ckpt

Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates

Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works

Cleanup the rename files to use the pretty macros
2015-12-28 23:15:44 +09:00
Gilles Gouaillardet
fec973efda configury: test portability
replace test ... -o ... with test ... || test ...
and test ... -a ... with test ... && test ...
2015-12-28 13:58:45 +09:00
Gilles Gouaillardet
ccc96ad204 fbtl/base: add missing #include "opal/util/output.h"
Thanks Marco Atzeri for contributing the original patch
2015-12-24 14:41:26 +09:00
Gilles Gouaillardet
cebde2a753 coll/tuned: add missing #include "opal/util/output.h"
Thanks Marco Atzeri for contributing the original patch
2015-12-24 14:41:17 +09:00
Gilles Gouaillardet
ad9693c604 pml/yalla: add missing #include <alloca.h> 2015-12-24 14:33:58 +09:00
Gilles Gouaillardet
b38c17dbcb pml/cm: add missing #include <alloca.h>
Thanks Paul Hargrove for reporting this issue
2015-12-24 14:33:58 +09:00
Gilles Gouaillardet
071ae39a44 osc/rdma: add missing #include <alloca.h> 2015-12-24 14:33:58 +09:00
Gilles Gouaillardet
77f199d1d7 coll/fca: add missing #include <alloca.h> 2015-12-24 14:33:58 +09:00
Todd Kordenbrock
8a3660138e mtl-portals4: initialize endpoint nid/pid when using logical mapping
When mtl-portals4 is configured for logical mapping, coll-portals4
must disqualify because it does not yet support logical mapping.
coll-portals4 looks for the endpoint pid to be zero which tells it
that mtl-portals4 is configured for logical mapping.  This commit
initializes the endpoint nid/pid to zero for logical mapping.
2015-12-22 11:20:18 -06:00
rhc54
aa17bdf6e8 Merge pull request #1239 from rhc54/topic/cleanup
Cleanup the warnings from the ompi layer when compiling optimized under Mac OSX
2015-12-21 07:23:31 -08:00
Edgar Gabriel
46c20a1246 correctly set all variables storing information on the file pointer position to zero when setting the file view 2015-12-21 09:41:39 +09:00
Ralph Castain
ac6289dca6 Cleanup the warnings from the ompi layer when compiling optimized under Mac OSX
Cleanup per George's comments
2015-12-17 17:39:15 -08:00
igor.ivanov@itseez.com
041a6a9f53 ompi/pml: Fix warnings in yalla component 2015-12-16 16:22:30 +02:00
igor.ivanov@itseez.com
38c253c74c ompi/mtl: Fix warnings in mxm component 2015-12-16 16:22:29 +02:00
igor.ivanov@itseez.com
0a9956927a ompi/coll: Fix warnings in fca components
warning: assignment from incompatible pointer type
2015-12-16 16:22:16 +02:00
igor.ivanov@itseez.com
8f45d83d46 ompi/coll: Fix warnings in hcoll component
warning: assignment from incompatible pointer type
2015-12-16 14:52:29 +02:00
Ralph Castain
3a56f0d34b Create the pmix external component. Fix a few places where opal/util/argv.h were required when building with an external pmix (go figure).
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.

Closes #1204  (replaces it)
Fixes #1064
2015-12-15 15:26:13 -08:00
Nathan Hjelm
0de9445fc7 osc/rdma: fix bugs when running more than one process per node
A previous commit updated the one-sided code to register the state
region only once. This created an issue when using the scratch lock
with fetching atomics. In this case on any rank that isn't local rank
0 the module->state_handle is NULL. This commit fixes the issue by
removing the scratch lock and using a fragment pointer instead.

Fixes open-mpi/ompi#1290

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-12-15 11:25:25 -07:00
Nathan Hjelm
b7ba301310 Merge pull request #1165 from hjelmn/add_procs_group
ompi/group: release ompi_proc_t's at group destruction
2015-12-14 13:53:42 -08:00
Nathan Hjelm
9d659465b7 Merge pull request #1210 from artpol84/icbarrier_fix
Fix NBC iBarrier for inter-communicators.
2015-12-14 13:52:38 -08:00
Nathan Hjelm
4b3dac5933 Merge pull request #1216 from artpol84/icgatherv_fix
Fix NBC iGatherv for inter-communicators.
2015-12-14 13:51:58 -08:00
Matias Cabral
7cfd7d50b9 Merge pull request #1219 from matcabral/PSM2_tag_hashing
Support for PSM2 hashing lookup in message queue.
2015-12-14 12:01:55 -08:00
matcabral
9a1f9be146 A new internal feature in PSM2 will use hash tables to
accelerate message queue lookups if the lookups have
the proper tag&mask layout. OpenMPI should follow
PSM2's preferred tag&mask spec, so that PSM2 can provide
a performance benefit.
2015-12-14 10:13:39 -08:00
Artem Polyakov
2d0919dbdc Fix NBC iGatherv for inter-communicators.
We need to use remote size to form a schedule.
2015-12-14 12:19:10 +06:00
Artem Polyakov
fc17deca43 Fix NBC iBarrier for inter-communicators.
Remove send of the extra message. This bug hase triggered on
MPICH/coll/nbicbarrier test. In this test a series of communicators
are created.
This extre-message was reseived after original communicator was destroyed
and queued into non_existing_communicator_pending. When new completely
unrelated communicator with the same id as original was created this message
was pushed into the frags_cant_match queue and caused seq numbers skew and hang.
2015-12-12 13:27:31 +06:00
Gilles Gouaillardet
3a3b13ea12 coll/base: fix an integer overflow in ompi_coll_base_reduce_generic
Refs open-mpi/ompi#1198
2015-12-11 13:55:59 +09:00
Alina Sklarevich
3ffd8dcd20 PML UCX: fix typo (following 7becc54d). 2015-12-10 13:51:10 +02:00
Nathan Hjelm
dae3746d2f Merge pull request #1190 from kawashima-fj/pr/sm-win-test-fix
osc/sm: Fix a bug that `MPI_WIN_TEST` does not update `flag` to 0
2015-12-08 06:39:16 -07:00
KAWASHIMA Takahiro
9c7b6a4352 osc/sm: Fix a bug that MPI_WIN_TEST does not update flag to 0.
`MPI_WIN_TEST` must update the `flag` parameter to 0 when not all
origin processes called `MPI_WIN_COMPLETE`. But sm OSC doesn't.
If the caller initialize the `flag` argument to a non-0 value,
the caller will receive the non-0 `flag` value.
2015-12-08 19:23:21 +09:00
Gilles Gouaillardet
59a361b781 ompio: correctly handle zero f_cc_size in mca_io_ompio_simple_grouping 2015-12-08 17:00:11 +09:00
Nathan Hjelm
f68c315188 pml/ob1: add missing ompi_request_wait_completion for buffered sends
This commit adds a call to ompi_request_wait_completion for buffered
sends. Without this line it is possible to get into a state where the
data is never sent.

Fixes open-mpi/ompi#1185

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-12-07 22:28:07 -07:00
Gilles Gouaillardet
bfe8e03d9d fcoll/two_phase: use ompi_mpi_abort instead of PMPI_Abort
Thanks Jeff for the review
2015-12-07 11:34:36 +09:00
Gilles Gouaillardet
37c978f5e9 coll/libnbc: correctly handle changed types.
this fixes open-mpi/ompi@d816d1c194
thanks Jeff for the review
2015-12-07 10:13:43 +09:00
George Bosilca
3a9664ac9d Fix Coverity CIDs 1341584-1341589. 2015-12-06 14:06:36 -05:00
bosilca
8fee96c086 Merge pull request #1091 from bosilca/topic/datatype_span
Cleanup the temporary memory allocation in collectives
2015-12-03 19:25:04 -05:00
Gilles Gouaillardet
a5440ade5f topo/treematch: do not invoke hwloc_topology_{init,load}
* this is not necessary
 * this overwrites existing topology, that could be different if hwloc_base_topo_file is used
2015-12-03 11:24:32 +09:00
George Bosilca
688108cf7f Patch submitted by @ggouaillardet on ticket #1091. 2015-12-02 20:42:18 -05:00
George Bosilca
4d00c59b2e Cleanup the memory handling for temporary buffers in
some of the collective modules. Added a new function
opan_datatype_span, to compute the memory span of
count number of datatype, excluding the gaps in the
beginning and at the end. If a memory allocation is
made using the returned value, the gap (also returned)
should be removed from the allocated pointer.
2015-12-02 20:42:18 -05:00
Jeff Squyres
15325c8094 op/x86: change the owner to Ralph
Cisco no longer cares about this component, but Intel might.
Transferring ownership to Ralph.
2015-12-01 15:08:07 -08:00
Nathan Hjelm
5334d22a37 ompi/group: release ompi_proc_t's at group destruction
This commit changes the way ompi_proc_t's are retained/released by
ompi_group_t's. Before this change ompi_proc_t's were retained once
for the group and then once for each retain of a group. This method
adds unnecessary overhead (need to traverse the group list each time
the group is retained) and causes problems when using an async
add_procs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-11-30 23:03:47 -07:00
Ryan Grant
324534b191 Merge pull request #1161 from tkordenbrock/topic/add.triggered.scatter
coll-portals4: add scatter and iscatter implementations that use Portals4 triggered operations
2015-11-30 16:53:47 -07:00
Todd Kordenbrock
4721b70dd5 coll-portals4: add scatter and iscatter implementations that use Portals4 triggered operations
This commit adds implementations of scatter and iscatter using
Portals4 triggered operations.  Currently, the only algorithm
is linear.
2015-11-30 15:07:18 -06:00
Todd Kordenbrock
f6f525e0d8 coll-portals4: remove unneeded code from gather
This commit removes two pieces of unneeded code from gather.  First
it removes destroy_tree() calls from linear_top(), because the
linear algorithm does not create a tree, so there is no need to
destroy it.  Second it removes unpack_bytes from the gather request
because it was calculated but never used.
2015-11-30 10:38:51 -06:00
Gilles Gouaillardet
80f02518ff topo/base: correctly free the topo object in mca_topo_base_dist_graph_create_adjacent 2015-11-30 15:33:59 +09:00
Ryan Grant
81d482dca6 Merge pull request #1137 from francois-wellenreiter/trig_mtl_rdv
MTL portals4 : improve the rendez-vous protocol using PtlTriggeredGet…
2015-11-24 17:31:31 -07:00
Ryan Grant
219581e87e Merge pull request #1090 from tkordenbrock/topic/check.for.invalid.handles.in.finalize
mtl-portals4: test for valid handle before releasing resources
2015-11-20 07:54:44 -06:00
Mike Dubman
c544620a7c Merge pull request #1138 from igor-ivanov/pr/yalla-valgrind
yalla: fix valgrind error due to uninitialized status field.
2015-11-20 07:19:11 -05:00
Gilles Gouaillardet
002c7b8b3a fcoll/two_phase: use PMPI_* insted of MPI_* 2015-11-20 13:46:19 +09:00
Gilles Gouaillardet
561e7f6647 vprotocol/pessimist: use internal ompi_* insted of MPI_* 2015-11-20 13:46:19 +09:00
Gilles Gouaillardet
025fd8a9fc osc: use PMPI_* insted of MPI_* 2015-11-20 13:46:19 +09:00
Gilles Gouaillardet
d816d1c194 coll/libnbc: use PMPI_* and internal ompi_* insted of MPI_* 2015-11-20 13:46:19 +09:00
yosefe
3bb1270715 yalla: fix valgrind error due to uninitialized status field. 2015-11-19 10:59:31 +02:00
Francois WELLENREITER
9126ea5e82 MTL portals4 : improve the rendez-vous protocol using PtlTriggeredGet operation 2015-11-19 09:52:53 +01:00
Edgar Gabriel
9e5ade4e8b argh, a debugging sleep statement got into the source code. 2015-11-16 13:26:57 -06:00
Edgar Gabriel
dbfbcdecd5 make adjustments for the default settings of grouping parameters and the default contiguous group size option.
minor bug fix in the simple grouping strategy.
2015-11-16 08:17:27 -06:00
Edgar Gabriel
27628774c7 add a new option for a simple aggregator selection which has zero communication costs. 2015-11-16 08:17:26 -06:00
Edgar Gabriel
66c1ea5fcb change the default value of the grouping option. Also add new grouping option which skips the refinement step in the aggregator selection. 2015-11-16 08:17:23 -06:00
Edgar Gabriel
e8e117503d reduce the communication volume during MPI_File_set_view 2015-11-16 08:17:22 -06:00
yohann
005400a937 mtl/ofi: Make sure the resources are managed by the provider. 2015-11-13 16:16:58 -08:00
Nathan Hjelm
9ef0821856 osc/rdma: fix some threading bugs
There were two bugs in osc/rdma when using threads:

 - Deadlock is ompi_osc_rdma_start_atomic. This occurs because
   ompi_osc_rdma_frag_alloc is called with the module lock. To fix the
   issue the module lock is now recursive. In the future I will add a
   new lock to protect just the current rdma fragment.

 - Do not drop the lock in ompi_osc_rdma_frag_alloc when calling
   ompi_osc_rdma_frag_complete. Not only is it not needed but dropping
   the lock at this point can cause a competing thread to mess up the
   state.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-11-12 20:25:57 -07:00
Yossi
b750b72a81 Merge pull request #1127 from yosefe/topic/pml-ucx-implement-cancel
pml_ucx: implement cancel, and add small optimizations.
2015-11-12 10:50:48 +02:00
yosefe
7becc54d67 pml_ucx: fix typo. 2015-11-12 09:57:41 +02:00
Todd Kordenbrock
b9630f802b Merge pull request #1120 from francois-wellenreiter/mtl_min_mdbind
mtl-portals4 : remove useless PtlMDBind PtlMDRelease calls for rendez-vous messages
2015-11-10 14:34:19 -06:00
yosefe
d66b01d380 pml_ucx: implement cancel, and add small optimizations. 2015-11-10 17:40:06 +02:00
Gilles Gouaillardet
d6ff25b9a2 pml/monitoring: initialize common symbols 2015-11-10 13:58:54 +09:00
Jeff Squyres
e89ecac83c bml r2: fix exclusivity comparison
Fixes open-mpi/ompi#1106
2015-11-06 13:26:32 -08:00
Francois WELLENREITER
b301b49a40 MTL portals4 : remove useless PtlMDBind PtlMDRelease calls for rendez-vous messages 2015-11-06 15:55:44 +01:00
yosefe
45c3d04857 pml_ucx: fix request construct/destruct.
We should invoke OBJ_CONTRUCT/OBJ_DESTRUCT only on regular requests
(which are embedded inside UCX requests) and for the completed request.
Persistent requests are already constructed/destructed by the free list.
This fixes an assertion in ompi_request_destruct.
2015-11-04 11:03:46 +02:00
Todd Kordenbrock
cefe50cf54 mtl-portals4: test for valid handle before releasing resources
During component finalize, mtl-portals4 would blindly release
resources without testing if the handle was valid.  This was OK,
but resource allocation is now delayed until add_procs().  If
mtl-portals4 is deselected, it will be finalized without
add_procs() ever being called.  This commit ensures that invalid
handles are not released.
2015-11-02 21:01:14 -06:00
George Bosilca
5c60e76669 Fix Coverity CIDs 1338021, 1338020, 1338019, 1338018. 2015-11-02 17:38:51 -05:00
George Bosilca
b77c203068 Add more comments and restore the progress, flags, max tag, and max
context_id from the original PML.
2015-10-31 17:13:35 -04:00
George Bosilca
3efd494972 Make sure the monitoring infrastructure works well with the
new dynamic add_procs.
2015-10-31 17:13:35 -04:00
Guillaume Papauré
86714ad91e change pml_monitoring_messages_count and pml_monitoring_messages_size pvars to use the start/stop features 2015-10-31 17:13:35 -04:00
George Bosilca
a43c2ce529 Fully integrate the monitoring with the MPI_T PVAR.
Writing to the pml_monitoring_flush variable will set the filename of
the output file.
Stopping a session for the pml_monitoring_flush will force the
generation of the nobitoring output file (as long as the filename
is not NULL).
To reset the monitoring, une has to bind the pml_monitoring_flush to a
session.
2015-10-31 17:13:35 -04:00
George Bosilca
646a662721 Use the new group interface and add const to the PML send functions. 2015-10-31 17:13:35 -04:00
George Bosilca
5224a7ce4d Allow the pvar to be written by invoking the associated callback.
Use a PVAR to generate the monitoring dump of the information into a
file.

Use the PVAR to instruct the PML monitoring when to do the dump.
2015-10-31 17:13:35 -04:00
George Bosilca
df167f4177 Rewrite the close logic to be more clean and cleaner. 2015-10-31 17:13:35 -04:00
George Bosilca
c801ffde86 Use MPI_T variables to handle the flush in a more MPI-blessed way.
Code cleanup.

Update the monitoring test to use MPI_T variables.
2015-10-31 17:13:35 -04:00
George Bosilca
4f88c82500 Fix a convertion problem and add a comment about the lack of component
retain in the new component infrastructure.

Clean Makefile.am to fix "make distcheck".

Update the gitignore rules.
2015-10-31 17:13:35 -04:00
George Bosilca
80343a0d39 add ability to querry pml monitorinting results with MPI Tools interface
using performance variables "pml_monitoring_messages_count" and
"pml_monitoring_messages_size"

Per Brice suggestion make all data count and message length be
uint64_t.
2015-10-31 17:13:35 -04:00
George Bosilca
a47d69202f Add a monitoring PML. This PML track all data exchanges by the processes
counting or not the collective traffic as a separate entity. The need
for such a PML is simply because the PMPI interface doesn't allow us to
identify the collective generated traffic.
2015-10-31 17:13:35 -04:00
Rolf vandeVaart
578385ca78 Merge pull request #1079 from rolfv/pr/cuda-require-41
Make CUDA 4.1 a requirement for CUDA-aware support
2015-10-29 12:56:22 -04:00
Nathan Hjelm
b1e3936261 Merge pull request #1078 from rolfv/pr/disable-osc-rdma-for-cuda
Disable the use of osc rdma when we detect a GPU buffer
2015-10-29 10:03:28 -06:00
Rolf vandeVaart
f2ff6e03ab Make CUDA 4.1 a requirement for CUDA-aware support.
Remove all related preprocessor conditionals.
2015-10-29 11:24:02 -04:00
Matias Cabral
8ebcac1b2c Merge pull request #1075 from matcabral/psm2_symbol_rename
Updated psm2 mtl with new externally exposed symbols of psm2.so 
Fixes open-mpi/ompi#1018
Fixes open-mpi/ompi#1021
2015-10-28 13:55:45 -07:00
Rolf vandeVaart
87a4cc6118 Disable the use of osc rdma when we detect a GPU buffer as it is not supported in that component.
This forces a failover to the osc pt2pt component. Fixes #1012
2015-10-28 14:47:45 -04:00
yosefe
ae738d0434 pml_ucx: add pmi fence in del_procs 2015-10-28 18:34:36 +02:00
Matias A Cabral
ed16d8e1cc Updated psm2 mtl with new externally exposed symbols of psm2.so
Fixes open-mpi/ompi#1018
Fixes open-mpi/ompi#1021
2015-10-28 09:12:33 -07:00
yosefe
41b6230be3 pml_ucx: fix debug macros, and initialize mpi request properly. 2015-10-28 10:59:25 +02:00
yohann
8bf1c95cdc mtl/ofi: Remove unused help messages. 2015-10-27 09:38:04 -07:00
Nathan Hjelm
69d403d42b Merge pull request #1054 from hjelmn/add_procs_threading
add_procs: add threading protection for dynamic add_procs
2015-10-27 09:27:13 -06:00
yohann
a111d66f0f mtl/ofi: Change hints to FI_PROGRESS_MANUAL. 2015-10-26 15:32:30 -07:00
yohann
fde8b89ceb mtl/ofi: Use OFI's representation of ANY_SRC instead of NULL. 2015-10-26 14:38:41 -07:00
yohann
4246de4508 mtl/ofi: Treat error correctly. 2015-10-26 14:38:33 -07:00
George Bosilca
2622b9d3a1 Fix minor issues in the treematch topo
based on a patch provided by Guillaume.
2015-10-25 21:38:59 -04:00
Jeff Squyres
140cf90e3e osc_rdma: minor compiler warning stomp 2015-10-23 06:21:56 -07:00
Nathan Hjelm
63e744ffc6 osc/rdma: use only a single btl registration for local state
This commit fixes a bug that can occur on Cray Gemini networks. If
multiple registrations are used for the local state then we looks the
atomicity guarantees. To avoid issues like this use only a single
registration handle for all local state on a node.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-22 15:51:19 -06:00
Nathan Hjelm
f690fc8fd5 osc/pt2pt: fix warnings
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-22 15:50:40 -06:00
Nathan Hjelm
e4219aa692 Merge pull request #1059 from hjelmn/osc_fixes
osc/rdma: bug fixes
2015-10-22 11:25:51 -06:00
Nathan Hjelm
97c9732bad osc/rdma: bug fixes
This commit fixes the following:

 - CIDs 1328491, 1328492: Dead code caused by typos in a prior
   commit.

 - Fix the calculation of dynamic memory regions. This was causes
   incorrect RMA range errors when accessing the last partial page of
   an attachment.

 - Fix a SEGV when using dynamic memory windows with local state (all
   processes on the same node).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-22 09:49:38 -06:00
yohann
889c76634e mtl/ofi: Increase priority. 2015-10-22 08:39:36 -07:00
Nathan Hjelm
b2fa2a9bef Merge pull request #1056 from hjelmn/osc_fixes
osc/pt2pt: reset all_sync sync object before sending complete messages
2015-10-21 19:40:28 -06:00
Nathan Hjelm
864f88a2a3 osc/pt2pt: reset all_sync sync object before sending complete messages
This commit fixes a bug that occurs when a post message comes in when
sending complete messages or while waiting for all outgoing messages
to flush. In that case the post message might get incorrecly
associated with the ending sync object.

References open-mpi/ompi#1012

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-21 18:30:08 -06:00
Nathan Hjelm
08e267b811 add_procs: add threading protection for dynamic add_procs
This commit add protection to the group, ob1, and bml endpoint lookup
code. For ob1 and the bml a lock has been added. For performance
reasons the lock is only held if a bml or ob1 endpoint does not
exist. ompi_group_dense_lookup no uses opal_atomic_cmpset to ensure
the proc is only retained by the thread that actually updates the
group.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-21 16:13:41 -06:00
Nathan Hjelm
386991d590 Merge pull request #1052 from hjelmn/osc_rdma_fixes
osc/rdma: use standard verbosity levels
2015-10-21 13:21:50 -06:00
Nathan Hjelm
9476c7bbca osc/rdma: use standard verbosity levels
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-21 12:31:41 -06:00
yohann
abe5002ee9 mtl/ofi: remove threading and progress hints. 2015-10-21 10:25:08 -07:00
yosefe
cc76db8d39 ucx: reduce components priority to 5. 2015-10-21 17:38:25 +03:00
Mike Dubman
4ea13f10f6 Merge pull request #1008 from alex-mikheev/topic/ucx_support
UCX support for ompi and oshmem
2015-10-21 09:33:33 +03:00
Nathan Hjelm
763744a32c Merge pull request #1046 from hjelmn/osc_rdma_fixes
osc/rdma: bug fixes
2015-10-20 16:44:38 -06:00
Nathan Hjelm
b8ee05d352 osc/rdma: bug fixes
This commit fixes several bugs in the osc/rdma component:

 - Complete aggregated requests immediately. Completion of RMA
   requests indicates local completion anyway. This fixes a hang in
   the c_reqops test.

 - Correctly mark Rget_accumulate requests.

 - Set the local base flag correctly on the local peer.

 - Clear or set the no locks flag on the window if the value is
   changed by MPI_Win_set_info.

 - Actually update the target when using MPI_OP_REPLACE.

Fixes open-mpi/ompi#1010

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-20 15:27:15 -06:00
Ryan Grant
f60c506c68 Merge pull request #999 from tkordenbrock/topic/add.triggered.gather
coll-portals4: add gather and igather implementations that use Portals4 triggered operations
2015-10-20 14:59:09 -06:00
yosefe
a313588337 ompi: Add UCX PML. 2015-10-20 19:46:06 +03:00
Nathan Hjelm
9602484568 Merge pull request #1040 from hjelmn/mtl_priority
Change how cm's priority is calculated
2015-10-19 14:18:36 -06:00
Nathan Hjelm
53f6b57c0a pml/cm: use the priority of the mtl component
This commit changes the priority of mtl components to be relative to
pml/ob1 and updates the mtl interface to expose this priority. cm now
sets its own priority based on the priority of the selected mtl
component.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:42 -06:00
Nathan Hjelm
8b5810f7f7 mca/base: add priority output to mca_base_select
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Nathan Hjelm
bedd80214e pml/ob1: remove priority check
This commit removes code that checks the ob1 priority vs the previous
priority. The previous priority is meaningless here and may only cause
ob1 to disable itself when it shouldn't.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Nathan Hjelm
2fd176ac7f cm: fix selection priority
This patch removes a priority check that disables cm if the previous
pml had higher priority. The check was incorrect as coded and is
unnecessary as we finalize all but one pml anyway.

Fixes open-mpi/ompi#1035

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:26 -06:00
Gilles Gouaillardet
0f23037775 coll/base: fix memory allocation in mca_coll_base_alltoall_intra_basic_inplace 2015-10-19 16:47:59 +09:00
Edgar Gabriel
f97655f28e make sure the iov buffer is initialized to zero, otherwise bad things can happen for 0-byte contributions on a process. 2015-10-15 12:46:01 -05:00
Edgar Gabriel
0177918a08 limit the number of bytes used for the semaphore name depending on platform (31 bytes for MacOS, 252 for Linux) 2015-10-15 11:13:45 -05:00
Nathan Hjelm
341b60dd57 Merge pull request #1029 from kawashima-fj/pr/ob1-fin-memory-leak
pml/ob1: Fix a memory leak regarding pending FIN control messages.
2015-10-15 07:55:52 -06:00
KAWASHIMA Takahiro
4e56505202 pml/ob1: Fix a memory leak regarding pending FIN control messages.
Once a FIN control message is appended to the pending list,
the ob1 PML attempts to send the FIN again in the                               `mca_pml_ob1_process_pending_packets` function.
But if the PML failed to sent the FIN again, the `mca_pml_ob1_send_fin`
function creates a new `mca_pml_ob1_pckt_pending_t` object and the
old object is not retured to the free list.
2015-10-15 11:15:03 +09:00
Jeff Squyres
8307330e8a Merge pull request #989 from jsquyres/pr/friendly-message-when-dynamics-disabled
Print friendly message when dynamics disabled
2015-10-14 19:52:52 -04:00
Jeff Squyres
889d80a659 mxm/yalla: disable MPI dynamic process functionality
Disable the MPI dynamic process functionality when these components
are selected to be used.
2015-10-14 13:42:56 -07:00
Nathan Hjelm
e11f014c6e osc/rdma: fix segmentation fault when running 1 ppn
This commit fixes an issue identified by @rolfv. The local peer was
not being correctly initialized when running with a single process on
a node.

This fixes open-mpi/ompi#1010

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-14 12:40:52 -06:00
Jeff Squyres
62351f442a help: remove stale help messages and files
Found by contrib/check-help-strings.pl.
2015-10-13 16:50:20 -04:00
Todd Kordenbrock
7c738fb657 coll-portals4: add gather and igather implementations that use Portals4 triggered operations
This commit adds implementations of gather and igather using
Portals4 triggered operations.  The default algorithm is linear,
but binomial can be selected using an MCA parameter -
coll_portals4_use_binomial_gather_algorithm.
2015-10-13 11:26:35 -05:00
Nathan Hjelm
d8dc5292ed Merge pull request #1002 from hjelmn/ompi_coverity
ompi: fix coverity issues
2015-10-09 12:27:41 -06:00
bosilca
1310acc83f Merge pull request #912 from bosilca/topic/coll_requests
This patch fixes the issues identified by @ggouaillardet in the IBM tests (collectives and topologies). It also improves the memory usage of OMPI, as a communicator without collective communications will never allocate the array of requests needed to coordinate the basic collective algorithms. This ticket replaced #790.
2015-10-09 11:27:07 -04:00
Nathan Hjelm
4cb42f8264 ompi: fix coverity issues
Fixes CID 715741: Logically dead code

Verified. Removed dead code.

Fixes CID 1320878: Resource leak

Free proc_list before returning.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-10-09 08:41:27 -06:00
Todd Kordenbrock
141b20d991 osc-portals4: Initialize datatype in MPI_Get_accumulate and MPI_Rget_accumulate
Fix code paths that didn't convert the MPI datatype to the
corresponding Portals4 datatype.

Thanks to Nicolas Chevalier (@shawone) for finding this bug and
submitting a patch.
2015-10-08 12:17:19 -05:00
Gilles Gouaillardet
e946c82847 Revert "coll/basic: fix segmentation fault in neighborhood collectives if the degree"
This partially reverts commit open-mpi/ompi@76204dfafe.
2015-10-08 12:00:41 -04:00
Gilles Gouaillardet
99cca2cfd3 Revert "* comment on communicator creation in mca_topo_base_dist_graph_create(...)"
This partially reverts commit open-mpi/ompi@27e4389259.
2015-10-08 12:00:41 -04:00
George Bosilca
a8bdd8f668 Don't lose the pointer to the request array. Patch provided by
@ggouaillardet.
2015-10-08 12:00:41 -04:00
George Bosilca
88492a1e12 Consistently use the request array for all modules (single array stored
in the base).
Correctly deal with persistent requests (they must be always freed when
they are stored in the request array associated with the communicator).
Always use MPI_STATUS_IGNORE for single request waiting functions.
2015-10-08 12:00:41 -04:00
George Bosilca
01b32caf98 Update the basic module to dynamically allocate the right
number of requests.

Remove unnecessary fields.We don't need these fields.
2015-10-08 12:00:41 -04:00
George Bosilca
a324602174 Never allocate a temporary array for the requests. Instead rely on the
module_data to hold one with the largest necessary size. This array is
only allocated when needed, and it is released upon communicator
destruction.
2015-10-08 12:00:41 -04:00
Ryan Grant
8134ba76f1 Merge pull request #998 from tkordenbrock/topic/fix.incorrect.ompi_proc.cast
Looks good to me.

mtl-portals4: fix bug in the Portals4 get_peer family
2015-10-08 08:38:16 -06:00
Todd Kordenbrock
88d79efd9f mtl-portals4: fix bug in the Portals4 get_peer family
The Portals4 get_peer family incorrectly cast the ompi_proc_t to
ptl_process_t and returned that as the peer.  The ptl_process_t is
actually found in the endpoint array.  This commit fixes the
Portals4 get_peer family to return the dereferenced endpoint
pointer.
2015-10-08 07:57:48 -05:00
Todd Kordenbrock
f33b0c1cdf coll-portals4: allreduce: remove extra %d from error message. 2015-10-08 07:57:33 -05:00
Devendar Bureddy
72f98ccf6c HCOLL: Enable alltoall interface 2015-10-07 08:00:04 +03:00
Edagr Gabriel
8af80cd02c update the interfaces of the sharedfp addproc component to match the changes made in the const commit. 2015-10-06 07:54:38 -05:00
Gilles Gouaillardet
de8de65b07 coll/tuned: remove unused prototypes from coll_tuned.h 2015-10-06 09:07:48 +09:00
Mike Dubman
e8d7373b14 COLL/FCA: revert to prev barrier if called from finalize
FCA barrier may not complete if FCA progress is not called periodically.
PMI/PMI2 API that can be used in rte barrier has no provision for calling
external progress function.

So it is possible that during finalize some ranks will be stuck
in fca barrier while others are in PMI barrier.
2015-10-04 09:40:19 +03:00
Nathan Hjelm
5122327727 fcoll/two_phase: fix new coverity errors
Fix CID 1325467: use after free

Remove extra free of aggregator_list.

Fix CID 1325466: resource leak

Fix typo in prior coverity fix.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-10-02 21:38:31 -06:00
Nathan Hjelm
eb79edff33 Merge pull request #963 from hjelmn/ompi_coverity
fcoll/two_phase: fix coverity errors
2015-10-02 10:22:24 -06:00
Devendar Bureddy
243b75aa80 HCOLL: Add alltoallv interface 2015-10-02 01:51:33 +03:00
Nathan Hjelm
95b95e19af fcoll/dynamic: fix coverity errors
Fixes CID 72320: Explicit NULL dereferenced

On error it is possible that the blocklen_per_process array is
NULL. Change the NULL check before the free to check for non-NULL on
the array not the array element. Also clean up allocation of this
array to use calloc instead of malloc + setting each element to NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-01 14:38:09 -06:00
Nathan Hjelm
09df7aa205 fcoll/two_phase: fix coverity errors
Fixes CIDs 72300, 72344, 1196764-1196768, 72300: Resource leaks

Mulitple allocated arrays are going out of scope at the end of
mca_fcoll_two_phase_file_write_all. Free these arrays. Also removed
the extraneous NULL checks since free (NULL) is safe in C.

Change returns to goto exit where the allocated resources are freed.

Fixes CIDs 72285-72292, 72297, 72298: Resource leaks

Change all appropriate return statements to goto exit to ensure that
all resources are freed. Also removed the NULL checks since free
(NULL) is safe in C.

Fixes CIDs 72295, 72296: Resource leaks

Moved free of requests and recv_types to after exit label. This will
ensure these are freed on error.

Also added a loop and statement to free send_buf which is going out of
scope at the end of the function.

Fixes CIDs 72336-72240, 735197, 735198: Resource leaks

Moved the exit label before to before the resources are released and
changed all appropriate return statements to goto exit. Also removed
extraneous NULL checks because free (NULL) is safe in C.

Fixes CIDs 72341, 72343, 1196805-1196809: Resource leaks

Free all resources after exit label and change return statements to
goto exit to ensure all resources are freed on error.

Fixes CID 1269973: Unused value

Check return code of ompi_request_wait_all. If it fails jump to the
exit.

Fixes CID 714119: Dereference before NULL check

Wrong value checked in conditional.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-01 14:38:09 -06:00
Nathan Hjelm
5fd9c35957 osc/rdma: fix incorrect assert
This commit fixes MTT failures in debug builds.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-29 15:37:40 -06:00
Nathan Hjelm
7b8ec48c68 osc/rdma: fix typos inarguments to btl_atomic_[f]op
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-09-29 08:09:00 -06:00
Nathan Hjelm
12bd300c40 Merge pull request #929 from hjelmn/add_procs
Update add_procs support
2015-09-28 17:29:13 -06:00
Nathan Hjelm
6611c000c9 Fix coverity warnings
Fix CID 1315271: Constant expression result

The intent of this conditional is to not produce a peruse event for
probe or mprobe requests. Coverity is correct that the expression is
always true. Changed the || to && to fix. Also moved the conditional
within an OMPI_WANT_PERUSE to ensure the conditional is not evaluated
if peruse is disabled.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-28 15:35:25 -06:00
bosilca
0a3c54ed61 Merge pull request #942 from bosilca/topic/global_request
Fix for "Random errors on MPI_COMPARE_AND_SWAP with pt2pt OSC of Open MPI master" (#933)
2015-09-27 16:56:29 +02:00
Nathan Hjelm
552e1b59a5 osc/rdma: fix coverity issues
Fixes CID 1324730, 1327429, 1324728, 1196633, 1324731, 1324727, and
1196632: Logically dead code

OMPI_OSC_RDMA_REQUEST_ALLOC can never return a NULL request. Removed
unnecessary NULL checks.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-09-26 12:45:14 -06:00
Nathan Hjelm
ebf19ac5eb osc/pt2pt: fix coveity issues
Fixed CID 1269712, 1269709, 1269706, 1269703, 1269694: Logically dead code

Remove extra NULL check as OMPI_OSC_PT2PT_REQUEST_ALLOC can never set the
request to NULL.

Fixes CID 1269668: Unchecked return value

False positive. Add (void) to indicate we do not care about the return code
from opal_hash_table_get_uint32.

Fixes CID 1324726: Free of address-of expression

Do not free lock if it was not allocated.

Fixes CID 1269658: Free of address-of expression

Never will happen but because op is always a built-in op there is no
reason to retain/release it anyway.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-09-26 11:18:22 -06:00
George Bosilca
01d8e23ccc Fix the random errors related to the recursive sends and receives
identified by Fujitsu.
2015-09-26 00:44:51 +02:00
Nathan Hjelm
f84716fcd0 Merge pull request #941 from hjelmn/osc_pt2pt_fix
osc/pt2pt: fix heterogenous build
2015-09-25 08:07:09 -06:00
Nathan Hjelm
ae7f47e04d osc/pt2pt: fix heterogenous build
Fixes #940

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-09-25 00:15:02 -06:00
Todd Kordenbrock
3e63a3458c portals4: add support for dynamic add_procs() to all Portals4 components
In the default mode of operation, the Portals4 components support
dynamic add_procs().

The Portals4 components have two alternate modes (flow control and
logical-to-physical) that require knowledge of all procs at startup.
In these modes, mtl-portals4 sets the MCA_MTL_BASE_FLAG_REQUIRE_WORLD
flag and btl-portals4 sets the MCA_BTL_FLAGS_SINGLE_ADD_PROCS flag
to tell the PML that we need all the procs in one add_procs() call.
2015-09-24 22:12:57 -05:00
Nathan Hjelm
248212276d osc/sm: fix remaining coverity issues
Fixes CID 1324870: Memory - illegal accesses (USE_AFTER_FREE)

Free osc module after calling destruct on the lock.

Fixes CID 1324868: Integer handling issues (OVERFLOW_BEFORE_WIDEN)
Fixes CID 1324867: Integer handling issues (OVERFLOW_BEFORE_WIDEN)

Explicitly cast to uint64_t to ensure the widen happens before an overflow
can occur.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-24 15:55:01 -06:00
Nathan Hjelm
4ab4c4d7e9 Merge pull request #930 from hjelmn/ompi_coverity
coll/libnbc: fix coverity errors
2015-09-23 17:07:34 -06:00
Nathan Hjelm
54a4061d88 Add support for detecting when dynamic add_procs is not possible
This commit adds support to the pml, mtl, and btl frameworks for
components to indicate at runtime that they do not support the new
dynamic add_procs behavior. At the high end the lack of dynamic
add_procs support is signalled by the pml using the new pml_flags
member to the pml module structure. If the
MCA_PML_BASE_FLAG_REQUIRE_WORLD flag is set MPI_Init will generate the
ompi_proc_t array passed to add_proc from ompi_proc_world () instead
of ompi_proc_get_allocated ().

Both cm and ob1 have been updated to detect if the underlying mtl and
btl components support dynamic add_procs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-23 16:22:05 -06:00
Nathan Hjelm
2c89c7f47d ompi/proc: add function to get all allocated procs
This commit adds two new functions:

 - ompi_proc_get_allocated - Returns all procs in the current job that
   have already been allocated. This is used in init/finalize to
   determine which procs to pass to add_procs/del_procs.

 - ompi_proc_world_size - returns the number of processes in
   MPI_COMM_WORLD. This may be removed in favor of callers just
   looking at ompi_process_info.

The behavior of ompi_proc_world has been restored to return
ompi_proc_t's for all processes in the current job. The use of this
function is discouraged.

Code that was using ompi_proc_world() has been updated to make use of
the new functions to avoid the memory overhead of ompi_comm_world ().

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-23 16:22:05 -06:00
Nathan Hjelm
30f8d0b038 coll/libnbc: fix coverity errors
Fix CID 1196812: Resource Leak

dsts array was leaked on error.

Fix CID 710565: Copy-paste error

The line in question (nbc:513) is indeed a copy-paste error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-23 16:14:49 -06:00
William Throwe
80bb41a079 ROMIO configure looks for lstat in wrong header
ROMIO configure looks for lstat in wrong header

The ROMIO configure script checks for a declaration of lstat in
unistd.h, but, at least on the Linux machines I checked, lstat is in
sys/stat.h.  (The detection failure led to a linker error when building
ROMIO as part of OpenMPI on one of my admittedly strangely configured
machines, somehow.)  It appears from the man page that either location
is possible, so check both.

(cherry picked from mpich/mpich@7b8bd055df)

Signed-off-by: Rob Latham <robl@mcs.anl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-23 11:56:53 -06:00
Nathan Hjelm
db74fa9d0f bml/r2: fix memory leak
The add_procs change made some assumptions in the bml/r2 add_procs
wrong. This lead to del_procs never being called. I removed the logic
that checks the ompi_proc_t reference count and removed an unnecessary
allocation. The allocation only makes sense if we pass more than a
single proc at a time to the btl del_procs.

This commit also ensures that the btl del_procs is called if the
endpoint is in the btl_rdma array but not the btl_send array.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-23 10:45:13 -06:00
Nathan Hjelm
ee5810813b osc/pt2pt: fix regression in pscw sync on 0 size groups
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-22 17:09:00 -06:00
Nathan Hjelm
f6920aa916 osc/rdma: check for usable btls during query
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-22 17:08:28 -06:00
Nathan Hjelm
903762e194 osc/sm: fix pscw synchronization
The osc/sm component was using a simple counter to determine if all
expected posts had arrived to start a PSCW access epoch. This is
incorrect as a post may arrive from a peer that isn't part of the
current start group. There are many ways this could have been fixed.
This commit adds an n^2 bitmap. When a process posts it sets a bit in
the bitmap associated with the access rank to indicate the post is
complete. The access rank checks for and clears the bits associated
with all the processes in the start group.

The bitmap requires comm_size ^ 2 bits of space. This should be
managable as most nodes have relatively small numbers of processes. If
this changes another algorigthm can be implemented.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-22 16:00:27 -06:00
Nathan Hjelm
036395dc0f osc/pt2pt: fix typos
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-22 10:30:01 -06:00
Nathan Hjelm
974061c38f osc: fixed issues identified by coverity
Fix CID 1324733: Null pointer dereferences  (FORWARD_NULL)
Fix CID 1324734: Null pointer dereferences  (FORWARD_NULL)
Fix CID 1324735: Null pointer dereferences  (FORWARD_NULL)
Fix CID 1324736: Null pointer dereferences  (FORWARD_NULL)
Fix CID 1324737: Null pointer dereferences  (FORWARD_NULL)
Fix CID 1324751: Memory - illegal accesses  (USE_AFTER_FREE)
Fix CID 1324750: (USE_AFTER_FREE)
Fix CID 1324749: Memory - corruptions  (USE_AFTER_FREE)
Fix CID 1324748: Memory - illegal accesses  (USE_AFTER_FREE)
Fix CID 1324747: (USE_AFTER_FREE)
Fix CID 1324746: Memory - corruptions  (USE_AFTER_FREE)

Add missing return on an error path.

Fix CID 1324745: Code maintainability issues  (UNUSED_VALUE)

Ignore return code from barrier. It was not being used anyway.

Fix CID 1324738: Null pointer dereferences  (FORWARD_NULL)
Fix CID 1324741: Null pointer dereferences  (REVERSE_INULL)

module->selected_btl can not be NULL in osc/rdma during normal
operation. Removed the unnecessary NULL check.

Fix CID 1324752: Memory - illegal accesses  (USE_AFTER_FREE)

Move ompi_osc_pt2pt_module_lock_remove to before the lock is freed.

Fix CID 1324744: Uninitialized variables  (UNINIT)
Fix CID 1324743: Uninitialized variables  (UNINIT)

This array is not used unitialized but there is no reason not to use
calloc here to silence the warning.

The following CID is a false positive: 1324742. I will mark it such in
coverity.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-22 09:23:39 -06:00
Rolf vandeVaart
2c51faa58d Fix warnings due to missing const 2015-09-21 14:18:44 -04:00
Nathan Hjelm
60c2b0df48 Merge pull request #903 from hjelmn/new_osc_rdma
osc/rdma: add true RDMA one-sided component
2015-09-21 10:29:11 -06:00
Nathan Hjelm
88100ad670 Merge pull request #902 from hjelmn/new_osc
osc/pt2pt: reduce memory footprint of windows
2015-09-21 10:28:41 -06:00
Edgar Gabriel
01fcfb08fe do not set the contigous flag in two_phase_file_read_all. This optimization
needs some more debugging for the two_phase component, and is disabled
for two_phase_file_write_all as well.
2015-09-18 09:30:50 -05:00
Edgar Gabriel
3734a38370 this file should have been part of the previous commit. for removeing io_ompio_nbc.[ch] 2015-09-18 09:28:25 -05:00
Edgar Gabriel
cf46a6bd4d remove the io_ompio_nbc.[ch] files, they are not used anymore at this point in time. 2015-09-18 09:26:25 -05:00
Gilles Gouaillardet
a611274704 pml: fix commit open-mpi/ompi@6e6a3e965c
do not use the const modifier for allocator nor recv buffers
2015-09-18 09:54:18 +09:00
Jeff Squyres
567c9e3a5b mtl_ofi_component.c: add missing argv.h header 2015-09-17 10:05:05 -07:00
Nathan Hjelm
d8df9d414d osc/rdma: add true RDMA one-sided component
This commit adds support for performing one-sided operations over
supported hardware (currently Infiniband and Cray Gemini/Aries). This
component is still undergoing active development.

Current features:

 - Use network atomic operations (fadd, cswap) for implementing
   locking and PSCW synchronization.

 - Aggregate small contiguous puts.

 - Reduced memory footprint by storing window data (pointer, keys,
   etc) at the lowest rank on each node. The data is fetched as each
   process needs to communicate with a new peer. This is a trade-off
   between the performance of the first operation on a peer and the
   memory utilization of a window.

TODO:

 - Add support for the accumulate_ops info key. If it is known that
   the same op or same op/no op is used it may be possible to use
   hardware atomics for fetch-and-op and compare-and-swap.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-16 15:01:33 -06:00
Nathan Hjelm
fd42343ff0 osc/pt2pt: reduce memory footprint of window
This commit updates osc/pt2pt to allocate peer object as they are
needed rather than all at once. Additionally, to help improve the
memory footprint a new synchronization structure has been added.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-16 13:01:56 -06:00
George Bosilca
02624bd0b6 Fix all treematch issues idenfied by Coverity. 2015-09-15 23:49:11 -04:00
George Bosilca
6ab5f68fc3 indentation. 2015-09-15 22:46:13 -04:00