1
1
Граф коммитов

5869 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
004c0cc05b Fix issues identified by @derbeyn. 2016-03-29 15:50:32 -04:00
Jeff Squyres
91c54d7a07 Merge pull request #1491 from ICLDisco/progress_thread
BTL TCP async progress
2016-03-29 06:26:10 -04:00
George Bosilca
f69eba1bc4 Update the copyright and cleanup the code.
Per @jsquyres suggestion remove all trailing spaces.
Credit to `sed -i.bak 's/ *$//' */[ch]`.
2016-03-28 14:41:01 -04:00
Thananon Patinyasakdikul
92062492b9 Enable Threading in the BTL TCP
Added mca parameter to turn progress thread on/off
Add a flag to check if we have btl progress thread.
Added macro for ob1 matching lock.
Update the AUTHORS file.
2016-03-28 14:41:01 -04:00
Nathan Hjelm
9d5eeecb8a pml/ob1: detect unreachable errors
This commit adds code to detect when procs are unreachable when using
the dynamic add_procs functionality.

Fixes #1501

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-28 10:52:40 -06:00
George Bosilca
57eadb0dd6 Fix for Coverity CID 1357152.
Or at least that was the origin of the issue. It turns out
we were freeing the wrong buffer (but as it only happen in the
case of an error we never noticed).
2016-03-24 00:53:30 -04:00
George Bosilca
4b38b6bd0c Fix multiple issues with the collective requests.
This patch addresses most (if not all) @derbeyn concerns
expressed on #1015. I added checks for the requests allocation
in all functions, ompi_coll_base_free_reqs is called with the
right number of requests, I removed the unnecessary basic_module_comm_t
and use the base_module_comm_t instead, I remove all uses of the
COLL_BASE_BCAST_USE_BLOCKING define, and other minor fixes.
2016-03-23 18:35:41 -04:00
Todd Kordenbrock
2122a15217 Merge pull request #1443 from francois-wellenreiter/fix_trig_rndv
MTL portals4 : fix around triggered rndv operations
2016-03-21 08:16:33 -05:00
Ralph Castain
c146c4969b Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
Fixes #1467

Restore debugger attach operations

Fixes #1225
2016-03-18 21:49:04 -07:00
Nathan Hjelm
075dfa4121 topo/treematch: fix component coverity issues
Fix CID 1315298: Resource leak (RESOURCE_LEAK) :
Fix CID 1315300: Resource leak (RESOURCE_LEAK):
Fix CID 1315299: Resource leak (RESOURCE_LEAK):
Fix CID 1315297 (#1 of 1): Resource leak (RESOURCE_LEAK):

Confirmed leaks in error paths. Added the leaked arrays to the
ERR_EXIT macro to ensure they are freed.

Fix CID 1315296 (#1 of 1): Resource leak (RESOURCE_LEAK):

Confirmed leak in error paths. Both the oversub and reqs arrays are
leaked. Free these arrays on error.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 11:31:11 -06:00
Nathan Hjelm
3540b65f7d bcol: fix coverity issues
Fix CID 1269976 (#1 of 1): Unused value (UNUSED_VALUE):
Fix CID 1269979 (#1 of 1): Unused value (UNUSED_VALUE):

Removed unused variables k_temp1 and k_temp2.

Fix CID 1269981 (#1 of 1): Unused value (UNUSED_VALUE):
Fix CID 1269974 (#1 of 1): Unused value (UNUSED_VALUE):

Removed gotos and use the matched flags to decide whether to return.

Fix CID 715755 (#1 of 1): Dereference null return value (NULL_RETURNS):

This was also a leak. The items on cs->ctl_structures are allocated using OBJ_NEW so they mist be released using OBJ_RELEASE not OBJ_DESTRUCT. Replaced the loop with OPAL_LIST_DESTRUCT().

Fix CID 715776 (#1 of 1): Dereference before null check (REVERSE_INULL):

Rework error path to remove REVERSE_INULL. Also added a free to an error path where it was missing.

Fix CID 1196603 (#1 of 1): Bad bit shift operation (BAD_SHIFT):
Fix CID 1196601 (#1 of 1): Bad bit shift operation (BAD_SHIFT):

Both of these are false positives but it is still worthwhile to fix so they no longer appear. The loop conditional has been updated to use radix_mask_pow instead of radix_mask to quiet these issues.

Fix CID 1269804 (#1 of 1): Argument cannot be negative (NEGATIVE_RETURNS):

In general close (-1) is safe but coverity doesn’t like it. Reworked the error path for open to not try to close (-1).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 10:59:46 -06:00
Nathan Hjelm
c8b077f232 coll/ml: fix coverity issues
Fix CID 715744 (#1 of 1): Logically dead code (DEADCODE):
Fix CID 715745 (#1 of 1): Logically dead code (DEADCODE):

The free of scratch_num in either place is defensive programming. Instead of removing the free the conditional around the free has been removed to quiet the warning.

Fix CID 715753 (#1 of 1): Dereference after null check (FORWARD_NULL):
Fix CID 715778 (#1 of 1): Dereference before null check (REVERSE_INULL):

Fixed the conditional to check for collective_alg != NULL instead of collective_alg->functions != NULL.

Fix CID 715749 (#1 of 4): Explicit null dereferenced (FORWARD_NULL):

Updated code to ensure that none of the parse functions are reached with a non-NULL value.

Fix CID 715746 (#1 of 1): Logically dead code (DEADCODE):

Removed dead code.

Fix CID 715768 (#1 of 1): Resource leak (RESOURCE_LEAK):
Fix CID 715769 (#2 of 2): Resource leak (RESOURCE_LEAK):
Fix CID 715772 (#1 of 1): Resource leak (RESOURCE_LEAK):

Move free calls to before error checks to cleanup leak in error paths.

Fix CID 741334 (#1 of 1): Explicit null dereferenced (FORWARD_NULL):

Added a check to ensure temp is not dereferenced if it is NULL.

Fix CID 1196605 (#1 of 1): Bad bit shift operation (BAD_SHIFT):

Fixed overflow in calculation by replacing int mask with 1ul.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 10:11:16 -06:00
Nathan Hjelm
2f4e5325aa coll/base: fix coverity issues
Fix CID 1325868 (#1 of 1): Dereference after null check (FORWARD_NULL):
Fix CID 1325869 (#1-2 of 2): Dereference after null check (FORWARD_NULL):

Here reqs can indeed be NULL. Added a check to
ompi_coll_base_free_reqs to prevent dereferencing NULL pointer.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 09:31:43 -06:00
Nathan Hjelm
2ed4501490 osc: fix coverity issues
Fix CID 1324726 (#1 of 1): Free of address-of expression (BAD_FREE):

Indeed, if a lock conflicts with the lock_all we will end up trying to
free an invalid pointer.

Fix CID 1328826 (#1 of 1): Dereference after null check (FORWARD_NULL):

This was intentional but it would be a good idea to check for
module->comm being non_NULL to be safe. Also cleaned out some checks
for NULL before free().

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 09:11:48 -06:00
Nathan Hjelm
ec9712050b Merge pull request #1118 from hjelmn/mpool_rewrite
mpool/rcache rewrite
2016-03-15 10:46:24 -06:00
Nathan Hjelm
deae9e52bf Merge pull request #1259 from kawashima-fj/pr/osc-sm-align
osc/sm: Fix a bus error on MPI_WIN_{POST,START}.
2016-03-15 09:13:38 -06:00
Francois WELLENREITER
2bc432d95f MTL portals4 : fix around triggered rndv operations 2016-03-15 15:31:04 +01:00
Nathan Hjelm
d4afb16f5a opal: rework mpool and rcache frameworks
This commit rewrites both the mpool and rcache frameworks. Summary of
changes:

 - Before this change a significant portion of the rcache
   functionality lived in mpool components. This meant that it was
   impossible to add a new memory pool to use with rdma networks
   (ugni, openib, etc) without duplicating the functionality of an
   existing mpool component. All the registration functionality has
   been removed from the mpool and placed in the rcache framework.

 - All registration cache mpools components (udreg, grdma, gpusm,
   rgpusm) have been changed to rcache components. rcaches are
   allocated and released in the same way mpool components were.

 - It is now valid to pass NULL as the resources argument when
   creating an rcache. At this time the gpusm and rgpusm components
   support this. All other rcache components require non-NULL
   resources.

 - A new mpool component has been added: hugepage. This component
   supports huge page allocations on linux.

 - Memory pools are now allocated using "hints". Each mpool component
   is queried with the hints and returns a priority. The current hints
   supported are NULL (uses posix_memalign/malloc), page_size=x (huge
   page mpool), and mpool=x.

 - The sm mpool has been moved to common/sm. This reflects that the sm
   mpool is specialized and not meant for any general
   allocations. This mpool may be moved back into the mpool framework
   if there is any objection.

 - The opal_free_list_init arguments have been updated. The unused0
   argument is not used to pass in the registration cache module. The
   mpool registration flags are now rcache registration flags.

 - All components have been updated to make use of the new framework
   interfaces.

As this commit makes significant changes to both the mpool and rcache
frameworks both versions have been bumped to 3.0.0.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-14 10:50:41 -06:00
Gilles Gouaillardet
fbed6df4a3 coll/base: fix a typo
typo was introduced in open-mpi/ompi@c98e97a46e
2016-03-11 14:18:03 +09:00
Aurélien Bouteiller
c98e97a46e Do not return MPI_ERR_PENDING from collectives. 2016-03-09 16:13:34 -05:00
Joshua Ladd
4dffae2f88 Fixing MXM Yalla and MTL add procs behavior. MXM cannot support dynamic add procs, so propaget this info to the MTL and PML layers. 2016-03-08 01:46:24 +02:00
Aurélien Bouteiller
892e1ed57e Fix a potential race condition in which a progress matching thread could match a request while we are cancelling it. 2016-03-01 16:43:45 -05:00
Gilles Gouaillardet
8aff67c399 topo/base: correctly support MPI_UNWEIGHTED in mca_topo_base_dist_graph_neighbors()
Thanks Jun Kudo for the bug report.
2016-03-01 10:28:28 +09:00
George Bosilca
dbe93b0b19 Use mca_bml_base_get_endpoint
Correctly use mca_bml_base_get_endpoint instead of accessing the
endpoint directly.
2016-02-25 11:00:30 -06:00
Sylvain Jeaugey
5f32f49eb8 pml/ob1: Fix segmentation fault on CUDA path.
Fix segfault due to mca_pml_ob1_cuda_need_buffers not handling the case of the
endpoint not being there. Calling mca_bml_get_endpoint() seems to fix the problem.

Fixes open-mpi/ompi#1402
2016-02-24 21:32:25 -08:00
Edgar Gabriel
45003ef78d fix the data size counter for large ops for the static fcoll component 2016-02-23 08:33:50 -06:00
yohann
59b6d041f8 mtl/ofi: Check allocated pointer. 2016-02-19 16:59:47 -08:00
yohann
bd47062764 mtl/ofi: Fix error handling. 2016-02-19 16:58:41 -08:00
yohann
404987e9b3 mtl/ofi: Fix mismatching types. 2016-02-19 16:57:26 -08:00
yohann
3ad59435ce mtl/ofi: Prevent possible memory leak. 2016-02-19 16:57:02 -08:00
Edgar Gabriel
92d1b99468 optimize the shuffle step:
1. use communicator collectives if possible for performance reasons
 2. combined multiple allgathers into a single one
2016-02-19 11:04:04 -06:00
Edgar Gabriel
e63836c653 clean up the mca parameter handling of the component. Add new parameters for number of sub groups and write chunk size. This will allow to perform a systematic parameter study. 2016-02-19 10:15:28 -06:00
Edgar Gabriel
4f400314e0 add the dynamic_gen2 component into the fcoll selection table. 2016-02-19 09:32:54 -06:00
Edgar Gabriel
268d525053 change the tag to be a positive value. handle 0-byte situations correctly. 2016-02-19 08:28:50 -06:00
Edgar Gabriel
ad79012059 first cut on the version which overlaps the communication/computation of 2 iterations. 2016-02-19 08:28:50 -06:00
Ralph Castain
60a7bc2e50 Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion.
Fixes ##1225
2016-02-18 09:29:12 -08:00
yohann
7fe395c82a mtl/ofi: cleanup 2016-02-16 09:57:57 -08:00
yohann
22eddfee10 mtl/ofi: update copyright dates. 2016-02-16 09:56:09 -08:00
George Bosilca
68c36ea9dc Fix two annoying warnings in our UCX support. 2016-02-14 00:02:16 -05:00
yohann
67ce4a080a mtl/ofi: FI_AV_MAP support only. 2016-02-12 10:06:52 -08:00
yohann
b3d8ead76e mtl/ofi: Fix dynamic add_procs. 2016-02-12 10:05:52 -08:00
Gilles Gouaillardet
b55b9e6aee sentinel: fix sentinel to proc_name conversion
converting an opal_process_name_t means the loss of one bit,
it was decided to restrict the local job id to 15 bits, so the
useful information of an opal_process_name_t can fit in 63 bits.
2016-02-10 15:44:07 +09:00
Gilles Gouaillardet
030a5f2054 sentinel: use type uintptr_t for sentinel
MSB is now automatically cleared when right shifting
Thanks George for pointing this
2016-02-10 11:28:56 +09:00
George Bosilca
7c574a3530 Typo. 2016-02-07 07:22:22 +02:00
Nathan Hjelm
5b9c82a964 osc/pt2pt: bug fixes
This commit fixes several bugs identified by @ggouaillardet and MTT:

 - Fix SEGV in long send completion caused by missing update to the
   request callback data.

 - Add an MPI_Barrier to the fence short-cut. This fixes potential
   semantic issues where messages may be received before fence is
   reached.

 - Ensure fragments are flushed when using request-based RMA. This
   allows MPI_Test/MPI_Wait/etc to work as expected.

 - Restore the tag space back to 16-bits. It was intended that the
   space be expanded to 32-bits but the required change to the
   fragment headers was not committed. The tag space may be expanded
   in a later commit.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-04 16:59:39 -07:00
Gilles Gouaillardet
6eac6a8b00 osc/sm: create datafile into the per proc directory in order to make it unique per communicator
Thanks Peter Wind for the report
2016-02-03 10:12:37 +09:00
Nathan Hjelm
519fffb65e osc/pt2pt: eager sends are always active if MPI_MODE_NOCHECK is used
This commit fixes open-mpi/ompi#1299.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-02 12:44:17 -07:00
Nathan Hjelm
d7264aa613 osc/pt2pt: various threading fixes
This commit fixes several bugs identified by a new multi-threaded RMA
benchmarking suite. The following bugs have been identified and fixed:

 - The code that signaled the actual start of an access epoch changed
   the eager_send_active flag on a synchronization object without
   holding the object's lock. This could cause another thread waiting
   on eager sends to block indefinitely because the entirety of
   ompi_osc_pt2pt_sync_expected could exectute between the check of
   eager_send_active and the conditon wait of
   ompi_osc_pt2pt_sync_wait.

 - The bookkeeping of fragments could get screwed up when performing
   long put/accumulate operations from different threads. This was
   caused by the fragment flush code at the end of both put and
   accumulate. This code was put in place to avoid sending a large
   number of unexpected messages to a peer. To fix the bookkeeping
   issue we now 1) wait for eager sends to be active before stating
   any large isend's, and 2) keep track of the number of large isends
   associated with a fragment. If the number of large isends reaches
   32 the active fragment is flushed.

 - Use atomics to update the large receive/send tag counters. This
   prevents duplicate tags from being used. The tag space has also
   been updated to use the entire 16-bits of the tag space.

These changes should also fix open-mpi/ompi#1299.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-02 12:33:33 -07:00
Edgar Gabriel
3f7fff5780 Merge pull request #1331 from edgargabriel/solaris-statfs-fix
Solaris statfs fix
2016-01-28 20:16:33 -06:00
Nathan Hjelm
a19c265ab5 osc/rdma: fix typo in ompi_osc_rdma_complete_atomic
The typo caused SEGVs on systems with only fetching atomic
support.

Fixes open-mpi/ompi#1329

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-26 15:44:07 -07:00