1
1
Граф коммитов

5934 Коммитов

Автор SHA1 Сообщение Дата
Thananon Patinyasakdikul
60d0fbf683 Removal of ompi_request_lock from pml/ucx. 2016-05-26 12:36:58 -04:00
George Bosilca
90f294096e Remove more references to the request mutex.
Regarding BFO it should be mentionned that this component is currently
unmaintained, and that despite my efforts I could not make it compile
(it would not compile before this patch either).
2016-05-25 23:27:06 -04:00
Nathan Hjelm
9d439664f0 pml/yalla: update for request changes
This commit brings the pml/yalla component up to date with the request
rework changes.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-25 15:42:53 -06:00
Nathan Hjelm
8445c885ce pml/cm: update for request changes
This fixes a hang caused by the request refactor work. The cm pml was
not updated and was hanging is most cases.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-25 15:35:32 -06:00
Valentin Petrov
5ff6372886 coll/hcoll: bugfix: initialize req_type field
If left uninitialized then segfault is possible in MPI_Waitall in
    the case the field by chance equals OMPI_REQUEST_GEN.
2016-05-25 15:38:01 +03:00
bosilca
b90c83840f Refactor the request completion (#1422)
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).

* Fix the condition to release the request.
2016-05-24 18:20:51 -05:00
Jeff Squyres
e7d46b96a3 Merge pull request #1680 from yburette/topic/fix_provider_selection
mtl/ofi: Change default provider selection behavior.
2016-05-23 15:06:02 -04:00
Gilles Gouaillardet
bca44592af Merge pull request #1643 from ggouaillardet/topic/romio_openbsd57
io/romio: fix filesystem type check on OpenBSD
2016-05-23 16:33:56 +09:00
Nathan Hjelm
31bfeede82 bml/r2: always add btl progress function
This commit changes the behavior of bml/r2 from conditionally
registering btl progress functions to always registering progress
functions. Any progress function beloning to a btl that is not yet in
use is registered as low-priority. As soon as a proc is added that
will make use of the btl is is re-registered normally.

This works around an issue with some btls. In order to progress a
first message from an unknown peer both ugni and openib need to have
their progress functions called. If either btl is not in use after the
first call to add_procs the callback was never happening. This commit
ensures the btl progress function is called at some point but the
number of progress callbacks is reduced from normal to ensure lower
overhead when a btl is not used. The current ratio is 1 low priority
progress callback for every 8 calls to opal_progress().

Fixes open-mpi/ompi#1676

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-21 15:54:04 -04:00
yohann
2f0cde791a mtl/ofi: Change default provider selection behavior.
As more providers get added to libfabric, the default exclude list would need
to be updated.
Instead, we choose to include only the providers known to work by default.

New default:
  - include: psm,psm2,gni
  - exclude: none
2016-05-19 10:59:25 -07:00
Ralph Castain
a35bb8453a Unlock the mutex prior to destructing it.
Thanks to Nicolas Joly for the report
2016-05-19 10:36:58 -07:00
rhc54
8b534e9897 Merge pull request #1668 from rhc54/topic/slurm
When direct launching applications, we must allow the MPI layer to pr…
2016-05-16 12:23:19 -07:00
Jeff Squyres
5275e5e2a1 bml_r2: use __func__ to identify function names
There were some old/stale function names in some debugging/verbose
opal_output calls.  Use __func__ instead, so that they won't become
stale in the future.

Thanks to Durga Choudhury for pointing out the issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-16 11:06:47 -04:00
Ralph Castain
01ba861f2a When direct launching applications, we must allow the MPI layer to progress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization.
Update external as well

Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
2016-05-14 16:37:00 -07:00
Aurélien Bouteiller
7f65c2b18e forgot to update copyright in commits 627a89b 4899c89 2016-05-13 11:34:59 -04:00
George Bosilca
37e03e3e5b Don't update req_bytes_received if no bytes were received. 2016-05-12 23:39:32 -04:00
Matias A Cabral
528abff6ae Merge remote-tracking branch 'upstream/master' 2016-05-10 15:42:08 -07:00
Matias A Cabral
d28ee62a96 Update in PSM and PSM2 MTLs to detect entries created by drivers for
Intel TrueScale and Intel OmniPath, and detect a link in ACTIVE state.
This fix addresses the scenario reported in the below OMPI users email,
including formerly named Qlogic IB, now Intel True scale. Given the
nature of the PSM/PSM2 mtls this fix applies to OmniPath:
https://www.open-mpi.org/community/lists/users/2016/04/29018.php
2016-05-09 12:08:44 -07:00
Gilles Gouaillardet
0a19337371 coll/base: return MPI_ERR_UNSUPPORTED_OPERATION when coll_base_*_two_procs algo is used on a communicator that has no two tasks
Thanks Dave Love for the report
2016-05-09 14:18:40 +09:00
Gilles Gouaillardet
b159587325 io/romio: fix filesystem type check on OpenBSD 5.7
check the existence of the f_type field in struct statfs

Thanks Paul Hargrove for the report
2016-05-09 13:54:46 +09:00
Ralph Castain
6b24e2779b Remove stale component - I'm not going to get to it 2016-05-07 04:13:34 -07:00
Edgar Gabriel
def1b95fd7 Merge pull request #1646 from edgargabriel/getview-preallocate-fixes
io/ompio: file_getview and file_preallocate fixes
2016-05-06 11:46:00 -05:00
Edgar Gabriel
e65e189671 io/ompio: fix file size after file_preallocate
Thanks for @dalcini for reporting
Fixes open-mpi/ompi#1633
2016-05-06 08:20:59 -05:00
Edgar Gabriel
d358965134 io/ompio: fix envelope of datatype returned by getview
Thanks for @dalcini for reporting
Fixes open-mpi/ompi#1632
2016-05-06 08:19:48 -05:00
Edgar Gabriel
7c92acaa78 Merge pull request #1637 from edgargabriel/pr/netbsd-compilation-problems
fs/lustre and fs/pvfs2: fix netbsd compilation problems
2016-05-06 08:05:36 -05:00
Gilles Gouaillardet
6c9d65c0ca coll/libnbc: fix MPI_Ireduce_scatter_block for one task communicator
Thanks Lisandro Dalcin for the report

Fixes open-mpi/ompi#248
2016-05-06 09:43:29 +09:00
Ralph Castain
08022d7af1 Some minor cleanups of warnings from gcc 6.0.0. Update s1/s2 pmix to get max_procs as required. 2016-05-05 15:28:13 -07:00
Jeff Squyres
f167be1c91 ompio: always return valid info from FILE_GET_INFO
MPI-3.1 says that even if no info keys are set on the file, we need to
return a new, empty info.

Thanks to Lisandro Dalcin for identifying the issue.

Fixes open-mpi/ompi#1630

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-05 12:03:29 -07:00
Aurélien Bouteiller
4899c89731 Fix a race condition when multiple threads try to create a bml endpoint simultaneously. 2016-05-05 10:49:30 -04:00
Aurélien Bouteiller
627a89bf71 Fix a race condition when multiple threads do the "first send" to an endpoint simultaneously. 2016-05-05 09:04:10 -04:00
Joshua Ladd
4771c9ece6 Merge pull request #1617 from jladd-mlnx/topic/disable-hcoll-barrier-in-finalize-ompi-trunk
HCOLL: fix hang in hcoll barrier called from finalize for MXM/yalla
2016-05-04 10:12:34 -04:00
Edgar Gabriel
78fa8bb2c4 remove some unused variables that can cause compilation problems on netbsd 2016-05-03 10:25:15 -05:00
Todd Kordenbrock
3498bed650 Merge pull request #1555 from shawone/check_reduce_ret
coll-portals4: check return value from reduce kary tree functions
2016-05-03 10:17:23 -05:00
Jeff Squyres
33dd8ca81e osc_rdma_peer: properly include ompi_config.h
Thanks to Paul Hargrove for reporting.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-03 07:39:55 -07:00
Devendar Bureddy
cafd55f18c HCOLL: fix hang in hcoll barrier called from finalize for MXM/yalla
tear down

HCOLL barrier may not complete if HCOLL progress is not called periodically.
which is the case in HCOLL teardown progress in the finalize.
(cherry picked from commit 793244d75dd94d1d5e0243bcccf6d04318750f3f)
2016-05-03 00:49:57 +03:00
Nathan Hjelm
d3d779f6d9 osc/rdma: clear all_sync object when obtaining a lock
This commit fixes a bad synchronization detection bug that occurs when
mixing MPI_Win_fence() and MPI_Win_lock(). If no communication has
occurred in the fence epoch it is safe to just clear the all_sync
object (it was set up by fence).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 15:28:47 -06:00
Jeff Squyres
265e5b9795 Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1
ompi/opal/orte/oshmem/test: max hostname length cleanup
2016-05-02 09:44:18 -04:00
Ralph Castain
6ac7929bd0 Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Nathan Hjelm
7bda3eb2dc osc/rdma: fix global index array calculation
This commit fixes a bug that occurs when ranks are either not mapped
evenly or by something other than core.

Fixes open-mpi/ompi#1599

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-28 19:11:11 -06:00
Nathan Hjelm
f0f3383006 Merge pull request #1590 from hjelmn/thread_multiple
osc/pt2pt: do not drop/reacquire the ompi_request_lock
2016-04-26 16:48:37 -06:00
Nathan Hjelm
34ff6293bd osc/pt2pt: do not drop/reacquire the ompi_request_lock
This lock is now recursive so it is safe to call into the pml without
dropping the lock.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-26 14:19:38 -06:00
George Bosilca
bf190671e9 Make the request lock recursive.
If during the request completion callback we post another request that
completes right away (such a small send or a match for an unexpected
short message) we will try to complete the second request while holding
the lock for the completion of the first. For performance reasons
(mainly to avoid unlocking and locking the request mutex several times)
we have made the request lock recursive.
2016-04-26 16:16:07 -04:00
Nathan Hjelm
c16e639b2f Merge pull request #1563 from hjelmn/ompi_coverity
ompi coverity fixes
2016-04-26 09:17:48 -06:00
Karol Mroz
3322347da9 ompi: fixup hostname max length usage
Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-04-25 07:08:23 +02:00
Nathan Hjelm
ae0ffbb67f Merge pull request #1397 from hjelmn/enable_thread_multiple
ompi: always enable MPI_THREAD_MULTIPLE support
2016-04-23 08:40:22 -06:00
Nathan Hjelm
1ff3d3b16b pml/ob1: fix coverity issue
Fix CID 1357978 (1 of 1): Logically dead code (DEADCODE):

Remove duplicate check for NULL == endpoint.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 14:48:13 -06:00
Nathan Hjelm
70533e6d50 fcoll/static: fix coverity issues
Fix CID 72362: Explicit null dereferenced (FORWARD_NULL)

From what I can tell the code @ fcoll_static_file_read_all.c:649
should be setting bytes_per_process[i] to 0 not bytes_per_process.

Fix CID 72361: Explicit null dereferenced (FORWARD_NULL)

Modified check to check for blocklen_per_process non-NULL before
trying to free blocklen_per_process[l]. This is sufficient because
free (NULL) is safe. Also cleaned up the initialization of this an a
couple other arrays. They were allocated with malloc() then
initialized to 0. Changed to used calloc().

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 14:48:13 -06:00
Nathan Hjelm
8871bdb2f8 fcoll/two_phase: fix coverity issues
Fix CID 72296: Resource leak (RESOURCE_LEAK):

Changed code to goto exit instead of returning to ensure memory is
freed.

Fix CID 712589: Out-of-bounds read (OVERRUN):

In this loop i and j are identical and always less than
iov_count. The CID was triggered because i was incremented if i was <
iov_count. This meant that if the loop did go on the next iteration
would access an invalid index.

Fix CID 741363: Uninitialized scalar variable (UNINIT):

Allocate tmp_len with calloc to insure every index is initialized.

Fix CID 741364: Uninitialized pointer read (UNINIT):

Allocate recv_types with calloc to ensure all indices are always
initialized. Also added a check to not loop and destroy if recv_types
is NULL.

Also added a NULL check on the allocation of decoded iov. This is not
the cause of CID 126784 but should be fixed.

Fix CID 712588: Out-of-bounds read (OVERRUN):

Similar to CID 712589. Should silence the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 14:47:41 -06:00
Valentin Petrov
21f1c572c0 Adds mapping to hcoll complex dte 2016-04-19 14:14:28 +03:00
Nicolas Chevalier
c86d4035d2 coll-portals4: check return value from reduce kary tree functions 2016-04-18 12:02:30 +00:00