1
1

2226 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
6e7345c790 pmix/cray: more stubs plus a get_version method
Add more stubs to reduce likelihood of future
mysterious segfaults if some of the newer pmix
funcs start to get used within ompi.

Add a get_version to return the version of the
Cray PMI library being used, since the Cray PMI
library actually has a function to get that info.

Be more accurate about which functions have a hope
of being implemented using Cray PMI and those which
never will.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-03 12:51:50 -07:00
Ralph Castain
a772b46c15 Bring the MPI_Publish and friends online 2015-09-02 12:04:07 -07:00
matcabral
1f9218a0bc Fix for openib btl mca command line parameter btl_openib_mtu being
ignored.
2015-09-02 02:22:30 -07:00
Ralph Castain
95dbd70f44 Sync to PMIx 1.1, sha- 51479b0 2015-09-01 14:09:25 -07:00
Rolf vandeVaart
30b1a6e003 Merge pull request #836 from rolfv/pr/fix-cuda-war
Add config code to check for need of workaround. Add runtime way to turn oiff just in case.
2015-09-01 15:05:29 -04:00
Nathan Hjelm
f926796e57 Merge pull request #828 from hjelmn/openib_thread_fix
openib thread fixes
2015-09-01 09:12:50 -06:00
rhc54
d8cb3fe705 Merge pull request #852 from rhc54/topic/pmix
Sync to PMIx tarball - includes:
2015-09-01 06:54:34 -07:00
Gilles Gouaillardet
6dfa996760 configury: fix a typo in opal/mca/pmix/pmix1xx/configure.m4 2015-09-01 14:59:07 +09:00
Ralph Castain
c1bbd7bc78 Sync to PMIx tarball - includes:
* update to configury to silence ident messages (thanks Gilles!)
* fix for warnings Jeff saw when get didn't find the requested data
* fix for Mac OSX operations
2015-08-31 21:51:02 -07:00
rhc54
2d3c6af8ad Merge pull request #851 from rhc54/topic/copyfix
Only copy the value across if the "get" operation succeeded
2015-08-31 19:51:13 -07:00
Ralph Castain
ef69958e01 Only copy the value across if the "get" operation succeeded 2015-08-31 17:11:26 -07:00
Jeff Squyres
8558458bb9 usnic: adjust for new PMIX argument type 2015-08-31 14:55:58 -07:00
Rolf vandeVaart
54ab0d1a51 Add config code to check for need of workaround. Add runtime way to turn it off just in case 2015-08-31 17:18:47 -04:00
rhc54
6e78e2c89b Merge pull request #846 from rhc54/topic/pmix
Sync to PMIx tarball
2015-08-31 08:53:07 -07:00
Nathan Hjelm
2aab6ad90f Merge pull request #827 from hjelmn/recursive_locks
Add support for recursive locks (revisited)
2015-08-31 07:52:23 -07:00
Ralph Castain
a3842af709 Sync to PMIx tarball 2015-08-31 07:47:46 -07:00
Ralph Castain
bcabd1e282 Sync with PMIx tarball, bringing across the warning fixes pointed out by Gilles 2015-08-30 21:13:55 -07:00
Gilles Gouaillardet
7e6a213465 pmix: fix compilation error
compilation failed because of missing prototypes when configure'd with --enable-debug --enable-picky on a CentOS 7 box
2015-08-31 10:33:13 +09:00
rhc54
51a8a0f5d7 Merge pull request #842 from rhc54/topic/smfix
Fix shared memory operations by resolving local peers
2015-08-30 14:49:43 -07:00
Ralph Castain
b0d7564400 Sync to PMIx 1.1 - do not check pmix version when making connections 2015-08-30 12:15:30 -07:00
Ralph Castain
38ba54366c Fix shared memory operations by resolving local peers 2015-08-30 12:07:14 -07:00
Ralph Castain
0d5814b5ca Cleanup Coverity issues 2015-08-29 21:19:27 -07:00
Ralph Castain
3cab860a01 Some cleanups - still some errors that impact shared memory operations 2015-08-29 18:11:11 -07:00
Ralph Castain
1d71037139 Update some APIs 2015-08-29 17:26:32 -07:00
Ralph Castain
79827ceaa8 Remove stale directory 2015-08-29 17:15:17 -07:00
Ralph Castain
cf6137b530 Integrate PMIx 1.0 with OMPI.
Bring Slurm PMI-1 component online
Bring the s2 component online

Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.

Bring the OMPI pubsub/pmi component online

Get comm_spawn working again

Ensure we always provide a cpuset, even if it is NULL

pmix/cray: adjust cray pmix component for pmix

Make changes so cray pmix can work within the integrated
ompi/pmix framework.

Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet

Cleanup comm_spawn - procs now starting, error in connect_accept

Complete integration
2015-08-29 16:04:10 -07:00
Nathan Hjelm
1d56007ab1 rcache/vma: make rcache lock recursive
There is currently a path through the grdma mpool and vma rcache that
leads to deadlock. It happens during the rcache insert. Before the
insert the rcache mutex is locked. During the call a new vma item is
allocated and then inserted into the rcache tree. The allocation
currently goes through the malloc hooks which may (and does) call back
into the mpool if the ptmalloc heap needs to be reallocated. This
callback tries to lock the rcache mutex which leads to the
deadlock. This has been observed with multi-threaded tests and the
openib btl.

This change may lead to some minor slowdown in the rcache vma when
threading is enabled. This will only affect larger message paths in
some of the btls.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-26 10:01:37 -06:00
Nathan Hjelm
f451876058 Merge pull request #825 from hjelmn/white_space_purge
periodic trailing whitespace purge
2015-08-25 19:23:52 -06:00
Nathan Hjelm
64e4419d76 btl/openib: allow the use of the openib btl in thread muliple
There were several issues preventing the openib btl from running in
thread multiple mode:

 - Missing locks in UDCM when generating a loopback endpoint. Fixed in
   open-mpi/ompi@8205d79819.

 - Incorrect sequence numbers generated in debug mode. This did not
   prevent the openib btl from running but instead produced incorrect
   error messages in debug builds.

 - Recursive locking of the rcache lock caused by the malloc
   hooks. This is fixed by open-mpi/ompi#827

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 16:04:52 -06:00
Nathan Hjelm
c101385f64 btl/openib: fix sequence number generation for debug mode
When using eager RDMA in debug builds the openib btl generates a
sequence number for each send. The code independently updated the head
index and the sequence number for the eager rdma transaction. If
multiple threads enter this code at the same time and run in the
following order:

thread 1: update sequence (0 -> 1)
thread 2: update sequence (1 -> 2)
thread 2: update head (0 -> 1)
thread 1: update head (1 -> 2)

the sequence number for head[0] gets 1 and the sequence number for
head[1] gets 0. The fix is to generate the sequence number from the
head index.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 16:00:06 -06:00
Nathan Hjelm
8205d79819 btl/openib: add missing lock calls
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 12:21:49 -06:00
Nathan Hjelm
156ce6af21 periodic whitespace purge
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 09:32:33 -06:00
Nathan Hjelm
0a968de53f mca/base: use standard verbosity levels
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-17 11:48:06 -06:00
Nathan Hjelm
8cbf743cfa mca/base: standize MCA verbosity levels
Up until this point we have had inconsistent usage for MCA verbosity
levels. This commit attempts to correct this by recommending
components use these standard levels: none (0), error (1), warn (10),
info (20), debug (40), and trace (60).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-17 11:47:57 -06:00
Gilles Gouaillardet
950b07d17b Merge pull request #809 from ggouaillardet/topic/xrc_runtime_check
btl/openib: remove OFED version runtime check when XRC is used
2015-08-16 10:54:09 +09:00
Gilles Gouaillardet
d02ccd67de btl/openib: remove OFED version runtime check when XRC is used
this test seems broken :
 - some false positive were reported
 - it fails to detect some OFED version mismatch
this commit simply removes this test, which means the application
will likely fail if XRC is used ad OFED version is different
between compile time and runtime
2015-08-14 09:10:03 +09:00
Rolf vandeVaart
cb8c86910e Add static definitions where needed and remove one unused definition 2015-08-13 14:59:07 -04:00
Jeff Squyres
14340770c4 usnic: remove some logically dead code
This code really had no purpose; just assign FI_VERSION(1, 1).  This
fixes CID 1315274.

Also clarify the commet about why we still retain libfabric v1.0.0
compatibility code, even though configure.m4 requires libfabric >= v1.1.0.
2015-08-12 05:21:18 -07:00
Jeff Squyres
7f857034d9 common verbs: check return value of sscanf()
Fixes CID 1304563.
2015-08-12 05:14:58 -07:00
Gilles Gouaillardet
92b2d2ffeb configury: fix libevent configure.ac
fix interleaved messages :
checking for working epoll library interface... checking if epoll can build... yes
yes
2015-08-12 15:37:22 +09:00
Jeff Squyres
3369606c75 Merge pull request #791 from jsquyres/pr/usnic-async-events
usnic: move cchecker to OPAL-wide progress thread
2015-08-11 07:54:07 -04:00
Jeff Squyres
236cf7ff62 usnic: add v1.8/v1.10 compat code
Add compat code so that I can sync master against the v1.10 branch.
2015-08-10 16:27:38 -07:00
Jeff Squyres
7da1c4b875 usnic: avoid race condition in connectivity checker
In short applications, it's possible that the agent (i.e., local rank
0) will finalize after non-local rank 0 procs detect the connectivity
checker named socket, but before they complete a connect() on it.  As
such, their connect() gets ECONNREFUSED.

This commit adds a simple counter in the agent that won't let it quit
before it accept()'s from all local procs, or 10 seconds goes by
(whichever occurs first).  This is similar to the timeout for the
clients: they'll exit if they don't see the expected named socket
within 10 seconds.
2015-08-10 15:40:33 -07:00
Jeff Squyres
bad508687e usnic: move cchecker to OPAL-wide progress thread
There's no longer any need for the usnic BTL to have its own progress
thread: it can use the opal_progress_thread() infrastructure.  This
commit removes the code to startup/shutdown the usnic-BTL-specific
progress thread and instead, just adds its events to the OPAL-wide
progress thread.

This necessitated a small change in the finalization step.
Previously, we would stop the progress thread and then tear down the
events.  We can no longer stop the progress thread, and if we start
tearing down events, this will cause shutdown/hangups to be sent
across sockets, potentially firing some of the still-remaining events
while some (but not all) of the data structures have been torn down.
Chaos ensues.

Instead, queue up an event to tear down all the pending events.  Since
the progress thread will only fire one event at a time, having a
teardown event means that it can tear down all the pending events
"atomically" and not have to worry that one of those events will get
fired in the middle of the teardown process.
2015-08-10 15:40:33 -07:00
Rolf vandeVaart
95d19af0eb Merge pull request #783 from rolfv/pr/fix-thread-issue
Refs open-mpi/ompi#627. Fix support for multi-threads with CUDA 7.0
2015-08-10 11:13:56 -04:00
Rolf vandeVaart
8cc6bef090 Refs open-mpi/ompi#627. Fix support for multi-threads with CUDA 7.0 2015-08-10 10:22:45 -04:00
Jeff Squyres
9e1e563120 event: remove opal_async_event_base
opal_async_event_base is not used anywhere.  The opal_progress_thread
API should be used instead.
2015-08-07 10:13:41 -07:00
Jeff Squyres
d7c25f683e pmix_native: update to the new opal_progress_thread API 2015-08-07 10:13:40 -07:00
Jeff Squyres
b5c37dbfe2 CSCuv67889: usnic: fix an error corner case
Ensure that we have non-NULL on all levels of pointers, which will
save us if there are exitable errors very early during component /
module initialization.
2015-08-06 10:54:28 -07:00
Jeff Squyres
cbcd16b399 usnic: remove a stale shell variable name 2015-07-31 18:53:54 -07:00