1
1
Граф коммитов

2145 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
1d56007ab1 rcache/vma: make rcache lock recursive
There is currently a path through the grdma mpool and vma rcache that
leads to deadlock. It happens during the rcache insert. Before the
insert the rcache mutex is locked. During the call a new vma item is
allocated and then inserted into the rcache tree. The allocation
currently goes through the malloc hooks which may (and does) call back
into the mpool if the ptmalloc heap needs to be reallocated. This
callback tries to lock the rcache mutex which leads to the
deadlock. This has been observed with multi-threaded tests and the
openib btl.

This change may lead to some minor slowdown in the rcache vma when
threading is enabled. This will only affect larger message paths in
some of the btls.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-26 10:01:37 -06:00
Nathan Hjelm
0a968de53f mca/base: use standard verbosity levels
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-17 11:48:06 -06:00
Nathan Hjelm
8cbf743cfa mca/base: standize MCA verbosity levels
Up until this point we have had inconsistent usage for MCA verbosity
levels. This commit attempts to correct this by recommending
components use these standard levels: none (0), error (1), warn (10),
info (20), debug (40), and trace (60).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-17 11:47:57 -06:00
Gilles Gouaillardet
950b07d17b Merge pull request #809 from ggouaillardet/topic/xrc_runtime_check
btl/openib: remove OFED version runtime check when XRC is used
2015-08-16 10:54:09 +09:00
Gilles Gouaillardet
d02ccd67de btl/openib: remove OFED version runtime check when XRC is used
this test seems broken :
 - some false positive were reported
 - it fails to detect some OFED version mismatch
this commit simply removes this test, which means the application
will likely fail if XRC is used ad OFED version is different
between compile time and runtime
2015-08-14 09:10:03 +09:00
Rolf vandeVaart
cb8c86910e Add static definitions where needed and remove one unused definition 2015-08-13 14:59:07 -04:00
Jeff Squyres
14340770c4 usnic: remove some logically dead code
This code really had no purpose; just assign FI_VERSION(1, 1).  This
fixes CID 1315274.

Also clarify the commet about why we still retain libfabric v1.0.0
compatibility code, even though configure.m4 requires libfabric >= v1.1.0.
2015-08-12 05:21:18 -07:00
Jeff Squyres
7f857034d9 common verbs: check return value of sscanf()
Fixes CID 1304563.
2015-08-12 05:14:58 -07:00
Gilles Gouaillardet
92b2d2ffeb configury: fix libevent configure.ac
fix interleaved messages :
checking for working epoll library interface... checking if epoll can build... yes
yes
2015-08-12 15:37:22 +09:00
Jeff Squyres
3369606c75 Merge pull request #791 from jsquyres/pr/usnic-async-events
usnic: move cchecker to OPAL-wide progress thread
2015-08-11 07:54:07 -04:00
Jeff Squyres
236cf7ff62 usnic: add v1.8/v1.10 compat code
Add compat code so that I can sync master against the v1.10 branch.
2015-08-10 16:27:38 -07:00
Jeff Squyres
7da1c4b875 usnic: avoid race condition in connectivity checker
In short applications, it's possible that the agent (i.e., local rank
0) will finalize after non-local rank 0 procs detect the connectivity
checker named socket, but before they complete a connect() on it.  As
such, their connect() gets ECONNREFUSED.

This commit adds a simple counter in the agent that won't let it quit
before it accept()'s from all local procs, or 10 seconds goes by
(whichever occurs first).  This is similar to the timeout for the
clients: they'll exit if they don't see the expected named socket
within 10 seconds.
2015-08-10 15:40:33 -07:00
Jeff Squyres
bad508687e usnic: move cchecker to OPAL-wide progress thread
There's no longer any need for the usnic BTL to have its own progress
thread: it can use the opal_progress_thread() infrastructure.  This
commit removes the code to startup/shutdown the usnic-BTL-specific
progress thread and instead, just adds its events to the OPAL-wide
progress thread.

This necessitated a small change in the finalization step.
Previously, we would stop the progress thread and then tear down the
events.  We can no longer stop the progress thread, and if we start
tearing down events, this will cause shutdown/hangups to be sent
across sockets, potentially firing some of the still-remaining events
while some (but not all) of the data structures have been torn down.
Chaos ensues.

Instead, queue up an event to tear down all the pending events.  Since
the progress thread will only fire one event at a time, having a
teardown event means that it can tear down all the pending events
"atomically" and not have to worry that one of those events will get
fired in the middle of the teardown process.
2015-08-10 15:40:33 -07:00
Rolf vandeVaart
95d19af0eb Merge pull request #783 from rolfv/pr/fix-thread-issue
Refs open-mpi/ompi#627. Fix support for multi-threads with CUDA 7.0
2015-08-10 11:13:56 -04:00
Rolf vandeVaart
8cc6bef090 Refs open-mpi/ompi#627. Fix support for multi-threads with CUDA 7.0 2015-08-10 10:22:45 -04:00
Jeff Squyres
9e1e563120 event: remove opal_async_event_base
opal_async_event_base is not used anywhere.  The opal_progress_thread
API should be used instead.
2015-08-07 10:13:41 -07:00
Jeff Squyres
d7c25f683e pmix_native: update to the new opal_progress_thread API 2015-08-07 10:13:40 -07:00
Jeff Squyres
b5c37dbfe2 CSCuv67889: usnic: fix an error corner case
Ensure that we have non-NULL on all levels of pointers, which will
save us if there are exitable errors very early during component /
module initialization.
2015-08-06 10:54:28 -07:00
Jeff Squyres
cbcd16b399 usnic: remove a stale shell variable name 2015-07-31 18:53:54 -07:00
Jeff Squyres
0ee8295e6e usnic: ensure that we have libfabric >= v1.1 2015-07-31 18:53:54 -07:00
rhc54
c6cc1a9707 Merge pull request #766 from rhc54/topic/hwloc
Update x86_32 cpuid assembly code.
2015-07-31 12:53:16 -07:00
Jeff Squyres
2e7f794aae usnic: convert to use fi_recvmsg / FI_MORE
Minor optimization to post 16 receive buffers at a time (vs. 1).
2015-07-31 12:45:40 -07:00
Ralph Castain
b42545b0cb Update x86_32 cpuid assembly code. Cheery-picked from
open-mpi/hwloc@40f9978bcc
2015-07-31 11:40:38 -07:00
George Bosilca
c03b3b135c Don't allow multiple pvar with the same pvar_index.
Fix Cisco copyright.
2015-07-25 15:57:50 -04:00
Guillaume Papauré
98b6d65385 avoid use of non initialized variable 2015-07-25 15:29:32 -04:00
Rolf vandeVaart
1f32fa21ae Fix arguments to error message, remove tabs and trailing spaces 2015-07-23 10:02:45 -04:00
Rolf vandeVaart
773b509407 Merge pull request #737 from rolfv/pr/add-cuda-war
Add a workaroud for issue in libcuda.so library
2015-07-22 16:14:14 -04:00
Rolf vandeVaart
7703c96496 Add a workaroud for issue in libcuda.so library 2015-07-22 11:35:27 -04:00
Jeff Squyres
ec3a38384f Merge pull request #688 from jsquyres/pr/usnic-libfabric-msg-prefix-fix
usnic fixes for differences between libfabric v1.0.0 and v1.1.0
2015-07-21 10:18:36 -04:00
Gilles Gouaillardet
f7cf7d5070 configury: fix XRC detection on OFED < 3.12
since ibv_create_xrc_rcv_qp is now deprecated, and in order to
be "future-proof", we have to consider the case in which only XRC Domains are supported.
also, correctly handle distro that ship broken ibverbs devel headers

Thanks Paul Hargrove for the detailled report.
2015-07-13 10:43:22 +09:00
Ralph Castain
219c4dfba5 Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one. 2015-07-12 08:23:34 -07:00
Ralph Castain
683efcb850 Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename. 2015-07-11 10:08:19 -07:00
rhc54
053d9b2a7c Merge pull request #713 from rhc54/topic/errhandler
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:58:57 -07:00
Ralph Castain
a2243dcddd Add an opal/errhandler so opal-level errors can be up-leveled 2015-07-11 07:09:11 -07:00
Ralph Castain
61fb067f14 Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base 2015-07-11 06:42:23 -07:00
Jeff Squyres
633da6641e usnic: gracefully handle when we can't alloc an ACK
The comment didn't match the debugging code (which was ugly, and
apparently never happens, anyway).  Just return and let the sender
retransmit.
2015-07-10 14:19:33 -07:00
Jeff Squyres
3327fa56b5 usnic: minor code cleanups 2015-07-10 10:10:43 -07:00
Jeff Squyres
f9c65a701e usnic: "sin" assignment needs to be outside the #if
The "sin" variable is used below; need to ensure that it is assigned
for all builds (not just debug builds).
2015-07-10 06:51:03 -07:00
Jeff Squyres
cd87c8ad41 usnic: misc compiler warnings fixes 2015-07-10 06:51:03 -07:00
Jeff Squyres
ba429dc890 usnic: temporarily disable the BTL put method
The usnic BTL put method is currently broken.  Disable it until we can
fix it properly.
2015-07-10 06:51:03 -07:00
Jeff Squyres
f265358fbe usnic: handle FI_MSG_PREFIX differences libfabric v1.0.0->v1.1.0
In libfabric v1.0.0 (i.e., API v1.0), the usnic provider handled
FI_MSG_PREFIX inconsistently between sends and receives.  This has
been fixed in libfabric v1.1.0 (i.e., API v1.1): FI_MSG_PREFIX is
handled consistently for both sends and receives.

Run-time detect which libfabric we are running with and adapt behavior
appropriately.
2015-07-10 06:51:03 -07:00
Jeff Squyres
ddd0de6cfc usnic: make more OS-bypass memory Valgrind-defined
This helps reduce false positives when running MPI apps through
Valgrind.
2015-07-10 06:51:03 -07:00
Jeff Squyres
9bc7a54e0c usnic: correctly count CRC errors
Handle the differences between libfabric v1.0.0 and v1.1.0 in the
return value of fi_cq_readerr().

Also consolidate CRC and truncation errors into the same handling
block, since truncation errors are typically another symptom of CRC
errors.  This ensures that buffers get reposted properly.
2015-07-10 06:51:03 -07:00
Jeff Squyres
fc686f5538 usnic: make configure complain if libfabric cannot be found
Instead of silently determining that the usnic BTL can't be built,
announce that usnic is checking for libfabric support, and then
AC_MSG_RESULT the result of that check.
2015-07-10 06:45:33 -07:00
Jeff Squyres
4341639a66 Revert "configury: fix (again) XRC detection on OFED < 3.12"
@ggouaillardet is likely offline for the weekend, but master is broken
on RHEL 6.5 systems that do not have MOFED installed.  So I'm taking
the liberty of revering this commit; I'm guessing Gilles will fixup
and re-commit next week.

This reverts commit 77f8282d51.
2015-07-10 06:45:33 -07:00
Gilles Gouaillardet
77f8282d51 configury: fix (again) XRC detection on OFED < 3.12
since ibv_create_xrc_rcv_qp is now deprecated, and in order to
be "future-proof", we have to consider the case in which only XRC Domains are supported.

Thanks Paul Hargrove for the detailled report.
2015-07-10 15:31:45 +09:00
Rolf vandeVaart
ae0f3cfee7 Make explicit call to initalize MCA parameters in common CUDA code. This allows us to view them with ompi_info and possibly modify with tools interface 2015-07-09 12:51:55 -04:00
Rolf vandeVaart
cdffa4724d Force smcuda BTL to use CUDA IPC path for all GPU buffers where possible 2015-07-08 17:11:25 -04:00
Ralph Castain
ed93154e43 Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line. 2015-07-07 12:52:16 -07:00
Gilles Gouaillardet
9f171de412 btl/openib: queue pending fragments once only when running out of credit
Fixes open-mpi/ompi#640
2015-07-06 09:45:01 +09:00