openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	64e4419d76	btl/openib: allow the use of the openib btl in thread muliple There were several issues preventing the openib btl from running in thread multiple mode: - Missing locks in UDCM when generating a loopback endpoint. Fixed in open-mpi/ompi@8205d79819. - Incorrect sequence numbers generated in debug mode. This did not prevent the openib btl from running but instead produced incorrect error messages in debug builds. - Recursive locking of the rcache lock caused by the malloc hooks. This is fixed by open-mpi/ompi#827 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 16:04:52 -06:00
Nathan Hjelm	c101385f64	btl/openib: fix sequence number generation for debug mode When using eager RDMA in debug builds the openib btl generates a sequence number for each send. The code independently updated the head index and the sequence number for the eager rdma transaction. If multiple threads enter this code at the same time and run in the following order: thread 1: update sequence (0 -> 1) thread 2: update sequence (1 -> 2) thread 2: update head (0 -> 1) thread 1: update head (1 -> 2) the sequence number for head[0] gets 1 and the sequence number for head[1] gets 0. The fix is to generate the sequence number from the head index. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 16:00:06 -06:00
Nathan Hjelm	8205d79819	btl/openib: add missing lock calls Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 12:21:49 -06:00
Gilles Gouaillardet	d02ccd67de	btl/openib: remove OFED version runtime check when XRC is used this test seems broken : - some false positive were reported - it fails to detect some OFED version mismatch this commit simply removes this test, which means the application will likely fail if XRC is used ad OFED version is different between compile time and runtime	2015-08-14 09:10:03 +09:00
Jeff Squyres	14340770c4	usnic: remove some logically dead code This code really had no purpose; just assign FI_VERSION(1, 1). This fixes CID 1315274. Also clarify the commet about why we still retain libfabric v1.0.0 compatibility code, even though configure.m4 requires libfabric >= v1.1.0.	2015-08-12 05:21:18 -07:00
Jeff Squyres	236cf7ff62	usnic: add v1.8/v1.10 compat code Add compat code so that I can sync master against the v1.10 branch.	2015-08-10 16:27:38 -07:00
Jeff Squyres	7da1c4b875	usnic: avoid race condition in connectivity checker In short applications, it's possible that the agent (i.e., local rank 0) will finalize after non-local rank 0 procs detect the connectivity checker named socket, but before they complete a connect() on it. As such, their connect() gets ECONNREFUSED. This commit adds a simple counter in the agent that won't let it quit before it accept()'s from all local procs, or 10 seconds goes by (whichever occurs first). This is similar to the timeout for the clients: they'll exit if they don't see the expected named socket within 10 seconds.	2015-08-10 15:40:33 -07:00
Jeff Squyres	bad508687e	usnic: move cchecker to OPAL-wide progress thread There's no longer any need for the usnic BTL to have its own progress thread: it can use the opal_progress_thread() infrastructure. This commit removes the code to startup/shutdown the usnic-BTL-specific progress thread and instead, just adds its events to the OPAL-wide progress thread. This necessitated a small change in the finalization step. Previously, we would stop the progress thread and then tear down the events. We can no longer stop the progress thread, and if we start tearing down events, this will cause shutdown/hangups to be sent across sockets, potentially firing some of the still-remaining events while some (but not all) of the data structures have been torn down. Chaos ensues. Instead, queue up an event to tear down all the pending events. Since the progress thread will only fire one event at a time, having a teardown event means that it can tear down all the pending events "atomically" and not have to worry that one of those events will get fired in the middle of the teardown process.	2015-08-10 15:40:33 -07:00
Jeff Squyres	b5c37dbfe2	CSCuv67889: usnic: fix an error corner case Ensure that we have non-NULL on all levels of pointers, which will save us if there are exitable errors very early during component / module initialization.	2015-08-06 10:54:28 -07:00
Jeff Squyres	cbcd16b399	usnic: remove a stale shell variable name	2015-07-31 18:53:54 -07:00
Jeff Squyres	0ee8295e6e	usnic: ensure that we have libfabric >= v1.1	2015-07-31 18:53:54 -07:00
Jeff Squyres	2e7f794aae	usnic: convert to use fi_recvmsg / FI_MORE Minor optimization to post 16 receive buffers at a time (vs. 1).	2015-07-31 12:45:40 -07:00
Jeff Squyres	ec3a38384f	Merge pull request #688 from jsquyres/pr/usnic-libfabric-msg-prefix-fix usnic fixes for differences between libfabric v1.0.0 and v1.1.0	2015-07-21 10:18:36 -04:00
Gilles Gouaillardet	f7cf7d5070	configury: fix XRC detection on OFED < 3.12 since ibv_create_xrc_rcv_qp is now deprecated, and in order to be "future-proof", we have to consider the case in which only XRC Domains are supported. also, correctly handle distro that ship broken ibverbs devel headers Thanks Paul Hargrove for the detailled report.	2015-07-13 10:43:22 +09:00
Ralph Castain	683efcb850	Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.	2015-07-11 10:08:19 -07:00
Ralph Castain	61fb067f14	Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base	2015-07-11 06:42:23 -07:00
Jeff Squyres	633da6641e	usnic: gracefully handle when we can't alloc an ACK The comment didn't match the debugging code (which was ugly, and apparently never happens, anyway). Just return and let the sender retransmit.	2015-07-10 14:19:33 -07:00
Jeff Squyres	3327fa56b5	usnic: minor code cleanups	2015-07-10 10:10:43 -07:00
Jeff Squyres	f9c65a701e	usnic: "sin" assignment needs to be outside the #if The "sin" variable is used below; need to ensure that it is assigned for all builds (not just debug builds).	2015-07-10 06:51:03 -07:00
Jeff Squyres	cd87c8ad41	usnic: misc compiler warnings fixes	2015-07-10 06:51:03 -07:00
Jeff Squyres	ba429dc890	usnic: temporarily disable the BTL put method The usnic BTL put method is currently broken. Disable it until we can fix it properly.	2015-07-10 06:51:03 -07:00
Jeff Squyres	f265358fbe	usnic: handle FI_MSG_PREFIX differences libfabric v1.0.0->v1.1.0 In libfabric v1.0.0 (i.e., API v1.0), the usnic provider handled FI_MSG_PREFIX inconsistently between sends and receives. This has been fixed in libfabric v1.1.0 (i.e., API v1.1): FI_MSG_PREFIX is handled consistently for both sends and receives. Run-time detect which libfabric we are running with and adapt behavior appropriately.	2015-07-10 06:51:03 -07:00
Jeff Squyres	ddd0de6cfc	usnic: make more OS-bypass memory Valgrind-defined This helps reduce false positives when running MPI apps through Valgrind.	2015-07-10 06:51:03 -07:00
Jeff Squyres	9bc7a54e0c	usnic: correctly count CRC errors Handle the differences between libfabric v1.0.0 and v1.1.0 in the return value of fi_cq_readerr(). Also consolidate CRC and truncation errors into the same handling block, since truncation errors are typically another symptom of CRC errors. This ensures that buffers get reposted properly.	2015-07-10 06:51:03 -07:00
Jeff Squyres	fc686f5538	usnic: make configure complain if libfabric cannot be found Instead of silently determining that the usnic BTL can't be built, announce that usnic is checking for libfabric support, and then AC_MSG_RESULT the result of that check.	2015-07-10 06:45:33 -07:00
Jeff Squyres	4341639a66	Revert "configury: fix (again) XRC detection on OFED < 3.12" @ggouaillardet is likely offline for the weekend, but master is broken on RHEL 6.5 systems that do not have MOFED installed. So I'm taking the liberty of revering this commit; I'm guessing Gilles will fixup and re-commit next week. This reverts commit 77f8282d51d8f40f6ae988ef84c9c852de75c625.	2015-07-10 06:45:33 -07:00
Gilles Gouaillardet	77f8282d51	configury: fix (again) XRC detection on OFED < 3.12 since ibv_create_xrc_rcv_qp is now deprecated, and in order to be "future-proof", we have to consider the case in which only XRC Domains are supported. Thanks Paul Hargrove for the detailled report.	2015-07-10 15:31:45 +09:00
Rolf vandeVaart	ae0f3cfee7	Make explicit call to initalize MCA parameters in common CUDA code. This allows us to view them with ompi_info and possibly modify with tools interface	2015-07-09 12:51:55 -04:00
Rolf vandeVaart	cdffa4724d	Force smcuda BTL to use CUDA IPC path for all GPU buffers where possible	2015-07-08 17:11:25 -04:00
Gilles Gouaillardet	9f171de412	btl/openib: queue pending fragments once only when running out of credit Fixes open-mpi/ompi#640	2015-07-06 09:45:01 +09:00
bosilca	77367ca02c	Merge pull request #687 from rolfv/pr/fix-smcuda-perfprob Add the ability use different size buffers for host and CUDA buffers	2015-07-02 18:42:41 -04:00
Rolf vandeVaart	30a872b478	Add the ability to send host buffers through one sized staging buffers and CUDA buffers through different sized buffers. Fixes performance issues	2015-07-02 11:11:15 -04:00
Alina Sklarevich	27797654db	openib btl: added a new vendor_part_id for Mellanox ConnectX4-LX.	2015-06-29 13:50:43 +03:00
Jeff Squyres	a172bd161e	usnic: switch to use the new libfabric common library The usnic BTL configure.m4 no longer needs to OPAL_CHECK_LIBFABRIC; it just uses the results from opal/mca/common/libfabric's configure.m4. We also now don't need to link against libfabric -- they just link against the opal_common_libfabric library.	2015-06-25 13:33:15 -07:00
Nathan Hjelm	ee36d813dc	Merge pull request #657 from hjelmn/c99 more c99 updates	2015-06-25 11:21:09 -06:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Howard Pritchard	e49a37c034	ownership: update ownership files per discussions at OMPI devel workshop Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-06-25 10:04:42 -06:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Jeff Squyres	8ab2b11f88	btl_openib.c: fix another compiler warning Remove this unused variable	2015-06-17 09:00:12 -07:00
Jeff Squyres	f688289aaf	btl_openib.c: fix compiler warning This return code is not used; tell the compiler we're not going to use it.	2015-06-17 08:56:56 -07:00
Jeff Squyres	dfa36197ea	usnic/Makefile.am: ensure static builds include -lfabric	2015-06-17 08:15:29 -07:00
Jeff Squyres	44e7646de9	usnic/configure.m4: convert to use external libfabric Use the new OPAL_CHECK_LIBFABRIC macro.	2015-06-15 15:17:06 -07:00
Jeff Squyres	4384131e65	openib: minor style and defensive programming fixes Minor comment/whitespace fixes. Also some minor logic changes that are mainly for defensive programming purposes (i.e., ensure to always set malloc_hook_set to true or false, and then check it before we try to actually invoke it).	2015-06-12 20:11:47 -07:00
Jeff Squyres	2f137ff151	openib: reset memalign threshhold properly Now that open-mpi/ompi#638 is fixed, reset the openib BTL memalign threshhold properly. This effectively re-instates commit open-mpi/ompi@ce915b5757.	2015-06-12 20:11:47 -07:00
Jeff Squyres	88c13adc8c	openib: only set the memory hook if it is enabled Instead of unconditionally setting the memory hook, only set it when the memory hooks are both available and have been enabled (e.g., opal/mca/memory/linux has decided that it can be enabled, and when the mpi_leave_pinned MCA param is set to 1, or is set to -1 and some component requested the memory hooks be enabled). If we set the memory hook when memory hooks are not enabled, __malloc_hook will be NULL, which will cause problems when btl_openib_malloc_hook() tries to invoke it. Fixes open-mpi/ompi#638.	2015-06-12 20:11:47 -07:00
Ralph Castain	12d3c9ca22	Revert "Fix a typo that incorrectly set the alignment threshold in the openib BTL." This reverts commit ce915b5757d428d3e914dcef50bd4b2636561bca.	2015-06-10 14:02:49 -07:00
Jeff Squyres	4b59be4e4c	btl tcp: cosmetic changes and updates No logic changes. Update some stale/incorrect comments, fix some indenting and style.	2015-06-06 10:17:20 -07:00
Jeff Squyres	cddc8945e0	btl_tcp_proc.c: add missing "continue" Also add another (superflous but symmetric) continue statement. This missing "continue" statement allows IPv4 "private network" matches to fall through and allow IPv6 matches to be made -- thereby overriding the IPv4 match that was already made. Fixes #585 (although several of the other issues identified on #585 still exist, the primary / initial bug that was reported there is now fixed).	2015-06-06 10:17:12 -07:00
Rolf vandeVaart	8622b34664	Check for GPU Direct RDMA and leave pinned turned off	2015-06-04 14:25:24 -04:00
Nathan Hjelm	5e2bc2c662	btl/openib: fix coverity issue CID 1269821 Dereference null return value (NULL_RETURNS) This is another false positive that can be silenced by looping on opal_list_remove_first instead of using both opal_list_is_empty and opal_list_remove_first. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-29 08:44:03 -06:00

... 5 6 7 8 9 ...

715 Коммитов