openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	515fd00411	CSCul95082: DMAR faults during mtt testing usnic_channel_finalize() was deregistering recv buffers before destroying the QP to which they were posted. The QP needs to be destroyed first so that the NIC does not attemp tto write to deregistered memory, causing the DMAR messages. Submitted by Reese, reviewed by Jeff. cmr=v1.7.4:reviewer=ompi-rm1.7 This commit was SVN r29963.	2013-12-19 00:01:35 +00:00
George Bosilca	430a13719f	Only if OMPI_BTL_SM_HAVE_CMA is set to 1. This commit was SVN r29916.	2013-12-15 16:49:27 +00:00
Jeff Squyres	0ab48ad0d2	Fix some annoying flex warnings that have been there for years. Many thanks to Tom Fogal for the initial patch. cmr=v1.7.4:reviewer=rhc:subject=Fix annoying flex warnings This commit was SVN r29904.	2013-12-14 00:36:12 +00:00
Rolf vandeVaart	b955dbd6d9	Fix various items discovered by review of ticket #3951 . This commit was SVN r29900.	2013-12-13 21:25:07 +00:00
Jeff Squyres	f4afa4fd1f	Add missing include, exposed in "external libevent" work. Refs trac:3694 This commit was SVN r29898. The following Trac tickets were found above: Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694	2013-12-13 21:21:30 +00:00
Jeff Squyres	bcfe2156d5	Bring over m4 quoting fix from v1.7 branch (in r29894) that was discovered when removing some components. This commit was SVN r29895. The following SVN revision numbers were found above: r29894 --> open-mpi/ompi@58ed00296c	2013-12-13 20:27:33 +00:00
Ralph Castain	f763be26c4	Closes trac:2433. Check for hetero architecture and disqualify sm connections if that is found as the sm btl currently doesn't support hetero operations. cmr=v1.7.4:reviewer=brbarret:subject=Disqualify sm btl for hetero procs This commit was SVN r29882. The following Trac tickets were found above: Ticket 2433 --> https://svn.open-mpi.org/trac/ompi/ticket/2433	2013-12-13 15:23:33 +00:00
Mike Dubman	fb3f94a16e	remove debug print Refs trac:3969 This commit was SVN r29876. The following Trac tickets were found above: Ticket 3969 --> https://svn.open-mpi.org/trac/ompi/ticket/3969	2013-12-13 06:08:44 +00:00
Mike Dubman	21be95c9b5	Initialize sm global variables in mca_btl_sm_component_open(), because they are destructed in mca_btl_sm_component_close(), and init() function might not be called or fail. For exammple, mca_btl_sm.knem_fd remained 0, and mca_btl_sm_component_close() ended up doing closing fd 0 which belongs to someone else. fixed by Yossi, reviewed by miked cmr=v1.7.4:reviewer=ompi-rm1.7 This commit was SVN r29875.	2013-12-13 06:01:24 +00:00
Jeff Squyres	bac67e0d81	Per discussion @Chicago OMPI dev meeting Dec 2013: remove all MX support. This commit was SVN r29873.	2013-12-12 18:54:47 +00:00
Nathan Hjelm	3262080391	Cleanup udcm structures to avoid issues with nesting structures with flexible members. UDCM is ready to go for 1.7.4 with this patch. cmr=v1.7.4:ticket=3940 This commit was SVN r29861. The following Trac tickets were found above: Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940	2013-12-12 05:24:37 +00:00
Nathan Hjelm	e0e94a6029	Fix warning caused by typo in r29815 This commit was SVN r29860. The following SVN revision numbers were found above: r29815 --> open-mpi/ompi@d556b60b21	2013-12-11 21:45:39 +00:00
Nathan Hjelm	6ab69c758b	Fix warnings in udcm. cmr=v1.7.4:reviewer=rhc:ticket=3940 This commit was SVN r29859. The following Trac tickets were found above: Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940	2013-12-11 21:40:06 +00:00
Rolf vandeVaart	3ae88f8a24	Ensure no fork support with GDR. CUDA-aware code only. This commit was SVN r29854.	2013-12-10 18:08:53 +00:00
Rolf vandeVaart	1cc55f305f	Add extra check for GDR. Adjust some names and replace opal_output with opal_show_help. This commit was SVN r29853.	2013-12-10 16:04:08 +00:00
Jeff Squyres	0f61bb651e	Technically, the PORT_ACTIVE is not a "bad" event. Note that this event should never happen within a single OMPI job, because OMPI will ignore usnic ports that are down. The PORT_ACTIVE event should only occur if a port ''was'' down and is now ''up''. But what the heck -- if we ever do get this event, it is harmless -- just ignore it. This commit was SVN r29852.	2013-12-09 20:45:55 +00:00
Jeff Squyres	3bd9c603ff	Clean up variables used in configure with OPAL_VAR_SCOPE. This is helpful in the work for #3694: ensure that many places that eventually end up in configure don't overly-pollute the global shell variable space (because debugging accidental shell variable pollution can be a real pain). Refs trac:3694 This commit was SVN r29830. The following Trac tickets were found above: Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694	2013-12-06 23:40:34 +00:00
Rolf vandeVaart	d556b60b21	Chnage some CUDA configure code and macro names per review request by jsquyres in ticket #3880 . Functionally, nothing changes. This commit was SVN r29815.	2013-12-06 14:35:10 +00:00
Dave Goodell	da26226e3c	usnic: add some extra debug-build sanity checks On the off chance that the PML is twiddling fields that it really shouldn't be... Reviewed-by: Reese Faucette <rfaucett@cisco.com> This commit was SVN r29804.	2013-12-05 00:28:11 +00:00
Jeff Squyres	ba018b3603	Protect the container_of #define. MOFED apparently has a /usr/include/infiniband/verbs.h that also defines a (slightly different but fully compatible) container_of macro. So put proper #ifndef protection around our definition of container_of. Thanks to Rolf vandeVaart for pointing out the issue. Reviewed by Dave Goodell. cmr=v1.7.4:reviewer=ompi-rm1.7 This commit was SVN r29799.	2013-12-04 14:24:56 +00:00
Jeff Squyres	c74c1e86d3	Per suggestion from Paul Kapinos, report in BTL verbosity if a device is skipped because it is too far away. (see thread starting here: http://www.open-mpi.org/community/lists/devel/2013/06/12470.php) This commit was SVN r29790.	2013-12-03 22:44:11 +00:00
Rolf vandeVaart	ab77435d9b	Fix the CUDA-aware case where we are not sending any GPU data. This commit was SVN r29788.	2013-12-03 20:25:58 +00:00
Nathan Hjelm	fe327d9859	udcm: cleanup code and improve the ack handling Originally udcm acks used the immediate data to indicate which message was being acknowleged. This data was (mysteriously) junk when using QLogic HCAs so I updated udcm to use the source info (slid, qp, etc) to determine which message was being acked. This works as long as we don't have two messages simultaneously in flight to a particular peer and then loose the first of the two messages. The chances of this happening are tiny. To fix this case I updated the udcm message header to include a pointer to the in flight message. This pointer is then sent back to the sending process to ack receipt. cmr=v1.7.4:ticket=trac:3940 This commit was SVN r29775. The following Trac tickets were found above: Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940	2013-12-02 20:18:46 +00:00
Nathan Hjelm	fb0b0442c4	openib/connect: re-enable xrc support in the openib btl This commit updates the udcm cpc to support xrc. The steps followed by udcm mimic those in the removed xoob cpc. This update has been tested with both XRC and RC. Mellanox, this is intended to go into 1.7.4. Please review carefully and let me know if there are any issues. cmr=v1.7.4:reviewer=miked This commit was SVN r29767.	2013-11-27 22:28:04 +00:00
Devendar Bureddy	4a311ae9fd	continue search sorted openib device list if no btls found with nearest HCA. cmr=v1.7.4:reviewer=jladd This commit was SVN r29725.	2013-11-20 22:23:12 +00:00
Nathan Hjelm	24a7e7aa34	Add support for the udreg registration cache and dynamics on XE/XK/XC. To support the new mpool two changes were made to the mpool infrastructure: 1) Added an mpool flag to indicate that an mpool does not need the memory hooks to use the leave pinned protocols. This flag is checked in the mpool lookup. 2) Add a mpool context to the base registration. This new member is used by the udreg mpool to store the udreg context associated with the particular registration. The new member will not break the ABI compatibility as the new member is only currently used by the udreg mpool. Dynamics support for Cray systems makes use of the global rank provided by orte to give the ugni library a unique rank for each process. Dynamics support is not available under direct-launch (srun.) cmr=v1.7.4 This commit was SVN r29719.	2013-11-18 04:58:37 +00:00
Jeff Squyres	5206e877be	Help decrease conflicts between SVN trunk and Cisco git branch of OMPI v1.6 branch This commit was SVN r29715.	2013-11-15 21:35:56 +00:00
Jeff Squyres	e6ed7c9f4d	Avoid trivial "don't mix declarations and code" compiler warning This commit was SVN r29714.	2013-11-15 21:31:10 +00:00
Ralph Castain	22e30a680d	Given that the oob and xoob cpc's are no longer operable and haven't been since the OOB update, remove them to avoid confusion cmr:v1.7.4:reviewer=hjelmn:subject=Remove stale cpcs from openib This commit was SVN r29703.	2013-11-14 04:16:53 +00:00
Rolf vandeVaart	4964a5e98b	Per this RFC from October 8, 2013 and as discuessed in telecon. http://www.open-mpi.org/community/lists/devel/2013/10/13072.php Add support for pinning GPU Direct RDMA in openib BTL for better small message latency of GPU buffers. Note that none of this is compiled in unless CUDA-aware support is requested. This commit was SVN r29680.	2013-11-13 13:22:39 +00:00
Jeff Squyres	98ff91cfeb	Refs trac:3091 Gah! The "device" variable isn't used at all in this loop (my eye glossed over the next line and thought that "device" was used in the free() statement, but it's actually "devices" -- not "device"). This commit was SVN r29665. The following Trac tickets were found above: Ticket 3091 --> https://svn.open-mpi.org/trac/ompi/ticket/3091	2013-11-12 23:01:04 +00:00
Jeff Squyres	7cb31111a6	Refs trac:3901 Feedback from Dave's review. This commit was SVN r29664. The following Trac tickets were found above: Ticket 3901 --> https://svn.open-mpi.org/trac/ompi/ticket/3901	2013-11-12 22:51:20 +00:00
Jeff Squyres	5a940f5ee7	Arrgh -- remove debugging printf. This commit was SVN r29657.	2013-11-11 22:44:28 +00:00
Jeff Squyres	e20217eccc	Expand the "btl_usnic" MPI_T enumeration to have strings of the form: <usnic device name>,<eth device>,<ip address>/<CIDR prefix> For example: usnic_0,eth4,10.1.0.15/16 This is just handy for mapping the usnic_X device back to the IP network to which it corresponds. This commit was SVN r29656.	2013-11-11 22:25:30 +00:00
Nathan Hjelm	3d3c29ae96	btl/scif: do not return resource busy if we started a connection attempt. Resolves a hang when using scif for shared memory transfers. This is a simple change and doesn't require a review. cmr=v1.7.4:reviewer=ompi-rm1.7 This commit was SVN r29653.	2013-11-11 19:36:34 +00:00
Rolf vandeVaart	3290cde630	Various minor changes to bring smcuda up to date with sm. This commit was SVN r29639.	2013-11-07 19:45:56 +00:00
Dave Goodell	82db913490	usnic: fix module_recv_buffers perf regression Cisco v1.6 git commit 913ec6c and upstream trunk r29593 (segfault fix) introduced a performance regression by inadvertently disabling the `module_recv_buffers` functionality. With those changes in place, the `btl_usnic_recv.c` logic would end up mallocing a buffer that should have otherwise come from a `module_recv_buffers` pool. It also resulted in a small, bounded memory leak (128 buffers at each power-of-two size interval). The new version just places the buffer after the free list item with a flexible array member. I bumped the pool to allocate all 128 elements up front because the deferred allocation was modestly impacting IMB Sendrecv performance at a few sizes. Reviewed-by: Reese Faucette <rfaucett@cisco.com> This commit was SVN r29631. The following SVN revision numbers were found above: r29593 --> open-mpi/ompi@1ed9b8ff43	2013-11-07 01:27:31 +00:00
Jeff Squyres	e28261898d	Per discussion on the devel list, rename the btl_usnic_devices MPI_T state pvar to be btl_usnic (i.e., the best suggestion so far). See http://www.open-mpi.org/community/lists/devel/2013/11/13188.php for more detail. This commit was SVN r29614.	2013-11-06 06:19:03 +00:00
Rolf vandeVaart	e57795f097	Revert r29594. That was just plain wrong. Sorry about workday configure change. This commit was SVN r29605. The following SVN revision numbers were found above: r29594 --> open-mpi/ompi@ed7ddcd9c7	2013-11-05 14:45:56 +00:00
Rolf vandeVaart	ed7ddcd9c7	Fix CUDA-aware compile error introduces with r29581. This commit was SVN r29594. The following SVN revision numbers were found above: r29581 --> open-mpi/ompi@ee7510b025	2013-11-05 00:08:33 +00:00
Dave Goodell	1ed9b8ff43	usnic: fix segfault at finalize time Without this commit, if you run IMB pingpong between two nodes with only one usnic selected (e.g., via `--mca btl_usnic_if_include usnic_0`) then the run will seem fine but will segfault at MPI_Finalize time. This behavior has happened since Cisco v1.6 git commit ec7ddf8, upstream trunk r29484, and upstream v1.7 r29507. Root cause was that the free list element was being used as the recv buffer instead of the data buffer associated with the element. So the reassembly code would stomp all over the free list element, which would cause the destructor to explode when the free list attempted to clean up all of its elements. This surprisingly did not cause any other problems until now. Reviewed-by: Reese Faucette <rfaucett@cisco.com> This commit was SVN r29593. The following SVN revision numbers were found above: r29484 --> open-mpi/ompi@a6ed232a10 r29507 --> open-mpi/ompi@790d269ce8	2013-11-04 22:52:14 +00:00
Dave Goodell	73a943492c	usnic: pack via convertor on the fly If we need to use a convertor, go back to stashing that convertor in the frag and populating segments "on the fly" (in ompi_btl_usnic_module_progress_sends). Previously we would pack into a chain of chunk segments at prepare_src time, unnecessarily consuming additional memory. Reviewed-by: Jeff Squyres <jsquyres@cisco.com> Reviewed-by: Reese Faucette <rfaucett@cisco.com> This commit was SVN r29592.	2013-11-04 22:52:03 +00:00
Dave Goodell	71d0d73575	usnic: refactor callback invocation This makes it a little easier to see what's happening with callbacks to the PML. Reviewed-by: Jeff Squyres <jsquyres@cisco.com> Reviewed-by: Reese Faucette <rfaucett@cisco.com> This commit was SVN r29591.	2013-11-04 22:51:48 +00:00
Dave Goodell	4c791e21d2	usnic: add MSGDEBUG1_OUT/MSGDEBUG2_OUT macros This includes suppressing picky-mode warnings about __VA_ARGS__, which we know are supported by any compilers we care about. Reviewed-by: Jeff Squyres <jsquyres@cisco.com> Reviewed-by: Reese Faucette <rfaucett@cisco.com> This commit was SVN r29590.	2013-11-04 22:51:35 +00:00
Dave Goodell	825686a205	usnic: certain send frag members are immutable Ensure that they never are touched by checking in their destructors. Reviewed-by: Jeff Squyres <jsquyres@cisco.com> Reviewed-by: Reese Faucette <rfaucett@cisco.com> This commit was SVN r29589.	2013-11-04 22:51:24 +00:00
Rolf vandeVaart	ee7510b025	Remove redundant macro. This was from reviewed of earlier ticket. Fixes trac:3878. Reviewed by jsquyres. This commit was SVN r29581. The following Trac tickets were found above: Ticket 3878 --> https://svn.open-mpi.org/trac/ompi/ticket/3878	2013-11-01 12:19:40 +00:00
Mike Dubman	b0e64427a9	ompi/mca/btl/openib: Fix memory leak and accessing free'd memory issues Let imagine that we have two btls in btl_openib_component_init() both points to the same openib_btl->device and as a result have the same openib_btl->device->endpoints array. Finalization phase calls twice mca_btl_openib_finalize()->mca_btl_openib_finalize_resources(). mca_btl_openib_finalize_resources() frees endpoint related btl. But the second call of mca_btl_openib_finalize_resources() checks endpoint that is released by previus call. fixed by Igor, reviewed by miked/vasily cmr=v1.7.4:reviewer=ompi-gk1.7 This commit was SVN r29563.	2013-10-30 11:47:49 +00:00
Jeff Squyres	6569019b06	Move all usNIC stats to _stats.c\|h and export them as MPI_T pvars. This commit moves all the module stats into their own struct so that the stats only need to appear as a single line in the module_t definition, and then moves all the logic for reporting the stats into btl_usnic_stats.c\|h. Further, the stats are now exported as MPI_T_BIND_NO_OBJECT entities (i.e., not bound to any particular MPI handle), and are marked as READONLY and CONTINUOUS. They currently all default to verbose level 5 ("Application tuner / detailed", according to https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels). Most of the statistics are counters, but a small number are high watermark values. Due to how counters are reported via MPI_T, none of the counters are exported through MPI_T if the MCA param btl_usnic_stats_relative=1 (i.e., the module resets the stats back to zero at a given frequency). When MPI_T_pvar_handle_alloc() is invoked on any of these pvars, it will return a count that is equal to the number of active usnic BTL modules. The values returned for any given pvar (e.g., num_total_sends) are an array containing one value for each active usnic BTL module. The ordering of values in the array is both consistent across all usnic pvars and stable throughout a single job: array slot 0 corresponds to module X, array slot 1 corresponds to module Y, etc. Mapping which array slot corresponds to which underlying Linux usnic_X device works as follows: * The btl_usnic_devices MPI_T state pvar is associated with a btl_usnic_device MPI_T enum, and be obtained via MPI_T_pvar_get_info(). * If all usNIC pvars are of length N, the values [0,N) in the btl_usnic_device enum are associated with strings of the corresponding underlying Linux device. For exampe, to look up which Linux device is reported in all usNIC pvars' array slot 1, look up the int value 1 in the btl_usnic_devices enum. Its corresponding string value is underlying Linux device name (e.g., "usnic_1"). cmr=v1.7.4:subject="usnic BTL MPI_T pvars" This commit was SVN r29545.	2013-10-28 22:23:08 +00:00
Nathan Hjelm	404cceb9c4	Always check the return of [mc]alloc and fix a warning introduced by r29479. This fixes some issues reported awhile ago in the openib btl. There are a couple more unchecked mallocs but they are a bit more difficult to fix since they are in void functions (btl_openib_endpoint.c). Refs trac:2401. cmr=v1.7.4:reviewer=miked This commit was SVN r29543. The following SVN revision numbers were found above: r29479 --> open-mpi/ompi@d6ead2a3a5 The following Trac tickets were found above: Ticket 2401 --> https://svn.open-mpi.org/trac/ompi/ticket/2401	2013-10-28 20:04:49 +00:00
Nathan Hjelm	26f3a029d3	Fix scif configury. cmr=v1.7.4:ticket=3862 This commit was SVN r29493. The following Trac tickets were found above: Ticket 3862 --> https://svn.open-mpi.org/trac/ompi/ticket/3862	2013-10-23 17:04:20 +00:00
Nathan Hjelm	6186b5ed9d	Remove extra file that made its way into r29490. cmr=v1.7.4:ticket=3862 This commit was SVN r29491. The following SVN revision numbers were found above: r29490 --> open-mpi/ompi@cde3b05ed3 The following Trac tickets were found above: Ticket 3862 --> https://svn.open-mpi.org/trac/ompi/ticket/3862	2013-10-23 16:17:51 +00:00
Nathan Hjelm	cde3b05ed3	Add support for the Intel scif interface. Depends on #3847. cmr=v1.7.4:reviewer=rhc This commit was SVN r29490.	2013-10-23 15:59:14 +00:00
Dave Goodell	d969cfa513	usnic: correctly clean up verbs resources Due to deallocation ordering (and an entirely missed deallocation), we were leaking modest amounts of memory inside libusnic_verbs. Reviewed-by: Jeff Squyres <jsquyres@cisco.com> This commit was SVN r29485.	2013-10-23 15:51:33 +00:00
Dave Goodell	a6ed232a10	usnic: fix several memory leaks - some free lists simply were not being OBJ_DESTRUCTed, so they never freed their internal memory - channel->recv_segs.ctx was being assigned in a way that got clobbered by ompi_free_list_init_new, so the cleanup code that relied on it being set never ran - numerous other ".ctx" assignments were similarly ineffectual and were not being consumed, so I deleted them Reviewed-by: Jeff Squyres <jsquyres@cisco.com> This commit was SVN r29484.	2013-10-23 15:51:22 +00:00
Dave Goodell	c9b2343982	usnic: add ompi_btl_usnic_component_debug helper This new routine can be called in exceptional situations, either conditionally in BTL code or from a debugger, to help with debugging in cases where MSGDEBUG1/2 or stats logging are impractical but more detail is needed. Reviewed-by: Jeff Squyres <jsquyres@cisco.com> This commit was SVN r29483.	2013-10-23 15:51:11 +00:00
Dave Goodell	d0b7d125b2	usnic: refactor usnic_stats_callback Pull the bulk of the functionality out into a new routine, ompi_btl_usnic_print_stats, which can be used in other debugging contexts. This also lets us eliminate the module->final_stats state tracking. Reviewed-by: Jeff Squyres <jsquyres@cisco.com> This commit was SVN r29482.	2013-10-23 15:50:57 +00:00
Jeff Squyres	0fb8edd720	Trivial comment change This commit was SVN r29480.	2013-10-23 10:15:18 +00:00
Mike Dubman	d6ead2a3a5	Add support for routable ROCE where different subnet_id is a valid to proceed with MPI routing. (can happen in the same LAN) developed by vasily, reviewed by miked cmr=v1.7.4:reviewer=ompi-gk1.7 This commit was SVN r29479.	2013-10-23 06:08:54 +00:00
Nathan Hjelm	280a89448f	Make btl/vader valgrind safe. cmr=v1.7.4:reviewer=samuel This commit was SVN r29464.	2013-10-22 15:33:32 +00:00
Rolf vandeVaart	0cd1e8dfd9	Add runtime support to turn off CUDA IPC support. This commit was SVN r29444.	2013-10-16 16:48:18 +00:00
Jeff Squyres	b5e2ae86ad	Remove all of our "to-do" items from the README.txt. This commit was SVN r29424.	2013-10-11 16:43:56 +00:00
Jeff Squyres	66dadbe1e7	Per RFC, remove the udapl BTL. This commit was SVN r29400.	2013-10-08 15:18:59 +00:00
Rolf vandeVaart	3bd02fbaf5	Add one more verbose debug output that prints when we are out of memory. This commit was SVN r29378.	2013-10-04 18:56:06 +00:00
Rolf vandeVaart	66725f6973	Enable some CUDA-aware support on tcp btl. Only when configured in. This commit was SVN r29364.	2013-10-04 12:50:16 +00:00
Nathan Hjelm	f3d18028e5	Fix typo in uGNI prepare source that could cause incorrect results with non-contiguous datatypes. cmr=v1.7.3 This commit was SVN r29294.	2013-09-30 16:00:58 +00:00
Dave Goodell	a42fa78da7	usnic: SEGV in OSU benchmarks Prevent frag from being freed out from under us in the case the PML callback routine calls usnic_free(). We accomplish this by delaying decrement of sf_bytes_to_ack until after the callback is performed, since sf_bytes_to_ack == 0 is condition of freeing the frag. Fixes Cisco bug CSCuj45094. Authored-by: Reese Faucette <rfaucett@cisco.com> cmr=v1.7.3 This commit was SVN r29264.	2013-09-26 21:48:04 +00:00
Ralph Castain	34fbec1f49	Sadly, the connection priorities being defined at time of variable instantiation were being overridden just before registering the param. Thus, changes people made to the relative priority of the cpc methods were being lost. Fix it be removing the duplicate initializiation, letting the value defined at instantiation be the one actually used. cmr:v1.7.4:reviewer=hjelmn This commit was SVN r29212.	2013-09-19 19:45:00 +00:00
Jeff Squyres	74d1278f48	btl_usnic_util.c:ompi_btl_usnic_util_abort() also passes in the strerror(). This commit was SVN r29188.	2013-09-17 12:35:51 +00:00
Reese Faucette	8f235e6977	usnic: wrong SG entry used to compute length for small put()s This commit was SVN r29186.	2013-09-17 08:18:02 +00:00
Reese Faucette	651d61f1a3	Clean up debugging logging a bit. MSGDEBUG2 now means "print a one-liner for all PML calls into BTL, and also when BTL calls PML with a recv completion (not send completions)" MSGDEBUG1 means print more internal gory detail MSGDEBUG is gone, replaced by MSGDEBUG1 In the process also found that PUT_DEST style fragments could potentially be leaked in usnic_free() since send_fragment tests were being applied to see if it was eligible to be freed. This commit was SVN r29185.	2013-09-17 07:29:40 +00:00
Reese Faucette	f35d9b50e3	Cisco CSCuj22803: fixes for Bsend changes required to support MPI_Bsend(). Introduces concept of attaching a buffer to a large segment that the PML can scribble into and we will send from. The reason we don't use a pinned buffer and send directly from that is that usnic_verbs does not (yes) support num_sge>1 for regular sends. This means the data gets copied twice, but that is unavoidable. changed the logic in handle_large_send to be more sensible Incorporated David's review comments This commit was SVN r29184.	2013-09-17 07:27:39 +00:00
Reese Faucette	25b5c84d0f	Cisco CSCuj13135: Data corruption in MPI_Bsend_ator_c Do not assume that the "size" passed to alloc_send() will be the same as the size of the message the resulting fragment will hold when usnic_send() is called. This means usnic_send()/usnic_put() can never trust any pre-computed size values, and are only allowed to look at the lengths and pointers of the elements in the desc SG list. This commit was SVN r29183.	2013-09-17 07:25:05 +00:00
Reese Faucette	b9103c0f66	Cisco CSCuj12524: c_put_big segfault - usnic_free() cannot free the fragment until ACK is received This commit was SVN r29182.	2013-09-17 07:23:15 +00:00
Reese Faucette	89b5f0899b	Cisco CSCuj12520: various problems running c_fence_put_1 - tag needs to be sent in our header, not the PML header - usnic_alloc() should return smaller value if too much data requested - be careful about callbacks vs removing items from lists (we need to remove from outr lists before the callback) - improve send callback handling - add some more MSGDEBUG2 logging and cleanup This commit was SVN r29181.	2013-09-17 07:20:44 +00:00
Rolf vandeVaart	096b8c022e	Also add flag to debug output. This commit was SVN r29163.	2013-09-13 19:47:05 +00:00
Rolf vandeVaart	c15b2a26b8	Fix some formatting. Move some CUDA-aware mca parameter initialization earlier. This commit was SVN r29162.	2013-09-13 17:43:41 +00:00
Rolf vandeVaart	d247c26b84	In the case that HAVE_IBV_FORK_INIT is not defined, we will need this variable so we can give the user an error if they ask for it. Also fixes compile error when HAVE_IBV_FORK_INIT is not defined. This commit was SVN r29160.	2013-09-13 14:38:49 +00:00
Rolf vandeVaart	ba9ec1b8bc	For debug builds, add the ability to view memory registrations and deregistrations in the openib BTL. This commit was SVN r29159.	2013-09-13 14:28:26 +00:00
Joshua Ladd	b3f88c4a1d	Per the RFC schedule, this commit adds Mellanox OpenSHMEM to the trunk. It does not yet run on OSX or with CM PML for an MTL other than MXM. Mellanox is aware of these issues and is in the process of resolving them. This should be added to \ncmr=v1.7.4:subject=Move OSHMEM to 1.7.4:reviewer=rhc This commit was SVN r29153.	2013-09-10 15:34:09 +00:00
Jeff Squyres	c9f05a2664	Delineate OMPI_FREE_LIST__MT separately. The FREE_LIST__MT stuff was introduced on the SVN trunk in r28722 (2013-07-04), but so far, has not been merged into the v1.7 branch yet (2013-09-06). So put it in its own #ifdef, rather than defining it based on OMPI_MAJOR_VERSION/OMPI_MINOR_VERSION. This commit was SVN r29148. The following SVN revision numbers were found above: r28722 --> open-mpi/ompi@c9e5ab9ed1	2013-09-06 19:22:56 +00:00
Jeff Squyres	e02cc0a7ec	No need for this header file. This commit was SVN r29147.	2013-09-06 19:22:28 +00:00
Jeff Squyres	c53b0890cf	Ensure that btl_usnic_compat.h is in the tarball. This commit was SVN r29140.	2013-09-06 15:53:56 +00:00
Dave Goodell	75fa28c303	usnic: v1.6<->trunk unification, trunk side The Cisco-maintained v1.6 port of the usnic BTL has diverged from the upstream trunk and v1.7 branches. This commit adjusts the trunk to more closely match the v1.6 branch to simplify future merging and cherry-picking. The usnic MCA parameters also need work on this side. Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760) This commit was SVN r29138. The following Trac tickets were found above: Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760	2013-09-06 03:21:34 +00:00
Dave Goodell	a669bd01e6	usnic: revamp convertor handling. The fix for the HPL SEGV was incorrect because it assumed the prepare_src() routine was always allowed to return "bytes processed" less than the requested "bytes to send". It turns out this is only true if the convertor is what limits the size, we are not allowed to limit the data sent for our own reasons, else we break login in the upper layers. This means we need to learn the number of bytes out of the size requested the convertor will give us, no matter how big the size is. Unfortunately, this is a destructive test, and (currently) the only way to learn that number is to actually have the convertor copy the data out into buffers. This change implements this, copying the entire data out into a chain of send segments which are attached to the large send fragment. Now we can always return the proper size value to the PML. Fixes Cisco bug CSCuj08024 Authored-by: Reese Faucette <rfaucett@cisco.com> Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760) This commit was SVN r29137. The following Trac tickets were found above: Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760	2013-09-06 03:21:21 +00:00
Dave Goodell	0ef8336502	new bookkeeping code should return value indicating whether packet is good or not. Authored-by: Reese Faucette <rfaucett@cisco.com> Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760) This commit was SVN r29136. The following Trac tickets were found above: Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760	2013-09-06 03:19:32 +00:00
Dave Goodell	122890c2fd	usnic: "bookeeping" --> "bookkeeping" Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760) This commit was SVN r29135. The following Trac tickets were found above: Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760	2013-09-06 03:19:20 +00:00
Dave Goodell	0df6ed4acc	usnic: squash warnings from perf improvements Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760) This commit was SVN r29134. The following Trac tickets were found above: Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760	2013-09-06 03:19:08 +00:00
Dave Goodell	6dc54d372d	usnic: Basket of performance changes including: - round segment buffer allocation to cache-line - split some routines into an inline fast section and a called slower section - introduce receive fastpath in component_progress that: o returns immediately if there is a packet available on priority queue and fastpath is enabled o disables fastpath for 1 time after use to provide fairness to other processing o defers receive buffer posting o defers bookeeping for receive until next call to usnic_component_progress Authored-by: Reese Faucette <rfaucett@cisco.com> Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760) This commit was SVN r29133. The following Trac tickets were found above: Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760	2013-09-06 03:18:57 +00:00
Dave Goodell	9cab9777d9	usnic: properly destroy embedded small send frag Without this, an `--enable-debug` build would hit an assertion in the list code when run under valgrind with `--malloc-fill=0xff` or any other case where malloc returned non-zeroed buffers. Also allow the normal OBJ_ machinery to handle the constructor invocation ordering for us instead of doing it by hand (which could have led to future bugs). Reviewed-by: jsquyres@cisco.com cmr=v1.7.4 Depends on trunk functionality in r29095 and r29096. Refs trac:3740,#3741. This commit was SVN r29127. The following SVN revision numbers were found above: r29095 --> open-mpi/ompi@d1b5940e97 r29096 --> open-mpi/ompi@a552921171 The following Trac tickets were found above: Ticket 3740 --> https://svn.open-mpi.org/trac/ompi/ticket/3740	2013-09-04 20:59:12 +00:00
Brian Barrett	16a1166884	Remove the proc_pml and proc_bml fields from ompi_proc_t and replace with a configure-time dynamic allocation of flags. The net result for platforms which only support BTL-based communication is a reduction of 8*nprocs bytes per process. Platforms which support both MTLs and BTLs will not see a space reduction, but will now be able to safely run both the MTL and BTL side-by-side, which will prove useful. This commit was SVN r29100.	2013-08-30 16:54:55 +00:00
Rolf vandeVaart	18962d296b	This has bothered me for a while. Change MCA_BTL_TAG_BTL to MCA_BTL_TAG_IB. They are the same value so this does not change anything. (MCA_BTL_TAG_IB = MCA_BTL_TAG_BTL + 0). This just makes it more correct. This commit was SVN r29099.	2013-08-30 14:53:59 +00:00
Dave Goodell	c5a7e8a079	usnic: stomp format specifier warnings The usnic BTL now builds cleanly under `--enable-picky` when `MSGDEBUG1` is set. Reviewed-by: jsquyres cmr=v1.7.4:reviewer=jsquyres This commit was SVN r29097.	2013-08-29 23:24:14 +00:00
George Bosilca	305fa88d4b	Remove two warnings from the SM BTL. The return code can be safely ignored as the internals of the SM BTL will repost the fragment until the send operation succesfully complete. This commit was SVN r29077.	2013-08-28 06:36:01 +00:00
Dave Goodell	dd82bd3c19	usnic: fix invalid rfstart initialization endpoint_rfstart was being initialized from a value which was not yet set. Also ensure that rfstart is a valid index in the range 0..WINDOW_SIZE-1, since it is used as the index into endpoint_rcvd_segs, which has WINDOW_SIZE elements. Without this change there is significant risk of memory corruption or segfaults, resulting in hangs or crashes, if malloc ever returns us a value >=WINDOW_SIZE (4096). Right now we seem to be getting lucky that the malloc is returning zero-pages to us when we are allocating endpoint structures (possibly because the freelist performs a single large allocation for all endpoints). Fixes Cisco bug CSCui88781. Reviewed-by: rfaucett@cisco.com Reviewed-by: jsquyres@cisco.com cmr=v1.7.3:reviewer=jsquyres This commit was SVN r29075.	2013-08-27 22:43:20 +00:00
Rolf vandeVaart	96457df9bc	Fix compile errors created from changeset 29058. This commit was SVN r29061.	2013-08-22 18:25:23 +00:00
Jeff Squyres	63ac60864b	Refs trac:3730 Turns out that AC_CHECK_DECLS is one of the "new style" Autoconf macros that #defines the output to be 0 or 1 (vs. #define'ing or #undef'ing it). So don't check for "#if defined(..."; just check for "#if ...". This commit was SVN r29059. The following Trac tickets were found above: Ticket 3730 --> https://svn.open-mpi.org/trac/ompi/ticket/3730	2013-08-22 17:44:20 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Rolf vandeVaart	504fa2cda9	Fix support in smcuda btl so it does not blow up when there is no CUDA IPC support between two GPUs. Also make it so CUDA IPC support is added dynamically. Fixes ticket 3531. This commit was SVN r29055.	2013-08-21 21:00:09 +00:00
Rolf vandeVaart	96fdb060ea	Fix compile errors and warnings from changeset 29052. This commit was SVN r29054.	2013-08-21 19:01:54 +00:00
Steve Wise	67fe3f23ed	Use the HAVE_DECL_IBV_LINK_LAYER_ETHERNET macro. Commit r27211 added ifdef checks for #define HAVE_IBV_LINK_LAYER_ETHERNET, which is incorrect. The correct #define is HAVE_DECL_IBV_LINK_LAYER_ETHERNET. This broke OMPI over iWARP. This fixes trac:3726 and should be added to cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29053. The following SVN revision numbers were found above: r27211 --> open-mpi/ompi@b27862e5c7 The following Trac tickets were found above: Ticket 3726 --> https://svn.open-mpi.org/trac/ompi/ticket/3726	2013-08-20 20:00:46 +00:00
Ralph Castain	45e695928f	As per the email discussion, revise the sparse handling of hostnames so that we avoid potential infinite loops while allowing large-scale users to improve their startup time: * add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit. * remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL" * modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded * removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base * added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames This commit was SVN r29052.	2013-08-20 18:59:36 +00:00
Jeff Squyres	b30ad28276	Remove some unused variables and an unused goto label. This commit was SVN r29044.	2013-08-19 16:18:35 +00:00
Ralph Castain	611d7f9f6b	When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require. This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times. Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes: * upon first request for data, have the OPAL db pmi component fetch and decode all the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally * reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test * reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued). Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time. This commit was SVN r29040.	2013-08-17 00:49:18 +00:00
Ralph Castain	f8a72feb25	Silence unitialized var warning This commit was SVN r29036.	2013-08-16 21:39:28 +00:00
Jeff Squyres	c09ec204ad	Change usNIC BTL to always use small fragments when there is a non-contiguous converter. We can't "convert on the fly" because the # of bytes requested may not divide evenly into the convertor data type. This commit was SVN r29014.	2013-08-11 17:04:13 +00:00
George Bosilca	837b3363fe	Silence few warnings. This commit was SVN r29004.	2013-08-06 09:38:30 +00:00
Brian Barrett	2cc947513b	* Fix some compile errors * Need to subtract 1 off the size so that we stay in the bit length requirements This commit was SVN r28997.	2013-08-05 18:49:48 +00:00
Jeff Squyres	87910daf51	Fix a collection of bugs found by QA and Coverity, and make some minor improvements: * Fix minor memory leaks during component_init * Ensure that an initialization loop does not underflow an unsigned int * Improve mlock limit checking * Fix set of BTL modules created during component_init when failing to get QP resources or otherwise excluding some (but not all) usnic verbs devices * Fix/improve error messages to be consistent with other Cisco documentation * Randomize the initial sliding window sequence number so that we silently drop incoming frames from previous jobs that still have existant processes in the middle of dying (and are still transmitting) * Ensure we don't break out of add_procs too soon and create an asymetrical view of what interfaces are available This commit was SVN r28975.	2013-08-01 16:56:15 +00:00
Jeff Squyres	f7337b8f77	Correct faulty max payload and MTU computations (and update some debugging that helped us find those). This commit was SVN r28942.	2013-07-24 16:06:28 +00:00
Jeff Squyres	5323051047	Use sysfs to check MPI has enough VFs, QPs, and CQs Use the new sysfs files to check that there are enough VFs, QPs, and CQs for all the MPI processes on this server. Move the checking code into its own subroutine to make it smaller and easier to read/grok. This commit was SVN r28937.	2013-07-24 00:38:32 +00:00
Jeff Squyres	b437041aeb	Update one more comment. This commit was SVN r28908.	2013-07-22 17:29:00 +00:00
Jeff Squyres	4b6006402d	Use the RTE framework instead of calling ORTE directly. Brian (rightfully) hit me on the head with the don't-use-ORTE-use-the-rte-framework clue bat; the usnic BTL now nicely plays with the RTE framework. This commit was SVN r28907.	2013-07-22 17:28:23 +00:00
Jeff Squyres	194b285447	First commit of the Cisco usNIC BTL. This BTL accesses the Cisco usNIC Linux device via the Linux verbs API via Unreliable Datagram queue pairs. A few noteworthy points: * This BTL does most of its own fragmentation; it tells the PML that it has a very high max_send_size (much higher than the network MTU). * Since UD fragments are, by definition, unreliable, the usnic BTL handles all of its own reliability via a sliding window approach using the opal_hotel construct and many tricks stolen from the corpus of knowledge surrounding efficient TCP. * There is a fun PML latency-metric based optimization for NUMA awareness of short messages. * Note that this is ''not'' a generic UD verbs BTL; it is specific to the Cisco usNIC device. This commit was SVN r28879.	2013-07-19 22:13:58 +00:00
Jeff Squyres	3546163c48	Devices that do not support RC QP's are also intentionally skipped; don't warn about skipping them. This commit was SVN r28874.	2013-07-19 19:05:18 +00:00
Rolf vandeVaart	49663fb802	Move CUDA-aware configurary to its own file and other minor changes due to review. This commit was SVN r28832.	2013-07-17 22:12:29 +00:00
Mike Dubman	5bd2e15cbb	support for ConnectX3-Pro card. cmr:v1.7:reviewer=jsquyres cmr:v1.6:reviewer=jsquyres This commit was SVN r28787.	2013-07-14 06:44:19 +00:00
Nathan Hjelm	dfca3d4804	fix typos in the ugni and vader btls This commit was SVN r28772.	2013-07-12 17:55:33 +00:00
Nathan Hjelm	1119cd3e8a	Merge branch 'vader_fix' This commit was SVN r28764.	2013-07-11 23:30:20 +00:00
Brian Barrett	2f19fc52de	use the same multi-md workaround the rest of the Portals code is using. This commit was SVN r28761.	2013-07-11 21:00:11 +00:00
Nathan Hjelm	b5281778b0	btl/vader: improve small message performance This commit improved the small message latency and bandwidth when using the vader btl. These improvements should make performance competative with other MPI implementations. This commit was SVN r28760.	2013-07-11 20:54:12 +00:00
Brian Barrett	bea54eeeb1	First take at a BTL for Portals 4 This commit was SVN r28759.	2013-07-11 20:47:08 +00:00
Jeff Squyres	baa3182794	Per RFC (http://www.open-mpi.org/community/lists/devel/2013/07/12534.php), remove a bunch of dead code. This commit was SVN r28756.	2013-07-11 17:34:28 +00:00
Rolf vandeVaart	5051cd53fd	Use new API. This commit was SVN r28754.	2013-07-11 17:06:14 +00:00
Jeff Squyres	80145742a3	Fix typo in comment This commit was SVN r28747.	2013-07-10 15:13:08 +00:00
Jeff Squyres	ea94936531	First cut at assigning some fine-grained "levels" to MCA parameters for the SM and TCP BTLs, as well as the mca_btl_base_param_register() function (which registers MCA params for all BTLs). The guidelines in https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels were used to pick these levels. This commit was SVN r28746.	2013-07-10 00:47:52 +00:00
Aurelien Bouteiller	e1066143a4	rename ompi_free_list operations to _mt, as per discussions at last face to face meeting This commit was SVN r28734.	2013-07-08 22:07:52 +00:00
George Bosilca	483ed8da8c	Remove an unused variable resulting from the removal of the last parameter of the OMPI_FREE_LIST_GET macro. This commit was SVN r28723.	2013-07-04 09:19:00 +00:00
George Bosilca	c9e5ab9ed1	Our macros for the OMPI-level free list had one extra argument, a possible return value to signal that the operation of retrieving the element from the free list failed. However in this case the returned pointer was set to NULL as well, so the error code was redundant. Moreover, this was a continuous source of warnings when the picky mode is on. The attached parch remove the rc argument from the OMPI_FREE_LIST_GET and OMPI_FREE_LIST_WAIT macros, and change to check if the item is NULL instead of using the return code. This commit was SVN r28722.	2013-07-04 08:34:37 +00:00
George Bosilca	b82abf6bef	Silence a compiler warning. This commit was SVN r28686.	2013-07-01 11:40:42 +00:00
Jeff Squyres	a0b27f5b28	Better comment than what was submitted in r28614. This commit was SVN r28631. The following SVN revision numbers were found above: r28614 --> open-mpi/ompi@9556310bd0	2013-06-13 20:52:44 +00:00
Mike Dubman	9556310bd0	cosmetic: add comment with rationale for malloc.h include This commit was SVN r28614.	2013-06-12 05:58:32 +00:00
Nathan Hjelm	9b1f32bf12	BTL: add flags for signaled BTL operations As per discussion in the June 2013 developer meeting these flags will be used by the PML in the future to request asynchronous progress on an operation. The naming was chosen to reflect that a BTL supports this mode (MCA_BTL_FLAG_SIGNALED) and that a descriptor should "signal" the remote side to wake up and progress the message (MCA_BTL_DES_FLAG_SIGNAL). Future commits will update OB1 to take advantage of this feature when performing the RDMA get or RDMA rendezvous protocols. This commit was SVN r28612.	2013-06-11 21:52:20 +00:00
Mike Dubman	d18b3ae1a7	fix malloc deprication error with gcc 4.6.3 on ubuntu/fedora This commit was SVN r28605.	2013-06-09 18:13:16 +00:00
Jeff Squyres	713e3aa3db	Refs trac:3626: that ticket specifically refers to the v1.6 branch; this commit is the trunk version of what is needed for #3626. Add the "ignore_device" field to the INI file. This allows us to specifically list devices that should be ignored by the openib BTL (such as the Intel Phi, at least as of May 2013 -- see #3626). Also add the Intel Phi to the ini file, and set its ignore_device=1. Finally, add the concept of counting intentionally ignored verbs devices. Devices are ignored for one of two reasons: * If the number of allowed ports on that device is 0 (i.e., if if_include/if_exclude was set such that we're intentionally ignoring this device). * If the INI ignore_device field for this device is set to 1. Once we have the count of devices that were intentionally ignored, only show the "Hey, there's verbs devices that you're not using!" show_help message if there are devices that were ''unintentionally'' ignored. This commit was SVN r28589. The following Trac tickets were found above: Ticket 3626 --> https://svn.open-mpi.org/trac/ompi/ticket/3626	2013-06-05 12:12:09 +00:00
Jeff Squyres	3019b7a3f8	Oops! Remove duplicate registration. This commit was SVN r28588.	2013-06-05 11:55:19 +00:00
Jeff Squyres	1de00b17ad	Properly check the return status from registering the MCA params. This commit was SVN r28587.	2013-06-05 11:53:18 +00:00
Rolf vandeVaart	3d1d158a80	Do not abort in BTL. Rather, callback into PML error function. Thanks George for review. This commit was SVN r28559.	2013-05-23 18:45:23 +00:00
Nathan Hjelm	721779d7ab	Per RFC: remove old MCA parameter system. This commit was SVN r28541.	2013-05-20 15:36:13 +00:00
Rolf vandeVaart	91fdb423d7	Fix warning in CUDA-aware code. This commit was SVN r28511.	2013-05-14 21:04:15 +00:00
Rolf vandeVaart	52ebb0b17f	Change some opal_output to OPAL_OUTPUT per CMR review. This commit was SVN r28510.	2013-05-14 20:49:42 +00:00
Nathan Hjelm	32a8ff5255	btl/openib: bump up udcm priority This commit was SVN r28505.	2013-05-14 20:02:40 +00:00
Rolf vandeVaart	9d569f1487	Fix warning when compiling in CUDA aware code. This commit was SVN r28476.	2013-05-10 21:29:08 +00:00
Nathan Hjelm	422331b4da	btl/openib: fix unconnected datagram connection method (udcm) The primary issue with udcm is that the immediate data in message acks were often bogus. This caused the sender to keep trying even though a message was received and acked. The fix is to use the source LID and QP to determine which message is being acked. In most cases this should work well since only one message will be in flight to any peer. This commit was SVN r28444.	2013-05-03 17:11:38 +00:00
Alex Mikheev	9e2fdc7d56	- correction of r28440 This commit was SVN r28441. The following SVN revision numbers were found above: r28440 --> open-mpi/ompi@93ce233530	2013-05-02 12:52:58 +00:00
Alex Mikheev	93ce233530	- btl_openib: changed default SRQ settings: - increase number of wqe to minimize number of RNRs - it is better to have high watermark and post relatively small number of wqes - increased TX queue size This commit was SVN r28440.	2013-05-02 12:46:35 +00:00
Alex Mikheev	f76680fbd0	- btl_openib: fix total registered memory calculation for ConnectIB and Ofed 2.0 This commit was SVN r28432.	2013-05-01 13:39:29 +00:00
Jeff Squyres	d92a8e01f8	Use the _SAFE list traversal macro so that we can remove each item from the list (just for good measure), and then free() it (without using _SAFE, we were accessing memory that was just free()'d to get to the next item). Also be a little more thorough -- DESTRUCT the list when we're all done. This commit was SVN r28429.	2013-05-01 12:26:16 +00:00
George Bosilca	8b0335380a	Fix the error messages to reference the correct function. This commit was SVN r28425.	2013-04-30 23:26:03 +00:00
George Bosilca	6a75c84fa8	Remove useless define. This commit was SVN r28424.	2013-04-30 23:24:59 +00:00
Ralph Castain	ceb4061214	Fix BTL_VERBOSE - when the MCA param change was committed, it left the base verbosity variable declared so things compiled. Sadly, the verbosity was now being set to a new variable, so debug never was output. This commit was SVN r28414.	2013-04-30 01:15:52 +00:00

1 2 3 4 5 ...

2066 Коммитов