openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	3c33a8e94b	btl/ugni: adjust exclusivity below sm and vader Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-29 15:40:35 -06:00
Nathan Hjelm	12bd300c40	Merge pull request #929 from hjelmn/add_procs Update add_procs support	2015-09-28 17:29:13 -06:00
Todd Kordenbrock	3e63a3458c	portals4: add support for dynamic add_procs() to all Portals4 components In the default mode of operation, the Portals4 components support dynamic add_procs(). The Portals4 components have two alternate modes (flow control and logical-to-physical) that require knowledge of all procs at startup. In these modes, mtl-portals4 sets the MCA_MTL_BASE_FLAG_REQUIRE_WORLD flag and btl-portals4 sets the MCA_BTL_FLAGS_SINGLE_ADD_PROCS flag to tell the PML that we need all the procs in one add_procs() call.	2015-09-24 22:12:57 -05:00
Todd Kordenbrock	3afac9e37d	btl-portals4: fix PMIx integration problem After PMIx integration, the thrid parameter to OPAL_MODEX_RECV() is opal_process_name_t instead of opal_proc_t. This commit replaces proc with &proc->proc_name.	2015-09-24 21:53:20 -05:00
Nathan Hjelm	60f3dbd160	btl/openib: fix udcm coverity errors Fix CID 1312120: Uninitialized scalar variable The response type will always be set unless a message of another type is passed to this function. To make sure that error is caught I am adding an assert. Fix CID 1312116: Dereference after null check This is a potential bug. If there is no endpoint data for an incoming connection a rejection should be sent. In this case we would just SEGV. Fix CID 1312115: Dereference after null check Clear error in the error message. Use the queue pair number that was passed in. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-23 17:08:27 -06:00
Nathan Hjelm	54a4061d88	Add support for detecting when dynamic add_procs is not possible This commit adds support to the pml, mtl, and btl frameworks for components to indicate at runtime that they do not support the new dynamic add_procs behavior. At the high end the lack of dynamic add_procs support is signalled by the pml using the new pml_flags member to the pml module structure. If the MCA_PML_BASE_FLAG_REQUIRE_WORLD flag is set MPI_Init will generate the ompi_proc_t array passed to add_proc from ompi_proc_world () instead of ompi_proc_get_allocated (). Both cm and ob1 have been updated to detect if the underlying mtl and btl components support dynamic add_procs. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-23 16:22:05 -06:00
Jeff Squyres	7bf364ff04	usnic: create blocking libevent for progress thread emulation In the v1.10 opal_progress_thread emulation, ensure to create a blocking libevent timer far in the future. Without that, opal_event_loop() will return immediately (and therefore the progress thread spins hard, stealing CPU cycles).	2015-09-18 11:53:03 -07:00
Jeff Squyres	7cb6c2fcc2	usnic: put the HWLOC #if's back to preserve compat with v1.10 We try to keep the source code the same between master and v1.10. So put the #if's back for OPAL_HAVE_HWLOC (and just hard-code it to 1 on master) so that this code is also compilable in v1.10.	2015-09-18 11:53:03 -07:00
Ralph Castain	1b7930ad52	Silence some warnings and address Coverity issues	2015-09-16 07:58:22 -07:00
George Bosilca	0e7e14449f	Typo in the modex_recv.	2015-09-14 18:00:02 -04:00
Gilles Gouaillardet	d5af5d106c	btl/sm: mca_btl_sm_sendi: do not set *descriptor when descriptor is NULL	2015-09-14 14:04:40 +09:00
Jeff Squyres	f7d90abf42	usnic: update for new add_procs() downcall behavior	2015-09-10 08:55:55 -06:00
Jeff Squyres	2f2d5ff855	btl.h: update comment for new add_procs behavior	2015-09-10 08:55:55 -06:00
Nathan Hjelm	2041aac4e4	btl/openib: add support for dynamic add_procs Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:55 -06:00
Nathan Hjelm	40067f7ec4	btl/tcp: add support for dynamic add_procs This commit makes two changes to the tcp btl: - If a tcp proc does not exist when handling a new connection create a new proc and use it. The current implementation uses the opal_proc_by_name() function to get the opal_proc_t then calls add_procs on all btl modules. It may be sufficient to just call add_procs until an endpoint is created so this may change somewhat. - In add_procs add a check for an existing endpoint before creating one. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:55 -06:00
Nathan Hjelm	536aba1172	btl/portals4: add send flag to btl_flags	2015-09-10 08:55:55 -06:00
Nathan Hjelm	408da16d50	ompi/proc: add proc hash table for ompi_proc_t objects This commit adds an opal hash table to keep track of mapping between process identifiers and ompi_proc_t's. This hash table is used by the ompi_proc_by_name() function to lookup (in O(1) time) a given process. This can be used by a BTL or other component to get a ompi_proc_t when handling an incoming message from an as yet unknown peer. Additionally, this commit adds a new MCA variable to control the new add_procs behavior: mpi_add_procs_cutoff. If the number of ranks in the process falls below the threshold a ompi_proc_t is created for every process. If the number of ranks is above the threshold then a ompi_proc_t is only created for the local rank. The code needed to generate additional ompi_proc_t's for a communicator is not yet complete. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:54 -06:00
Nathan Hjelm	6f8f2325ed	btl: btls are now required to set the send flag if supported This commit updates each non-compliant btl to send the MCA_BTL_FLAGS_SEND flag in the btl_flags field if send is supported. This fixes a problem identified after the latest bml/r2 update which excplicitly checks for the send flag. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:54 -06:00
Matias Cabral	f360eebfeb	Merge pull request #855 from matcabral/btl_openib_mtu Fix for openib btl mca command line parameter btl_openib_mtu being ignored	2015-09-09 11:22:00 -07:00
Rolf vandeVaart	2e64a69fa9	Add some verbosity to help debug hwloc issues	2015-09-08 10:50:22 -07:00
Jeff Squyres	bc9e5652ff	whitespace: purge whitespace at end of lines Generated by running "./contrib/whitespace-purge.sh".	2015-09-08 09:47:17 -07:00
Jeff Squyres	f782a7640e	usnic: minor re-order of Makefile.am sources Put the hwloc.c file alphabetically in the list.	2015-09-05 05:02:00 -07:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
matcabral	1f9218a0bc	Fix for openib btl mca command line parameter btl_openib_mtu being ignored.	2015-09-02 02:22:30 -07:00
Nathan Hjelm	f926796e57	Merge pull request #828 from hjelmn/openib_thread_fix openib thread fixes	2015-09-01 09:12:50 -06:00
Jeff Squyres	8558458bb9	usnic: adjust for new PMIX argument type	2015-08-31 14:55:58 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Nathan Hjelm	64e4419d76	btl/openib: allow the use of the openib btl in thread muliple There were several issues preventing the openib btl from running in thread multiple mode: - Missing locks in UDCM when generating a loopback endpoint. Fixed in open-mpi/ompi@8205d79819. - Incorrect sequence numbers generated in debug mode. This did not prevent the openib btl from running but instead produced incorrect error messages in debug builds. - Recursive locking of the rcache lock caused by the malloc hooks. This is fixed by open-mpi/ompi#827 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 16:04:52 -06:00
Nathan Hjelm	c101385f64	btl/openib: fix sequence number generation for debug mode When using eager RDMA in debug builds the openib btl generates a sequence number for each send. The code independently updated the head index and the sequence number for the eager rdma transaction. If multiple threads enter this code at the same time and run in the following order: thread 1: update sequence (0 -> 1) thread 2: update sequence (1 -> 2) thread 2: update head (0 -> 1) thread 1: update head (1 -> 2) the sequence number for head[0] gets 1 and the sequence number for head[1] gets 0. The fix is to generate the sequence number from the head index. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 16:00:06 -06:00
Nathan Hjelm	8205d79819	btl/openib: add missing lock calls Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 12:21:49 -06:00
Gilles Gouaillardet	d02ccd67de	btl/openib: remove OFED version runtime check when XRC is used this test seems broken : - some false positive were reported - it fails to detect some OFED version mismatch this commit simply removes this test, which means the application will likely fail if XRC is used ad OFED version is different between compile time and runtime	2015-08-14 09:10:03 +09:00
Jeff Squyres	14340770c4	usnic: remove some logically dead code This code really had no purpose; just assign FI_VERSION(1, 1). This fixes CID 1315274. Also clarify the commet about why we still retain libfabric v1.0.0 compatibility code, even though configure.m4 requires libfabric >= v1.1.0.	2015-08-12 05:21:18 -07:00
Jeff Squyres	236cf7ff62	usnic: add v1.8/v1.10 compat code Add compat code so that I can sync master against the v1.10 branch.	2015-08-10 16:27:38 -07:00
Jeff Squyres	7da1c4b875	usnic: avoid race condition in connectivity checker In short applications, it's possible that the agent (i.e., local rank 0) will finalize after non-local rank 0 procs detect the connectivity checker named socket, but before they complete a connect() on it. As such, their connect() gets ECONNREFUSED. This commit adds a simple counter in the agent that won't let it quit before it accept()'s from all local procs, or 10 seconds goes by (whichever occurs first). This is similar to the timeout for the clients: they'll exit if they don't see the expected named socket within 10 seconds.	2015-08-10 15:40:33 -07:00
Jeff Squyres	bad508687e	usnic: move cchecker to OPAL-wide progress thread There's no longer any need for the usnic BTL to have its own progress thread: it can use the opal_progress_thread() infrastructure. This commit removes the code to startup/shutdown the usnic-BTL-specific progress thread and instead, just adds its events to the OPAL-wide progress thread. This necessitated a small change in the finalization step. Previously, we would stop the progress thread and then tear down the events. We can no longer stop the progress thread, and if we start tearing down events, this will cause shutdown/hangups to be sent across sockets, potentially firing some of the still-remaining events while some (but not all) of the data structures have been torn down. Chaos ensues. Instead, queue up an event to tear down all the pending events. Since the progress thread will only fire one event at a time, having a teardown event means that it can tear down all the pending events "atomically" and not have to worry that one of those events will get fired in the middle of the teardown process.	2015-08-10 15:40:33 -07:00
Jeff Squyres	b5c37dbfe2	CSCuv67889: usnic: fix an error corner case Ensure that we have non-NULL on all levels of pointers, which will save us if there are exitable errors very early during component / module initialization.	2015-08-06 10:54:28 -07:00
Jeff Squyres	cbcd16b399	usnic: remove a stale shell variable name	2015-07-31 18:53:54 -07:00
Jeff Squyres	0ee8295e6e	usnic: ensure that we have libfabric >= v1.1	2015-07-31 18:53:54 -07:00
Jeff Squyres	2e7f794aae	usnic: convert to use fi_recvmsg / FI_MORE Minor optimization to post 16 receive buffers at a time (vs. 1).	2015-07-31 12:45:40 -07:00
Jeff Squyres	ec3a38384f	Merge pull request #688 from jsquyres/pr/usnic-libfabric-msg-prefix-fix usnic fixes for differences between libfabric v1.0.0 and v1.1.0	2015-07-21 10:18:36 -04:00
Gilles Gouaillardet	f7cf7d5070	configury: fix XRC detection on OFED < 3.12 since ibv_create_xrc_rcv_qp is now deprecated, and in order to be "future-proof", we have to consider the case in which only XRC Domains are supported. also, correctly handle distro that ship broken ibverbs devel headers Thanks Paul Hargrove for the detailled report.	2015-07-13 10:43:22 +09:00
Ralph Castain	683efcb850	Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.	2015-07-11 10:08:19 -07:00
Ralph Castain	61fb067f14	Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base	2015-07-11 06:42:23 -07:00
Jeff Squyres	633da6641e	usnic: gracefully handle when we can't alloc an ACK The comment didn't match the debugging code (which was ugly, and apparently never happens, anyway). Just return and let the sender retransmit.	2015-07-10 14:19:33 -07:00
Jeff Squyres	3327fa56b5	usnic: minor code cleanups	2015-07-10 10:10:43 -07:00
Jeff Squyres	f9c65a701e	usnic: "sin" assignment needs to be outside the #if The "sin" variable is used below; need to ensure that it is assigned for all builds (not just debug builds).	2015-07-10 06:51:03 -07:00
Jeff Squyres	cd87c8ad41	usnic: misc compiler warnings fixes	2015-07-10 06:51:03 -07:00
Jeff Squyres	ba429dc890	usnic: temporarily disable the BTL put method The usnic BTL put method is currently broken. Disable it until we can fix it properly.	2015-07-10 06:51:03 -07:00
Jeff Squyres	f265358fbe	usnic: handle FI_MSG_PREFIX differences libfabric v1.0.0->v1.1.0 In libfabric v1.0.0 (i.e., API v1.0), the usnic provider handled FI_MSG_PREFIX inconsistently between sends and receives. This has been fixed in libfabric v1.1.0 (i.e., API v1.1): FI_MSG_PREFIX is handled consistently for both sends and receives. Run-time detect which libfabric we are running with and adapt behavior appropriately.	2015-07-10 06:51:03 -07:00
Jeff Squyres	ddd0de6cfc	usnic: make more OS-bypass memory Valgrind-defined This helps reduce false positives when running MPI apps through Valgrind.	2015-07-10 06:51:03 -07:00

1 2 3 4 5 ...

442 Коммитов