openmpi

Автор	SHA1	Сообщение	Дата
George Bosilca	366d64b7e5	Move the collective structure outside the communicator. As we changed the ABI (forcing a major release), we can limit the size of the predefined communicators by moving the collective structure outside the communicator. This might have a minimal, but unnoticeable, impact on performance. This approach has been discussed during the January 2017 devel meeting. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-02-27 11:54:17 -06:00
Sylvain Jeaugey	f827b6b8dd	Fix more typos using the allgather module for allreduce operations, causing a crash when CUDA collectives are enabled. Signed-off-by: Sylvain Jeaugey <sjeaugey@nvidia.com> Signed-off-by: Akshay Venkatesh <akvenkatesh@nvidia.com>	2017-02-24 16:35:29 -08:00
Joshua Hursey	78006f93a4	coll: Move reduce_local into the coll framework * Since we are adding a new function to `mca_coll_base_module_2_1_0_t` we need to increase the version of the module structure to `2_2_0`. * Add a comment just above the PREDEFINED_COMMUNICATOR_PAD describing it's purpose and when it should change. To help future developers trying to answer the question noted in the comment. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-02-14 08:56:07 -06:00
Joshua Hursey	a2d45f6e9f	communicator: Fix uninitialized variable Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-01-24 16:46:13 -06:00
Joshua Gerrard	d5a45bc12e	Swapped use of fprintf for opal_output_verbose Signed-off-by: Joshua Gerrard <enquiries@joshuagerrard.com>	2016-12-01 19:56:06 +00:00
Joshua Gerrard	7cf5de12b9	Fixed -Werror=unused-result warnings in comm_cid.c by adding error checking Signed-off-by: Joshua Gerrard <enquiries@joshuagerrard.com>	2016-11-29 21:08:12 +00:00
Ralph Castain	1e2019ce2a	Revert "Update to sync with OMPI master and cleanup to build" This reverts commit `cb55c88a8b`.	2016-11-22 15:03:20 -08:00
Ralph Castain	cb55c88a8b	Update to sync with OMPI master and cleanup to build Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-22 14:24:54 -08:00
Ralph Castain	a2919174d0	Bring the RML modifications across. This is the first step in a revamp of the ORTE messaging subsystem to support fabric-based communications during launch and wireup phases. When completed, the grpcomm and plm frameworks will each have their own "conduit" for communication - each conduit corresponds to a particular RML messaging transport. This can be the active OOB-based component, or a provider from within the RML/OFI component. Messages sent down the conduit will flow across the associated transport. Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted. While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress	2016-10-11 16:01:02 -07:00
Gilles Gouaillardet	6c6e35bb40	ompi/communicator: silence warnings	2016-10-06 15:03:06 +09:00
Joshua Hursey	f6f24a4f67	build: Custom libmpi(_FOO) name option in configure * Add a configure time option to rename libmpi(_FOO).* - `--with-libmpi-name=STRING` * This commit only impacts the installed libraries. Internal, temporary libraries have not been renamed to limit the scope of the patch to only what is needed. For example: ```shell shell$ ./configure --with-libmpi-name=wookie ... shell$ find . -name "libmpi" shell$ find . -name "libwookie" ./lib/libwookie.so.0.0.0 ./lib/libwookie.so.0 ./lib/libwookie.so ./lib/libwookie.la ./lib/libwookie_mpifh.so.0.0.0 ./lib/libwookie_mpifh.so.0 ./lib/libwookie_mpifh.so ./lib/libwookie_mpifh.la ./lib/libwookie_usempi.so.0.0.0 ./lib/libwookie_usempi.so.0 ./lib/libwookie_usempi.so ./lib/libwookie_usempi.la shell$ ```	2016-09-29 21:47:24 -05:00
George Bosilca	803897a915	Correctly indent the code.	2016-09-21 07:46:53 -04:00
Gilles Gouaillardet	3b968ec6bb	ompi/communicator: fix typos in CID generation use MPI_MIN instead of MPI_MAX when appropriate, otherwise a currently used CID can be reused, and bad things will likely happen. Refs open-mpi/ompi#2061	2016-09-09 10:10:35 +09:00
Nathan Hjelm	54cc829aab	comm/cid: use ibcast to distribute result in intercomm case This commit updates the intercomm allgather to do a local comm bcast as the final step. This should resolve a hang seen in intercomm tests. Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2016-09-07 10:49:04 -06:00
Nathan Hjelm	f3e9a72f1a	Merge pull request #1987 from hjelmn/cid comm/cid: fix threaded CID allocation	2016-08-19 14:26:39 -06:00
Nathan Hjelm	fbbf743c36	comm/cid: fix threaded CID allocation This commit should restore the pre-non-blocking behavior of the CID allocator when threads are used. There are two primary changes: 1) do not hold the cid allocator lock past the end of a request callback, and 2) if a lower id communicator is detected during CID allocation back off and let the lower id communicator finish before continuing. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-08-19 11:47:19 -06:00
Sylvain Jeaugey	61e900eea5	Fix typo calling allreduce with the allgather module. That was causing CUDA collective to crash.	2016-08-17 17:05:13 -07:00
Ralph Castain	ba77d9beff	Remove forced debugs	2016-08-08 13:20:24 -07:00
Ralph Castain	cacb582ecd	Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application	2016-07-28 14:09:06 -07:00
Gilles Gouaillardet	bbc6d4b3d4	ompi/communicator: remove an other debug print statement in ompi_comm_allreduce_intra_pmix_nb()	2016-07-22 15:42:56 +09:00
Ralph Castain	01a653d50a	Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found". Remove stale file reference Restore autogen pass thru pmix Remove generated file	2016-07-20 00:58:19 -07:00
Ralph Castain	36a9063466	Silence warnings	2016-07-19 17:36:13 -07:00
Nathan Hjelm	4c49c42dd0	ompi/comm: improve comm_split_type scalability This commit introduces a new algorithm for MPI_Comm_split_type. The old algorithm performed an allgather on the communicator to decide which processes were part of the new communicators. This does not scale well in either time or memory. The new algorithm performs a couple of all reductions to determine the global parameters of the MPI_Comm_split_type call. If any rank gives an inconsistent split_type (as defined by the standard) an error is returned without proceeding further. The algorithm then creates a communicator with all the ranks that match the split_type (no communication required) in the same order as the original communicator. It then does an allgather on the new communicator (which should be much smaller) to determine 1) if the new communicator is in the correct order, and 2) if any ranks in the new communicator supplied MPI_UNDEFINED as the split_type. If either of these conditions are detected the new communicator is split using ompi_comm_split and the intermediate communicator is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-07-18 12:47:05 -06:00
Nathan Hjelm	035c2e2e2a	ompi/comm: refactor communicator cid code This commit simplifies the communicator context ID generation by removing the blocking code. The high level calls: ompi_comm_nextcid and ompi_comm_activate remain but now call the non-blocking variants and wait on the resulting request. This was done to remove the parallel paths for context ID generation in preperation for further improvements of the CID generation code. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-07-18 12:47:05 -06:00
George Bosilca	73972768f8	Remove an apparently useless function.	2016-07-05 18:30:11 +02:00
Nathan Hjelm	65be935676	comm/split_type: allow MPI_UNDEFINED for split_type It is valid for any rank to deviate on the split_type argument if they specify MPI_UNDEFINED. The code was incorrectly not allowing this condition. Changed the split_type uniformity check and allow local_size to be 0 if the local split_type is MPI_UNDEFINED. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-06-16 17:42:28 -06:00
bosilca	b90c83840f	Refactor the request completion (#1422 ) * Remodel the request. Added the wait sync primitive and integrate it into the PML and MTL infrastructure. The multi-threaded requests are now significantly less heavy and less noisy (only the threads associated with completed requests are signaled). * Fix the condition to release the request.	2016-05-24 18:20:51 -05:00
Nathan Hjelm	ae0ffbb67f	Merge pull request #1397 from hjelmn/enable_thread_multiple ompi: always enable MPI_THREAD_MULTIPLE support	2016-04-23 08:40:22 -06:00
Nathan Hjelm	a1420003b6	ompi/comm: clean up includes in comm_request.h Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-03-22 09:17:38 -06:00
Nathan Hjelm	230d04327e	ompi: always enable MPI_THREAD_MULTIPLE support This commit removes the --with-mpi-thread-multiple option and forces MPI_THREAD_MULTIPLE support. This cleans up an abstration violation in opal where OMPI_ENABLE_THREAD_MULTIPLE determines whether the opal_using_threads is meaningful. To reduce the performance hit on MPI_THREAD_SINGLE programs an OPAL_UNLIKELY is used for the check on opal_using_threads in OPAL_THREAD_* macros. This commit does not clean up the arguments to the various functions that take whether muti-threading support is enabled. That should be done at a later time. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-23 10:02:14 -07:00
Gilles Gouaillardet	030a5f2054	sentinel: use type uintptr_t for sentinel MSB is now automatically cleared when right shifting Thanks George for pointing this	2016-02-10 11:28:56 +09:00
Artem Polyakov	2abb2972ac	Fix Mellanox copyrights with respect to the following PRs: * https://github.com/open-mpi/ompi/pull/1184 * https://github.com/open-mpi/ompi/pull/1188 * https://github.com/open-mpi/ompi/pull/1197 * https://github.com/open-mpi/ompi/pull/1202 * https://github.com/open-mpi/ompi/pull/1210 * https://github.com/open-mpi/ompi/pull/1216 * https://github.com/open-mpi/ompi/pull/1236 * https://github.com/open-mpi/ompi/pull/1237 * https://github.com/open-mpi/ompi/pull/1248 * https://github.com/open-mpi/ompi/pull/1260 * https://github.com/open-mpi/ompi/pull/1264	2015-12-30 00:12:19 +06:00
Nathan Hjelm	b7ba301310	Merge pull request #1165 from hjelmn/add_procs_group ompi/group: release ompi_proc_t's at group destruction	2015-12-14 13:53:42 -08:00
Artem Polyakov	ee71e35a90	Fix ompi_comm_create when source communicator is inter-communicator. This bug was triggered by probe-intercom and icm tests from MPICH suite.	2015-12-09 15:44:26 +02:00
Artem Polyakov	7690f4027a	Yet one more fix to intercommunicator splitting logic. Previous commit `f2794740` reverts Nathans changes. However it turns out that I was unable to trace his logic until I started investigation of icsplit hang. Bug was triggered when splitting Intercom was giving a group where on side of the communicator was empty (icsplit, intercom create #2). in this case remote_size == 0 and there is no way to distinguish between inter- and intra-communicator. Conclusion: We do need to distinguish between intra- and inter-communicators. So we should use ompi_mpi_group_null.group.	2015-12-08 08:43:08 +02:00
Artem Polyakov	f2794740b3	Fix intercommunicator split (was triggered by MPICH/icsend test)	2015-12-07 15:41:29 +02:00
Nathan Hjelm	406b9ff1e6	ompi/group: add helper function for creating plist groups This commit adds a helper function for creating groups from proc lists. The function is used by ompi_comm_fill_rest to create the local and remote groups. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-11-30 23:52:57 -07:00
Nathan Hjelm	5334d22a37	ompi/group: release ompi_proc_t's at group destruction This commit changes the way ompi_proc_t's are retained/released by ompi_group_t's. Before this change ompi_proc_t's were retained once for the group and then once for each retain of a group. This method adds unnecessary overhead (need to traverse the group list each time the group is retained) and causes problems when using an async add_procs. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-11-30 23:03:47 -07:00
Gilles Gouaillardet	a0782e1c7e	mpi: MPI_Neighbor_all* and MPI_Ineighbor_all* do not work with inter communicators (fail with MPI_ERR_COMM) or non process topologies communicators (fail with MPI_ERR_TOPOLOGY)	2015-10-21 16:21:19 +09:00
Nathan Hjelm	6b83fa2f58	ompi/comm: fix coverity errors Fixes CID 1323841: Logically dead code Wrong value in conditional. Should be newcomp not newcomm. Fixes CID 1269762: Explicit null dereference ompi_group_incl could return an error and not set local_group. Add a check to ensure ompi_group_incl succeeded before incrementing the proc count. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-28 15:55:19 -06:00
Nathan Hjelm	c84c05bab7	ompi/comm: fix comm_[i]dup on intracommunicators The behavior of ompi_comm_set was changed to get the remote size from the remote group. This broke how ompi_comm_[i]dup were using ompi_comm_set. In order to adapt to the new behavior these functions now pass NULL for the remote group if the communicator is not an inter-communicator. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-16 10:31:18 -06:00
Ralph Castain	c1bbbb5e2f	Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds	2015-09-15 13:08:35 -07:00
Nathan Hjelm	6379178046	ompi/comm: fix bug in ompi_comm_set This commit updates the behavior of ompi_comm_set to explicitly take either local/remote group(s) OR local/remote array(s). If array(s) are in use the sizes will be taken from the appropriate group(s). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-15 11:37:44 -06:00
Ralph Castain	fbcf819d2e	Remove unnecessary include	2015-09-11 15:53:00 -07:00
Nathan Hjelm	c45789a222	ompi/comm: improve comm_split_type scalability This commit includes two changes. First, the locality code has been factored out to improve readability and maintainability. Second, instead of looking up each proc using ompi_group_peer_lookup the code now uses ompi_group_peer_lookup_existing. The code falls back on modex if a proc doesn't exist. This will prevent MPI_Comm_split_type from allocating ompi_proc_t's for every process in the job. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-11 13:53:48 -06:00
Nathan Hjelm	5b7943db78	ompi/group: do not allocate ompi_proc_t's on group union/difference This commit modifies the ompi_group_t union/difference code to compare/copy the raw group values. This will either be a ompi_proc_t or a sentinel value. This commit also adds helper functions to convert between opal process names and sentinel values. Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-09-10 08:55:55 -06:00
Nathan Hjelm	0bf06de3f1	group\|comm: add initial support for group sentinel values This commit modifies ompi's process list group object to support a sentinel value for non-existant ompi_proc_t objects. The sentinel was chosen to be the negative of the opal_process_name_t of the associated ompi_proc_t. This takes advantage of the fact that on most (all?) systems the top bit of a user-space pointer is never set. If this changes then a new sentinel will be needed. In addition this commit modifies the way ompi_mpi_comm_world is initialized to fill in the group with sentinel values if the number of processes exceeds the new add_procs behavior cutoff. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:54 -06:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Nathan Hjelm	69a0e1dd08	comm: fix coverity issues CID 1269683 Unchecked return value (CHECKED_RETURN) CID 1269684 Unchecked return value (CHECKED_RETURN) Use ompi_comm_rank instead of MPI_Comm_rank here. There is no reason to be using the MPI interface over the internal interface. This should clear up these issues. Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-02 09:40:26 -06:00

1 2 3 4 5 ...

262 Коммитов