openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	d4afb16f5a	opal: rework mpool and rcache frameworks This commit rewrites both the mpool and rcache frameworks. Summary of changes: - Before this change a significant portion of the rcache functionality lived in mpool components. This meant that it was impossible to add a new memory pool to use with rdma networks (ugni, openib, etc) without duplicating the functionality of an existing mpool component. All the registration functionality has been removed from the mpool and placed in the rcache framework. - All registration cache mpools components (udreg, grdma, gpusm, rgpusm) have been changed to rcache components. rcaches are allocated and released in the same way mpool components were. - It is now valid to pass NULL as the resources argument when creating an rcache. At this time the gpusm and rgpusm components support this. All other rcache components require non-NULL resources. - A new mpool component has been added: hugepage. This component supports huge page allocations on linux. - Memory pools are now allocated using "hints". Each mpool component is queried with the hints and returns a priority. The current hints supported are NULL (uses posix_memalign/malloc), page_size=x (huge page mpool), and mpool=x. - The sm mpool has been moved to common/sm. This reflects that the sm mpool is specialized and not meant for any general allocations. This mpool may be moved back into the mpool framework if there is any objection. - The opal_free_list_init arguments have been updated. The unused0 argument is not used to pass in the registration cache module. The mpool registration flags are now rcache registration flags. - All components have been updated to make use of the new framework interfaces. As this commit makes significant changes to both the mpool and rcache frameworks both versions have been bumped to 3.0.0. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-03-14 10:50:41 -06:00
Gilles Gouaillardet	d8482ce6f4	opal/mca/memory: add a memoryc_set_alignment subroutine to the OPAL memory MCA this commit also (partially) reverts : - open-mpi/ompi@7de01b347c - open-mpi/ompi@8b05f308f9	2016-02-24 09:50:12 +09:00
Gilles Gouaillardet	7de01b347c	ompi/init: fix abstraction violation This fixes open-mpi/ompi@8b05f308f9 libmpi.so cannot be built (unresolved symbols) with configure'd with --disable-mem-debug --disable-mem-profile --disable-memchecker --without-memory-manager	2016-02-16 16:39:21 +09:00
Igor Ivanov	8b05f308f9	opal/memory: Move Memory Allocation Hooks usage from openib These changes fix issue https://github.com/open-mpi/ompi/issues/1336 - improve abstractions: opal/memory/linux component should be single place that opeartes with Memory Allocation Hooks. - avoid collisions in case dynamic component open/close: it is safe because it is linked statically. - does not change original behaivour.	2016-02-11 14:46:35 +02:00
Ralph Castain	810f2446b7	Add pmix120 component, update the error handling functions in the PMIx API. Update the configure logic for the new pmix120 component ckpt Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works Cleanup the rename files to use the pretty macros	2015-12-28 23:15:44 +09:00
Ralph Castain	267ca8fcd3	Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now to continue current default behavior. Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_ modex is used.	2015-10-27 17:31:56 -07:00
Jeff Squyres	f5ad90c920	init/finalize: extensions Proposed extensions for Open MPI: - If MPI_INITLIZED is invoked and MPI is only partially initialized, wait until MPI is fully initialized before returning. - If MPI_FINALIZED is invoked and MPI is only partially finalized, wait until MPI is fully finalized before returning. - If the ompi_mpix_allow_multi_init MCA param is true, allow MPI_INIT and MPI_INIT_THREAD to be invoked multiple times without error (MPI will be safely initialized only the first time it is invoked).	2015-10-15 12:39:15 -04:00
Mike Dubman	5bebed45eb	OMPI: set "in finalize" indicator in finalize flow	2015-10-04 09:39:37 +03:00
Nathan Hjelm	54a4061d88	Add support for detecting when dynamic add_procs is not possible This commit adds support to the pml, mtl, and btl frameworks for components to indicate at runtime that they do not support the new dynamic add_procs behavior. At the high end the lack of dynamic add_procs support is signalled by the pml using the new pml_flags member to the pml module structure. If the MCA_PML_BASE_FLAG_REQUIRE_WORLD flag is set MPI_Init will generate the ompi_proc_t array passed to add_proc from ompi_proc_world () instead of ompi_proc_get_allocated (). Both cm and ob1 have been updated to detect if the underlying mtl and btl components support dynamic add_procs. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-23 16:22:05 -06:00
Nathan Hjelm	2c89c7f47d	ompi/proc: add function to get all allocated procs This commit adds two new functions: - ompi_proc_get_allocated - Returns all procs in the current job that have already been allocated. This is used in init/finalize to determine which procs to pass to add_procs/del_procs. - ompi_proc_world_size - returns the number of processes in MPI_COMM_WORLD. This may be removed in favor of callers just looking at ompi_process_info. The behavior of ompi_proc_world has been restored to return ompi_proc_t's for all processes in the current job. The use of this function is discouraged. Code that was using ompi_proc_world() has been updated to make use of the new functions to avoid the memory overhead of ompi_comm_world (). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-23 16:22:05 -06:00
Nathan Hjelm	408da16d50	ompi/proc: add proc hash table for ompi_proc_t objects This commit adds an opal hash table to keep track of mapping between process identifiers and ompi_proc_t's. This hash table is used by the ompi_proc_by_name() function to lookup (in O(1) time) a given process. This can be used by a BTL or other component to get a ompi_proc_t when handling an incoming message from an as yet unknown peer. Additionally, this commit adds a new MCA variable to control the new add_procs behavior: mpi_add_procs_cutoff. If the number of ranks in the process falls below the threshold a ompi_proc_t is created for every process. If the number of ranks is above the threshold then a ompi_proc_t is only created for the local rank. The code needed to generate additional ompi_proc_t's for a communicator is not yet complete. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:54 -06:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Gilles Gouaillardet	9c77c6b66d	fortran: fix f08 bindings only define the unique fortran symbol depending on - CAPS - PLAIN - SINGLE_UNDERSCORE - DOUBLE_UNDERSCORE and bind the f08 symbol to the uniquely defined C symbol. Use real data structures to make the code simpler. (perl script written by Jeff)	2015-07-27 16:28:57 +09:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Gilles Gouaillardet	9d56b85b55	initialize common symbols from ompi	2015-05-08 10:11:58 +09:00
Nadezhda Kogteva	116169c38a	opal timing: added ability to choose the timer type	2015-04-17 11:15:55 +03:00
Nathan Hjelm	a7b0c00ab6	fix memory leaks and valgrind errors This commit fixes several vagrind errors. Included: - installdirs did not correctly reinitialize all pointers to NULL at close. This causes valgrind errors on a subsequent call to opal_init_tool. - several opal strings were leaked by opal_deregister_params which was setting them to NULL instead of letting them be freed by the MCA variable system. - move opal_net_init to AFTER the variable system is initialized and opal's MCA variables have been registered. opal_net_init uses a variable registered by opal_register_params! - do not leak ompi_mpi_main_thread when it is allocated by MPI_T_init_thread. - do not overwrite ompi_mpi_main_thread if it is already set (by MPI_T_init_thread). - mca_base_var: read_files was overwritting mca_base_var_file_list even if it was non-NULL. - mca_base_var: set all file global variables to initial states on finalize. - btl/vader: decrement enumerator reference count to ensure that it is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-11 09:28:35 -06:00
Artem Polyakov	01601f3284	Merge pull request #305 from artpol84/timing Timing framework improvement	2014-12-16 15:13:48 +06:00
George Bosilca	3430714989	Correctly propagate the requested level of thread support during the component init calls.	2014-12-13 02:36:21 -05:00
Artem Polyakov	8ffad75a0a	Introduce timing interval measurement facility in timing framework	2014-12-10 16:47:49 +06:00
Ralph Castain	06e49d0e92	Per contribution from Pascal Deveze of Bull: move opal_set_using_threads earlier in MPI_Init (before datatype init) so the value gets set in time to be properly used.	2014-12-09 00:37:57 -08:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
Ralph Castain	fd6a044b7f	Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages. Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.	2014-10-03 16:02:57 -06:00
Ralph Castain	dfb952fa78	[Contribution from Artem - moved it to svn from git for him] Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup. This commit was SVN r32738.	2014-09-15 18:00:46 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Jeff Squyres	132375f07f	helpfiles: fix filenames referenced by calls to show_help() This commit was SVN r32453.	2014-08-08 13:34:15 +00:00
Ralph Castain	daeb9b6c4f	Some more cleanups. Remove direct references to ORTE by changing OMPI_CAST_ORTE_NAME -> OMPI_CAST_RTE_NAME. Ensure that ORTE tools (mpirun, orted, tools) set the OPAL proc structure fields so OPAL knows what is going on and uses the correct print functions (still need to fix the problem for non-MPI apps). Properly return uint32_t from the opal utilities instead of int32_t as that is what the ORTE process name fields contain. Thanks to Gilles for pointing out some of the discrepancies. This commit was SVN r32398.	2014-08-01 14:44:11 +00:00
George Bosilca	9b2fcd898e	No more ORTE specifics in this file. This commit was SVN r32384.	2014-07-31 22:34:16 +00:00
George Bosilca	f39abb9e69	Reverting r32355: a number of processes is not a notion that a low level communication library should use to initialize itself. Ralph will champion this change back with an RFC if there is a realistic need/use case from the community. This commit was SVN r32361. The following SVN revision numbers were found above: r32355 --> open-mpi/ompi@c903917f47	2014-07-30 20:11:35 +00:00
Ralph Castain	c903917f47	Expose the num_procs information to the opal layer as the info is needed in several BTLs This commit was SVN r32355.	2014-07-30 09:33:41 +00:00
Ralph Castain	bcade48e27	Move the opal_process_info initialization to the right place - all that info is known (for ourselves only) immediately after rte_init. This commit was SVN r32329.	2014-07-28 19:19:35 +00:00
George Bosilca	a3feb627cf	Move some of the ompi_process_info down in OPAL. This commit was SVN r32324.	2014-07-26 21:43:34 +00:00
Ralph Castain	552c9ca5a0	George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-) WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic. This commit was SVN r32317.	2014-07-26 00:47:28 +00:00
Ralph Castain	6c5e592785	Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. Update the orte/test/mpi/coll_test.c test to show the revised example. This commit was SVN r32234. The following SVN revision numbers were found above: r32203 --> open-mpi/ompi@a523dba41d r32210 --> open-mpi/ompi@2ce11ed5c4 r32222 --> open-mpi/ompi@d55f16db50	2014-07-15 03:48:00 +00:00
Ralph Castain	a523dba41d	NOTE: this modifies the MPI-RTE interface We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti ng. This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren tly have a way of tagging the collective to a specific thread. From the comment in the code: * In order to allow scalable * generation of collective id's, they are formed as: * * top 32-bits are the jobid of the procs involved in * the collective. For collectives across multiple jobs * (e.g., in a connect_accept), the daemon jobid will * be used as the id will be issued by mpirun. This * won't cause problems because daemons don't use the * collective_id * * bottom 32-bits are a rolling counter that recycles * when the max is hit. The daemon will cleanup each * collective upon completion, so this means a job can * never have more than 2*32 collectives going on at a time. If someone needs more than that - they've got * a problem. * * Note that this means (for now) that RTE-level collectives * cannot be done by individual threads - they must be * done at the overall process level. This is required as * there is no guaranteed ordering for the collective id's, * and all the participants must agree on the id of the * collective they are executing. So if thread A on one * process asks for a collective id before thread B does, * but B asks before A on another process, the collectives will * be mixed and not result in the expected behavior. We may * find a way to relax this requirement in the future by * adding a thread context id to the jobid field (maybe taking the * lower 16-bits of that field). This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives. This commit was SVN r32203.	2014-07-10 18:53:12 +00:00
Ralph Castain	f3cb124e50	Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed. This commit was SVN r32089. The following SVN revision numbers were found above: r32070 --> open-mpi/ompi@12d92d0c22 r32082 --> open-mpi/ompi@aa6438ef7a	2014-06-25 20:43:28 +00:00
Ralph Castain	12d92d0c22	Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS This commit was SVN r32070.	2014-06-24 17:05:11 +00:00
Ralph Castain	248a4b100f	Per Artem, we don't know our VPID at the time of getting the initial timing mark, so just get it if timing is requested This commit was SVN r31951.	2014-06-04 16:28:41 +00:00
George Bosilca	750c6c7861	Update the UTK copyright on the topology related files. This commit was SVN r31805.	2014-05-16 22:23:52 +00:00
Ralph Castain	ab4f8585b0	When we abort during MPI_Init, we currently emit a totally incorrect error message stating that we were unable to aggregate error messages and cannot guarantee all other processes were killed. This simply isn't true IF the rte has been initialized. So track that the rte has reached that point, and only emit the new message if it is accurate. Note that we still generate a TON of output for a minor error: Ralphs-iMac:examples rhc$ mpirun -n 3 -mca btl sm ./hello_c -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[50239,1],2]) is on host: Ralphs-iMac Process 2 ([[50239,1],2]) is on host: Ralphs-iMac BTLs attempted: sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- * An error occurred in MPI_Init * on a NULL communicator * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, * and potentially your MPI job) * An error occurred in MPI_Init * on a NULL communicator * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, * and potentially your MPI job) * An error occurred in MPI_Init * on a NULL communicator * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, * and potentially your MPI job) -------------------------------------------------------------------------- MPI_INIT has failed because at least one MPI process is unreachable from another. This usually means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[50239,1],2] Exit code: 1 -------------------------------------------------------------------------- [Ralphs-iMac.local:23227] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc [Ralphs-iMac.local:23227] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [Ralphs-iMac.local:23227] 2 more processes have sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail Ralphs-iMac:examples rhc$ Hopefully, we can agree on a way to reduce this verbage! This commit was SVN r31686. The following SVN revision numbers were found above: r2 --> open-mpi/ompi@58fdc18855	2014-05-08 15:48:16 +00:00
Nathan Hjelm	595a6e94e6	Fix typos in r31260 Also added some missing values and sentinels. cmr=v1.8:ticket=trac:4470 This commit was SVN r31263. The following SVN revision numbers were found above: r31260 --> open-mpi/ompi@69036437b7 The following Trac tickets were found above: Ticket 4470 --> https://svn.open-mpi.org/trac/ompi/ticket/4470	2014-03-27 22:34:28 +00:00
George Bosilca	bde9619386	Various minor cleanups. This commit was SVN r30431.	2014-01-26 17:27:12 +00:00
Mike Dubman	2af0f878bc	remove bml_init call, called from btl add_proc. Refs trac:3763 This commit was SVN r30310. The following Trac tickets were found above: Ticket 3763 --> https://svn.open-mpi.org/trac/ompi/ticket/3763	2014-01-17 16:52:20 +00:00
Mike Dubman	b7750ccbf4	OSHMEM: bml initialization is moved into ompi_init it fixes race of mca_var segfault in finalization of shmem based on this thread: http://www.open-mpi.org/community/lists/devel/2014/01/13778.php Refs trac:3763 fixed by Igor, reviewed by Brian This commit was SVN r30304. The following Trac tickets were found above: Ticket 3763 --> https://svn.open-mpi.org/trac/ompi/ticket/3763	2014-01-17 06:09:29 +00:00
Ralph Castain	e7710873a1	Open/close the RTE framework cmr=v1.7.4:reviewer=hjelmn This commit was SVN r30270.	2014-01-13 17:43:24 +00:00
Nathan Hjelm	c17b21b11d	Due to MPI_Comm_idup we can no longer use the communicator's CID as the fortran handle. Use a seperate opal_pointer_array to keep track of the fortran handles of communicators. This commit also fixes a bug in ompi_comm_idup where the newcomm was not set until after the operation completed. cmr=v1.7.4:reviewer=jsquyres:ticket=trac:3796 This commit was SVN r29342. The following Trac tickets were found above: Ticket 3796 --> https://svn.open-mpi.org/trac/ompi/ticket/3796	2013-10-03 01:11:28 +00:00
Nathan Hjelm	195c892a9d	Add support for MPI_Comm_dup_with_info, MPI_Comm_create_group, and MPI_Comm_idup. As part of this work I implemented a basic request scheduler in ompi/comm/comm_request.c. This scheduler might be useful for more than just communicator requests and could be moved to ompi/request if there is a demand. Otherwise I will leave it where it is. Added a non-blocking version of ompi_comm_set to support ompi_comm_idup. The call makes a recursive call to comm_dup and a non-blocking version was needed. To simplify the code the blocking version calls the nonblocking version and waits on the resulting request if one exists. cmr=v1.7.4:reviewer=jsquyres:ticket=trac:3796 This commit was SVN r29334. The following Trac tickets were found above: Ticket 3796 --> https://svn.open-mpi.org/trac/ompi/ticket/3796	2013-10-02 14:26:40 +00:00
George Bosilca	2aed876be6	Fix the bug where MPI_is_thread_main returns true for all threads when not in MPI_THREAD_MULTIPLE mode. Thanks to Lisandro Dalcin for pointing it out. This commit was SVN r29113.	2013-09-04 11:10:51 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00

1 2 3 4 5

243 Коммитов