openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	60a7bc2e50	Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion. Fixes ##1225	2016-02-18 09:29:12 -08:00
Ralph Castain	810f2446b7	Add pmix120 component, update the error handling functions in the PMIx API. Update the configure logic for the new pmix120 component ckpt Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works Cleanup the rename files to use the pretty macros	2015-12-28 23:15:44 +09:00
Ralph Castain	03eb1a80bf	Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release Rename the pmix1xx component to pmix111 so it reflects the actual release it includes Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable. Update the PMIx code and continue attempting to debug direct modex Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node. Update PMIx to v1.1.2	2015-12-12 18:46:38 -08:00
Ralph Castain	e6add86e4f	Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues.	2015-09-07 09:19:24 -07:00
Ralph Castain	f6948c2bb4	Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working	2015-09-04 10:07:17 -07:00
Ralph Castain	a772b46c15	Bring the MPI_Publish and friends online	2015-09-02 12:04:07 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Ralph Castain	1b24536941	Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail.	2015-03-25 13:22:01 -07:00
Ralph Castain	2581b41d08	Continue refactoring code by splitting the msg processing from the sendrecv code	2014-12-17 19:57:14 -08:00
Ralph Castain	f489e871c2	Take first step towards refactoring the PMIx server code by splitting out the proc_map function into its own file. Update ignore to include .DS_Store from the Mac	2014-12-17 19:08:52 -08:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
Elena	03fc809bc9	This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory.	2014-11-06 16:01:19 +02:00
Ralph Castain	2bfb18e004	Resolve some race conditions when async pmix modex modes are invoked. Since calls to "get" data can come both locally and remotely before data for a given proc has actually been received, we have to track all requests that cannot be immediately fulfilled and provide the data once it has been received. This commit was SVN r32664.	2014-09-02 20:04:17 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00

15 Коммитов