openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	9fe8153d38	Sync to IOF branch and continue fix of request for job info from unknown nspace Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 02400d30d79ce3c7e7e28f9a08f7062a5b6f4c51)	2018-02-03 19:56:35 -08:00
Ralph Castain	93cf3c7203	Update OPAL and ORTE for thread safety (I swear, if I look this over one more time, I'll puke) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 12:30:57 -07:00
Ralph Castain	db8943cedd	Provide further (hopefully) helpful messages about the hotel size Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-05 04:27:32 -07:00
Ralph Castain	734b90aa6b	Adjust the timeout for direct modex requests to reflect the size of the job. It can take several seconds to start all the procs, and we don't want to timeout due to differences in start times of the various procs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-04 18:20:51 -07:00
Gilles Gouaillardet	6b90b03c28	orted/pmix: plug a memoy leak in pmix_server_fencenb_fn() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:45 +09:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Ralph Castain	03eb1a80bf	Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release Rename the pmix1xx component to pmix111 so it reflects the actual release it includes Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable. Update the PMIx code and continue attempting to debug direct modex Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node. Update PMIx to v1.1.2	2015-12-12 18:46:38 -08:00
Ralph Castain	267ca8fcd3	Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now to continue current default behavior. Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_ modex is used.	2015-10-27 17:31:56 -07:00
Ralph Castain	8f6855459d	Cleanup some coverity warnings	2015-09-30 10:33:53 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00

10 Коммитов