openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	8b1f01dfe6	Set the default modex parameters back to full blocking modex while we continue to test and debug the slow modex - it seems to be having issues on the Cray Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-22 15:19:46 -07:00
Ralph Castain	9fc3079ac2	Implement a background fence that collects all data during modex operation The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called. Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim. This PR changes the default settings of a few relevant params to make "background modex" the default behavior: * pmix_base_async_modex -> defaults to true * pmix_base_collect_data -> continues to default to true (no change) * async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything. The logic in MPI_Init is: * if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation * if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed * if async_modex is not set, then we block until the fence completes (regardless of collecting data or not) * if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation. * if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case. HTH Ralph Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-21 10:29:23 -07:00
Ralph Castain	ffbfd22d84	Fix event registration - need to increment the event index and record the number of codes in the event handler Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-13 17:35:10 -07:00
Ralph Castain	9f73974fe1	Update to latest PMIx master, including disabling the pmi-1 and pmi-2 backward compatibility as these interfere with the s1,s2 components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-12 12:34:27 -07:00
Ralph Castain	95ae0d1df3	Cleanup timing macros for portability across compilers. Rename the --enable-timing configure option to be --enable-pmix-timing so it doesn't pickup external timing requests. Remove a stale function reference in PMIx so it can compile with timing enabled. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-10 12:56:38 +06:00
Ralph Castain	92c996487c	Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well. Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap Get the DVM running again Fix direct modex by eliminating race condition caused by releasing data while sending it Up the size limit before compressing Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 19:25:15 -07:00
Ralph Castain	2cc5fea8be	Update to PMIx v2.0alpha Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 10:02:29 -07:00
Ralph Castain	7dd34d0c9a	Use the correct callback data - the callback function was expecting a bool, not a pmix_ptl_sr_t. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-28 17:21:47 -07:00
Ralph Castain	35f817911e	Fix coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 08:09:46 -07:00
Ralph Castain	c0bcd11bcf	Fix permissions - no CI required Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 08:05:52 -07:00
Ralph Castain	55e4fba5f5	If we lose connection to the server after initiating a send/recv in PMIx (e.g., in PMIx_Abort), then we need to "resolve" all pending recvs to avoid hanging. Fixes #3225 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 02:53:21 -07:00
Ralph Castain	d645557fa0	Update to include the PMIx 2.0 APIs for monitoring and job control. Include required integration, but leave the monitors off for now. Move the sensor framework out of ORTE as it is being absorbed into PMIx Fix typo and silence warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 17:47:08 -07:00
Ralph Castain	4b6d220a83	You cannot include both pmi.h and pmi2.h as they have conflicting defines in them. Thanks to Kilian Cavalotti for pointing it out Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-19 11:53:54 -07:00
Ralph Castain	c6bc3ccb76	Sync to latest PMIx master and PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 12:50:38 -08:00
Ralph Castain	aca7091114	Fix some minor compatibility issues by ensuring job-level data gets stored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes #3101 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-05 10:30:59 -08:00
Ralph Castain	1de72ff023	Silence an unnecessary error log Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-02 17:18:34 -08:00
Ralph Castain	e86a0dbf39	Update to PMIx master to include dlopen fixes and addition of libltdl support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-22 11:54:33 -08:00
Ralph Castain	8cffdcf127	Ensure that the pmix headers and lib get installed when --with-devel-headers is given so that PMIx applications can be built and executed against the "embedded" PMIx version Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-21 13:46:46 -08:00
Gilles Gouaillardet	bb2481a84b	pmix2x: synchronize to the latest PMIx master pmix/master@f57d9b2953 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-20 10:45:17 +09:00
Ralph Castain	f49118eaab	Fix some pmix configuration code Remove stale file reference that caused a check to always fail. Update psm2 function check to new libs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-16 10:54:47 -08:00
Howard Pritchard	b272f87926	Merge pull request #2968 from hjelmn/pmix_cray pmix/cray: performance improvements and cleanup	2017-02-16 11:41:59 -07:00
Ralph Castain	201f8571ca	Ensure we retain the peer object until we are done with it, then detect that the socket has closed due to a lost connection and cleanly release the message event Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 18:30:55 -08:00
Ralph Castain	9cd7349d7c	Instead of completely free'ing the event base, pause the PMIx progress thread before tearing down the infrastructure, and then release the event base at the end of the procedure. This allows any infrastructure objects holding events to delete them prior to free'ing the event base. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 05:02:43 -08:00
Ralph Castain	f7fe2f7189	Merge pull request #2977 from rhc54/topic/spawn Fix comm_spawn by registering nspace info only when needed	2017-02-15 04:31:54 -08:00
Ralph Castain	68b53e2179	Fix comm_spawn by registering nspace info only when needed - either when we have local procs, or when job-level info is required by connecting jobs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 19:47:56 -08:00
Ralph Castain	0c8609ca16	Update to newest PMIx master (includes configuration cleanups). Silence trivial Coverity warning in hwloc base. Cleanup a race condition segfault during finalize by ensuring the PMIx progress thread is stopped prior to starting to tear down the messaging components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 15:14:00 -08:00
Nathan Hjelm	3b912ea2a7	pmix/cray: performance improvements and cleanup Do not use opal_output_verbose inside O(n) loops. This was causing us to make O(n) calls to snprintf which was greatly slowing launch at scale. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-02-14 11:13:10 -07:00
Ralph Castain	35578b4009	Update to lastest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-13 23:19:26 -08:00
Gilles Gouaillardet	7acef4833e	pmix2x: Update to latest PMIx master pmix/master@6ed27be839 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-08 13:23:27 +09:00
KAWASHIMA Takahiro	4b2eba34a6	Merge pull request #2933 from kawashima-fj/pr/dstore-config-desc pmix/pmix2x: Correct configure option description	2017-02-08 13:03:27 +09:00
Jeff Squyres	100b112d3c	pmix: fix zlib protection macro usage It's possible that we can have zlib.h but still not have zlib support. Use the correct macro to protect the usage of calling zlib functions. This fixes 32-bit MTT builds at Cisco (e.g., https://mtt.open-mpi.org/index.php?do_redir=2389). Submitted upstream to PMIX: https://github.com/pmix/master/pull/290 Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-07 05:52:32 -08:00
KAWASHIMA Takahiro	750406f67b	pmix/pmix2x: Correct configure option description `--enable-pmix-dstore` option was enabled by default in f4a5511. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2017-02-07 11:52:56 +09:00
Ralph Castain	edcfdf2365	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-31 08:01:37 -08:00
Gilles Gouaillardet	b078e57e73	pmix/ext1x: fix misc memory leaks in namespace registration Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:42 +09:00
Gilles Gouaillardet	f51fc293a2	ext1x/pmix1x_client: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:42 +09:00
Gilles Gouaillardet	022cca79ea	pmix/ext1x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:36 +09:00
Gilles Gouaillardet	f485d12a82	pmix: rename the ext11 component into ext1x also use the same naming scheme thann pmix/ext2x Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:35 +09:00
Gilles Gouaillardet	dccb1899e6	pmix/ext11: correctly use PMIx_server_register_nspace() PMIx_server_register_nspace() is an asynchronous operation, so the pmix glue wait for it completes before returning. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 09:23:19 +09:00
Gilles Gouaillardet	6955e1e25c	pmix/ext11: fix compilation the argc field from the opal_pmix_app_t struct was removed, so adjust the pmix/ext11 glue accordingly. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 09:23:18 +09:00
Ralph Castain	3302864a7d	Cleanup a typo that can cause a segfault - use a local variable name different than the one passed into the function Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-27 16:49:25 -08:00
Gilles Gouaillardet	896434b1bd	pmix/ext2x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-26 14:07:15 +09:00
Gilles Gouaillardet	6b8e1c217c	pmix/ext2x: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-26 14:06:58 +09:00
Gilles Gouaillardet	142b95df87	pmix/ext2x: plug misc memory leaks regarding opal_pmix2x_event_chain_t handling Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-25 16:17:10 +09:00
Gilles Gouaillardet	7a3d39f079	pmix/ext2x: plug a memory leak in _reg_nspace() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-25 16:17:01 +09:00
Gilles Gouaillardet	189da7fdab	pmix2x: plug a memory leak in _event_hdlr() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:30 +09:00
Gilles Gouaillardet	acbc32d3b2	pmix2x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Gilles Gouaillardet	b5b21043c4	pmix2x: plug a memory leak in _reg_nspace() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Gilles Gouaillardet	0f47310a75	pmix2x/pmix2x_client: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Ralph Castain	8c960bae8d	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-23 07:07:40 -08:00
Ralph Castain	e568b211e4	Silence Coverity CID 1398541 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-10 15:30:50 -08:00

1 2 3 4 5 ...

527 Коммитов