1
1
openmpi/opal/mca
Ralph Castain 9fc3079ac2 Implement a background fence that collects all data during modex operation
The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called.

Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim.

This PR changes the default settings of a few relevant params to make "background modex" the default behavior:

* pmix_base_async_modex -> defaults to true

* pmix_base_collect_data -> continues to default to true (no change)

* async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything.

The logic in MPI_Init is:

* if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation

* if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed

* if async_modex is not set, then we block until the fence completes (regardless of collecting data or not)

* if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation.

* if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case.

HTH
Ralph

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-21 10:29:23 -07:00
..
allocator opal: rework mpool and rcache frameworks 2016-03-14 10:50:41 -06:00
backtrace stacktrace: Add flexibility in stacktrace ouptut 2017-01-26 11:55:32 -06:00
base Additional mpirun --help changes 2017-04-19 11:43:45 -06:00
btl usnic: ensure to set the iov_limit to 1 2017-04-20 13:28:15 -07:00
common btl/ugni: improve multi-threaded performance 2017-03-13 14:46:06 -06:00
compress mca/base: add priority output to mca_base_select 2015-10-19 12:32:41 -06:00
crs configury: test portability 2015-12-28 13:58:45 +09:00
dl dl/dlopen: add libs to wrapper LIBS 2017-04-15 09:30:18 -07:00
event libevent/external: Add opal_event_include to this component 2017-01-24 16:03:09 -06:00
hwloc hwloc: add CUDA include dir to CPPFLAGS 2017-04-05 11:46:22 +09:00
if opal/if: open the if framework once in opal_init_util 2016-12-01 14:24:30 +09:00
installdirs Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
memchecker configury: test portability 2015-12-28 13:58:45 +09:00
memcpy Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
memory memory/patcher: fix a compiler warning 2017-03-16 05:43:51 -07:00
mpool mpool/memkind: add a missing include file 2017-02-08 16:06:22 +09:00
patcher asm: rename the AMD64 into X86_64 2017-02-27 15:10:50 +09:00
pmix Implement a background fence that collects all data during modex operation 2017-04-21 10:29:23 -07:00
pstat powerpc: Add support for powerpcle in timer/pstat. 2016-11-07 02:35:44 -05:00
rcache rcache/base: do not free memory with the vma lock held 2017-02-21 21:04:46 -07:00
reachable configury: check libnl version and abort in case of conflict 2016-10-25 09:23:59 +09:00
shmem opal: standardize on max hostname length 2016-04-24 08:19:47 +02:00
timer timer/linux: rename component-specific functions 2017-03-15 21:03:13 -05:00
Makefile.am Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
mca.h ompi/hook: Add the hook/license framework 2017-02-27 12:05:53 -05:00