openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	9fc3079ac2	Implement a background fence that collects all data during modex operation The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called. Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim. This PR changes the default settings of a few relevant params to make "background modex" the default behavior: * pmix_base_async_modex -> defaults to true * pmix_base_collect_data -> continues to default to true (no change) * async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything. The logic in MPI_Init is: * if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation * if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed * if async_modex is not set, then we block until the fence completes (regardless of collecting data or not) * if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation. * if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case. HTH Ralph Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-21 10:29:23 -07:00
Ralph Castain	c86f71376a	Increase fine grain of timing info Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-20 00:17:40 -07:00
Boris Karasev	36a0e71f2d	ompi/timings: preparing to production state Adds: - enabling/disabling of timings throught environment variable `OMPI_TIMING_ENABLE` - output format: [file name]:[function name]:[description]: avg/min/max - dynamically extending array of results for case then inited size was exhausted - catch and collect errors - cleanup Note: For use feature need to configure with `--enable-timings` and set env `OMPI_TIMING_ENABLE = 1` Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2017-04-07 21:16:57 +06:00
Artem Polyakov	e3acf2a339	ompi/timings: add OMPI-level timing framework. This is an extension of OPAL timing framework that allows to use MPI_reduce to provide the compact representation of the collected timings throughout the whole application. NOTE: the functionality is disabled now, it will be enabled after the runtime verification. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-04-07 21:16:22 +06:00
Artem Polyakov	1063c0d567	opal/timing: remove timings from MPI_Init and MPI_Finalize Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-04-07 21:16:21 +06:00
Joshua Hursey	c10bbfded6	ompi/hook: Add the hook/license framework * Include a 'demo' component that shows some of the features. * Currently has hooks for: - MPI_Initialized - top, bottom - MPI_Init_thread - top, bottom - MPI_Finalized - top, bottom - MPI_Init - top (pre-opal_init), top (post-opal_init), error, bottom - MPI_Finalize - top, bottom * Other places in ompi can 'register' to hook into any one of these places by passing back a component structure filled with function pointers. * Add a `MCA_BASE_COMPONENT_FLAG_REQUIRED` flag to the MCA structure that is checked by the `hook` framework. If a required, static component has been excluded then the `hook` framework will fail to initialize. - See note in `opal/mca/mca.h` as to why this is checked in the `hook` framework and not in `opal/mca/base/mca_base_component_find.c` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-02-27 12:05:53 -05:00
Ralph Castain	a7b8190fdc	Per f2f meeting: if async modex is given, default to no MPI init barrier, letting the user override that if desired. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-25 10:13:53 -08:00
Ralph Castain	fe68f23099	Only instantiate the HWLOC topology in an MPI process if it actually will be used. There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced: * shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string instead of the topology itself, if available, thus avoiding instantiating the topology * openib BTL. This uses the distance matrix. At present, I haven't developed a method for replacing that reference. Thus, this component will instantiate the topology * usnic BTL. Uses the distance matrix. * treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate the topology * ess base functions. If a process is direct launched and not bound at launch, this code attempts to bind it. Thus, procs in this scenario will instantiate the topology Note that instantiating the topology on complex chips such as KNL can consume megabytes of memory. Fix pernode binding policy Properly handle the unbound case Correct pointer usage Do not free static error messages! Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-29 10:33:29 -08:00
Ralph Castain	114e20ad66	Never collect data when doing the fence at the end of MPI_Init Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-29 08:31:35 -08:00
Artem Polyakov	06a73da5ea	ompi/init: always lazy-wait in ompi_mpi_init According to discussion in #2181 we don't need MCA parameter any more. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2016-11-12 07:54:48 +07:00
Gilles Gouaillardet	11dc86f26b	cleanup: always #include <pthread.h> pthreads are now mandatory, so there is no more need to Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-08 13:07:45 +09:00
Gilles Gouaillardet	efac15e9a1	ompi: use opal_setenv instead of putenv this fixes a memory leak at finalize Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-28 09:32:30 +09:00
Artem Polyakov	08618845a4	ompi/mpi_init: fix barrier Relax CPU usage pressure from the application processes when doing modex and barrier in ompi_mpi_init. We see significant latencies in SLURM/pmix plugin barrier progress because app processes are aggressively call opal_progress pushing away daemon process doing collective progress.	2016-09-27 07:28:52 +03:00
Gilles Gouaillardet	252fadf099	ompi: fix #if vs #ifdef HAVE___MALLOC_INITIALIZE_HOOK usage	2016-07-20 13:18:11 +09:00
rhc54	702a982271	Merge pull request #1767 from rhc54/topic/pmix2 Enable the PMIx event notification capability	2016-06-16 15:27:43 -07:00
Nathan Hjelm	7018aeda2b	opal/memory: disable __malloc_initialize_hook if poisoned Newer versions of gcc have "poisoned" the __malloc_initialize_hook name and it can no longer be used. Added a configure check and protection around its usage. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-06-15 12:00:49 -06:00
Ralph Castain	5d330d5220	Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler. Add PMIx 2.0 Remove PMIx 1.1.4 Cleanup copying of component Add missing file Touchup a typo in the Makefile.am Update the pmix ext114 component Minor cleanups and resync to master Update to latest PMIx 2.x Update to the PMIx event notification branch latest changes	2016-06-14 13:08:41 -07:00
Ralph Castain	2c086e56be	Add an experimental ability to skip the RTE barriers at the end of MPI_Init and the beginning of MPI_Finalize	2016-06-01 17:01:15 -07:00
Ralph Castain	01ba861f2a	When direct launching applications, we must allow the MPI layer to progress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization. Update external as well Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro	2016-05-14 16:37:00 -07:00
Nathan Hjelm	1e4daa2a0e	mpi_init: move opal_set_using_threads() earlier in MPI_Init() There is a potential race condition in MPI_Init() where an orte even thread could be in a function that uses OPAL_THREAD_LOCK / OPAL_THREAD_UNLOCK when ompi_mpi_init calls opal_set_using_threads(). Closes open-mpi/ompi#1586 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-04-26 13:02:42 -06:00
Nathan Hjelm	ae0ffbb67f	Merge pull request #1397 from hjelmn/enable_thread_multiple ompi: always enable MPI_THREAD_MULTIPLE support	2016-04-23 08:40:22 -06:00
Nathan Hjelm	b4e5b5c09e	Merge pull request #1531 from hjelmn/bml bml: always enable the bml	2016-04-14 10:22:33 -06:00
Nathan Hjelm	11e2d7886e	opal/memory: update component structure This commit makes it possible to set relative priorities for components. Before the addition of the patched component there was only one component that would run on any system but that is no longer the case. When determining which component to open each component's query function is called and the one that returns the highest priority is opened. The default priority of the patcher component is set slightly higher than the old ptmalloc2/ummunotify component. This commit fixes a long-standing break in the abstration of the memory components. ompi_mpi_init.c was referencing the linux malloc hook initilize function to ensure the hooks are initialized for libmpi.so. The abstraction break has been fixed by adding a memory base function that calls the open memory component's malloc hook init function if it has one. The code is not yet complete but is intended to support ptmalloc in 2.0.0. In that case the base function will always call the ptmalloc hook init if exists. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-04-13 17:14:51 -06:00
Nathan Hjelm	c6b19818be	bml: always enable the bml This commit ensures the bml is always enabled whether or not it will be used. This ensures that any available btls communicate their modex so that they can be used for one-sided communication. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-04-08 21:14:17 -06:00
Nathan Hjelm	d4afb16f5a	opal: rework mpool and rcache frameworks This commit rewrites both the mpool and rcache frameworks. Summary of changes: - Before this change a significant portion of the rcache functionality lived in mpool components. This meant that it was impossible to add a new memory pool to use with rdma networks (ugni, openib, etc) without duplicating the functionality of an existing mpool component. All the registration functionality has been removed from the mpool and placed in the rcache framework. - All registration cache mpools components (udreg, grdma, gpusm, rgpusm) have been changed to rcache components. rcaches are allocated and released in the same way mpool components were. - It is now valid to pass NULL as the resources argument when creating an rcache. At this time the gpusm and rgpusm components support this. All other rcache components require non-NULL resources. - A new mpool component has been added: hugepage. This component supports huge page allocations on linux. - Memory pools are now allocated using "hints". Each mpool component is queried with the hints and returns a priority. The current hints supported are NULL (uses posix_memalign/malloc), page_size=x (huge page mpool), and mpool=x. - The sm mpool has been moved to common/sm. This reflects that the sm mpool is specialized and not meant for any general allocations. This mpool may be moved back into the mpool framework if there is any objection. - The opal_free_list_init arguments have been updated. The unused0 argument is not used to pass in the registration cache module. The mpool registration flags are now rcache registration flags. - All components have been updated to make use of the new framework interfaces. As this commit makes significant changes to both the mpool and rcache frameworks both versions have been bumped to 3.0.0. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-03-14 10:50:41 -06:00
Gilles Gouaillardet	d8482ce6f4	opal/mca/memory: add a memoryc_set_alignment subroutine to the OPAL memory MCA this commit also (partially) reverts : - open-mpi/ompi@7de01b347c - open-mpi/ompi@8b05f308f9	2016-02-24 09:50:12 +09:00
Nathan Hjelm	230d04327e	ompi: always enable MPI_THREAD_MULTIPLE support This commit removes the --with-mpi-thread-multiple option and forces MPI_THREAD_MULTIPLE support. This cleans up an abstration violation in opal where OMPI_ENABLE_THREAD_MULTIPLE determines whether the opal_using_threads is meaningful. To reduce the performance hit on MPI_THREAD_SINGLE programs an OPAL_UNLIKELY is used for the check on opal_using_threads in OPAL_THREAD_* macros. This commit does not clean up the arguments to the various functions that take whether muti-threading support is enabled. That should be done at a later time. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-23 10:02:14 -07:00
Gilles Gouaillardet	7de01b347c	ompi/init: fix abstraction violation This fixes open-mpi/ompi@8b05f308f9 libmpi.so cannot be built (unresolved symbols) with configure'd with --disable-mem-debug --disable-mem-profile --disable-memchecker --without-memory-manager	2016-02-16 16:39:21 +09:00
Igor Ivanov	8b05f308f9	opal/memory: Move Memory Allocation Hooks usage from openib These changes fix issue https://github.com/open-mpi/ompi/issues/1336 - improve abstractions: opal/memory/linux component should be single place that opeartes with Memory Allocation Hooks. - avoid collisions in case dynamic component open/close: it is safe because it is linked statically. - does not change original behaivour.	2016-02-11 14:46:35 +02:00
Ralph Castain	810f2446b7	Add pmix120 component, update the error handling functions in the PMIx API. Update the configure logic for the new pmix120 component ckpt Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works Cleanup the rename files to use the pretty macros	2015-12-28 23:15:44 +09:00
Ralph Castain	267ca8fcd3	Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now to continue current default behavior. Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_ modex is used.	2015-10-27 17:31:56 -07:00
Jeff Squyres	f5ad90c920	init/finalize: extensions Proposed extensions for Open MPI: - If MPI_INITLIZED is invoked and MPI is only partially initialized, wait until MPI is fully initialized before returning. - If MPI_FINALIZED is invoked and MPI is only partially finalized, wait until MPI is fully finalized before returning. - If the ompi_mpix_allow_multi_init MCA param is true, allow MPI_INIT and MPI_INIT_THREAD to be invoked multiple times without error (MPI will be safely initialized only the first time it is invoked).	2015-10-15 12:39:15 -04:00
Mike Dubman	5bebed45eb	OMPI: set "in finalize" indicator in finalize flow	2015-10-04 09:39:37 +03:00
Nathan Hjelm	54a4061d88	Add support for detecting when dynamic add_procs is not possible This commit adds support to the pml, mtl, and btl frameworks for components to indicate at runtime that they do not support the new dynamic add_procs behavior. At the high end the lack of dynamic add_procs support is signalled by the pml using the new pml_flags member to the pml module structure. If the MCA_PML_BASE_FLAG_REQUIRE_WORLD flag is set MPI_Init will generate the ompi_proc_t array passed to add_proc from ompi_proc_world () instead of ompi_proc_get_allocated (). Both cm and ob1 have been updated to detect if the underlying mtl and btl components support dynamic add_procs. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-23 16:22:05 -06:00
Nathan Hjelm	2c89c7f47d	ompi/proc: add function to get all allocated procs This commit adds two new functions: - ompi_proc_get_allocated - Returns all procs in the current job that have already been allocated. This is used in init/finalize to determine which procs to pass to add_procs/del_procs. - ompi_proc_world_size - returns the number of processes in MPI_COMM_WORLD. This may be removed in favor of callers just looking at ompi_process_info. The behavior of ompi_proc_world has been restored to return ompi_proc_t's for all processes in the current job. The use of this function is discouraged. Code that was using ompi_proc_world() has been updated to make use of the new functions to avoid the memory overhead of ompi_comm_world (). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-23 16:22:05 -06:00
Nathan Hjelm	408da16d50	ompi/proc: add proc hash table for ompi_proc_t objects This commit adds an opal hash table to keep track of mapping between process identifiers and ompi_proc_t's. This hash table is used by the ompi_proc_by_name() function to lookup (in O(1) time) a given process. This can be used by a BTL or other component to get a ompi_proc_t when handling an incoming message from an as yet unknown peer. Additionally, this commit adds a new MCA variable to control the new add_procs behavior: mpi_add_procs_cutoff. If the number of ranks in the process falls below the threshold a ompi_proc_t is created for every process. If the number of ranks is above the threshold then a ompi_proc_t is only created for the local rank. The code needed to generate additional ompi_proc_t's for a communicator is not yet complete. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:54 -06:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Gilles Gouaillardet	9c77c6b66d	fortran: fix f08 bindings only define the unique fortran symbol depending on - CAPS - PLAIN - SINGLE_UNDERSCORE - DOUBLE_UNDERSCORE and bind the f08 symbol to the uniquely defined C symbol. Use real data structures to make the code simpler. (perl script written by Jeff)	2015-07-27 16:28:57 +09:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Gilles Gouaillardet	9d56b85b55	initialize common symbols from ompi	2015-05-08 10:11:58 +09:00
Nadezhda Kogteva	116169c38a	opal timing: added ability to choose the timer type	2015-04-17 11:15:55 +03:00
Nathan Hjelm	a7b0c00ab6	fix memory leaks and valgrind errors This commit fixes several vagrind errors. Included: - installdirs did not correctly reinitialize all pointers to NULL at close. This causes valgrind errors on a subsequent call to opal_init_tool. - several opal strings were leaked by opal_deregister_params which was setting them to NULL instead of letting them be freed by the MCA variable system. - move opal_net_init to AFTER the variable system is initialized and opal's MCA variables have been registered. opal_net_init uses a variable registered by opal_register_params! - do not leak ompi_mpi_main_thread when it is allocated by MPI_T_init_thread. - do not overwrite ompi_mpi_main_thread if it is already set (by MPI_T_init_thread). - mca_base_var: read_files was overwritting mca_base_var_file_list even if it was non-NULL. - mca_base_var: set all file global variables to initial states on finalize. - btl/vader: decrement enumerator reference count to ensure that it is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-11 09:28:35 -06:00
Artem Polyakov	01601f3284	Merge pull request #305 from artpol84/timing Timing framework improvement	2014-12-16 15:13:48 +06:00
George Bosilca	3430714989	Correctly propagate the requested level of thread support during the component init calls.	2014-12-13 02:36:21 -05:00
Artem Polyakov	8ffad75a0a	Introduce timing interval measurement facility in timing framework	2014-12-10 16:47:49 +06:00
Ralph Castain	06e49d0e92	Per contribution from Pascal Deveze of Bull: move opal_set_using_threads earlier in MPI_Init (before datatype init) so the value gets set in time to be properly used.	2014-12-09 00:37:57 -08:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
Ralph Castain	fd6a044b7f	Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages. Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.	2014-10-03 16:02:57 -06:00
Ralph Castain	dfb952fa78	[Contribution from Artem - moved it to svn from git for him] Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup. This commit was SVN r32738.	2014-09-15 18:00:46 +00:00

1 2 3 4 5 ...

268 Коммитов