openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	92bc8afd43	opal_progress_threads: fix double RELEASE If a thread failed to start, the tracker would be released twice. This commit fixes CID 1316020.	2015-08-12 05:11:40 -07:00
Jeff Squyres	99fa054507	opal_progress_threads: update to the API There are now four functions and one global constant: * opal_progress_thread_name: the name of the OPAL-wide async progress thread. If you have general purpose events that you need to run in a progress thread, but not a dedicated progress thread, use this name in the functions below to glom your events on to the general OPAL-wide async progress thread. * opal_progress_thread_init(): return an event base corresponding to a progress thread of the specified name (a progress thread will be created for that name if it does not already exist). * opal_progress_thread_finalize(): decrement the refcount on the passed progress thread name. If the refcount is 0, stop the thread and destroy the event base. * opal_progress_thread_pause(): stop processing events on the event base corresponding to the progress thread name, but do not destroy the event base. * opal_progess_thread_resume(): resume processing events on the event base corresponding to a previously-paused progress thread name.	2015-08-07 10:13:40 -07:00
Ralph Castain	219c4dfba5	Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one.	2015-07-12 08:23:34 -07:00
Ralph Castain	683efcb850	Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.	2015-07-11 10:08:19 -07:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Jeff Squyres	d164fe9bc5	opal_params.c: fix typo in comment	2015-06-06 10:17:20 -07:00
Nathan Hjelm	427aebbaca	Fix cuda support MCA variables This commit fixes some issues with the cuda support parameters. There were a couple of duplicate registrations and an incorrect synonym (one variable was made a synonym of mpi_preconnect_mpi). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-12 09:52:51 -06:00
Gilles Gouaillardet	c809aace47	initialize common symbols from opal A few uninitialized common symbols are remaining: common symbols generated by flex : * opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng * opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext * opal/util/show_help_lex.l: opal_show_help_yyleng * opal/util/show_help_lex.l: opal_show_help_yytext common symbol generated by "external" hwloc library: * opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map	2015-05-08 09:48:51 +09:00
Nathan Hjelm	8287e1d28f	Merge pull request #528 from hjelmn/add_destructor Add opal destructor/fini function	2015-04-20 11:30:15 -06:00
Jeff Squyres	1e6a558993	opal_info_support.c: whitespace cleanup No code changes	2015-04-20 08:56:42 -07:00
Jeff Squyres	1f237b78d1	*_info tools: quote parsable values if they contain colons Thanks to Lev Givon for the suggestion.	2015-04-20 08:56:42 -07:00
Nathan Hjelm	662460b06b	Modify destructor function configury Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-20 09:51:06 -06:00
Nathan Hjelm	38589c46c0	opal: add a destructor/fini function to opal This commit is related to an RFC from June 2014. Disscussion can be found at: http://www.open-mpi.org/community/lists/devel/2014/07/15140.php The finalize function is set using either the linker option -fini or __attribute__((destructor)) depending on compiler support. I have confirmed that this hybrid approach works with all the major compilers. The attribute is supported by gcc, clang, llvm, xlc, and icc. The fini function will support pgi. If a compiler/linker combination does not support either the destructor or fini function a message will be printed on re-init indicating it is not supported (an improvement over the current behavior-- SEGV). I moved the following to the destructor function: - Class system finalize. This solves a bug when MPI_T_finalize is called before MPI_Init. The only downside to this change is we will leave the footprint of the opal class system after MPI_Finalize. This footprint should be relatively small. This is an alternative to #517 but the two PRs are not mutually-exclusive (with some modifications). This commit should also be safe for 1.8.x as it does not change internal or external ABI (#517 changes internal ABI). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-16 19:53:52 -06:00
Nathan Hjelm	e794658f2d	Merge pull request #516 from hjelmn/repository_update RFC: Repository update	2015-04-15 10:03:08 -06:00
Nathan Hjelm	c954f457d9	mca/base: update the way dynamic components are handled This commit is a rework of the component repository. The changes included in this commit are: - Remove the component dependency code based off .ompi_info files. This code is legacy code dating back 10 years that and is no longer used. - Move the plugin scanning code to the component repository. New calls have been added to add new scanning paths, query available components, and dlopen/load components. - Pass the framework down to mca_base_component_find/filter. Eventually the framework structure will be used to further validate components before they are used. - Add support to the MCA framework system to disable scanning for dlopened components on open (support already existed in register). This is really only relevant to installdirs as it has no register function and no DSO components. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-14 15:55:33 -06:00
Nathan Hjelm	a7b0c00ab6	fix memory leaks and valgrind errors This commit fixes several vagrind errors. Included: - installdirs did not correctly reinitialize all pointers to NULL at close. This causes valgrind errors on a subsequent call to opal_init_tool. - several opal strings were leaked by opal_deregister_params which was setting them to NULL instead of letting them be freed by the MCA variable system. - move opal_net_init to AFTER the variable system is initialized and opal's MCA variables have been registered. opal_net_init uses a variable registered by opal_register_params! - do not leak ompi_mpi_main_thread when it is allocated by MPI_T_init_thread. - do not overwrite ompi_mpi_main_thread if it is already set (by MPI_T_init_thread). - mca_base_var: read_files was overwritting mca_base_var_file_list even if it was non-NULL. - mca_base_var: set all file global variables to initial states on finalize. - btl/vader: decrement enumerator reference count to ensure that it is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-11 09:28:35 -06:00
Nathan Hjelm	9cd955badf	opal: fix multiple bugs in MCA and opal This commit fixes the following bugs: - opal_output_finalize did not properly set internal state. This caused problems when calling the sequence opal_output_init (), opal_output_finalize (), opal_output_init (). - opal_info support called mca_base_open () but never called the matching mca_base_close (). mca_base_open () and mca_base_close () have been updated to use a open count instead of an open flag to allow mca_base_open to be called through multiple paths (as may be the case when MPI_T is in use). - orte_info support did not register opal variables. This can cause orte-info to not return opal variables. - opal_info, orte_info, and ompi_info support have been updated to use a register count. - When opening the dl framework the reference count was added to ensure the framework stuck around. The framework being closed prematurely was a bug in the MCA base that has since been corrected. The increment (and associated decrement) have been removed. - dl/dlopen did not set the value of mca_dl_dlopen_component.filename_suffixes_mca_storage on each call to register. Instead the value was set in the component structure. This caused the value to be lost when re-loading the component. Fixed by setting the default value in register. - Reset shmem framework state on close to avoid returning a stale component after reloading opal/shmem. - MCA base parameters were not properly deregistered when the MCA base was closed. This commit may fix #374. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-07 19:13:20 -06:00
Ralph Castain	9dbc69df0f	Stop an ugly infinite loop caused by continual re-opening of the opal if framework.	2015-03-24 17:50:14 -07:00
Adrian Reber	f45dd069bd	FT: fix compilation using --with-ft (1/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. This first patch moves orte_cr_continue_like_restart from ORTE to opal_cr_continue_like_restart in OPAL. This only leaves three calls from OPAL to ORTE in the FT code. As it is not yet 100% clear how to handle these calls the code orte_sstore.set_attr() has been #ifdef'd out for now.	2015-03-11 14:23:33 +01:00
Gilles Gouaillardet	f7f7fa73dd	opal_cr: fix incorrect NULL assignment as reported by Coverity with CID 1288084	2015-03-10 12:06:57 +09:00
Gilles Gouaillardet	1746e23f11	opal/cr: fix misc memory leak and error case as reported by Coverity with CIDs 71858 and 710640	2015-03-09 19:28:52 +09:00
Mike Dubman	98503b56e0	Revert "create the opal_common_verbs_want_fork_support parameter."	2015-03-03 14:28:31 +02:00
Alina Sklarevich	8fe42f1bc1	create the opal_common_verbs_want_fork_support parameter. call the opal_common_verbs_mca_register function to make sure that opal_common_verbs_want_fork_support mca parameter is created and therefore can be used to control the fork support.	2015-03-01 17:40:49 +02:00
Jeff Squyres	336626dafe	spelling: trivial spelling fix s/interupted/interrupted/gi	2015-02-27 18:30:43 -08:00
George Bosilca	aeace0468e	A more sensible fix, move the MCA variable in the verbs common area.	2015-02-26 16:51:09 -05:00
George Bosilca	6777f3ac3c	Add missing qualifiers to the global variable.	2015-02-26 16:25:56 -05:00
Alina Sklarevich	e4c4e7df5e	Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support. In order to have an effect, ibv_fork_init should be called in the beginning of the verbs initialization flow - before the calls to the ibv_create_qp and ibv_create_cq verbs. These functions are called from the oob/ud code and by the time the other verbs components (btl openib, pml yalla, ...) call ibv_fork_init, it's too late. This commit forces the call to ibv_fork_init (if it's requested) right at the beginning of all the components that are using verbs. (ibv_fork_init() can be safely called multiple times) This commit also removes the btl_openib_want_fork_support mca parameter and adds a new mca parameter instead - opal_verbs_want_fork_support. Through this new parameter, fork support may be requested for ALL components. The default value for this parameter is set to 1. Before this commit the btl_openib_want_fork_support parameter didn't provide fork support for the openib btl if its value was set to 1. (because when openib called ibv_fork_init, it was already after the calls to ibv_create_* in oob/ud and thereofre it failed).	2015-02-25 10:58:50 +02:00
Jeff Squyres	04d9085c3b	opal_info_support: protect against (group->group_component==NULL) This was CID 1196660	2015-02-24 15:24:09 -05:00
Jeff Squyres	6c3ddf98ae	revert open-mpi/ompi@c75650e68f That commit was a bad idea.	2015-02-23 15:39:53 -08:00
Jeff Squyres	c75650e68f	opal_finalize: fix minor memory leak This string is strdup'ed in opal_init_util().	2015-02-23 12:34:49 -08:00
Jeff Squyres	4a85f759ec	opal_info_support.c: prevent a NULL pointer If NULL is passed in, then assume the caller meant "". This was CID 993714.	2015-02-12 13:41:29 -08:00
Ralph Castain	3ae3b96c17	Fix master compilation - a buried header dependency must have been removed.	2015-02-10 07:22:10 -08:00
Ralph Castain	f28238af59	Fix a race condition seen by Absoft during finalize. Stop the orte progress thread without cleaning it up, thus allowing the frameworks to still cancel their posted recv's. Then cleanup the memory footprint afterwards.	2015-02-05 11:41:37 -08:00
Gilles Gouaillardet	661c35ca67	cleanup dead code caused by the removal of the --with-threads configure option	2015-01-16 19:13:59 +09:00
Nathan Hjelm	d0da29351f	opal_progress: fix sched_yield check	2014-12-09 14:14:20 -07:00
Jeff Squyres	c22e1ae33b	configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros These two macros set the prefix for the OPAL and ORTE libraries, respectively. Specifically, the OPAL library will be named libPREFIXopen-pal.la and the ORTE library will be named libPREFIXopen-rte.la. These macros must be called, even if the prefix argument is empty. The intent is that Open MPI will call these macros with an empty prefix, but other projects (such as ORCM) will call these macros with a non-empty prefix. For example, ORCM libraries can be named liborcm-open-pal.la and liborcm-open-rte.la. This scheme is necessary to allow running Open MPI applications under systems that use their own versions of ORTE and OPAL. For example, when running MPI applications under ORTE, if the ORTE and OPAL libraries between OMPI and ORCM are not identical (which, because they are released at different times, are likely to be different), we need to ensure that the OMPI applications link against their ORTE and OPAL libraries, but the ORCM executables link against their ORTE and OPAL libraries.	2014-10-22 10:32:19 -07:00
Jeff Squyres	01fd96bfa5	Revert "Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build." This reverts commit `63f619f871`.	2014-10-22 10:32:11 -07:00
Ralph Castain	63f619f871	Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build.	2014-10-10 11:39:08 -07:00
Ralph Castain	fd6a044b7f	Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages. Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.	2014-10-03 16:02:57 -06:00
Jeff Squyres	413e775dbf	version configury: make dist now works Update the VERSION file scheme: * Remove "want_repo_rev". * Add "tarball_version". All values are now always included (major, minor, release, greek, repo_rev). However, configure.ac now runs "opal_get_version.sh ... --tarball", which will return the value of tarball_version (if it is non-empty) or the "full" version string (i.e., "major.minor.releasegreek").	2014-10-02 11:32:54 -07:00
Jeff Squyres	d4e2809531	version: always use all 3 version numbers In all previous releases, the version number would be "A.B.C" unless C was 0, in which case it would be "A.B". This commit changes that scheme to always be "A.B.C", even if C==0. Hence, v1.9.0 will be the first release where this new scheme is evident. This commit was SVN r32816.	2014-09-30 15:54:18 +00:00
Artem Polyakov	f2e586980b	Fix timing framework: 1. Fixes according to (http://www.open-mpi.org/community/lists/devel/2014/09/15869.php) 2. Force mpisync:rank0 to gather results. Now sync info is written by rank0 to the output file. 3. Improve mpirun_prof: 1) adopt to the environment (SLURM/TORQUE); 2) recognize some noteset-related mpirun options. This commit was SVN r32772.	2014-09-23 12:59:54 +00:00
Artem Polyakov	70587d1804	Remove outdated OPAL parameter "opal_pmi_version". Now PMI selection is handled by PMIx MCA. This commit was SVN r32767.	2014-09-20 02:30:23 +00:00
Ralph Castain	dfb952fa78	[Contribution from Artem - moved it to svn from git for him] Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup. This commit was SVN r32738.	2014-09-15 18:00:46 +00:00
Ralph Castain	e32d541c8d	Bring over a slight modification to the opal_init_test routine This commit was SVN r32676.	2014-09-07 15:46:53 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Jeff Squyres	0a398c155f	opal MCA params: Move (and adapt) help message to opal help file This commit was SVN r32547.	2014-08-16 11:54:41 +00:00
Ralph Castain	a347b19dc1	Add missing include This commit was SVN r32406.	2014-08-01 18:49:37 +00:00
Ralph Castain	76d82b885f	Correctly dereference the thread object This commit was SVN r32321.	2014-07-26 17:01:27 +00:00

1 2 3 4 5 ...

300 Коммитов