1
1
Граф коммитов

300 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
92bc8afd43 opal_progress_threads: fix double RELEASE
If a thread failed to start, the tracker would be released twice.
This commit fixes CID 1316020.
2015-08-12 05:11:40 -07:00
Jeff Squyres
99fa054507 opal_progress_threads: update to the API
There are now four functions and one global constant:

* opal_progress_thread_name: the name of the OPAL-wide async progress
  thread.  If you have general purpose events that you need to run in
  *a* progress thread, but not a *dedicated* progress thread, use this
  name in the functions below to glom your events on to the general
  OPAL-wide async progress thread.
* opal_progress_thread_init(): return an event base corresponding to a
  progress thread of the specified name (a progress thread will be
  created for that name if it does not already exist).
* opal_progress_thread_finalize(): decrement the refcount on the
  passed progress thread name.  If the refcount is 0, stop the thread
  and destroy the event base.
* opal_progress_thread_pause(): stop processing events on the event
  base corresponding to the progress thread name, but do not destroy
  the event base.
* opal_progess_thread_resume(): resume processing events on the event
  base corresponding to a previously-paused progress thread name.
2015-08-07 10:13:40 -07:00
Ralph Castain
219c4dfba5 Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one. 2015-07-12 08:23:34 -07:00
Ralph Castain
683efcb850 Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename. 2015-07-11 10:08:19 -07:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Jeff Squyres
d164fe9bc5 opal_params.c: fix typo in comment 2015-06-06 10:17:20 -07:00
Nathan Hjelm
427aebbaca Fix cuda support MCA variables
This commit fixes some issues with the cuda support parameters. There
were a couple of duplicate registrations and an incorrect synonym (one
variable was made a synonym of mpi_preconnect_mpi).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-12 09:52:51 -06:00
Gilles Gouaillardet
c809aace47 initialize common symbols from opal
A few uninitialized common symbols are remaining:

common symbols generated by flex :
 * opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng
 * opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext
 * opal/util/show_help_lex.l: opal_show_help_yyleng
 * opal/util/show_help_lex.l: opal_show_help_yytext

common symbol generated by "external" hwloc library:
 * opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map
2015-05-08 09:48:51 +09:00
Nathan Hjelm
8287e1d28f Merge pull request #528 from hjelmn/add_destructor
Add opal destructor/fini function
2015-04-20 11:30:15 -06:00
Jeff Squyres
1e6a558993 opal_info_support.c: whitespace cleanup
No code changes
2015-04-20 08:56:42 -07:00
Jeff Squyres
1f237b78d1 *_info tools: quote parsable values if they contain colons
Thanks to Lev Givon for the suggestion.
2015-04-20 08:56:42 -07:00
Nathan Hjelm
662460b06b Modify destructor function configury
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-20 09:51:06 -06:00
Nathan Hjelm
38589c46c0 opal: add a destructor/fini function to opal
This commit is related to an RFC from June 2014. Disscussion can be
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to #517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-16 19:53:52 -06:00
Nathan Hjelm
e794658f2d Merge pull request #516 from hjelmn/repository_update
RFC: Repository update
2015-04-15 10:03:08 -06:00
Nathan Hjelm
c954f457d9 mca/base: update the way dynamic components are handled
This commit is a rework of the component repository. The changes
included in this commit are:

 - Remove the component dependency code based off .ompi_info
   files. This code is legacy code dating back 10 years that and is no
   longer used.

 - Move the plugin scanning code to the component repository. New
   calls have been added to add new scanning paths, query available
   components, and dlopen/load components.

 - Pass the framework down to mca_base_component_find/filter. Eventually
   the framework structure will be used to further validate components
   before they are used.

 - Add support to the MCA framework system to disable scanning for
   dlopened components on open (support already existed in
   register). This is really only relevant to installdirs as it has no
   register function and no DSO components.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-14 15:55:33 -06:00
Nathan Hjelm
a7b0c00ab6 fix memory leaks and valgrind errors
This commit fixes several vagrind errors. Included:

 - installdirs did not correctly reinitialize all pointers to NULL
   at close. This causes valgrind errors on a subsequent call to
   opal_init_tool.

 - several opal strings were leaked by opal_deregister_params which
   was setting them to NULL instead of letting them be freed by the
   MCA variable system.

 - move opal_net_init to AFTER the variable system is initialized and
   opal's MCA variables have been registered. opal_net_init uses a
   variable registered by opal_register_params!

 - do not leak ompi_mpi_main_thread when it is allocated by
   MPI_T_init_thread.

 - do not overwrite ompi_mpi_main_thread if it is already set (by
   MPI_T_init_thread).

 - mca_base_var: read_files was overwritting mca_base_var_file_list
   even if it was non-NULL.

 - mca_base_var: set all file global variables to initial states on
   finalize.

 - btl/vader: decrement enumerator reference count to ensure that it
   is freed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-11 09:28:35 -06:00
Nathan Hjelm
9cd955badf opal: fix multiple bugs in MCA and opal
This commit fixes the following bugs:

 - opal_output_finalize did not properly set internal state. This
   caused problems when calling the sequence opal_output_init (),
   opal_output_finalize (), opal_output_init ().

 - opal_info support called mca_base_open () but never called the
   matching mca_base_close (). mca_base_open () and mca_base_close ()
   have been updated to use a open count instead of an open flag to
   allow mca_base_open to be called through multiple paths (as may be
   the case when MPI_T is in use).

 - orte_info support did not register opal variables. This can cause
   orte-info to not return opal variables.

 - opal_info, orte_info, and ompi_info support have been updated to
   use a register count.

 - When opening the dl framework the reference count was added to
   ensure the framework stuck around. The framework being closed
   prematurely was a bug in the MCA base that has since been
   corrected. The increment (and associated decrement) have been
   removed.

 - dl/dlopen did not set the value of
   mca_dl_dlopen_component.filename_suffixes_mca_storage on each call
   to register. Instead the value was set in the component
   structure. This caused the value to be lost when re-loading the
   component. Fixed by setting the default value in register.

 - Reset shmem framework state on close to avoid returning a stale
   component after reloading opal/shmem.

 - MCA base parameters were not properly deregistered when the MCA
   base was closed.

This commit may fix #374.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-07 19:13:20 -06:00
Ralph Castain
9dbc69df0f Stop an ugly infinite loop caused by continual re-opening of the opal if framework. 2015-03-24 17:50:14 -07:00
Adrian Reber
f45dd069bd FT: fix compilation using --with-ft (1/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
2015-03-11 14:23:33 +01:00
Gilles Gouaillardet
f7f7fa73dd opal_cr: fix incorrect NULL assignment
as reported by Coverity with CID 1288084
2015-03-10 12:06:57 +09:00
Gilles Gouaillardet
1746e23f11 opal/cr: fix misc memory leak and error case
as reported by Coverity with CIDs 71858 and 710640
2015-03-09 19:28:52 +09:00
Mike Dubman
98503b56e0 Revert "create the opal_common_verbs_want_fork_support parameter." 2015-03-03 14:28:31 +02:00
Alina Sklarevich
8fe42f1bc1 create the opal_common_verbs_want_fork_support parameter.
call the opal_common_verbs_mca_register function to make sure that
opal_common_verbs_want_fork_support mca parameter is created and therefore
can be used to control the fork support.
2015-03-01 17:40:49 +02:00
Jeff Squyres
336626dafe spelling: trivial spelling fix
s/interupted/interrupted/gi
2015-02-27 18:30:43 -08:00
George Bosilca
aeace0468e A more sensible fix, move the MCA variable in the verbs common area. 2015-02-26 16:51:09 -05:00
George Bosilca
6777f3ac3c Add missing qualifiers to the global variable. 2015-02-26 16:25:56 -05:00
Alina Sklarevich
e4c4e7df5e Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)

This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.

Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
2015-02-25 10:58:50 +02:00
Jeff Squyres
04d9085c3b opal_info_support: protect against (group->group_component==NULL)
This was CID 1196660
2015-02-24 15:24:09 -05:00
Jeff Squyres
6c3ddf98ae revert open-mpi/ompi@c75650e68f
That commit was a bad idea.
2015-02-23 15:39:53 -08:00
Jeff Squyres
c75650e68f opal_finalize: fix minor memory leak
This string is strdup'ed in opal_init_util().
2015-02-23 12:34:49 -08:00
Jeff Squyres
4a85f759ec opal_info_support.c: prevent a NULL pointer
If NULL is passed in, then assume the caller meant "".

This was CID 993714.
2015-02-12 13:41:29 -08:00
Ralph Castain
3ae3b96c17 Fix master compilation - a buried header dependency must have been removed. 2015-02-10 07:22:10 -08:00
Ralph Castain
f28238af59 Fix a race condition seen by Absoft during finalize. Stop the orte progress thread without cleaning it up, thus allowing the frameworks to still cancel their posted recv's. Then cleanup the memory footprint afterwards. 2015-02-05 11:41:37 -08:00
Gilles Gouaillardet
661c35ca67 cleanup dead code caused by the removal of the --with-threads configure option 2015-01-16 19:13:59 +09:00
Nathan Hjelm
d0da29351f opal_progress: fix sched_yield check 2014-12-09 14:14:20 -07:00
Jeff Squyres
c22e1ae33b configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros
These two macros set the prefix for the OPAL and ORTE libraries,
respectively.  Specifically, the OPAL library will be named
libPREFIXopen-pal.la and the ORTE library will be named
libPREFIXopen-rte.la.

These macros must be called, even if the prefix argument is empty.

The intent is that Open MPI will call these macros with an empty
prefix, but other projects (such as ORCM) will call these macros with
a non-empty prefix.  For example, ORCM libraries can be named
liborcm-open-pal.la and liborcm-open-rte.la.

This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL.  For example,
when running MPI applications under ORTE, if the ORTE and OPAL
libraries between OMPI and ORCM are not identical (which, because they
are released at different times, are likely to be different), we need
to ensure that the OMPI applications link against their ORTE and OPAL
libraries, but the ORCM executables link against their ORTE and OPAL
libraries.
2014-10-22 10:32:19 -07:00
Jeff Squyres
01fd96bfa5 Revert "Provide a mechanism by which an upstream project can rename
the OPAL and ORTE libraries. This is required by projects such as ORCM
that have their own ORTE and OPAL libraries in order to avoid library
confusion. By renaming their version of the libraries, the OMPI
applications can correctly dynamically load the correct one for their
build."

This reverts commit 63f619f871.
2014-10-22 10:32:11 -07:00
Ralph Castain
63f619f871 Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build. 2014-10-10 11:39:08 -07:00
Ralph Castain
fd6a044b7f Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
2014-10-03 16:02:57 -06:00
Jeff Squyres
413e775dbf version configury: make dist now works
Update the VERSION file scheme:

* Remove "want_repo_rev".
* Add "tarball_version".

All values are now always included (major, minor, release, greek,
repo_rev).  However, configure.ac now runs "opal_get_version.sh
... --tarball", which will return the value of tarball_version (if it
is non-empty) or the "full" version string (i.e.,
"major.minor.releasegreek").
2014-10-02 11:32:54 -07:00
Jeff Squyres
d4e2809531 version: always use all 3 version numbers
In all previous releases, the version number would be "A.B.C" unless C
was 0, in which case it would be "A.B".  This commit changes that
scheme to always be "A.B.C", even if C==0.

Hence, v1.9.0 will be the first release where this new scheme is evident.

This commit was SVN r32816.
2014-09-30 15:54:18 +00:00
Artem Polyakov
f2e586980b Fix timing framework:
1. Fixes according to (http://www.open-mpi.org/community/lists/devel/2014/09/15869.php)
2. Force mpisync:rank0 to gather results. Now sync info is written by rank0 to the output file.
3. Improve mpirun_prof: 1) adopt to the environment (SLURM/TORQUE); 2) recognize some noteset-related mpirun options.

This commit was SVN r32772.
2014-09-23 12:59:54 +00:00
Artem Polyakov
70587d1804 Remove outdated OPAL parameter "opal_pmi_version". Now PMI selection is handled by PMIx MCA.
This commit was SVN r32767.
2014-09-20 02:30:23 +00:00
Ralph Castain
dfb952fa78 [Contribution from Artem - moved it to svn from git for him]
Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup.

This commit was SVN r32738.
2014-09-15 18:00:46 +00:00
Ralph Castain
e32d541c8d Bring over a slight modification to the opal_init_test routine
This commit was SVN r32676.
2014-09-07 15:46:53 +00:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Jeff Squyres
0a398c155f opal MCA params: Move (and adapt) help message to opal help file
This commit was SVN r32547.
2014-08-16 11:54:41 +00:00
Ralph Castain
a347b19dc1 Add missing include
This commit was SVN r32406.
2014-08-01 18:49:37 +00:00
Ralph Castain
76d82b885f Correctly dereference the thread object
This commit was SVN r32321.
2014-07-26 17:01:27 +00:00