1
1

917 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
bd9e48d5cf Add the missing default case. Cleanup required by the author.
This commit was SVN r23939.
2010-10-25 18:55:18 +00:00
Ralph Castain
3c9d167bd2 Fix segfault in ompi_info when no pml_v includes provided
This commit was SVN r23926.
2010-10-24 19:24:44 +00:00
Ralph Castain
fceabb2498 Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.

Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.

Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.

I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:

1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)

2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.

There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.

This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
Rolf vandeVaart
148ed00dd1 Some more refactoring in the BFO PML. Getting it
as close to OB1 PML as possible.

This commit was SVN r23920.
2010-10-22 18:13:35 +00:00
Ralph Castain
1766bf271a Correct an abstraction break that causes ompi_info to segfault if pml-v is not built. Move the definition and instantation of the mca_pml_v struct to the vprotocol base. Include the vprotocol/base/base.h file in pml_v.h. Remove the now useless pml_v.c.
Perhaps somebody out there who cares and uses it can verify that vprotocol works?

This commit was SVN r23919.
2010-10-22 05:12:12 +00:00
Rolf vandeVaart
70fe48698c Change some of the bfo code to be more like the
ob1 code.  Create some new macros and functions to handle
some differences.

This commit was SVN r23913.
2010-10-19 17:46:51 +00:00
Rolf vandeVaart
24e5e38dce Remove a variable that is not needed. Just piggy
back on a pointer value.

This commit was SVN r23887.
2010-10-13 22:01:23 +00:00
Rolf vandeVaart
20c5e6e0d6 Fix a few more cases where we are using a function
as an argument to a macro which could result in it
being called twice.  I did not observe any issues,
but it should be fixed.  Also did some minor refactoring
for clarity and following code convention.

This commit was SVN r23886.
2010-10-12 20:11:48 +00:00
Rolf vandeVaart
44d7006f34 Just some more refactoring and cleanup of bfo PML.
This commit was SVN r23884.
2010-10-12 13:34:35 +00:00
Rolf vandeVaart
59e3fa8ed3 Some more formatting fixes and code refactoring. All
these changes are in the bfo so this has no affect on ob1.

This commit was SVN r23815.
2010-09-29 13:46:45 +00:00
Rolf vandeVaart
f808dd2881 Cosmetic changes to fix spaces. No code change.
This commit was SVN r23803.
2010-09-27 21:01:49 +00:00
Jeff Squyres
73bcc4a36b Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers
This commit was SVN r23801.

The following SVN revision numbers were found above:
  r23764 --> open-mpi/ompi@40a2bfa238
2010-09-24 22:53:28 +00:00
Rolf vandeVaart
3cc1fa45bf Fix a few more extraneous spaces. Also update csum
priority logic to match ob1.

This commit was SVN r23798.
2010-09-24 13:14:18 +00:00
Rolf vandeVaart
0331889495 Some more spaces, tabs, include file ordering changes.
No real code changes here.  

This commit was SVN r23789.
2010-09-22 13:48:22 +00:00
Shiqing Fan
a4c2ed7a87 Fix a few things for Windows build - type cast, modified variable names and unresolved symbols.
This commit was SVN r23783.
2010-09-21 09:40:26 +00:00
Rolf vandeVaart
77560269f2 More fixes of spaces, tabs, and ordering of include files
to make the 3 PMLs the same where they are the same.  No
real code changes.

This commit was SVN r23779.
2010-09-20 21:22:33 +00:00
Jeff Squyres
099816e59e Somehow this file got missed.
This commit was SVN r23766.
2010-09-18 04:37:37 +00:00
Ralph Castain
40a2bfa238 WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues.
This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change.

Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation.

This commit was SVN r23764.
2010-09-17 23:04:06 +00:00
Rolf vandeVaart
65e8277add Mostly fixes for tabs, spaces and indentations.
Also, some other changes to bring the csum PML up
to date with changes that happened in ob1 over the
last two years. This includes a few bug
fixes and some minor refactoring.  

This commit was SVN r23757.
2010-09-15 18:48:06 +00:00
Rolf vandeVaart
31a168695e Some more cleanup of extraneous spaces and tabs. Also
some changes to script to run diffs between PMLs.

This commit was SVN r23749.
2010-09-13 14:58:00 +00:00
Rolf vandeVaart
3bb587937a Just fix up some trailing spaces, tabs instead of spaces,
missing periods on copyrights, extraneous spaces on blank
lines.  No actual code change.

This commit was SVN r23739.
2010-09-10 21:01:52 +00:00
Josh Hursey
e12ca48cd9 A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php

Documentation:
  http://osl.iu.edu/research/ft/

Major Changes: 
-------------- 
 * Added C/R-enabled Debugging support. 
   Enabled with the --enable-crdebug flag. See the following website for more information: 
   http://osl.iu.edu/research/ft/crdebug/ 
 * Added Stable Storage (SStore) framework for checkpoint storage 
   * 'central' component does a direct to central storage save 
   * 'stage' component stages checkpoints to central storage while the application continues execution. 
     * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) 
     * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) 
 * Added Compression (compress) framework to support 
 * Add two new ErrMgr recovery policies 
   * {{{crmig}}} C/R Process Migration 
   * {{{autor}}} C/R Automatic Recovery 
 * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component 
 * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) 
   * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) 
   * {{{OMPI_CR_Restart}}} 
   * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) 
   * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) 
   * {{{OMPI_CR_Quiesce_start}}} 
   * {{{OMPI_CR_Quiesce_checkpoint}}} 
   * {{{OMPI_CR_Quiesce_end}}} 
   * {{{OMPI_CR_self_register_checkpoint_callback}}} 
   * {{{OMPI_CR_self_register_restart_callback}}} 
   * {{{OMPI_CR_self_register_continue_callback}}} 
 * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. 
 * Add a progress meter to: 
   * FileM rsh (filem_rsh_process_meter) 
   * SnapC full (snapc_full_progress_meter) 
   * SStore stage (sstore_stage_progress_meter) 
 * Added 2 new command line options to ompi-restart 
   * --showme : Display the full command line that would have been exec'ed. 
   * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) 
 * Deprecated some MCA params: 
   * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir 
   * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir 
   * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared 
   * snapc_base_store_in_place deprecated, replaced with different components of SStore 
   * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref 
   * snapc_base_establish_global_snapshot_dir deprecated, never well supported 
   * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem 

Minor Changes: 
-------------- 
 * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. 
 * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components 
 * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. 
 * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} 
 * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. 
 * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. 
 * Cleanup the CRS framework and components to work with the SStore framework. 
 * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). 
 * Add 'quiesce' hook to CRCP for a future enhancement. 
 * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. 
 * Add optional application level INC callbacks (registered through the CR MPI Ext interface). 
 * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. 
 * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. 
 * {{{opal-restart}}} also support local decompression before restarting 
 * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata 
 * {{{orte-restart}}} now uses the SStore framework to work with the metadata 
 * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. 
 * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. 
 * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. 
 * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. 
 * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options. 
 * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. 
 * Improve the checks for 'already checkpointing' error path. 
 * A a recovery output timer, to show how long it takes to restart a job 
 * Do a better job of cleaning up the old session directory on restart. 
 * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) 
 * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. 

This commit was SVN r23587.

The following Trac tickets were found above:
  Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
  Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
  Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
  Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
  Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
  Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
  Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-10 20:51:11 +00:00
Rolf vandeVaart
0324fdb407 Created two new macros that are used when filling in either the
status structure or the _ucount field in the status structure.
On 64-bit sparc, the macros resolve into integer array assignments.
For all others, they are just simple assignments.  This fixes 
possible BUS errors seen when running on the SPARC processor.
This bug was introduced when the _count field changed from an int
into a size_t.  See the changes to request.h for additional details.

This commit fixes trac:2514.

This commit was SVN r23554.

The following Trac tickets were found above:
  Ticket 2514 --> https://svn.open-mpi.org/trac/ompi/ticket/2514
2010-08-04 19:36:40 +00:00
George Bosilca
733d25a8a3 First step toward fixing the MPI_Get_count issues from the ticket #2241. Next
step is the configure and Fortran mojo that Jeff will put in. Until then I
guess the Fortran interface is broken (at least all functions using the hidden
count firld in the MPI_Status).

This commit was SVN r23467.
2010-07-21 20:07:00 +00:00
Rolf vandeVaart
45019a3abf Correctly handle zero-length match fragment.
This commit was SVN r23459.
2010-07-21 15:27:06 +00:00
Rolf vandeVaart
3abb5556a6 Fix bug pointed out by George Bosilca. Also
remove unneeded temp variable.

This commit was SVN r23424.
2010-07-15 19:32:31 +00:00
Jeff Squyres
bfe6c95dce Remove .windows because it doesn't exist in this directory.
This commit was SVN r23406.
2010-07-14 02:14:32 +00:00
Rolf vandeVaart
e27f953fa4 Fix casting in assert.
This commit was SVN r23388.
2010-07-13 12:02:12 +00:00
Rolf vandeVaart
19d007a6fc New PML to support failover between openib BTLs.
openib BTL changes coming soon.

This commit was SVN r23385.
2010-07-13 10:46:20 +00:00
Jeff Squyres
c8bb7537e7 Remove include/opal/sys/cache.h -- its only purpose in life was to
#define CACHE_LINE_SIZE to 128.  This name has a conflict on NetBSD,
and it seems kinda odd to have a header file that ''only'' defines a
single value.  Also, we'll soon be raising hwloc to be a first-class
item, so having this file around seemed kinda weird.

Therefore, I replaced CACHE_LINE_SIZE with opal_cache_line_size, an
int (in opal/runtime/opal_init.c and opal/runtime/opal.h) on the
rationale that we can fill this in at runtime with hwloc info (trunk
and v1.5/beyond, only).  The only place we ''needed'' a compile-time
CACHE_LINE_SIZE was in the BTL SM (for struct padding), so I made a
new BTL_SM_ preprocessor macro with the old CACHE_LINE_SIZE value
(128).  That use isn't suitable for run-time hwloc information,
anyway.

This commit was SVN r23349.
2010-07-06 14:33:36 +00:00
Rolf vandeVaart
03b3e75f86 Add two arguments to the PML error callback function. This
allows the BTL to specify a specific ompi_proc_t that had an
error.  Also add an optional descriptive string.  Currently, arguments
are not used but will be by future failover PML. 
Changes based on RFC.  Reviewed by George Bosilca.

This commit was SVN r23174.
2010-05-19 11:55:45 +00:00
Abhishek Kulkarni
afbe3e99c6 * Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
 SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
 back the native error code.

* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
  (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
  decode 'ret' to get the native error code.

This commit was SVN r23162.
2010-05-17 23:08:56 +00:00
George Bosilca
321213e779 Fix segmentation fault on heterogeneous architectures. Don't mess with the
ompi_ptr_t by translating into void*. Instead keep it as an ompi_ptr_t all
the way. Thanks to Timur Magomedov for helping to track down this issue and
test the patch.

cmr:v1.4
cmr:v1.5

This commit was SVN r23030.
2010-04-23 15:14:55 +00:00
Rainer Keller
a48a11821b - mca_base_param_reg_string_name allocates default_pml.
As it is strdup, just free(default_pml).

   cmr:v1.5

This commit was SVN r22955.
2010-04-12 19:54:07 +00:00
Rolf vandeVaart
0adb570693 Add pml_ob1_verbose flag. Fix the current location it is being used
This commit was SVN r22939.
2010-04-07 13:51:42 +00:00
Ralph Castain
522a23d6a3 A few changes to the FT-related configure options:
1. fix a bug that caused an infinite loop in configure when specifying want-ft but not want-ft-thread by removing a stale reference to the opal-progress-thread option

2. add want-ft=orcm so we can build the orcm errmgr component

3. cleanup the use of "ompi_want_ft_xxx" and replace it with "opal_want_ft_xxx" so that naming conventions are preserved

This commit was SVN r22885.
2010-03-25 22:53:48 +00:00
Ralph Castain
b400b84162 Merge in the modified thread configure option branch per today's telecon.
Remove the --enable-progress-threads option as this is no longer functional, and hardcode OPAL_ENABLE_PROGRESS_THREADS to 0.

Replace the --enable-mpi-threads option with --enable-mpi-thread-multiple as this is clearer as to meaning. This option automatically turns "on" opal thread support if it wasn't already so specified. If the user specifies --disable-opal-multi-threads --enable-mpi-thread-multiple, we will error out with a message

Add a new --enable-opal-multi-threads option that turns "on" opal thread support without doing anything wrt mpi-thread-multiple

This commit was SVN r22841.
2010-03-16 23:10:50 +00:00
Josh Hursey
e9b5162d79 Fix the configure logic for --with-ft so that it properly takes a comma separated list.
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.

The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.

This commit was SVN r22824.
2010-03-12 23:57:50 +00:00
Rainer Keller
06f5ba1c19 - Reverse the logic (OPAL_LIKELY -> OPAL_UNLIKELY)
This commit was SVN r22796.
2010-03-08 14:00:59 +00:00
Nysal Jan
97d66bce78 This fixes trac:2154 - CSUM PML false positive. Needs to go to both cmr:v1.4.2 and cmr:v1.5
This commit was SVN r22590.

The following Trac tickets were found above:
  Ticket 2154 --> https://svn.open-mpi.org/trac/ompi/ticket/2154
2010-02-10 10:24:16 +00:00
Shiqing Fan
ad763c327d Restore several linked libraries that were deleted by mistake in r22405.
This commit was SVN r22415.

The following SVN revision numbers were found above:
  r22405 --> open-mpi/ompi@872a4047ba
2010-01-14 21:50:42 +00:00
Avneesh Pant
8bdd334d95 Allow the PSM component to return ERR_NOT_AVAIL so it can be unloaded silently if executed on a node with no QLogic IB hardware. Also minor modifications to have the CM PML allow itself to be unloaded if no MTL components are available. The component selection logic can then continue to use other PMLs.
This commit was SVN r22410.
2010-01-14 19:39:35 +00:00
Shiqing Fan
872a4047ba Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake.
In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time.

This commit was SVN r22405.
2010-01-14 18:10:20 +00:00
Jeff Squyres
2bdcb2a979 Move CM's MCA params into their own function (component.register).
This commit was SVN r22392.
2010-01-12 20:11:47 +00:00
Shiqing Fan
c37308b8eb Remove the deleted windows file from the tarball.
This commit was SVN r22347.
2009-12-29 16:11:32 +00:00
Shiqing Fan
a2d00d4ab8 Exclude a pml component that is not necessary for Windows.
This commit was SVN r22342.
2009-12-28 16:12:28 +00:00
Jeff Squyres
4f68dfb03c Remove some dead code (thanks to George for pointing it out).
This commit was SVN r22309.
2009-12-14 21:20:41 +00:00
George Bosilca
76222eb869 Get rid of the useless mca_pml_base_endpoint_t and replace it by
[the well known and widely used!] mca_pml_endpoint_t.

This commit was SVN r22277.
2009-12-08 17:29:54 +00:00
Aurelien Bouteiller
59156cd92a Fix gcc 4.3 warning berserk about non-literal string format.
This commit was SVN r22147.
2009-10-27 20:45:02 +00:00
George Bosilca
16c6370b73 A little bit of cleanup, the main logic is still the same.
This commit was SVN r22043.
2009-10-01 14:05:25 +00:00