1
1
Граф коммитов

203 Коммитов

Автор SHA1 Сообщение Дата
Shiqing Fan
2c9a4beffd Add and remove a few components for windows build.
This commit was SVN r25775.
2012-01-25 09:01:27 +00:00
Ralph Castain
bf103de66c My apologies for doing this outside of the usual time restrictions, but we need to get this in so we can make progress.
Move the ORTE-level debugger code back into orterun and out of the ORTE library to resolve symbol conflicts.

This commit was SVN r25713.
2012-01-11 15:53:09 +00:00
Ralph Castain
6cbd8fa6c9 Keep everyone in sync with new job state
This commit was SVN r25563.
2011-12-02 14:12:40 +00:00
Ralph Castain
07655e2945 Handle the case where the allocator "fibs" to us about the node names. In some cases (ahem...you know who you are!), the allocator will tell us a node number (e.g., "16"). However, the daemon will return a node name (e.g., "nid0016") - leaving us not recognizing its location.
So provide a new parameter (can't have too many!) that handles this situation by stripping the prefix from the returned node name. Also do a little cleanup to ensure we cleanly exit from errors, without generating too many annoying messages.

This commit was SVN r25562.
2011-12-02 14:10:08 +00:00
Ralph Castain
9b59d8de6f This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build.
Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations.

Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way.

This commit was SVN r25497.
2011-11-22 21:24:35 +00:00
Ralph Castain
c55cba55a7 Totally trivial spelling fix
This commit was SVN r25361.
2011-10-24 14:06:33 +00:00
Ralph Castain
3e72fccacf Cray's PMI implementation is quite different from slurm's - they extended PMI-1 by adding some, but not all, of the PMI-2 APIs. So you can't just switch to using PMI-2 functions as it isn't a complete implementation. Instead, you have to selectively figure out which ones they have in PMI-2, and use any missing ones from PMI-1. What fun.
Modify the configure logic and the PMI components to accommodate Cray's approach. Refactor the PMI error reporting code so it resides in only one place. Cray actually decided -not- to define the PMI-2 error codes, so we have to use the PMI-1 codes instead. More fun.

This commit was SVN r25348.
2011-10-21 04:54:38 +00:00
Ralph Castain
b771114086 Fix the fix :-)
If the errmgr is going to try and hold the orted until all routes and children are gone, then the exit cmd must do the same. Otherwise, the orted exits immediately without waiting for routes to be dismantled, which is why we don't see the connections close.

Also cleanup some diagnostics and add some debug to more clearly see what's going on.

This commit was SVN r25321.
2011-10-18 17:56:37 +00:00
George Bosilca
749b63c09d Provide a generic fix for the termination issue instead of r25248. The
termination condition is to be checked at the daemon/HNP level not down
in the routing.

This commit was SVN r25313.

The following SVN revision numbers were found above:
  r25248 --> open-mpi/ompi@b42ccc89b8
2011-10-18 03:07:37 +00:00
Ralph Castain
89a20de474 Remove unused includes. Ensure that the error log is at least always available as we otherwise segfault when reporting errors that occur prior to opening the errmgr framework
This commit was SVN r25288.
2011-10-14 18:45:11 +00:00
George Bosilca
872d377021 Tell what the update status is.
This commit was SVN r25259.
2011-10-11 19:49:12 +00:00
Brian Barrett
98e98ce2c5 * opal_atomic_trylock is documented to return 0 if the lock was acquired,
1 otherwise.  It was doing the opposite, so this patch fixes the
  return values.  All uses (all in ORTE) used the actual return values,
  not the documented values, so fix them as well.

This commit was SVN r25257.
2011-10-11 18:43:45 +00:00
Terry Dontje
c6691b4122 clean up local procs when abort or abort signal happens
This commit was SVN r25237.
2011-10-06 19:19:55 +00:00
Ralph Castain
1cd7b02df3 Add a set of default errmgr components that support solely the default "everything dies on error" behavior. Set their priority to be selected by default, but provide params to adjust those priorities to allow other component selection.
This commit was SVN r25139.
2011-09-13 22:03:45 +00:00
Ralph Castain
03ddf8520b Resolve not-used warnings
This commit was SVN r25101.
2011-08-27 14:27:15 +00:00
Wesley Bland
f542ecd578 Fix a couple of problems with the resil code not compiling.
This commit was SVN r25099.
2011-08-27 03:21:00 +00:00
George Bosilca
a4245b8d63 Remove some warnings related to the resilience patch.
This commit was SVN r25097.
2011-08-27 00:15:34 +00:00
George Bosilca
67ec5a0556 While doing cleanup remove pending warnings.
This commit was SVN r25095.
2011-08-26 23:38:06 +00:00
Wesley Bland
4e7ff0bd5e By popular demand the epoch code is now disabled by default.
To enable the epochs and the resilient orte code, use the configure flag:

--enable-resilient-orte

This will define both:

ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE

This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Ralph Castain
1c08a4006c Refactor some code to remove a few API handles from errmgr. Reviewed/tested by Wes.
This commit was SVN r25064.
2011-08-18 16:24:45 +00:00
Wesley Bland
a2a20c3766 I believe this should fix the race condition that Terry is seeing in the MTT
tests. It appears that nothing in the errmgr was using the mutexes to protect
the odls child list.

This commit was SVN r25062.
2011-08-18 14:52:30 +00:00
Shiqing Fan
627f1dd351 Correct several export declarations.
This commit was SVN r25047.
2011-08-15 09:45:51 +00:00
Wesley Bland
67feeb6aca Move the errmgr code back. This shouldn't cause the svn problems that I
apparently caused last time. Sorry about that. This one will just be a big
changelog.  

This commit was SVN r25016.
2011-08-08 16:01:08 +00:00
Ralph Castain
bd8e43a2de Correct debug output so it doesn't falsely report the module
This commit was SVN r25003.
2011-08-05 20:30:34 +00:00
Jeff Squyres
294e1f50cd Remove compiler warning about nested comment
This commit was SVN r24984.
2011-08-03 18:30:56 +00:00
Wesley Bland
87a96da99c Should fix some of the shutdown woes of the errmgr.
Correctly checks that the orted's job is completed.
Correctly tests to make sure that there is shutdown going on (doesn't rely on orte_orteds_term_ordered).
Adds a patch from Ralph to correctdly check the status of processes.

This commit was SVN r24962.
2011-08-01 14:00:41 +00:00
Wesley Bland
5fde3e0e00 Move the resilient orte errmgr code into a seperate errmgr for now while it's
still unstable. Reverted errmgr modules back to the original errmgr (with the
updates since the resilient code was brought into the trunk).

This commit was SVN r24958.
2011-07-28 21:24:34 +00:00
Wesley Bland
b972fd84e1 No longer sends extra FAILED_NOTIFICATION messages in the non-failure case.
Should reduce finalize complexity and avoid a race condition that has been
detected by a few users.

This commit was SVN r24952.
2011-07-26 20:47:44 +00:00
Ralph Castain
8d1b31b887 Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly.
Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list.

This commit was SVN r24894.
2011-07-13 20:11:14 +00:00
Abhishek Kulkarni
6bf02d1344 Fixes (after the new ORTE resiliency layer was merged) to make the trunk build with C/R flags turned on.
This commit was SVN r24872.
2011-07-10 23:36:26 +00:00
Ralph Castain
6496b2f845 Ensure we terminate properly on non-zero exit status
This commit was SVN r24859.
2011-07-07 14:33:49 +00:00
Ralph Castain
8ac35a8496 Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate.
This commit was SVN r24843.
2011-06-30 14:11:56 +00:00
Wesley Bland
84be81df95 Standardize the initialization of the EPOCH's.
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.

This commit was SVN r24829.
2011-06-28 14:20:33 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
042ee3ec48 Support the option of outputting error_log messages with something other than the process name
This commit was SVN r24784.
2011-06-17 14:50:00 +00:00
Josh Hursey
0eb3b3b7b0 Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits.
Per RFC:
  http://www.open-mpi.org/community/lists/devel/2011/06/9335.php

This commit was SVN r24775.
2011-06-15 13:10:13 +00:00
Josh Hursey
9080eaedf3 Fully initialize the orte_errmgr_base_component_t structure for the app.
This commit was SVN r24765.
2011-06-09 14:22:25 +00:00
Ralph Castain
b586f2952e Arggg...revert r24645. I knew those fields were there for a reason...sigh.
This commit was SVN r24647.

The following SVN revision numbers were found above:
  r24645 --> open-mpi/ompi@e4732110da
2011-04-28 15:07:00 +00:00
Ralph Castain
e4732110da Remove a couple more stale fields
This commit was SVN r24645.
2011-04-28 00:26:38 +00:00
Ralph Castain
3a28556472 Expand our handling of non-zero exit status. If a process exits with non-zero status, pass that info along to the user in case it means something to them, even if the process also exited without calling MPI_Finalize. If the process calls MPI_Abort, that trumps the exit status question.
Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI.

This commit was SVN r24614.
2011-04-14 15:04:21 +00:00
Ralph Castain
e6a76cc923 Fixes CID #1954
This commit was SVN r24516.
2011-03-11 23:00:27 +00:00
Ralph Castain
b98a2917ff Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it.
This commit was SVN r24412.
2011-02-18 02:48:12 +00:00
Ralph Castain
a9dca25ca5 Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do.
Cleanup the heartbeat handling so it is associated with the proc, not a node.

Cleanup handling of recovery options so that defaults do not override user values iff they are provided.

This commit was SVN r24382.
2011-02-14 20:49:12 +00:00
Ralph Castain
b5de068533 Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address.
Add some new proc/job states

Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed.

Add pretty-print of signals for "proc aborted due to signal" reports.

This commit was SVN r24378.

The following SVN revision numbers were found above:
  r24371 --> open-mpi/ompi@93d28a5792
2011-02-14 19:29:09 +00:00
Josh Hursey
a9335ea423 Make sure to initialize the 'update_state' function for the default module.
This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB.

This commit was SVN r24365.
2011-02-08 20:42:32 +00:00
Josh Hursey
fa3f6485d8 Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around.
This commit was SVN r24326.
2011-01-31 19:09:47 +00:00
Josh Hursey
8ec85c6b8f Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally.
I want to thank Hugo Meyer for reporting this/these bugs.

Notes:
 * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation).
 * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function.
 * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery.
 * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup.
 * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading.

This commit was SVN r24317.
2011-01-27 20:40:23 +00:00
Josh Hursey
81fd41f811 Return an informative error message if the user requests a migration of a job that is not capable of it.
C/R Functionality cleanup

This commit was SVN r24307.
2011-01-26 15:36:34 +00:00
Josh Hursey
8f45fcb429 More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now.
* Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation).
 * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer.
 * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process.
 * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario.

This commit was SVN r24306.
2011-01-26 14:56:35 +00:00
Abhishek Kulkarni
fd7ef7a1f1 Fixes broken trunk compile: call process status notify
only when ft-enable-cr is selected.

This commit was SVN r24255.
2011-01-14 18:37:07 +00:00