openmpi

Автор	SHA1	Сообщение	Дата
Shiqing Fan	2c9a4beffd	Add and remove a few components for windows build. This commit was SVN r25775.	2012-01-25 09:01:27 +00:00
Ralph Castain	bf103de66c	My apologies for doing this outside of the usual time restrictions, but we need to get this in so we can make progress. Move the ORTE-level debugger code back into orterun and out of the ORTE library to resolve symbol conflicts. This commit was SVN r25713.	2012-01-11 15:53:09 +00:00
Ralph Castain	6cbd8fa6c9	Keep everyone in sync with new job state This commit was SVN r25563.	2011-12-02 14:12:40 +00:00
Ralph Castain	07655e2945	Handle the case where the allocator "fibs" to us about the node names. In some cases (ahem...you know who you are!), the allocator will tell us a node number (e.g., "16"). However, the daemon will return a node name (e.g., "nid0016") - leaving us not recognizing its location. So provide a new parameter (can't have too many!) that handles this situation by stripping the prefix from the returned node name. Also do a little cleanup to ensure we cleanly exit from errors, without generating too many annoying messages. This commit was SVN r25562.	2011-12-02 14:10:08 +00:00
Ralph Castain	9b59d8de6f	This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build. Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations. Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way. This commit was SVN r25497.	2011-11-22 21:24:35 +00:00
Ralph Castain	c55cba55a7	Totally trivial spelling fix This commit was SVN r25361.	2011-10-24 14:06:33 +00:00
Ralph Castain	3e72fccacf	Cray's PMI implementation is quite different from slurm's - they extended PMI-1 by adding some, but not all, of the PMI-2 APIs. So you can't just switch to using PMI-2 functions as it isn't a complete implementation. Instead, you have to selectively figure out which ones they have in PMI-2, and use any missing ones from PMI-1. What fun. Modify the configure logic and the PMI components to accommodate Cray's approach. Refactor the PMI error reporting code so it resides in only one place. Cray actually decided -not- to define the PMI-2 error codes, so we have to use the PMI-1 codes instead. More fun. This commit was SVN r25348.	2011-10-21 04:54:38 +00:00
Ralph Castain	b771114086	Fix the fix :-) If the errmgr is going to try and hold the orted until all routes and children are gone, then the exit cmd must do the same. Otherwise, the orted exits immediately without waiting for routes to be dismantled, which is why we don't see the connections close. Also cleanup some diagnostics and add some debug to more clearly see what's going on. This commit was SVN r25321.	2011-10-18 17:56:37 +00:00
George Bosilca	749b63c09d	Provide a generic fix for the termination issue instead of r25248. The termination condition is to be checked at the daemon/HNP level not down in the routing. This commit was SVN r25313. The following SVN revision numbers were found above: r25248 --> open-mpi/ompi@b42ccc89b8	2011-10-18 03:07:37 +00:00
Ralph Castain	89a20de474	Remove unused includes. Ensure that the error log is at least always available as we otherwise segfault when reporting errors that occur prior to opening the errmgr framework This commit was SVN r25288.	2011-10-14 18:45:11 +00:00
George Bosilca	872d377021	Tell what the update status is. This commit was SVN r25259.	2011-10-11 19:49:12 +00:00
Brian Barrett	98e98ce2c5	* opal_atomic_trylock is documented to return 0 if the lock was acquired, 1 otherwise. It was doing the opposite, so this patch fixes the return values. All uses (all in ORTE) used the actual return values, not the documented values, so fix them as well. This commit was SVN r25257.	2011-10-11 18:43:45 +00:00
Terry Dontje	c6691b4122	clean up local procs when abort or abort signal happens This commit was SVN r25237.	2011-10-06 19:19:55 +00:00
Ralph Castain	1cd7b02df3	Add a set of default errmgr components that support solely the default "everything dies on error" behavior. Set their priority to be selected by default, but provide params to adjust those priorities to allow other component selection. This commit was SVN r25139.	2011-09-13 22:03:45 +00:00
Ralph Castain	03ddf8520b	Resolve not-used warnings This commit was SVN r25101.	2011-08-27 14:27:15 +00:00
Wesley Bland	f542ecd578	Fix a couple of problems with the resil code not compiling. This commit was SVN r25099.	2011-08-27 03:21:00 +00:00
George Bosilca	a4245b8d63	Remove some warnings related to the resilience patch. This commit was SVN r25097.	2011-08-27 00:15:34 +00:00
George Bosilca	67ec5a0556	While doing cleanup remove pending warnings. This commit was SVN r25095.	2011-08-26 23:38:06 +00:00
Wesley Bland	4e7ff0bd5e	By popular demand the epoch code is now disabled by default. To enable the epochs and the resilient orte code, use the configure flag: --enable-resilient-orte This will define both: ORTE_ENABLE_EPOCH ORTE_RESIL_ORTE This commit was SVN r25093.	2011-08-26 22:16:14 +00:00
Ralph Castain	1c08a4006c	Refactor some code to remove a few API handles from errmgr. Reviewed/tested by Wes. This commit was SVN r25064.	2011-08-18 16:24:45 +00:00
Wesley Bland	a2a20c3766	I believe this should fix the race condition that Terry is seeing in the MTT tests. It appears that nothing in the errmgr was using the mutexes to protect the odls child list. This commit was SVN r25062.	2011-08-18 14:52:30 +00:00
Shiqing Fan	627f1dd351	Correct several export declarations. This commit was SVN r25047.	2011-08-15 09:45:51 +00:00
Wesley Bland	67feeb6aca	Move the errmgr code back. This shouldn't cause the svn problems that I apparently caused last time. Sorry about that. This one will just be a big changelog. This commit was SVN r25016.	2011-08-08 16:01:08 +00:00
Ralph Castain	bd8e43a2de	Correct debug output so it doesn't falsely report the module This commit was SVN r25003.	2011-08-05 20:30:34 +00:00
Jeff Squyres	294e1f50cd	Remove compiler warning about nested comment This commit was SVN r24984.	2011-08-03 18:30:56 +00:00
Wesley Bland	87a96da99c	Should fix some of the shutdown woes of the errmgr. Correctly checks that the orted's job is completed. Correctly tests to make sure that there is shutdown going on (doesn't rely on orte_orteds_term_ordered). Adds a patch from Ralph to correctdly check the status of processes. This commit was SVN r24962.	2011-08-01 14:00:41 +00:00
Wesley Bland	5fde3e0e00	Move the resilient orte errmgr code into a seperate errmgr for now while it's still unstable. Reverted errmgr modules back to the original errmgr (with the updates since the resilient code was brought into the trunk). This commit was SVN r24958.	2011-07-28 21:24:34 +00:00
Wesley Bland	b972fd84e1	No longer sends extra FAILED_NOTIFICATION messages in the non-failure case. Should reduce finalize complexity and avoid a race condition that has been detected by a few users. This commit was SVN r24952.	2011-07-26 20:47:44 +00:00
Ralph Castain	8d1b31b887	Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly. Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list. This commit was SVN r24894.	2011-07-13 20:11:14 +00:00
Abhishek Kulkarni	6bf02d1344	Fixes (after the new ORTE resiliency layer was merged) to make the trunk build with C/R flags turned on. This commit was SVN r24872.	2011-07-10 23:36:26 +00:00
Ralph Castain	6496b2f845	Ensure we terminate properly on non-zero exit status This commit was SVN r24859.	2011-07-07 14:33:49 +00:00
Ralph Castain	8ac35a8496	Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate. This commit was SVN r24843.	2011-06-30 14:11:56 +00:00
Wesley Bland	84be81df95	Standardize the initialization of the EPOCH's. Everyone will be starting at MIN anyway (until we implement restart of course) so there's no reason to set the epoch to INVALID and then immediately reset them to MIN. This way there's less room to make mistakes later. This commit was SVN r24829.	2011-06-28 14:20:33 +00:00
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Ralph Castain	042ee3ec48	Support the option of outputting error_log messages with something other than the process name This commit was SVN r24784.	2011-06-17 14:50:00 +00:00
Josh Hursey	0eb3b3b7b0	Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9335.php This commit was SVN r24775.	2011-06-15 13:10:13 +00:00
Josh Hursey	9080eaedf3	Fully initialize the orte_errmgr_base_component_t structure for the app. This commit was SVN r24765.	2011-06-09 14:22:25 +00:00
Ralph Castain	b586f2952e	Arggg...revert r24645. I knew those fields were there for a reason...sigh. This commit was SVN r24647. The following SVN revision numbers were found above: r24645 --> open-mpi/ompi@e4732110da	2011-04-28 15:07:00 +00:00
Ralph Castain	e4732110da	Remove a couple more stale fields This commit was SVN r24645.	2011-04-28 00:26:38 +00:00
Ralph Castain	3a28556472	Expand our handling of non-zero exit status. If a process exits with non-zero status, pass that info along to the user in case it means something to them, even if the process also exited without calling MPI_Finalize. If the process calls MPI_Abort, that trumps the exit status question. Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI. This commit was SVN r24614.	2011-04-14 15:04:21 +00:00
Ralph Castain	e6a76cc923	Fixes CID #1954 This commit was SVN r24516.	2011-03-11 23:00:27 +00:00
Ralph Castain	b98a2917ff	Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it. This commit was SVN r24412.	2011-02-18 02:48:12 +00:00
Ralph Castain	a9dca25ca5	Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do. Cleanup the heartbeat handling so it is associated with the proc, not a node. Cleanup handling of recovery options so that defaults do not override user values iff they are provided. This commit was SVN r24382.	2011-02-14 20:49:12 +00:00
Ralph Castain	b5de068533	Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address. Add some new proc/job states Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed. Add pretty-print of signals for "proc aborted due to signal" reports. This commit was SVN r24378. The following SVN revision numbers were found above: r24371 --> open-mpi/ompi@93d28a5792	2011-02-14 19:29:09 +00:00
Josh Hursey	a9335ea423	Make sure to initialize the 'update_state' function for the default module. This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB. This commit was SVN r24365.	2011-02-08 20:42:32 +00:00
Josh Hursey	fa3f6485d8	Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around. This commit was SVN r24326.	2011-01-31 19:09:47 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Josh Hursey	81fd41f811	Return an informative error message if the user requests a migration of a job that is not capable of it. C/R Functionality cleanup This commit was SVN r24307.	2011-01-26 15:36:34 +00:00
Josh Hursey	8f45fcb429	More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. * Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation). * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer. * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process. * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario. This commit was SVN r24306.	2011-01-26 14:56:35 +00:00
Abhishek Kulkarni	fd7ef7a1f1	Fixes broken trunk compile: call process status notify only when ft-enable-cr is selected. This commit was SVN r24255.	2011-01-14 18:37:07 +00:00

1 2 3 4 5

203 Коммитов