Ralph Castain
d6a1d7a082
Little more cleanup on paffinity. Provide a specific error code for affinity not supported so we can better report the problem. Move the error reporting to orterun so we only get one error message. Update the darwin paffinity module to return the correct new error codes.
...
This commit was SVN r23107.
2010-05-07 14:04:55 +00:00
Ralph Castain
d4f56cff61
More cleanup on paffinity....groan
...
It is okay to not have a paffinity module IF you aren't using paffinity anyway. So don't error out of MPI_Init because a paffinity module wasn't selected.
Cleanup error reporting in the odls default module to (once and for all!) eliminate messages originating in the fork'd process. Create some new error codes to allow us to pass enough info back to the parent process to provide useful error messages.
This commit was SVN r23106.
2010-05-06 20:57:17 +00:00
Jeff Squyres
477201e161
Fix "make dist" breakage
...
This commit was SVN r23105.
2010-05-06 18:47:20 +00:00
Shiqing Fan
76726f6094
Update a few module configuration for Windows.
...
This commit was SVN r23104.
2010-05-05 12:22:04 +00:00
Ralph Castain
2ff1ae13e1
Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set.
...
This commit was SVN r23102.
2010-05-05 00:48:43 +00:00
Ralph Castain
99f223210d
Add some contributed examples of how to start and configure the spread library. Do a little more cleanup on the spread module, and ensure that it isn't selected if spread isn't running.
...
This commit was SVN r23101.
2010-05-04 23:44:00 +00:00
Ralph Castain
43005b88e0
Catch one more spot...thanks to Sam for reporting it
...
This commit was SVN r23094.
2010-05-04 17:21:15 +00:00
Ralph Castain
f4ae2885e2
Add new error constant
...
This commit was SVN r23090.
2010-05-04 13:44:33 +00:00
Ralph Castain
cd569f8a79
Restore the global restart capability
...
This commit was SVN r23089.
2010-05-04 02:40:29 +00:00
Ralph Castain
3ca0b4138b
Let the nidmap functions update a new orte_process_info field as to the number of daemons in the system
...
This commit was SVN r23088.
2010-05-04 02:40:09 +00:00
Ralph Castain
8c5f442ee0
Fix some bugs in the spread rmcast component
...
This commit was SVN r23086.
2010-05-04 02:38:37 +00:00
Ralph Castain
8e7faf9119
Add a new test for the db framework, fix some minor bugs in the daemon module
...
This commit was SVN r23085.
2010-05-04 02:38:11 +00:00
Ralph Castain
9dfb5c7c62
Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set
...
This commit was SVN r23082.
2010-05-03 04:11:03 +00:00
Ralph Castain
5103ffead6
Continue work on local recovery for orteds
...
This commit was SVN r23080.
2010-05-03 04:08:13 +00:00
Ralph Castain
bcff0d6301
Some minor cleanup in the rmcast framework, ensure that a default multicast group is always defined for each app
...
This commit was SVN r23079.
2010-05-03 04:07:14 +00:00
Ralph Castain
f994a7edf4
Add recovery data to the jobdat object
...
This commit was SVN r23078.
2010-05-03 04:06:13 +00:00
Ralph Castain
323224b84b
Allow the restart data to be retrieved from a job's app->env or default that given to orterun
...
This commit was SVN r23077.
2010-05-03 04:05:19 +00:00
Ralph Castain
3f262bf0b6
Add a new reliable multicast component based on the "spread" library
...
Thanks to Srini Nariangadu (Cisco) for the contribution!
This commit was SVN r23076.
2010-05-02 17:29:41 +00:00
Ralph Castain
c93af95351
Since there is no defined behavior for the case where all application procs exit normally, but some or all have non-zero returns, just output a warning telling the user how many procs meet that criteria. Let the return code of mpirun in that scenario reflect any errors in OMPI/ORTE itself.
...
Clearly a temporary solution until a defined behavior can be established.
This commit was SVN r23075.
2010-04-30 19:01:10 +00:00
Ralph Castain
c62418d76d
Add a new test that checks behavior when we call exit with a non-zero return code after calling finalize - don't ask why.
...
Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status.
This commit was SVN r23071.
2010-04-29 19:58:44 +00:00
Ralph Castain
5689bd6d30
Remove debug
...
This commit was SVN r23062.
2010-04-28 22:14:23 +00:00
Ralph Castain
319758e3e0
Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params:
...
1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis
2. orte_max_local_restarts - default max number of local restarts, can be overridden
3. orte_max_global_restarts - default max number of relocates, can be overridden
Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code.
This commit was SVN r23057.
2010-04-28 04:06:57 +00:00
Ralph Castain
67568e7dec
Fix a warning
...
This commit was SVN r23049.
2010-04-27 15:59:24 +00:00
Ralph Castain
401fb22847
Fix the pack/unpack of process state values by getting the actual packing size to match the new value size.
...
Improve the debug output by printing the string
This commit was SVN r23048.
2010-04-27 15:58:04 +00:00
Ralph Castain
efc7d3e2f6
Update the LSF PLM module to ensure that the path is correctly passed.
...
Many thanks to Teng Lin for the error report - and the patch!
This commit was SVN r23047.
2010-04-27 03:46:04 +00:00
Ralph Castain
3434296836
Ensure we don't have a trailing separator on the end of our tmpdir as (a) it really looks weird, and (b) some exotic systems interpret that as indicating the rest of the path is to be treated as absolute. Makes for very strange and interesting behavior...
...
This commit was SVN r23046.
2010-04-27 03:40:44 +00:00
Ralph Castain
55889934d8
After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon.
...
Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly.
This commit was SVN r23045.
2010-04-27 03:39:32 +00:00
Ralph Castain
b9893aacc5
Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here:
...
1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified.
2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set.
This support must be enabled by configuring --enable-sensors.
ompi_info and orte-info have been updated to include the new framework.
Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized.
Implementation continues.
This commit was SVN r23043.
2010-04-26 22:15:57 +00:00
Ralph Castain
43a89bbace
Extend the process and job states by adding values for exceeding sensor bounds. This changes the job state field to 32-bit to also provide room for future expansion.
...
This commit was SVN r23036.
2010-04-26 12:36:40 +00:00
Ralph Castain
e3164d2ac1
For the autogen-challenged (i.e., Jeff), create a new ORTE constant that tells the user that a required module was not found. Update the errmgr select function to output the error if no module is found.
...
This commit was SVN r23032.
2010-04-24 01:39:26 +00:00
Ralph Castain
b62ebb8b94
Enable these to build so others can more easily begin implementation
...
This commit was SVN r23031.
2010-04-23 22:46:33 +00:00
Ralph Castain
9c69900bbc
Ensure we get the jdata object when updating proc state so we don't segv if report-progress is set. Thanks to Sam for reporting it!
...
This commit was SVN r23029.
2010-04-23 15:14:21 +00:00
George Bosilca
95394456ca
No Windows EOF.
...
This commit was SVN r23024.
2010-04-23 05:51:29 +00:00
Ralph Castain
efbb5c9b7c
Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base:
...
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.
* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.
* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.
* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons
* update the routed modules to reflect that process state is updated in the errmgr
* ensure that the orted's open the errmgr and select their appropriate module
* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place
* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.
* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios
This commit was SVN r23023.
2010-04-23 04:44:41 +00:00
Shiqing Fan
d1e66bdd01
Use variables instead of hard-coded compiler flags, in order to support various C/C++ compilers on Windows.
...
This commit was SVN r23016.
2010-04-21 12:45:00 +00:00
Ralph Castain
86228aee38
Provide two new opal paffinity utilities for printing a hex representation of the cpu set and parsing that string back into a cpu set on the other end. Also add a new MCA param for passing the cpu set applied to a process during launch down to that process so it can know what we attempted to do.
...
All to be used in some new MPI extensions provided by Jeff so that users can easily query their binding situation.
This commit was SVN r22998.
2010-04-19 22:16:35 +00:00
Ralph Castain
4d06125a33
Establish a method by which a process knows if it has been bound by mpirun. This helps resolve a problem where a process gets "bound" to all available resources, which looks to the opal paffinity system as "not bound". This can cause mpi_init to attempt to "bind" the process itself, causing unintended behavior.
...
This commit was SVN r22985.
2010-04-17 01:58:26 +00:00
Ralph Castain
41428e6b61
Issue a warning if a requested binding operation results in processes being bound to all available processes, which is the equivalent of not being bound at all.
...
See the following email thread for further details:
http://www.open-mpi.org/community/lists/devel/2010/04/7745.php
This commit was SVN r22984.
2010-04-17 01:02:41 +00:00
Ralph Castain
6c379fed2e
IOF components should not assume they will be selected when queried - thus, they should not perform init functions until after selection. Create init/finalize entry points for that purpose, and have select init the module after it has been selected.
...
This commit was SVN r22982.
2010-04-16 18:51:27 +00:00
Ralph Castain
2ecc9fc2b3
Additional diag output
...
This commit was SVN r22981.
2010-04-16 14:48:37 +00:00
Ralph Castain
4308922f59
Ensure that any application-specific selection of ess module doesn't get overridden by what is given to the orted or orterun
...
Cleanup tool name determination for CM
This commit was SVN r22980.
2010-04-15 18:10:50 +00:00
Ralph Castain
c3e4c40cdf
Add another multicast tag for updating state
...
This commit was SVN r22979.
2010-04-15 18:08:53 +00:00
Ralph Castain
eeccf2f15c
Some minor changes to support vm's
...
This commit was SVN r22975.
2010-04-14 01:20:43 +00:00
Ralph Castain
854dc12fc0
Add a tag for multicasting IOF messages
...
This commit was SVN r22972.
2010-04-13 22:51:26 +00:00
Ralph Castain
81329c637e
Indentation corrections
...
This commit was SVN r22971.
2010-04-13 17:47:34 +00:00
Ralph Castain
8da781af84
Continue developing support for distributed virtual machines - minor changes to ensure correct jobid gets used and that dvm's can communicate with tools
...
This commit was SVN r22958.
2010-04-12 22:33:09 +00:00
Shiqing Fan
96b20a29b5
An easy solution to make singleton work on Windows.
...
This commit was SVN r22952.
2010-04-10 16:30:59 +00:00
Ralph Castain
d3ed4e68b7
Utilize a non-used mapping policy bit to define a policy that uses only existing alive daemons to support virtual machines and restarting processes on already-active nodes
...
This commit was SVN r22951.
2010-04-10 05:02:47 +00:00
Ralph Castain
4f8279df3d
Enable substitution of the communication calls in the orted when sending messages back to the HNP by creating a function for this purpose and saving the pointer to it in orte_odls_base. Higher level libraries can then override the default function to use their own method.
...
This commit was SVN r22950.
2010-04-09 18:50:10 +00:00
Ralph Castain
c32f046d7c
Tiny cleanup - when the user kills us with a ctrl-c, there really isn't a need to tell him "your procs died and we don't know why". Just shaddup and die.
...
This commit was SVN r22949.
2010-04-09 18:47:35 +00:00