Ralph Castain
94f046e7f3
Fix an "oops" where we missed some instances of a name change. Thanks to Greg Koenig for spotting it!
...
This commit was SVN r23545.
2010-08-03 16:36:08 +00:00
Josh Hursey
01fbb729e3
Need to get the nidmap so that the apps have something to look at for wireup.
...
Without this we get errors like the following since the nidmap is empty in the apps:
[[13094,1],1] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 234
[[13094,1],1] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 281
This commit was SVN r23536.
2010-07-31 16:25:13 +00:00
Josh Hursey
ba7e94dd89
Some relatively minor C/R related cleanup
...
* Fix a configure warning for checking --enable-ft-thread
* In hnp and orted ErrMgr components check to see if other components have already recovered this process before trying to recover it again.
* Fix 'npernode' for restarting using the resilient rmaps component
* export ompi_info_set, so that internal functionality can use it.
This commit was SVN r23535.
2010-07-30 18:59:34 +00:00
Ralph Castain
d8ec83f939
Remove an unneeded tag
...
This commit was SVN r23533.
2010-07-29 02:13:06 +00:00
Ralph Castain
94d4887452
Cleanup some race conditions at job start
...
This commit was SVN r23515.
2010-07-27 18:24:11 +00:00
Ralph Castain
97cfbdfc8c
Add debug
...
This commit was SVN r23514.
2010-07-27 16:55:16 +00:00
Ralph Castain
cfbfbb75a2
Don't abort if a race condition causes a nid not to be found
...
This commit was SVN r23513.
2010-07-27 16:20:58 +00:00
Ralph Castain
06fe2c4c20
Add debug
...
This commit was SVN r23512.
2010-07-27 16:20:29 +00:00
Ralph Castain
e16ec40283
Check the correct envar. Thanks to Philippe for pointing out the error!
...
This commit was SVN r23499.
2010-07-27 03:49:03 +00:00
Ralph Castain
0c201486e4
Allow for multiple channels for the same channel "name" - could be both input and output channels
...
This commit was SVN r23498.
2010-07-27 01:38:39 +00:00
Ralph Castain
6158cab2e1
Add missing header
...
This commit was SVN r23497.
2010-07-27 01:37:41 +00:00
Ralph Castain
1ba8bbe1a9
Don't require the existence of a multicast interface if --enable-multicast wasn't specified.
...
This commit was SVN r23494.
2010-07-26 15:09:57 +00:00
Ralph Castain
e858b029aa
Re-establish support for IBM's POE environment by resurrecting the poe launcher from the v1.2 series, updated for the current trunk. Modify the hnp errmgr module to support batch-operated jobs. Update the new ess generic module to support a wider range of mapping options.
...
This commit was SVN r23493.
2010-07-24 23:35:20 +00:00
Ralph Castain
ff2d573f7e
Adjust some default values, and ensure we don't start sending too soon
...
This commit was SVN r23492.
2010-07-23 19:37:16 +00:00
Ralph Castain
140e427a79
Ensure that wildcard recvs go to the end of the matching list so that recvs for specific tags take precedence.
...
Ensure we don't try to send tcp mcast messsages to procs that haven't reported back yet
This commit was SVN r23491.
2010-07-23 19:31:34 +00:00
Ralph Castain
95a908cb27
Provide run-time control over heartbeat
...
This commit was SVN r23490.
2010-07-23 17:11:12 +00:00
Ralph Castain
dfdb1eca0c
Remove windows dist file requirement
...
This commit was SVN r23480.
2010-07-23 01:23:15 +00:00
Ralph Castain
67027cfce4
Add a new "generic" ess module that supports direct launched MPI jobs so long as certain minimum envars are provided, regardless of launch environment.
...
This commit was SVN r23478.
2010-07-22 21:51:43 +00:00
Ralph Castain
e7719f0aa4
Update platform files, adjust sensor heartbeat module selection rules
...
This commit was SVN r23477.
2010-07-22 21:50:46 +00:00
Jeff Squyres
64cb8f5d7f
Another round of man page cleanups from Debian mantainer Manuel
...
Prinz. Many thanks!
This commit was SVN r23445.
2010-07-20 14:07:18 +00:00
Ralph Castain
f3e13b9766
Improve the efficiency by making the check for uniqueness of incoming hnp contact info much faster by including the hnp_uri in the job_family tracker object. Replace the global buffer storage with a quick routine to build the buffer from the jobfams array
...
This commit was SVN r23443.
2010-07-20 08:30:47 +00:00
Ralph Castain
f85e69b64b
Continue enabling connect_accept across large numbers of independent jobs by replacing the hash tables with pointer_arrays to store routes to remote hnps to gain flexibility we'll need in the future.
...
This commit was SVN r23439.
2010-07-20 04:47:31 +00:00
Ralph Castain
248320b91a
Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started.
...
Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them.
Add a test provided by Philippe that tests this ability.
This commit was SVN r23438.
2010-07-20 04:22:45 +00:00
Ralph Castain
72525a5850
Allow the passing of NULL to orte_plm.kill_local_procs to match what we allow for the equivalent orte_odls call
...
This commit was SVN r23436.
2010-07-20 04:05:41 +00:00
Ralph Castain
ad5eaee4c6
Protect against NULL and provide additional resource check/error report
...
This commit was SVN r23432.
2010-07-19 18:33:32 +00:00
Ralph Castain
0355da6335
Cleanup pointer array addressing and protect against NULL
...
This commit was SVN r23431.
2010-07-19 18:30:04 +00:00
Ralph Castain
12cd07c9a9
Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases.
...
The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang.
This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is:
* pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place
* modified the errmgr to directly call the new routines when termination is detected
* removed the grpcomm.onesided_barrier and its associated RML tag
* add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree
* use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero
Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial.
This commit was SVN r23429.
2010-07-17 21:03:27 +00:00
Ralph Castain
a99e8cf132
Re-issue the read event so that multiple debugger attachments can occur
...
This commit was SVN r23421.
2010-07-15 15:35:26 +00:00
Jeff Squyres
c98729694a
Fix a compile error in the xgrid plm. This does nothing to make it
...
work; it just now compiles on os x.
This commit was SVN r23387.
2010-07-13 11:56:20 +00:00
Ralph Castain
570d19106b
Allow singletons to use ompi-server for rendezvous via pubsub as well as comm_spawn without starting their own local daemons
...
This commit was SVN r23384.
2010-07-13 06:33:07 +00:00
Ralph Castain
8bb0c16c2f
Add new tag
...
This commit was SVN r23382.
2010-07-13 06:29:13 +00:00
Ralph Castain
0b4081b162
The --output-filename option currently mixes the output from all processes with the same rank, but different jobids. Thus, the output from comm_spawn'd processes gets intermingled with their parents and each other.
...
Adding the jobid to the output filename solves the problem.
Thanks to Jody for pointing this out!
This commit was SVN r23380.
2010-07-12 21:43:49 +00:00
Ralph Castain
4c5ea3d1ef
Allow the ALPS allocator to run if the BASIL_RESERVATION_ID has been set. Update show_help messages and comments to reflect this new scenario.
...
Thanks to Jerome Soumagne for the patch!
This commit was SVN r23374.
2010-07-12 15:42:25 +00:00
Ralph Castain
2babebf9c3
Revert some of a prior commit - there is no need for another flag to indicate one-sided termination of the orteds.
...
This commit was SVN r23372.
2010-07-10 03:33:01 +00:00
Ralph Castain
da61b69b15
Ensure we don't incorrectly return a non-zero exit code when normally terminating a slurm job.
...
Slurm, of course, must always be different...
This commit was SVN r23371.
2010-07-09 19:14:10 +00:00
Ralph Castain
5d2233c950
Add some new tags
...
This commit was SVN r23369.
2010-07-09 17:51:28 +00:00
Ralph Castain
2193ace464
Ensure the debugger symbols are loaded into orterun prior to orte_init so they are found by attaching debuggers.
...
This commit was SVN r23368.
2010-07-09 17:51:00 +00:00
Rolf vandeVaart
cc9b768fdb
Fix typo and missing header - needed by Solaris
...
This commit was SVN r23367.
2010-07-09 14:33:13 +00:00
Shiqing Fan
c51c262e67
Relevant Windows fixes for r23360.
...
This commit was SVN r23363.
The following SVN revision numbers were found above:
r23360 --> open-mpi/ompi@31295e8dc2
2010-07-07 16:58:16 +00:00
Ralph Castain
62a8b73f1a
Correctly handle output sent to the group input channel
...
This commit was SVN r23362.
2010-07-07 14:17:48 +00:00
Ralph Castain
31295e8dc2
As discussed on today's telecon, reorganize the debugger attachment code in orte to better support efforts within the tool community aimed at exploring alternative methods. Move the debugger attachment code from the orterun directory to a new debugger framework. Organize the existing standard support code into an "mpir" component. Organize the current extensions for co-spawning debugger daemons into a separate "mpirx" component.
...
Since the MPIR symbols are now included in the ORTE library, remove duplicate declarations in OMPI and replace them with extern references to their ORTE instantiations.
This commit was SVN r23360.
2010-07-06 23:35:42 +00:00
Ralph Castain
98e7bc94f5
This belongs more properly in the ORCM repo now that we have MCA capability over there
...
This commit was SVN r23346.
2010-07-05 18:25:03 +00:00
Ralph Castain
dac1b9976e
Make the default group channels bidirectional
...
This commit was SVN r23345.
2010-07-05 14:45:57 +00:00
Jeff Squyres
5bebdb97fa
Need these header files for NetBSD. Thanks for the heads-up from
...
Aleksej Saushev.
This commit was SVN r23343.
2010-07-02 17:38:57 +00:00
Ralph Castain
ee4564c13b
Add some useful debug
...
This commit was SVN r23339.
2010-07-02 03:35:47 +00:00
Ralph Castain
f3d90dfb8d
Fully restore fault recovery, both at the individual process and daemon level.
...
NOTE: MPI fault recovery remains unavailable pending merge from Josh. This only covers ORTE-level processes.
This commit was SVN r23335.
2010-07-01 19:45:43 +00:00
Ralph Castain
7190415977
Fix JEFF's mistake - we cannot use orte_show_help if execv fails because we already closed all the file descriptors!
...
This commit was SVN r23334.
2010-07-01 19:41:26 +00:00
Ralph Castain
510ade9503
Do not use nodes that are flagged as down or do-not-use for this map. Modify error output to reflect possible reasons no nodes would be available
...
This commit was SVN r23333.
2010-07-01 19:39:31 +00:00
Ralph Castain
81a65f2c67
Define a new node state
...
This commit was SVN r23332.
2010-07-01 19:38:23 +00:00
Ralph Castain
f5548b8e0f
remove a potential locking conflict, and let emacs go ahead and reformat the function (sigh)
...
This commit was SVN r23331.
2010-07-01 19:37:53 +00:00