1
1
Граф коммитов

2160 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
f85e69b64b Continue enabling connect_accept across large numbers of independent jobs by replacing the hash tables with pointer_arrays to store routes to remote hnps to gain flexibility we'll need in the future.
This commit was SVN r23439.
2010-07-20 04:47:31 +00:00
Ralph Castain
248320b91a Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started.
Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them.

Add a test provided by Philippe that tests this ability.

This commit was SVN r23438.
2010-07-20 04:22:45 +00:00
Ralph Castain
72525a5850 Allow the passing of NULL to orte_plm.kill_local_procs to match what we allow for the equivalent orte_odls call
This commit was SVN r23436.
2010-07-20 04:05:41 +00:00
Ralph Castain
ad5eaee4c6 Protect against NULL and provide additional resource check/error report
This commit was SVN r23432.
2010-07-19 18:33:32 +00:00
Ralph Castain
0355da6335 Cleanup pointer array addressing and protect against NULL
This commit was SVN r23431.
2010-07-19 18:30:04 +00:00
Ralph Castain
12cd07c9a9 Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases.
The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang.

This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is:

* pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place

* modified the errmgr to directly call the new routines when termination is detected

* removed the grpcomm.onesided_barrier and its associated RML tag

* add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree

* use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero

Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial.

This commit was SVN r23429.
2010-07-17 21:03:27 +00:00
Ralph Castain
a99e8cf132 Re-issue the read event so that multiple debugger attachments can occur
This commit was SVN r23421.
2010-07-15 15:35:26 +00:00
Jeff Squyres
c98729694a Fix a compile error in the xgrid plm. This does nothing to make it
work; it just now compiles on os x. 

This commit was SVN r23387.
2010-07-13 11:56:20 +00:00
Ralph Castain
570d19106b Allow singletons to use ompi-server for rendezvous via pubsub as well as comm_spawn without starting their own local daemons
This commit was SVN r23384.
2010-07-13 06:33:07 +00:00
Ralph Castain
8bb0c16c2f Add new tag
This commit was SVN r23382.
2010-07-13 06:29:13 +00:00
Ralph Castain
0b4081b162 The --output-filename option currently mixes the output from all processes with the same rank, but different jobids. Thus, the output from comm_spawn'd processes gets intermingled with their parents and each other.
Adding the jobid to the output filename solves the problem.

Thanks to Jody for pointing this out!

This commit was SVN r23380.
2010-07-12 21:43:49 +00:00
Ralph Castain
4c5ea3d1ef Allow the ALPS allocator to run if the BASIL_RESERVATION_ID has been set. Update show_help messages and comments to reflect this new scenario.
Thanks to Jerome Soumagne for the patch!

This commit was SVN r23374.
2010-07-12 15:42:25 +00:00
Ralph Castain
2babebf9c3 Revert some of a prior commit - there is no need for another flag to indicate one-sided termination of the orteds.
This commit was SVN r23372.
2010-07-10 03:33:01 +00:00
Ralph Castain
da61b69b15 Ensure we don't incorrectly return a non-zero exit code when normally terminating a slurm job.
Slurm, of course, must always be different...

This commit was SVN r23371.
2010-07-09 19:14:10 +00:00
Ralph Castain
5d2233c950 Add some new tags
This commit was SVN r23369.
2010-07-09 17:51:28 +00:00
Ralph Castain
2193ace464 Ensure the debugger symbols are loaded into orterun prior to orte_init so they are found by attaching debuggers.
This commit was SVN r23368.
2010-07-09 17:51:00 +00:00
Rolf vandeVaart
cc9b768fdb Fix typo and missing header - needed by Solaris
This commit was SVN r23367.
2010-07-09 14:33:13 +00:00
Shiqing Fan
c51c262e67 Relevant Windows fixes for r23360.
This commit was SVN r23363.

The following SVN revision numbers were found above:
  r23360 --> open-mpi/ompi@31295e8dc2
2010-07-07 16:58:16 +00:00
Ralph Castain
62a8b73f1a Correctly handle output sent to the group input channel
This commit was SVN r23362.
2010-07-07 14:17:48 +00:00
Ralph Castain
31295e8dc2 As discussed on today's telecon, reorganize the debugger attachment code in orte to better support efforts within the tool community aimed at exploring alternative methods. Move the debugger attachment code from the orterun directory to a new debugger framework. Organize the existing standard support code into an "mpir" component. Organize the current extensions for co-spawning debugger daemons into a separate "mpirx" component.
Since the MPIR symbols are now included in the ORTE library, remove duplicate declarations in OMPI and replace them with extern references to their ORTE instantiations.

This commit was SVN r23360.
2010-07-06 23:35:42 +00:00
Ralph Castain
98e7bc94f5 This belongs more properly in the ORCM repo now that we have MCA capability over there
This commit was SVN r23346.
2010-07-05 18:25:03 +00:00
Ralph Castain
dac1b9976e Make the default group channels bidirectional
This commit was SVN r23345.
2010-07-05 14:45:57 +00:00
Jeff Squyres
5bebdb97fa Need these header files for NetBSD. Thanks for the heads-up from
Aleksej Saushev.

This commit was SVN r23343.
2010-07-02 17:38:57 +00:00
Ralph Castain
ee4564c13b Add some useful debug
This commit was SVN r23339.
2010-07-02 03:35:47 +00:00
Ralph Castain
f3d90dfb8d Fully restore fault recovery, both at the individual process and daemon level.
NOTE: MPI fault recovery remains unavailable pending merge from Josh. This only covers ORTE-level processes.

This commit was SVN r23335.
2010-07-01 19:45:43 +00:00
Ralph Castain
7190415977 Fix JEFF's mistake - we cannot use orte_show_help if execv fails because we already closed all the file descriptors!
This commit was SVN r23334.
2010-07-01 19:41:26 +00:00
Ralph Castain
510ade9503 Do not use nodes that are flagged as down or do-not-use for this map. Modify error output to reflect possible reasons no nodes would be available
This commit was SVN r23333.
2010-07-01 19:39:31 +00:00
Ralph Castain
81a65f2c67 Define a new node state
This commit was SVN r23332.
2010-07-01 19:38:23 +00:00
Ralph Castain
f5548b8e0f remove a potential locking conflict, and let emacs go ahead and reformat the function (sigh)
This commit was SVN r23331.
2010-07-01 19:37:53 +00:00
Ralph Castain
d463aec2f6 Don't try to send to dead daemons, keep accounting straight so we don't hang
This commit was SVN r23330.
2010-07-01 19:37:02 +00:00
Ralph Castain
dd85689560 Cleanup pointer array addressing
This commit was SVN r23329.
2010-07-01 19:33:10 +00:00
Ralph Castain
26fbae447e Don't try to forward input when we already ordered shutdown. Check return codes on sends
This commit was SVN r23328.
2010-07-01 19:32:08 +00:00
Ralph Castain
3237b9ec87 Print a nice error message when a daemon fails, and exit with a non-zero status
This commit was SVN r23314.
2010-06-28 16:38:54 +00:00
Ralph Castain
a1ea6bc130 Ignore debugger daemon termination status - we don't care how they died.
This commit was SVN r23306.
2010-06-26 03:08:50 +00:00
Ralph Castain
099c3aad97 Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0.
This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all.

Thanks to Dong Ahn (LLNL) for catching this problem!

Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm.

This commit was SVN r23300.
2010-06-24 05:13:53 +00:00
Shiqing Fan
681df0089b Add a few new files into the tarball.
This commit was SVN r23297.
2010-06-22 16:45:56 +00:00
Ralph Castain
8b2a682fba Return a silent error when -do-not-launch is given
This commit was SVN r23291.
2010-06-22 01:06:10 +00:00
Shiqing Fan
2e5e9f0a03 Fix a wrong windows path in hpn_contack, which causes problems when looking up in the session directories. Add two more ess module for Windows.
This commit was SVN r23286.
2010-06-21 09:47:33 +00:00
Ethan Mallove
fc37e408c2 Avoid SEGV in case rsh/ssh is not in PATH (refs trac:1490)
This commit was SVN r23278.

The following Trac tickets were found above:
  Ticket 1490 --> https://svn.open-mpi.org/trac/ompi/ticket/1490
2010-06-17 14:58:09 +00:00
Ralph Castain
1e90b91b84 Unset envars we set during initialization so we leave environ intact after orte_finalize.
Thanks to Damien Gunter for pointing it out.

This commit was SVN r23277.
2010-06-17 13:42:21 +00:00
Ralph Castain
628ffd1d6e Make the mcast channel assignments unsigned ints so they can be used as array indices. Assign input/output channels for apps. Cleanup some bugs in open_channel
This commit was SVN r23275.
2010-06-16 19:40:59 +00:00
Ralph Castain
6cbe947810 Modify the multicast scheme so that applications have separate input and output channels to avoid cross-talk. Update the multicast test to conform.
This commit was SVN r23271.
2010-06-15 03:50:31 +00:00
Ralph Castain
da43547983 Don't define the active_jobid until -after- the job has been setup.
Cleanup references to pointer_array objects

This commit was SVN r23250.
2010-06-09 02:16:05 +00:00
Jeff Squyres
f1a7b5cc33 Make "processor affinity not supported" error message a little better:
* Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic
   OPAL_ERR_NOT_SUPPORTED case.
 * When odls_default detects that processor affinity is not supported,
   it prints a specific message about it, and then it suppressed a
   generic HNP help message that would normally follow it (i.e., it's
   easier to have the "processor affinity is not supported" show_help
   message last).
 * Use some symbolic names in odls_default instead of fixed int's,
   just for slight readability improvements in the code.
 * Introduce orte_show_help_suppress(), which gives the ability to
   suppress any future showings of any arbitrary show_help() message.
   This is useful if you display message X and want to suppress
   message Y.  This suppression *only* works in environments where
   orte_show_help() does coalescing.

This commit was SVN r23249.
2010-06-08 20:16:07 +00:00
Ralph Castain
e52a54183f Let max restarts be associated with an app_context instead of a job so that individual apps can have different values. Default to a single job-level value
This commit was SVN r23248.
2010-06-07 14:21:08 +00:00
Ralph Castain
799a77a187 Some updates to the routed-cm module so it properly supports the tcp rmcast module
This commit was SVN r23247.
2010-06-07 14:19:32 +00:00
Ralph Castain
bd045468e5 Let apps use the ess cm module too...
This commit was SVN r23246.
2010-06-07 14:16:34 +00:00
Ralph Castain
ec7b5dae2b Add missing include file
This commit was SVN r23245.
2010-06-07 14:15:25 +00:00
George Bosilca
f453265de2 Only call gettimeofday once.
This commit was SVN r23235.
2010-06-02 09:44:37 +00:00
Ralph Castain
69410f2a87 Ensure that we report the state on debugger daemon co-launch so that the spawn properly releases
This commit was SVN r23233.
2010-06-01 23:23:00 +00:00