1
1

72 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
18cb5c9762 Complete modifications for failed-to-start of applications. Modifications for failed-to-start of orteds coming next.
This completes the minor changes required to the PLS components. Basically, there is a small change required to the parameter list of the orted cmd functions. I caught and did it for xcpu and poe, in addition to the components listed in my email - so I think that only leaves xgrid unconverted.

The orted fail-to-start mods will also make changes in the PLS components, but those can be localized so they come in one at a time.

This commit was SVN r14499.
2007-04-24 20:53:54 +00:00
Ralph Castain
2d04298002 Update the orted cmd xmit functions to match orted recv's. This fixes trac:1004.
This commit was SVN r14482.

The following Trac tickets were found above:
  Ticket 1004 --> https://svn.open-mpi.org/trac/ompi/ticket/1004
2007-04-24 01:58:40 +00:00
Ralph Castain
18b2dca51c Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly.
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).

This also involved a slight change to the oob.xcast API, so propagated that as required.

Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)

Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.

This commit was SVN r14475.
2007-04-23 18:41:04 +00:00
George Bosilca
c15cd5e4ab Unload all non necessary PLS. Once the selection process is done, we should release all
unselected PLS. This decrease the footprint of all Open MPI based processes.

This commit was SVN r14322.
2007-04-12 04:55:23 +00:00
Josh Hursey
cd5047a9bf Refs trac:976
Collect the base 'orted' command line into a base function since most of the
PLS components were duplicating this code. Add AMCA parameter command line
component to the base set.

Add Aggregate MCA parameter support to the following PLS components:
 - gridengine
 - process
 - slurm
 - poe
 - tm

Improve support for 'rsh' component.

Did/could not support the following components:
 - bproc
 - proxy
 - xcpu
 - cnos
 - xgrid

The above components had peculiar needs that made it non-trivial to add an 
option. The authors of these components need to help in supporting this
new option.

I was only able to test the SLURM and RSH components due to system availability.
The others should work without problem.

This commit was SVN r14284.

The following Trac tickets were found above:
  Ticket 976 --> https://svn.open-mpi.org/trac/ompi/ticket/976
2007-04-10 14:23:32 +00:00
Tim Prins
2f74160a37 Fix some more memory leaks
This commit was SVN r14175.
2007-03-30 13:43:50 +00:00
George Bosilca
4bab882d17 These 2 ORTE_DECLSPEC are not required.
This commit was SVN r13825.
2007-02-27 15:45:40 +00:00
Sven Stork
d8a369936e - Fix more symbols that should be exported.
This commit was SVN r13824.
2007-02-27 15:17:17 +00:00
George Bosilca
79d76b044a ORTE_DECL everything that can be used outside the base directory. I
woner why this file is called private when it's included by all PLS ...

This commit was SVN r13573.
2007-02-09 03:16:19 +00:00
Jeff Squyres
8d872b195a Refs trac:726
Tested this functionality quite a bit more and made some fixes:

 * Print far fewer help messages
 * Fix one additional deadlock upon error
 * Change some ORTE_LOG messages to silent (because they're not
   errors)
 * Some code got re-indented, sorry...

Discussed and reviewed with Ralph.

This commit was SVN r13375.

The following Trac tickets were found above:
  Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
2007-01-30 23:03:13 +00:00
Ralph Castain
ab5ea61100 Bring over the rest of the ctrl-c fixes. This commit includes:
1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned.

2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up.

3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded

4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond

This needs more testing before migrating to 1.2.

This commit was SVN r13304.
2007-01-25 14:17:44 +00:00
George Bosilca
9b16827049 Add ORTE_DECLSPEC, and few conversions.
This commit was SVN r13268.
2007-01-24 00:52:08 +00:00
Ralph Castain
64ec238b7b Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs.
This commit was SVN r12864.
2006-12-15 02:34:14 +00:00
George Bosilca
a0ed53d70b Make the compilers happy.
This commit was SVN r12729.
2006-12-03 00:19:11 +00:00
Ralph Castain
f771cc4fbd Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option.
Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution.

This commit was SVN r12619.
2006-11-17 19:06:10 +00:00
Ralph Castain
c1813e5c5a Extend the daemon reuse functionality to *most* of the other environments.
Note that Bproc won't support this operation, so we just ignore the --reuse-daemons directive.

I'm afraid I don't understand the POE and XGrid environments well enough to attempt the necessary modifications.

Also, please note that XGrid support has been broken on the trunk. I don't understand the code syntax well enough to make the required changes to that PLS component, so it won't compile at the moment. I'm hoping Brian has a few minutes to fix it after SC.

This commit was SVN r12614.
2006-11-16 15:11:45 +00:00
Ralph Castain
a17e27dfd5 Sign of old age.....fix some compiler complaints
This commit was SVN r12611.
2006-11-15 22:28:14 +00:00
Ralph Castain
f7fc19a2ca Create the ability to re-use existing daemons. Included in the commit:
1. new functionality in the pls base to check for reusable daemons and launch upon them

2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED

3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse".

This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on.

This commit was SVN r12608.
2006-11-15 21:12:27 +00:00
Ralph Castain
437f2b044d Modify the orted command communication system in two ways:
1. use non-blocking sends to transmit commands (this was actually done in a prior commit)

2. have an "ack" message sent back from the orted when it completes the command

The latter item is the new one here. With my prior commit, it was possible for the HNP to move on to other things before the orted had completed its command. This caused the HNP to occassionally exit before the orted, thus generating "lost connection" errors. With this change, we retain the parallel nature of the command communications, but still hold the HNP at that point until the orteds are done.

Best of both worlds.

This commit was SVN r12605.
2006-11-15 15:09:28 +00:00
Ralph Castain
6d6cebb4a7 Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.

I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).

This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
Ralph Castain
f95e20e2e1 Add another test program - an MPI app that just spins. This supports testing of system response to signal-terminated processes.
Add some debugger output to the ODLS default component.

Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time.

This commit was SVN r12585.
2006-11-13 21:51:34 +00:00
George Bosilca
2a863df0a5 Newline is required by some compilers at the end of a file.
This commit was SVN r12244.
2006-10-21 05:56:04 +00:00
Ralph Castain
153e38ffc9 Lesson to be learned: if you send an ack to a recv'd command, better not send it to the same tag it came from - at least, not if there is a persistent recv on that tag!
Fix the persistent daemon problem where it was exiting when a job completed. Problem was that the persistent daemon would order the job daemons to exit. They would then send an 'ack' back to the persistent daemon - but the ack consisted of an echo of the "exit" command, which was recv'd by the wrong listener who treated it as a properly sent cmd....and exited.

This commit was SVN r12243.
2006-10-21 02:53:19 +00:00
Ralph Castain
c07d4e2510 Cleaner rendition now extended to other environments. Remove MCA params for backend procs that can cause trouble. Specifically, any directives on the selection of components for RDS, RAS, RMAPS, PLS, and RMGR can be bad mojo on the backend.
This patch will cause a problem for cnos, however, as there we want to specifically tell the backends to be "null". I'm working on that issue.

This commit was SVN r12225.
2006-10-20 16:50:13 +00:00
George Bosilca
2aa3e51223 Nothing relevant. Only a set of castings to have a clean compile on
Windows. The cl.exe compiler is pretty good at complaining about
any kind of non explicit cast.

This commit was SVN r12207.
2006-10-20 02:25:50 +00:00
Ralph Castain
f91a95b3fe Fix the bug that caused mpirun to hang when a remote executable wasn't found using the rsh launcher. Will now test on a remote node
This commit was SVN r12095.
2006-10-11 18:43:13 +00:00
Ralph Castain
2da8245be0 Correctly propagate no-daemonize
This commit was SVN r12093.
2006-10-11 17:53:17 +00:00
George Bosilca
b56636c855 orte_pls belong to the PLS framework, therefore it should only be defined
in pls.h.

This commit was SVN r12089.
2006-10-11 17:12:22 +00:00
Ralph Castain
27e305347c Add a couple of options to orterun that support debugging of daemons for memory corruption.
Ensure that the environment provided to local application processes isn't "polluted" by the orteds

This commit was SVN r12087.
2006-10-11 15:18:57 +00:00
Ralph Castain
cebdc51762 Remove a debugging output
This commit was SVN r12066.
2006-10-09 02:10:52 +00:00
Ralph Castain
ae79894bad Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue.
I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there).

Gridengine compiles but I cannot test (believe it likely will run).

Poe and xgrid compile to the extent they can without the proper include files.

This commit was SVN r12059.
2006-10-07 15:45:24 +00:00
George Bosilca
cda46efd2a Some missing DECLSPEC
This commit was SVN r12047.
2006-10-06 15:21:52 +00:00
George Bosilca
03083cc1f6 Don't release the values[0] before it get initialized.
This commit was SVN r11999.
2006-10-05 05:25:18 +00:00
Ralph Castain
9eb14425b7 The last of the debug messages that keep hiding. My apologies.
This commit was SVN r11937.
2006-10-02 18:43:32 +00:00
Ralph Castain
3fd67a038f Bring comm_spawn and persistent operations online. Still some minor problems, though - so don't use them yet, please.
This won't affect anything except those two scenarios.

This commit was SVN r11936.
2006-10-02 18:29:15 +00:00
Ralph Castain
12328395ae Missed a couple of debug statements
This commit was SVN r11935.
2006-10-02 15:46:41 +00:00
Ralph Castain
7494a7a83f Clean out some debugging statements that were inadvertently left in the commit
This commit was SVN r11933.
2006-10-02 15:03:18 +00:00
Ralph Castain
559b9b0ae8 Continue beating on comm_spawn. Setup to debug bproc.
This commit was SVN r11932.
2006-10-02 14:58:22 +00:00
Ralph Castain
65593cd67e Fix a few Cyrador warnings
This commit was SVN r11930.
2006-10-02 13:00:32 +00:00
Ralph Castain
121f834776 Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component.
Still need some work on the proxy component, and on job termination for persistent daemon case.

This commit was SVN r11928.
2006-10-02 00:46:31 +00:00
Brian Barrett
9807a38458 Always initialize the base output stream, but only set verbose if requested.
Otherwise, the PLS components have pay more attention to debugging streams
than the rest of the OMPI source code

This commit was SVN r11922.
2006-10-01 22:37:30 +00:00
Ralph Castain
db6a93fa63 Fix a couple of reported issues:
1. PLS finalize was not being called. Now ensure that happens during orte_finalize.

2. Errmgr proxies were sending their messages to the wrong tag - typical cut/paste error.

This commit was SVN r11891.
2006-09-29 12:45:50 +00:00
Ralph Castain
37dfdb76eb Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done.
This commit was SVN r11661.
2006-09-14 21:29:51 +00:00
George Bosilca
f52c10d18e And ORTE is ready for prime-time. All Windows tricks are in:
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.

Plus a full implementation for the orte_wait functions.

This commit was SVN r11347.
2006-08-23 03:32:36 +00:00
George Bosilca
6afa4c6c64 Windows friendly version. We have to split the OMPI_DECLSPEC in at least 3
different macros, one for each project. Therefore, now we have OPAL_DECLSPEC,
ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project.

This commit was SVN r11270.
2006-08-20 15:54:04 +00:00
Ralph Castain
8c7f0ed9ae Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system.
Other changes:

1. Remove the old xcpu components as they are not functional.

2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one.

This will require an autogen/configure, I'm afraid.

This commit was SVN r11228.
2006-08-16 16:35:09 +00:00
Ralph Castain
5dfd54c778 With the branch to 1.2 made....
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).

Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).

I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).

In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...

Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.

This commit was SVN r11204.
2006-08-15 19:54:10 +00:00
Josh Hursey
5c5ce7e051 When 'mca_oob_send_callback' accesses the callback 'orte_pls_rsh_terminate_job_cb'
with an error status (< 0) then the req buffer is NULL. Put checks around the
OBJ_RELEASE(req) calls so that we don't try to release NULL :/

This commit was SVN r10641.
2006-07-03 22:44:54 +00:00
Ralph Castain
ee5a626d25 Add ability to trap and propagate SIGUSR1/2 to remote processes. There are a number of small changes that hit a bunch of files:
1. Changed the RMGR and PLS APIs to add "signal_job" and "signal_proc" entry points. Only the "signal_job" entries are implemented - none of the components have implementations for "signal_proc" at this time. Thus, you can signal all of the procs in a job, but cannot currently signal only one specific proc.

2. Implemented those new API functions in all components except xgrid (Brian will do so very soon). Only the rsh/ssh and fork modules have been tested, however, and only under OS-X.

3. Added signal traps and callback functions for SIGUSR1/2 to orterun/mpirun that catch those signals and call the appropriate commands to propagate them out to all processes in the job.

4. Added a new test directory under the orte branch to (eventually) hold unit and system level tests for just the run-time. Since our test branch of the repository is under restricted access, people working on the RTE were continually developing their own system-level tests - thus making it hard to help diagnose problems. I have moved the more commonly-used functions here, and added one specifically for testing the SIGUSR1/2 functionality.

I will be contacting people directly to seek help with testing the changes on more environments. Other than compile issues, you should see absolutely no change in behavior on any of your systems - this additional functionality is transparent to anyone who does not issue a SIGUSR1/2 to mpirun.

Ralph

This commit was SVN r10258.
2006-06-08 18:27:17 +00:00
Jeff Squyres
b4765b6db6 Add <netdb.h>
This commit was SVN r9142.
2006-02-25 19:04:26 +00:00