1
1

321 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
5818a32245 Bring in a forgotten speed improvement for the TM launcher that was developed during SNL Tbird testing last year. Remove the redundant and slow calls to TM to resolve hostnames. Instead, read the host info from the PBS file during the RAS, and then just use that info in the PLS (rather than getting it again).
Adjust the RMAPS mapped_node object to propagate the required launch_id info now included in the ras_node object. This provides support for those few systems that don't use nodename to launch, but instead want some id (typically an index into the array of allocated nodes). This value gets set for each node in the RAS - the RMAPS just propagates it for easy launch.

This commit was SVN r13581.
2007-02-09 15:06:45 +00:00
George Bosilca
79d76b044a ORTE_DECL everything that can be used outside the base directory. I
woner why this file is called private when it's included by all PLS ...

This commit was SVN r13573.
2007-02-09 03:16:19 +00:00
Ralph Castain
890e3c7981 Reset the trunk so that the odls now sets the paffinity and sched_yield params again. The sched_yield is still overridden by any user-specified setting.
This change utilizes the new num_processors function. I also left the mods made to ompi_mpi_init and the bug fix for the default value of mpi_yield_when_idle. Note that the mods to mpi_init will not really take effect as the mca param will now *always* be set (either by user or odls). We will need those mods later, so no point in removing them now.

This commit was SVN r13519.
2007-02-06 19:51:05 +00:00
Jeff Squyres
c91fcd7fbd Fix a bunch of minor typos submitted by Bernhard Fischer.
This commit was SVN r13505.
2007-02-06 12:00:30 +00:00
Ralph Castain
a8202742ba Fix a missing function pointer - reference ticket #854
This commit was SVN r13476.
2007-02-02 23:10:14 +00:00
Ralph Castain
3daf8b341b Fix the sched_yield problem for generic environments. We now determine and set sched_yield during mpi_init based on the following logical sequence:
1. if the user has specified sched_yield, we simply do what we are told

2. if they didn't specify anything, try to get the number of processors on this node. Note that we already now get the number of local procs in our job that are sharing this node - that now comes in through the proc callback and is stored in the ompi_proc_t structures.

3. if we can get the number of processors, compare that to the number of local procs from my job that are sharing my node. If the number of local procs exceeds the number of processors, then set sched_yield to true. If not, then be a hog and set sched_yield to false

4. if we can't get the number of processors, default to conservative behavior and set sched_yield to true.

Note that I have not yet dealt with the need to dynamically adjust this setting as more processes are added via comm_spawn. So far, we are *only* looking within our own job. Given that we have now moved this logic to mpi_init (and away from the orteds), it isn't yet clear to me how a process will be informed about the number of procs in *other* jobs that are also sharing this node.

Something to continue to ponder.

This commit was SVN r13430.
2007-02-01 19:31:44 +00:00
Ralph Castain
c754523a14 Add cancel_operations to the pls module definition for tm
This commit was SVN r13416.
2007-02-01 16:52:28 +00:00
Jeff Squyres
8d872b195a Refs trac:726
Tested this functionality quite a bit more and made some fixes:

 * Print far fewer help messages
 * Fix one additional deadlock upon error
 * Change some ORTE_LOG messages to silent (because they're not
   errors)
 * Some code got re-indented, sorry...

Discussed and reviewed with Ralph.

This commit was SVN r13375.

The following Trac tickets were found above:
  Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
2007-01-30 23:03:13 +00:00
George Bosilca
dea69e3c7c Remove one of the %s.
This commit was SVN r13357.
2007-01-30 03:56:48 +00:00
George Bosilca
668a2bd7ac Remove some debug output.
This commit was SVN r13323.
2007-01-26 08:09:22 +00:00
George Bosilca
bd7eebda83 Deal with the argv problem from r13321 for the Windows PLS.
This commit was SVN r13322.

The following SVN revision numbers were found above:
  r13321 --> open-mpi/ompi@b439e87f96
2007-01-26 07:21:07 +00:00
George Bosilca
b439e87f96 We have this one starting from r12059. We save a pointer to the argv[*] and
then we modify the argv, forcing the reallocation of the array. With luck
the saved pointer still have a meaning ... without execve return with error
14 (EFAULT).

This commit was SVN r13321.

The following SVN revision numbers were found above:
  r12059 --> open-mpi/ompi@ae79894bad
2007-01-26 07:06:52 +00:00
Rich Graham
3488b394be fix typo in name of the cancel operation.
This commit was SVN r13312.
2007-01-25 19:07:27 +00:00
Jeff Squyres
580a7a108c Fix a compiler warning.
This commit was SVN r13310.
2007-01-25 17:22:01 +00:00
Ralph Castain
ab5ea61100 Bring over the rest of the ctrl-c fixes. This commit includes:
1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned.

2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up.

3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded

4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond

This needs more testing before migrating to 1.2.

This commit was SVN r13304.
2007-01-25 14:17:44 +00:00
George Bosilca
dcce444ed4 When the user give a prefix that really means something. I expect to
start looking for the daemons using the prefix.

This commit was SVN r13297.
2007-01-25 07:35:25 +00:00
George Bosilca
3b988fcdfd Small update the the process PLS.
This commit was SVN r13293.
2007-01-25 00:17:54 +00:00
George Bosilca
9b16827049 Add ORTE_DECLSPEC, and few conversions.
This commit was SVN r13268.
2007-01-24 00:52:08 +00:00
Ralph Castain
46b7df5683 Fix bproc nodename to correctly assign process-to-nodename mapping
This commit was SVN r13244.
2007-01-22 19:39:00 +00:00
Jeff Squyres
6584df9262 For --prefix-like behavior, we used to modifiy environ directly and
then exec the "srun..." from there.  But somewhere along the line, we
switched to having a copy of environ and modifying that.  It looks
like we forgot to update the stuff for --prefix behavior.  So this
commit fixes the setenv's for PATH and LD_LIBRARY_PATH to modify the
environ copy (not environ itself) so that the values properly get
passed down to the srun environment via execve().

This restores --prefix behavior in the SLURM pls.

This commit was SVN r13239.
2007-01-22 15:50:35 +00:00
George Bosilca
1b92589179 Update the PLS process.
This commit was SVN r13236.
2007-01-22 05:48:25 +00:00
Brian Barrett
ffe35ef6b8 Update the Cray XT3 run-time support files to compile with latest RTE changes
This commit was SVN r13172.
2007-01-17 22:47:27 +00:00
George Bosilca
c8222b57eb The Windows PLS now is able to spawn process locally.
This commit was SVN r13074.
2007-01-11 00:16:58 +00:00
Rolf vandeVaart
9fd5e55b50 Need to include strings.h because that is where the rindex()
function prototype lives.  Without this, we get compile 
warnings.  In addition, for 64-bit Solaris, we get a 
segmentation fault from orterun without this include.

This commit was SVN r13065.
2007-01-10 18:44:08 +00:00
George Bosilca
c7da2b0a9a Add the process PLS. It's only intended for Windows users.
This commit was SVN r13051.
2007-01-09 00:19:52 +00:00
Brian Barrett
a34e67d743 Remove unneeded PARAM_INIT_FILE variable in configure.params files used by
components that use configure.m4 for configuration or are always built. 
The macro has not been needed since moving to configure types other than
configure.stub

Fixes trac:590

This commit was SVN r13031.

The following Trac tickets were found above:
  Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
2007-01-08 03:44:22 +00:00
Brian Barrett
b5057d923e fix compile error that crept in with orte changes. This just makes it
compile -- I can't check for correctness on this platform.  Someone
probably wants to do that.

This commit was SVN r12948.
2006-12-31 20:16:20 +00:00
Li-Ta Lo
6df4e80727 new XCPU PLS and SDS to work with libxcpu
This commit was SVN r12905.
2006-12-21 00:05:36 +00:00
Ralph Castain
a0ef517550 Fix some errors in the bproc components that prevented compiling. Thought I had already done this, but either those changes were lost when I did the merge, or my old man's memory is fading....
Whaz-at??? :-)

This commit was SVN r12874.
2006-12-15 19:40:04 +00:00
Ralph Castain
1e1d0e8a89 Set the app_num attribute into the process environment so we pick it up on the other end
This commit was SVN r12868.
2006-12-15 16:43:52 +00:00
Ralph Castain
677d1260aa cleanup nicely if we don't launch
This commit was SVN r12867.
2006-12-15 14:03:53 +00:00
Ralph Castain
cbb660504c Retain the ability to run valgrind on the bproc launcher - do not call bproc_version if "nolaunch" is specified.
This commit was SVN r12866.
2006-12-15 14:01:21 +00:00
Ralph Castain
64ec238b7b Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs.
This commit was SVN r12864.
2006-12-15 02:34:14 +00:00
Ralph Castain
58569546ed Fix the fix to remove compiler warning - an incorrect "\" was placed in the command string.
This commit was SVN r12805.
2006-12-08 04:17:38 +00:00
Sven Stork
78173a697a Replace the test opertion "-e" with "-r" to improve the protability.
Refs: #392

This commit was SVN r12790.
2006-12-07 12:14:40 +00:00
Brian Barrett
6f8b366acb Rename liborte to libopen-rte and libopal to libopen-pal per telecon today
and bug #632.

Refs trac:632

This commit was SVN r12762.

The following Trac tickets were found above:
  Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632
2006-12-05 18:27:24 +00:00
George Bosilca
a0ed53d70b Make the compilers happy.
This commit was SVN r12729.
2006-12-03 00:19:11 +00:00
George Bosilca
8df8d86b85 Complete the functions to match the expected prototype.
This commit was SVN r12680.
2006-11-28 00:44:30 +00:00
Ralph Castain
deb2470ba3 Move the waitpid callback in the bproc pls *after* we store the daemon info. Otherwise, a short-lived app could terminate before we store the daemon info, causing mpirun to not terminate the daemons since the call to get_active_daemons would return a NULL list.
This commit was SVN r12656.
2006-11-22 22:49:22 +00:00
Ralph Castain
b1ff5fe868 Move the name of the bproc common segment to the central schema location - avoids conflicts when bproc 3 components try to build
This commit was SVN r12654.
2006-11-22 20:23:17 +00:00
Ralph Castain
428c1f14c3 Modify the bproc components to resolve the current allocation problem
This commit was SVN r12652.
2006-11-22 19:10:58 +00:00
Ralph Castain
ea1e0d34c8 Fix the SLURM PLS and SDS components so they correctly communicate the name of the node. This repairs a number of comm_spawn issues seen on Odin and other SLURM machines.
This commit was SVN r12621.
2006-11-17 20:51:03 +00:00
Ralph Castain
f771cc4fbd Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option.
Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution.

This commit was SVN r12619.
2006-11-17 19:06:10 +00:00
Ralph Castain
c1813e5c5a Extend the daemon reuse functionality to *most* of the other environments.
Note that Bproc won't support this operation, so we just ignore the --reuse-daemons directive.

I'm afraid I don't understand the POE and XGrid environments well enough to attempt the necessary modifications.

Also, please note that XGrid support has been broken on the trunk. I don't understand the code syntax well enough to make the required changes to that PLS component, so it won't compile at the moment. I'm hoping Brian has a few minutes to fix it after SC.

This commit was SVN r12614.
2006-11-16 15:11:45 +00:00
Ralph Castain
a17e27dfd5 Sign of old age.....fix some compiler complaints
This commit was SVN r12611.
2006-11-15 22:28:14 +00:00
Ralph Castain
f7fc19a2ca Create the ability to re-use existing daemons. Included in the commit:
1. new functionality in the pls base to check for reusable daemons and launch upon them

2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED

3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse".

This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on.

This commit was SVN r12608.
2006-11-15 21:12:27 +00:00
Ralph Castain
437f2b044d Modify the orted command communication system in two ways:
1. use non-blocking sends to transmit commands (this was actually done in a prior commit)

2. have an "ack" message sent back from the orted when it completes the command

The latter item is the new one here. With my prior commit, it was possible for the HNP to move on to other things before the orted had completed its command. This caused the HNP to occassionally exit before the orted, thus generating "lost connection" errors. With this change, we retain the parallel nature of the command communications, but still hold the HNP at that point until the orteds are done.

Best of both worlds.

This commit was SVN r12605.
2006-11-15 15:09:28 +00:00
Ralph Castain
6d6cebb4a7 Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.

I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).

This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
Ralph Castain
f95e20e2e1 Add another test program - an MPI app that just spins. This supports testing of system response to signal-terminated processes.
Add some debugger output to the ODLS default component.

Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time.

This commit was SVN r12585.
2006-11-13 21:51:34 +00:00
Ralph Castain
4e50cdae52 This commit accomplishes two things:
1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs.

2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system.

This commit was SVN r12558.
2006-11-11 04:03:45 +00:00