1
1
Граф коммитов

150 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
42df4b2102 Enable the slurmd plm module for testing - only selected if specified
This commit was SVN r20495.
2009-02-09 21:16:24 +00:00
Ralph Castain
f0af389910 Enable comm_spawn of slave processes, currently only active for the rsh, slurm, and tm environments. Establish support for local rsh environments in the plm/base so that rsh of local slaves can be done by any environment that supports it. Create new orte_rsh_agent param so users can specify rsh agent from outside of rsh plm, and sym link that to the old plm_rsh_agent and pls_rsh_agent options.
Modify the orte-bootproxy to pass prefix for the remote slave to support hetero/hybrid scenarios

This commit was SVN r20492.
2009-02-09 20:44:44 +00:00
Ralph Castain
5bfd1f3fd0 Ensure we have a correct, non-zero exit status when daemons or procs abort or fail to launch
This commit was SVN r20478.
2009-02-07 00:57:17 +00:00
Ralph Castain
8924e00e4c Ensure we don't segfault if we don't know which proc failed
This commit was SVN r20474.
2009-02-06 20:04:36 +00:00
Ralph Castain
13749673ed Enable spawn of local slave processes - plm module implementation to follow
This commit was SVN r20466.
2009-02-06 15:31:33 +00:00
Ralph Castain
a6f9c1f2b1 Allocate the slots for use in the xgrid plm
This commit was SVN r20460.
2009-02-06 00:55:14 +00:00
Shiqing Fan
a20254c8a5 A few type casts, making the MS compiler silent.
This commit was SVN r20449.
2009-02-05 16:37:44 +00:00
George Bosilca
4804ee60a7 It barely compiles ...
This commit was SVN r20433.
2009-02-05 00:14:28 +00:00
Ralph Castain
c534757b59 Correct use of the return code from opal_pointer_array_add
This commit was SVN r20417.
2009-02-04 14:02:51 +00:00
Ralph Castain
f36b9332ab Pass along the new output-filename and xterm cmd line options to the orteds - otherwise, they won't work in ssh environments.
Modify the rsh launcher to add -X to ssh if xterm option was selected.

This commit was SVN r20407.
2009-02-03 20:06:05 +00:00
Ralph Castain
aa2abc8cac Fix xgrid plm by changing orte_pointer_array calls to opal_pointer_array
This commit was SVN r20404.
2009-02-03 18:43:00 +00:00
Ralph Castain
5e6d3ba289 Initial implementation of static ports. Provide an mca param to specify static port ranges to the OOB - can provide an
y combination of comma-separated values and ranges. Daemons will use the first port in the range, MPI procs will use the other ports in the range assuming that they know their node rank in time and enough ports were specified.

NOTE: this capability only works under specific conditions. I will outline more about this in a note to devel as the remainder of the implementation progresses. For now, the only environment where this works is slurm. The linear routed module has also been adjusted to work with static ports so that all messaging flows strictly through the topology, including the initial daemon callback - thus limiting the number of sockets opened by mpirun.

This commit was SVN r20390.
2009-01-30 18:31:43 +00:00
Jeff Squyres
bb3d258562 Round up a few places where PATH_MAX was used instead of
OMPI_PATH_MAX.  Thanks to Andrea Iob for the bug report.

This commit was SVN r20360.
2009-01-27 22:57:50 +00:00
Ralph Castain
a9af219ba7 Fix CID 723: a pointless whine about not checking a return code
This commit was SVN r20274.
2009-01-14 19:06:36 +00:00
Ralph Castain
2778c13fac Continue to refine the timing instrumentation to identify where launch time is being spent
This commit was SVN r20244.
2009-01-12 19:12:58 +00:00
Jeff Squyres
d1c6f3f89a * Fix a truckload of Cisco copyrights to be the same as the rest of
the code base.
 * Fix a few misspellings in other copyrights.

This commit was SVN r20241.
2009-01-11 02:30:00 +00:00
Ralph Castain
c009b51ad3 Silence warning about signed vs unsigned comparisons
This commit was SVN r20237.
2009-01-09 16:01:03 +00:00
George Bosilca
78d856e04c Release resources when a job is completed. This allows us to correctly
count and load balance MPI-2 dynamic type of applications.

This commit was SVN r20236.
2009-01-08 21:21:54 +00:00
Ralph Castain
25f578a7d2 Continue to improve timing instrumentation. Add ability to store timing data directly to a file instead of just to stdout.
This commit was SVN r20229.
2009-01-08 14:27:52 +00:00
Shiqing Fan
5ae5f0e173 - 4/4 commit for Windows Visual Studio and CCP support:
unnecessary clean up to non windows related files (within ifdef __WINDOWS__).

This commit was SVN r20111.
2008-12-10 21:13:27 +00:00
Shiqing Fan
8673f19f50 - 2/4 commit for Windows Visual Studio and CCP support:
changes to the already existing ccp components
  event/win32.c: merge old FD handling into new
  opal_installdirs_windows.c:fix the registry handling

This commit was SVN r20109.
2008-12-10 21:01:54 +00:00
Shiqing Fan
a5281f0434 - 1/4 commit for Windows Visual Studio and CCP support:
CMakeLists and .windows files.
  In contribs preconfigured and precompiled parts.

This commit was SVN r20108.
2008-12-10 20:59:20 +00:00
Ralph Castain
728a24c8ec After considerable patience and help with debugging/testing from Tim M and Jeff S, return a completed and pretty well tested patch of the IOF to the trunk. This commit includes the previously reverted r20074, r20068, and r20064, as well as changes to fix those commits.
Basically, the remaining problem turned out to be:

1. closing stdout/stderr during orte_finalize of mpirun

2. inadvertently setting up a write event on fd = -1

3. devising a scheme to more accurately track when the stdin write event was active vs closed so it only got released once

This passed prelim MTT testing by Jeff and Tim, but should soak for awhile before migrating to 1.3.

This commit was SVN r20106.

The following SVN revision numbers were found above:
  r20064 --> open-mpi/ompi@a07660aea8
  r20068 --> open-mpi/ompi@ec930d14a9
  r20074 --> open-mpi/ompi@2940309613
2008-12-10 20:40:47 +00:00
Ralph Castain
e28210d0dc Revert r20074, r20068, and r20064: remove the IOF proc completion code pending further off-trunk work.
This commit was SVN r20089.

The following SVN revision numbers were found above:
  r20064 --> open-mpi/ompi@a07660aea8
  r20068 --> open-mpi/ompi@ec930d14a9
  r20074 --> open-mpi/ompi@2940309613
2008-12-09 17:11:59 +00:00
Ralph Castain
61c21d787d Add missing param in tm launcher
This commit was SVN r20087.
2008-12-09 13:31:33 +00:00
Ralph Castain
ce4018efeb Take a step back on the slurm and tm launchers. Problems were occurring in the MTT runs, although not under non-MTT scenarios. Preserve the modified plm versions in new components that are ompi_ignored until we can resolve the problems.
This will allow for better MTT coverage until the problem can be better understood.

This commit was SVN r20083.
2008-12-09 00:32:04 +00:00
Ralph Castain
a07660aea8 Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes:
1. coordination of job completion notification to include a requirement for both waitpid detection AND notification that all iof pipes have been closed by the app

2. change of all IOF read and write events to be non-persistent so they can properly be shutdown and restarted only when required

3. addition of a delay (currently set to 10ms) before restarting the stdin read event. This was required to ensure that the stdout, stderr, and stddiag read events had an opportunity to be serviced in scenarios where large files are attached to stdin.

This commit was SVN r20064.
2008-12-03 17:45:42 +00:00
Ralph Castain
7213c109ac Revamp the TM plm module so that we detect orted termination without requiring a callback message by using the TM native capabilities. This allows TM to function with fully routed OOB comm, and to tell us what node failed to spawn a daemon.
This commit was SVN r20027.
2008-11-20 18:57:35 +00:00
Ralph Castain
89559396ea Resolve a race condition when running under a SLURM environment.
The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit.

If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state.

Mucho bad. :-/

This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths).

This commit was SVN r20018.
2008-11-18 13:59:23 +00:00
Jeff Squyres
f4ba25cf3c Remove linking components against ORTE and OPAL libs. This was
removed from all other components long ago; I'm not sure how these
survived.

This commit was SVN r19956.
2008-11-08 00:56:57 +00:00
Ralph Castain
58fe779388 Remove double destruct to fix segv when ctrl-c is used to terminate job
This commit was SVN r19875.
2008-11-02 02:25:20 +00:00
George Bosilca
9f17d1d67d Allow xgrid to compile with the changes from 19866.
This commit was SVN r19868.
2008-10-31 21:56:53 +00:00
Ralph Castain
f54fda489e This is a first step towards supporting fully-routed OOB communications:
1. remove direct routed module (hooray!)

2. add radix tree routed module (binomial remains default)

3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess

4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support

5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability

6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon)

Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm.

This will require an autogen...

This commit was SVN r19866.
2008-10-31 21:10:00 +00:00
Ralph Castain
30b3bc6761 Minor update - provide one more helpful hint regarding stdin target out-of-range, ensure we exit cleanly since daemons won't have been launched.
This commit was SVN r19847.
2008-10-29 16:00:48 +00:00
Ralph Castain
82ece176d5 Sanity check needs to allow vpid_invalid as this indicates the "none" scenario
This commit was SVN r19820.
2008-10-28 14:50:26 +00:00
Ralph Castain
71dcf61f9b Add sanity check to ensure that specified stdin target is within range of job. Print error message and exit if not.
Modify read_write test to allow specification of rank to read stdin.

IOF now validated to work for arbitrary rank as stdin target. Not validate to work for multiple simultaneous ranks reading stdin (untested).

This commit was SVN r19804.
2008-10-25 14:38:06 +00:00
George Bosilca
61317cb61d Complete the r19767 commit for XGrid, i.e. allow the PLM Xgrid to build.
This commit was SVN r19777.

The following SVN revision numbers were found above:
  r19767 --> open-mpi/ompi@6e5d844c36
2008-10-21 15:37:22 +00:00
Jeff Squyres
6d026b86b7 Fix a problem reported on the user list by Teng Lin: OPAL_PREFIX
wasn't exported in the Bourne-shell-flavor case on remote nodes.

This commit was SVN r19770.
2008-10-18 12:13:10 +00:00
Ralph Castain
6e5d844c36 Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.

2. removes all wireup messaging during launch and shutdown.

3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.

4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.

5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".

6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.

7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"

This is not intended for the 1.3 release as it is a major change requiring considerable soak time.

This commit was SVN r19767.
2008-10-18 00:00:49 +00:00
Jeff Squyres
e34c93c46a Fix problem of missing ) noted by Mostyn Lewis.
This commit was SVN r19758.
2008-10-17 16:03:17 +00:00
Ralph Castain
48c3de1865 Fix a problem in the plm "failed to start" code observed by Jeff. When we are unable to launch to a specific node because it doesn't exist or is down, the system would hang and/or segv. The reason for the hang was that we were "firing" the orted exit trigger prior to its timer event being defined - thus "locking" that one-shot and preventing it from firing when we actually were ready to use it.
The segv was caused by the fact that we don't really know which daemon failed to start (at least, in most cases), so we didn't set a pointer to the aborted proc object. All we really wanted, though, was to ensure that mpirun returned a non-zero exit status, so the fix was to simply return the default error status.

This commit was SVN r19754.
2008-10-16 14:21:37 +00:00
Ralph Castain
a7afa869af Bring Jeff's changes over from v1.2 that restores the automatic source of .profile for bash and ksh shells.
This commit was SVN r19709.
2008-10-08 14:21:42 +00:00
Ralph Castain
f4f81c7308 Let the HNP only update the routing tree if necessary. Enable some debug output
This commit was SVN r19676.
2008-10-03 13:41:08 +00:00
Ralph Castain
037231fbcb MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required.
This commit was SVN r19637.
2008-09-25 13:39:08 +00:00
Ralph Castain
16e4b0b698 Ensure that a child job inherits its parent job's prefix dir during comm_spawn operations
This commit was SVN r19538.
2008-09-10 19:05:23 +00:00
Ralph Castain
f326ee356e Add some error output to the plm rsh
This commit was SVN r19532.
2008-09-10 01:59:49 +00:00
Ralph Castain
9b8473fdbf Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use.
Cleanup when an orted actually opens the PLM. Unfortunately, some unmentionable people are pushing head node environs out to remote nodes, causing the daemons to think they are the HNP. This helps prevent the confusion.

This commit was SVN r19518.
2008-09-08 15:45:11 +00:00
Shiqing Fan
cd6ff74d89 Update the ccp module:
rename the get_cluster_message function for both ras/plm.
  use _umask instead of umask.
  add WIN32_DCOM definition to support Windows Vista.

This commit was SVN r19470.
2008-09-01 16:35:38 +00:00
Ralph Castain
43f8bcfe54 Update slurm plm to respect leave_session_attached
This commit was SVN r19370.
2008-08-19 18:30:30 +00:00
Ralph Castain
4e0f34a062 When we hit an error prior to actually launching daemons, it would be nice if orterun didn't bark about daemons failing to launch, mpirun detecting a job failed, etc.
Add a new job state to indicate that we never attempted to launch. Flag such a scenario and avoid hitting all the other error messages.

This commit was SVN r19366.
2008-08-19 15:19:30 +00:00