1
1
Граф коммитов

120 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
58fe779388 Remove double destruct to fix segv when ctrl-c is used to terminate job
This commit was SVN r19875.
2008-11-02 02:25:20 +00:00
George Bosilca
9f17d1d67d Allow xgrid to compile with the changes from 19866.
This commit was SVN r19868.
2008-10-31 21:56:53 +00:00
Ralph Castain
f54fda489e This is a first step towards supporting fully-routed OOB communications:
1. remove direct routed module (hooray!)

2. add radix tree routed module (binomial remains default)

3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess

4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support

5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability

6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon)

Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm.

This will require an autogen...

This commit was SVN r19866.
2008-10-31 21:10:00 +00:00
Ralph Castain
30b3bc6761 Minor update - provide one more helpful hint regarding stdin target out-of-range, ensure we exit cleanly since daemons won't have been launched.
This commit was SVN r19847.
2008-10-29 16:00:48 +00:00
Ralph Castain
82ece176d5 Sanity check needs to allow vpid_invalid as this indicates the "none" scenario
This commit was SVN r19820.
2008-10-28 14:50:26 +00:00
Ralph Castain
71dcf61f9b Add sanity check to ensure that specified stdin target is within range of job. Print error message and exit if not.
Modify read_write test to allow specification of rank to read stdin.

IOF now validated to work for arbitrary rank as stdin target. Not validate to work for multiple simultaneous ranks reading stdin (untested).

This commit was SVN r19804.
2008-10-25 14:38:06 +00:00
George Bosilca
61317cb61d Complete the r19767 commit for XGrid, i.e. allow the PLM Xgrid to build.
This commit was SVN r19777.

The following SVN revision numbers were found above:
  r19767 --> open-mpi/ompi@6e5d844c36
2008-10-21 15:37:22 +00:00
Jeff Squyres
6d026b86b7 Fix a problem reported on the user list by Teng Lin: OPAL_PREFIX
wasn't exported in the Bourne-shell-flavor case on remote nodes.

This commit was SVN r19770.
2008-10-18 12:13:10 +00:00
Ralph Castain
6e5d844c36 Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.

2. removes all wireup messaging during launch and shutdown.

3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.

4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.

5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".

6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.

7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"

This is not intended for the 1.3 release as it is a major change requiring considerable soak time.

This commit was SVN r19767.
2008-10-18 00:00:49 +00:00
Jeff Squyres
e34c93c46a Fix problem of missing ) noted by Mostyn Lewis.
This commit was SVN r19758.
2008-10-17 16:03:17 +00:00
Ralph Castain
48c3de1865 Fix a problem in the plm "failed to start" code observed by Jeff. When we are unable to launch to a specific node because it doesn't exist or is down, the system would hang and/or segv. The reason for the hang was that we were "firing" the orted exit trigger prior to its timer event being defined - thus "locking" that one-shot and preventing it from firing when we actually were ready to use it.
The segv was caused by the fact that we don't really know which daemon failed to start (at least, in most cases), so we didn't set a pointer to the aborted proc object. All we really wanted, though, was to ensure that mpirun returned a non-zero exit status, so the fix was to simply return the default error status.

This commit was SVN r19754.
2008-10-16 14:21:37 +00:00
Ralph Castain
a7afa869af Bring Jeff's changes over from v1.2 that restores the automatic source of .profile for bash and ksh shells.
This commit was SVN r19709.
2008-10-08 14:21:42 +00:00
Ralph Castain
f4f81c7308 Let the HNP only update the routing tree if necessary. Enable some debug output
This commit was SVN r19676.
2008-10-03 13:41:08 +00:00
Ralph Castain
037231fbcb MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required.
This commit was SVN r19637.
2008-09-25 13:39:08 +00:00
Ralph Castain
16e4b0b698 Ensure that a child job inherits its parent job's prefix dir during comm_spawn operations
This commit was SVN r19538.
2008-09-10 19:05:23 +00:00
Ralph Castain
f326ee356e Add some error output to the plm rsh
This commit was SVN r19532.
2008-09-10 01:59:49 +00:00
Ralph Castain
9b8473fdbf Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use.
Cleanup when an orted actually opens the PLM. Unfortunately, some unmentionable people are pushing head node environs out to remote nodes, causing the daemons to think they are the HNP. This helps prevent the confusion.

This commit was SVN r19518.
2008-09-08 15:45:11 +00:00
Shiqing Fan
cd6ff74d89 Update the ccp module:
rename the get_cluster_message function for both ras/plm.
  use _umask instead of umask.
  add WIN32_DCOM definition to support Windows Vista.

This commit was SVN r19470.
2008-09-01 16:35:38 +00:00
Ralph Castain
43f8bcfe54 Update slurm plm to respect leave_session_attached
This commit was SVN r19370.
2008-08-19 18:30:30 +00:00
Ralph Castain
4e0f34a062 When we hit an error prior to actually launching daemons, it would be nice if orterun didn't bark about daemons failing to launch, mpirun detecting a job failed, etc.
Add a new job state to indicate that we never attempted to launch. Flag such a scenario and avoid hitting all the other error messages.

This commit was SVN r19366.
2008-08-19 15:19:30 +00:00
Ralph Castain
49745c5f40 Provide a new option that allows a user to leave an ssh session open without getting deluged by ORTE debug output. The new option is --leave-session-attached, with a corresponding MCA param of orte_leave_session_attached.
Theoretically, any PLM could use this - but in reality, all of them except rsh/ssh already leave the session attached anyway.

This fixes trac:656 - a REALLY old ticket

This commit was SVN r19294.

The following Trac tickets were found above:
  Ticket 656 --> https://svn.open-mpi.org/trac/ompi/ticket/656
2008-08-14 18:59:01 +00:00
Ralph Castain
30f37f762d Enable co-location of debugger daemons during initial launch and when debugging a running job.
Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented.

No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release.

This commit was SVN r19275.
2008-08-13 17:47:24 +00:00
Ralph Castain
f017c55bfa Close a minor memory leak - we can reuse timer events
This commit was SVN r19251.
2008-08-12 12:53:30 +00:00
Ralph Castain
be02211b4f Modify the wakeup system to make it more Windows-friendly. This allows Shiqing to consolidate the Windows-specific modifications into one location, and generalizes the wakeup procedure in case we hit other system-specific requirements.
This needs some soak time to ensure we haven't opened any race conditions. I tried to loop everything in the shutdown procedure through that trigger event call to ensure it all goes through the one-time locks as it did before so that someone hitting ctrl-c when we are already shutting down shouldn't cause problems. Just want to let people use it for awhile to verify.

This commit was SVN r19159.
2008-08-05 15:09:29 +00:00
Ralph Castain
5b2f53a069 One more quick fix - ensure we are looking at the value and not its pointer
This commit was SVN r19123.
2008-08-01 23:39:55 +00:00
Jeff Squyres
26c7daf16a Fix typo
This commit was SVN r19121.
2008-08-01 21:30:53 +00:00
Ralph Castain
21cd4b9df8 Add pls_rsh_agent synonym to the PLM rsh component
This commit was SVN r19119.
2008-08-01 20:15:42 +00:00
Ralph Castain
a62b2a0150 Per the July technical meeting:
Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways.

Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument.

For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work.

This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do.

To solve this problem, we did the following:

1. removed all mca params from the individual plms for the launch agent

2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted".

3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values

4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry.

5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices

Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct.

This commit was SVN r19097.
2008-07-30 18:26:24 +00:00
George Bosilca
a4d905db4a Allow xgrid to compile.
This commit was SVN r19076.
2008-07-29 13:24:08 +00:00
Jeff Squyres
0af7ac53f2 Fixes trac:1392, #1400
* add "register" function to mca_base_component_t
   * converted coll:basic and paffinity:linux and paffinity:solaris to
     use this function
   * we'll convert the rest over time (I'll file a ticket once all
     this is committed)
 * add 32 bytes of "reserved" space to the end of mca_base_component_t
   and mca_base_component_data_2_0_0_t to make future upgrades
   [slightly] easier
   * new mca_base_component_t size: 196 bytes
   * new mca_base_component_data_2_0_0_t size: 36 bytes
 * MCA base version bumped to v2.0
   * '''We now refuse to load components that are not MCA v2.0.x'''
 * all MCA frameworks versions bumped to v2.0
 * be a little more explicit about version numbers in the MCA base
   * add big comment in mca.h about versioning philosophy

This commit was SVN r19073.

The following Trac tickets were found above:
  Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392
2008-07-28 22:40:57 +00:00
Thomas Herault
28dc80b67e Deal with the SIGCHLD issue in LSF.
lsb_launch tampers with SIGCHLD signal handler. We are forced to reinstall our own signal handler after a call to this function.

This commit fixes trac:1356.

This commit was SVN r19033.

The following Trac tickets were found above:
  Ticket 1356 --> https://svn.open-mpi.org/trac/ompi/ticket/1356
2008-07-25 15:23:23 +00:00
Thomas Herault
b6affd35e9 Small typos for LSF compilation and update Makefile.am
This commit was SVN r18998.
2008-07-23 14:42:26 +00:00
George Bosilca
bcac9a0540 Remove a warning about using map when it is not initialized.
This commit was SVN r18957.
2008-07-21 14:35:05 +00:00
Ralph Castain
b1f367563c Few minor code cleanups
This commit was SVN r18890.
2008-07-11 15:40:41 +00:00
Ralph Castain
58964b2bf8 Make lsf support compile
This commit was SVN r18889.
2008-07-11 15:40:25 +00:00
Shiqing Fan
67a842fd17 Too many actual parameters for this function, remove the wrong one in order to get rid of the compiler warnings.
This commit was SVN r18863.
2008-07-10 08:06:53 +00:00
Pak Lui
924bface15 The plm env var should set to the name of a current plm module, which is rsh.
This commit was SVN r18844.
2008-07-08 23:15:52 +00:00
Shiqing Fan
5d0f4dc88d - Clean up the unreferenced variables.
- Change the arguments for launch failed function according to changeset r18611.

This commit was SVN r18795.

The following SVN revision numbers were found above:
  r18611 --> open-mpi/ompi@7bee71aa59
2008-07-03 10:11:08 +00:00
Shiqing Fan
a3e1718126 Missing one argument for calling this function.
This commit was SVN r18793.
2008-07-01 18:01:22 +00:00
Ralph Castain
9cebe0ca96 Ckpt the bproc support. All compiles now except for PLM module
This commit was SVN r18744.
2008-06-26 03:48:22 +00:00
Ralph Castain
17fcd72b5d Restore bproc code - if someone wants to maintain it, then more power to them...but it would definitely be easier if the old code is in the trunk. This is all .ompi_ignore'd except for me so I can play with making it compile again in my copious free time.
This commit was SVN r18716.
2008-06-24 01:27:22 +00:00
Ralph Castain
3e61a3f92e Sandbox for next-gen launch
This commit was SVN r18715.
2008-06-24 01:25:51 +00:00
Ralph Castain
f799ea225f Orterun creates a "clean" copy of its environment for use in launching procs. This includes properly setting LD_LIBRARY_PATH and PATH, among other things. Unfortunately, our PLM modules were using the local environ instead of the saved copy, thus missing a number of things that really should have been included. From what I see, we got away with the error because the PLM's were duplicating all that setup logic themselves - I'll clean this up over the next few days.
Meantime, correct the PLM's so they use the correct environ for launching.

This commit was SVN r18713.
2008-06-23 22:39:36 +00:00
Ralph Castain
b56f8ced4f Ensure params are registered prior to parsing global cmd line options in orterun so that debugger options are properly captured and acted upon.
Ensure that routes to remote procs are set on the HNP before completing launch so that the debugger message can be sent. Solves a race condition that can exist in those environments where the HNP does not have local procs.

This commit was SVN r18674.
2008-06-19 02:58:14 +00:00
Ralph Castain
282a220e7e Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation
This commit was SVN r18670.
2008-06-18 15:28:46 +00:00
Ralph Castain
0532d799d6 Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm.
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.

This commit was SVN r18664.
2008-06-18 03:15:56 +00:00
Ralph Castain
1a422995ae Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak.
This commit was SVN r18641.
2008-06-10 17:53:28 +00:00
George Bosilca
f72ab90b16 Allow xgrid to compile again.
This commit was SVN r18631.
2008-06-09 21:51:41 +00:00
Ralph Castain
c13cadc3c7 Refs trac:1255
This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something.

So the ticket can be checked by Jeff when he returns from vacation... :-)

This commit was SVN r18625.

The following Trac tickets were found above:
  Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255
2008-06-09 20:34:14 +00:00
Ralph Castain
bf5c34d10a The rsh launcher is one place where multi-word MCA params would have to be passed via the orted cmd line. In such a case, we have to explicitly include quote marks about the param value. Add that capability here.
This commit fixes trac:1200

This commit was SVN r18621.

The following Trac tickets were found above:
  Ticket 1200 --> https://svn.open-mpi.org/trac/ompi/ticket/1200
2008-06-09 19:07:19 +00:00