1
1
Граф коммитов

1910 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
d23fe1bb10 Include Ralph's suggestions, i.e. keep the hnp and orted management in sync.
This commit was SVN r19872.
2008-11-01 00:39:46 +00:00
George Bosilca
9528d33e90 Nothing relevant, few indentations and replace tab by spaces.
This commit was SVN r19870.
2008-10-31 22:24:52 +00:00
George Bosilca
ebe87d1842 Apply some suggestions from Ralph and avoid a pretty nasty race condition on the close of the fd.
The problem was that we close the same fd twice, and that meantime the fd could have been reassigned
to some other file or socket.

This commit was SVN r19869.
2008-10-31 22:23:53 +00:00
George Bosilca
9f17d1d67d Allow xgrid to compile with the changes from 19866.
This commit was SVN r19868.
2008-10-31 21:56:53 +00:00
Ralph Castain
f54fda489e This is a first step towards supporting fully-routed OOB communications:
1. remove direct routed module (hooray!)

2. add radix tree routed module (binomial remains default)

3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess

4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support

5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability

6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon)

Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm.

This will require an autogen...

This commit was SVN r19866.
2008-10-31 21:10:00 +00:00
George Bosilca
0ce76248e8 Close the file descriptors used to push or pull the data to the children.
Without this patch, doing spawn in a loop ended up by exhausting all
available file descriptors pretty quickly. There were about 5 file
descriptors opened per spawned process. Now the number of file
descriptors managed by the process (orted or HNP)
is a lot smaller.

This commit was SVN r19864.
2008-10-31 18:05:28 +00:00
Ralph Castain
30b3bc6761 Minor update - provide one more helpful hint regarding stdin target out-of-range, ensure we exit cleanly since daemons won't have been launched.
This commit was SVN r19847.
2008-10-29 16:00:48 +00:00
Ralph Castain
82ece176d5 Sanity check needs to allow vpid_invalid as this indicates the "none" scenario
This commit was SVN r19820.
2008-10-28 14:50:26 +00:00
Brad Penoff
d7b0fdfe5c small fix to compile trunk on FreeBSD 7
This commit was SVN r19817.
2008-10-28 03:44:23 +00:00
Ethan Mallove
2457df91b3 Add missing #include <errno.h> line (for SunStudio Solaris).
This commit was SVN r19814.
2008-10-27 17:41:33 +00:00
Jeff Squyres
b11d13cc05 Silence a trivial compiler warnings (pgcc).
This commit was SVN r19810.
2008-10-27 14:23:02 +00:00
Jeff Squyres
c078ab6b09 Minor fix for a trivial compiler warning.
This commit was SVN r19809.
2008-10-27 14:18:49 +00:00
Ralph Castain
71dcf61f9b Add sanity check to ensure that specified stdin target is within range of job. Print error message and exit if not.
Modify read_write test to allow specification of rank to read stdin.

IOF now validated to work for arbitrary rank as stdin target. Not validate to work for multiple simultaneous ranks reading stdin (untested).

This commit was SVN r19804.
2008-10-25 14:38:06 +00:00
Jeff Squyres
d96b78fee1 If the script is there, there's no real reason to have these files in
the repo.

This commit was SVN r19795.
2008-10-24 13:42:26 +00:00
Jeff Squyres
0a741d7f81 Add scripty-foo to make the data files. Revamp the data files to be
non-uniform in content as a slightly better test.

This commit was SVN r19794.
2008-10-24 13:35:47 +00:00
Ralph Castain
c56cdac379 Finish cleanup of stdin. Set non-stdio file descriptors to non-blocking (thanks to Jeff for catching that one). Handle writes that result in "would have blocked" errno.
This commit was SVN r19793.
2008-10-24 01:42:58 +00:00
Ralph Castain
6100d88ded Cleanup the new IOF:
1. remove some stale files that were overlooked in original commit

2. add a test program and data to stress iof for stdin

3. cleanup a debug statement that caused memory corruption when reading large files

4. some minor cleanups to correctly handle xon/xoff scenarios

This commit was SVN r19792.
2008-10-23 19:11:05 +00:00
George Bosilca
61317cb61d Complete the r19767 commit for XGrid, i.e. allow the PLM Xgrid to build.
This commit was SVN r19777.

The following SVN revision numbers were found above:
  r19767 --> open-mpi/ompi@6e5d844c36
2008-10-21 15:37:22 +00:00
Ralph Castain
ebaa2c59bb Cleanup non-debug builds
This commit was SVN r19771.
2008-10-18 13:09:47 +00:00
Jeff Squyres
6d026b86b7 Fix a problem reported on the user list by Teng Lin: OPAL_PREFIX
wasn't exported in the Bourne-shell-flavor case on remote nodes.

This commit was SVN r19770.
2008-10-18 12:13:10 +00:00
Jeff Squyres
d96003fec5 Fix typo.
This commit was SVN r19769.
2008-10-18 11:52:41 +00:00
Jeff Squyres
8ea27c0ced Add a missing header file to the Makefile.am so that it can be
included in the distribution tarball.

This commit was SVN r19768.
2008-10-18 11:09:57 +00:00
Ralph Castain
6e5d844c36 Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.

2. removes all wireup messaging during launch and shutdown.

3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.

4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.

5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".

6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.

7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"

This is not intended for the 1.3 release as it is a major change requiring considerable soak time.

This commit was SVN r19767.
2008-10-18 00:00:49 +00:00
Jeff Squyres
e34c93c46a Fix problem of missing ) noted by Mostyn Lewis.
This commit was SVN r19758.
2008-10-17 16:03:17 +00:00
Josh Hursey
88aa45dd52 Commit to bring online OpenIB, MX, and shared memory support for Open MPI's checkpoint/restart functionality. Some tuning is still needed, but basic functionality is in place.
There is still a problem with OpenIB and threads (external to C/R functionality). It has been reported in Ticket #1539

Additionally:
* Fix a file cleanup bug in CRS Base.
* Fix a possible deadlock in the TCP ft_event function
* Add a mca_base_param_deregister() function to MCA base
* Add whole process checkpoint timers
* Add support for BTL: OpenIB, MX,  Shared Memory
* Add support Mpool: rdma, sm
* Sundry bounds checking an cleanup in some scattered functions

This commit was SVN r19756.
2008-10-16 15:09:00 +00:00
Ralph Castain
b46d3e766e Cleanup the plm failed-to-start problem a little - ensure that the event is always defined so we don't have to check when trying to trigger it, thus avoiding potential race conditions.
This commit was SVN r19755.
2008-10-16 14:58:32 +00:00
Ralph Castain
48c3de1865 Fix a problem in the plm "failed to start" code observed by Jeff. When we are unable to launch to a specific node because it doesn't exist or is down, the system would hang and/or segv. The reason for the hang was that we were "firing" the orted exit trigger prior to its timer event being defined - thus "locking" that one-shot and preventing it from firing when we actually were ready to use it.
The segv was caused by the fact that we don't really know which daemon failed to start (at least, in most cases), so we didn't set a pointer to the aborted proc object. All we really wanted, though, was to ensure that mpirun returned a non-zero exit status, so the fix was to simply return the default error status.

This commit was SVN r19754.
2008-10-16 14:21:37 +00:00
Shiqing Fan
3d4e89a5cd - Remove the unused code introduced with r19480, which was for serializing tcp events on Windows and not successful.
This commit was SVN r19747.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r19480
2008-10-15 08:39:30 +00:00
Shiqing Fan
8b60c755c2 - Bring r19742 into trunk.
- Unify the Windows and the others way of handling callbacks. Thanks to George.
- This will let Windows use the same callbacks as Linux does, which works also.

This commit was SVN r19746.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r19742
2008-10-15 08:14:24 +00:00
Ralph Castain
a7afa869af Bring Jeff's changes over from v1.2 that restores the automatic source of .profile for bash and ksh shells.
This commit was SVN r19709.
2008-10-08 14:21:42 +00:00
Ralph Castain
802d14b130 Default job controls to forward IO. Adjust debugger code to not forward IO unless requested.
This commit was SVN r19690.
2008-10-06 17:56:23 +00:00
Jeff Squyres
dbb932b619 Remove the missing app mpi_after_finalize from the Makefile.
This commit was SVN r19687.
2008-10-06 14:35:15 +00:00
Ralph Castain
f4f81c7308 Let the HNP only update the routing tree if necessary. Enable some debug output
This commit was SVN r19676.
2008-10-03 13:41:08 +00:00
Ralph Castain
0cc2e724f8 Separate var declaration from use to remove compiler warnings in non-debug builds
This commit was SVN r19675.
2008-10-03 13:40:31 +00:00
Ralph Castain
15c47a2473 Revise the daemon collective system to handle comm_spawn patterns that cross into new nodes that are not direct children on the routing tree of the HNP.
Refers to ticket #1548. Although this appears to fix the problem, the ticket will be held open pending further test prior to transition to the 1.3 branch.

This commit was SVN r19674.
2008-10-02 20:08:27 +00:00
Ralph Castain
aa11e0977c Correct a bug in the bookmarking code that incorrectly looked at #slots instead of #slots_allocated, thus causing slot reductions in hostfiles to be ignored when selecting our starting node.
Fixes trac:1527

This commit was SVN r19656.

The following Trac tickets were found above:
  Ticket 1527 --> https://svn.open-mpi.org/trac/ompi/ticket/1527
2008-09-29 14:09:02 +00:00
Ralph Castain
4f89adae0c Prettify the user level display of allocation and map to make it easier to see and understand
This commit was SVN r19655.
2008-09-28 16:44:09 +00:00
Ralph Castain
508cb45583 Add a little more diagnostic info when we cannot do an rml send
This commit was SVN r19654.
2008-09-28 02:13:49 +00:00
Ralph Castain
edb3d99687 Update SLURM environmental variables used to describe allocation. Retain backwards compatibility to SLURM 1.1 and earlier versions.
This commit was SVN r19647.
2008-09-26 02:38:37 +00:00
Kenneth Matney
91bbc6b919 Change algorithm from spawning a shell that spawns another shell, and
thereby runs apstat twice; and in the process thereof reads the ALPS
appinfo file TWICE; and in addition, experiences a failure sometimes
which causes mpirun to hang.  Change this to a looped read attempt
that breaks on success, thereby avoiding failure (except in the most

This commit was SVN r19642.
2008-09-25 20:44:16 +00:00
Ralph Castain
037231fbcb MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required.
This commit was SVN r19637.
2008-09-25 13:39:08 +00:00
Jeff Squyres
78a25cf116 Commit a few missing header files, etc.
This commit was SVN r19626.
2008-09-24 15:41:42 +00:00
Ralph Castain
8d1ecdb361 Correct the creation of MPIR_Proctable so that the structs in the array correspond to the order of the ranks.
This commit was SVN r19624.
2008-09-24 14:55:46 +00:00
Ralph Castain
e64b79f30f Modify the --display-map and --display-alloc per note on devel list to reduce info for user understanding.
Add --display-devel-map and --display-devel-alloc to display all the detailed info we used to provide - it is only of use/interest to developers anyway and confuses users.

This commit was SVN r19608.
2008-09-23 15:46:34 +00:00
Jeff Squyres
e0a991a8c2 Print out a message telling the user how to enable non-aggregated help
/ error messages.

This commit was SVN r19604.
2008-09-22 17:42:56 +00:00
Josh Hursey
0cd65bfaa8 Fix a SIGPIPE that may occur when checkpointing a restarted process. This was a result of calling system() in the BLCR CRS. After inspection and testing it was determined that the operation was no longer necessary. So the call was removed thus fixing the bug.
This commit was SVN r19601.
2008-09-22 16:49:56 +00:00
Jeff Squyres
8eccda391a Fix comment to match the code.
This commit was SVN r19598.
2008-09-20 12:35:48 +00:00
Ralph Castain
16e4b0b698 Ensure that a child job inherits its parent job's prefix dir during comm_spawn operations
This commit was SVN r19538.
2008-09-10 19:05:23 +00:00
Ralph Castain
f326ee356e Add some error output to the plm rsh
This commit was SVN r19532.
2008-09-10 01:59:49 +00:00
Ralph Castain
20ece3cb86 Add new test that stresses MPI send/recv
This commit was SVN r19530.
2008-09-09 15:47:31 +00:00