1
1

12 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
2966206f58 Fix a race condition in the IOF and add some new user-requested features:
1. fix a race condition whereby a proc's output could trigger an event prior to the other outputs being setup, thus c ausing the IOF to declare the proc "terminated" too early. This was really rare, but could happen.

2. add a new "timestamp-output" option that timestamp's each line of output

3. add a new "output-filename" option that redirects each proc's output to a separate rank-named file.

4. add a new "xterm" option that redirects the output of the specified ranks to a separate xterm window.

This commit was SVN r20392.
2009-01-30 22:47:30 +00:00
Ralph Castain
007d68becc Make the data on local children and their jobs available globally on both daemons and the HNP. This simply shifts the data structures from the ODLS base to the orte globals area to support subsequent movement of the daemon collective operations from the odls to the grpcomm framework. As that will be a larger change, it will be implemented on a branch and rolled over separately.
This commit was SVN r20228.
2009-01-08 14:25:56 +00:00
George Bosilca
9528d33e90 Nothing relevant, few indentations and replace tab by spaces.
This commit was SVN r19870.
2008-10-31 22:24:52 +00:00
Ralph Castain
f54fda489e This is a first step towards supporting fully-routed OOB communications:
1. remove direct routed module (hooray!)

2. add radix tree routed module (binomial remains default)

3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess

4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support

5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability

6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon)

Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm.

This will require an autogen...

This commit was SVN r19866.
2008-10-31 21:10:00 +00:00
Ralph Castain
30f37f762d Enable co-location of debugger daemons during initial launch and when debugging a running job.
Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented.

No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release.

This commit was SVN r19275.
2008-08-13 17:47:24 +00:00
Josh Hursey
9971bc9d95 Merge in the mca_base_select changes per RFC:
http://www.open-mpi.org/community/lists/devel/2008/04/3779.php

{{{
svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play .
}}}

Any components not in the trunk, but in one of the effected frameworks *must* be
updated. Contact the list, look at the RFC, or look at the diff for how to do this.

Sorry for the early commit of this, but I wanted to get it in today (per RFC) and
didn't know if I would have a chance later today.

This commit was SVN r18381.
2008-05-06 18:08:45 +00:00
Ralph Castain
3e55fe6f6d Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit.
Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs.

Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node.

This commit was SVN r18338.
2008-04-30 19:49:53 +00:00
Ralph Castain
7c7304466c Add a binomial tree-based launch to ssh, turned "on" only when the plm_rsh_tree_spawned mca param is set to a non-zero value. This probably isn't a very optimized capability, but it does execute a tree-based launch that may scale better than linear at high node counts.
Add the daemon map capability to the ODLS to create and save a map of daemon vpid vs nodename from the launch message.

Cleanup a few places in the base plm launch support where we didn't adequately protect rml recv's from potentially executing sends.

This commit was SVN r18143.
2008-04-14 18:26:08 +00:00
Ralph Castain
3a0d09300b Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations.
Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study.

This commit was SVN r18115.
2008-04-09 22:10:53 +00:00
Ralph Castain
d70e2e8c2b Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately.
Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer

This commit was SVN r17632.
2008-02-28 01:57:57 +00:00
Ralph Castain
82a8e2d10d Reorganize the odls framework to place common functionality in the base, thus making maintenance easier. We still need this to be a framework as some environments (e.g., bproc) require significantly different functionality. However, there is quite a bit of commonality across the components, so this ensures that fixes in one get propagated across the others.
This patch also fixes a minor bug discovered along the way: we had "lost" the passing of the oversubscribed condition flag from the mapper to the orteds. Thus, we were not setting sched_yield correctly when in oversubscribed conditions (except when a hostfile was specified - different logic there because we treat the number of slots allocated on the node as "uncertain")

I did not modify the process component in this patch - I will send a proposed patch to the maintainers of that component so they can review it first.

This commit was SVN r16418.
2007-10-10 15:02:10 +00:00
Ralph Castain
37dfdb76eb Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done.
This commit was SVN r11661.
2006-09-14 21:29:51 +00:00