1
1
Граф коммитов

158 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
9d381a4ebf Add a '!' option to the xterm iof option to invoke the -hold feature of xterm.
Correct the orte-show-help file when a rank is out of bounds, and do that test where a wildcard doesn't get incorrectly flagged as out-of-bounds.

This commit was SVN r20398.
2009-02-02 15:06:23 +00:00
Ralph Castain
2966206f58 Fix a race condition in the IOF and add some new user-requested features:
1. fix a race condition whereby a proc's output could trigger an event prior to the other outputs being setup, thus c ausing the IOF to declare the proc "terminated" too early. This was really rare, but could happen.

2. add a new "timestamp-output" option that timestamp's each line of output

3. add a new "output-filename" option that redirects each proc's output to a separate rank-named file.

4. add a new "xterm" option that redirects the output of the specified ranks to a separate xterm window.

This commit was SVN r20392.
2009-01-30 22:47:30 +00:00
Ralph Castain
5e6d3ba289 Initial implementation of static ports. Provide an mca param to specify static port ranges to the OOB - can provide an
y combination of comma-separated values and ranges. Daemons will use the first port in the range, MPI procs will use the other ports in the range assuming that they know their node rank in time and enough ports were specified.

NOTE: this capability only works under specific conditions. I will outline more about this in a note to devel as the remainder of the implementation progresses. For now, the only environment where this works is slurm. The linear routed module has also been adjusted to work with static ports so that all messaging flows strictly through the topology, including the initial daemon callback - thus limiting the number of sockets opened by mpirun.

This commit was SVN r20390.
2009-01-30 18:31:43 +00:00
Ralph Castain
fd5e15ea58 Since parsing comma-delimited, range-capable options is being used in multiple places, create a new utility that consolidates that code.
Have orte-iof use it.

This commit was SVN r20346.
2009-01-25 17:16:25 +00:00
Ralph Castain
88a0af9726 Revise the way we output resolved hostnames to make life easier for the Eclipse folks. Store aliases for individual nodes (only when requested to show resolved hostnames) and then report them out as part of the display-map option.
This commit was SVN r20284.
2009-01-15 18:11:50 +00:00
Jeff Squyres
a568ba0468 Fix CID 25: it's not possible for sav to be non-NULL by the time it
gets here.

This commit was SVN r20273.
2009-01-14 18:57:48 +00:00
Ralph Castain
7818779760 Expose the nidmap and pidmap as orte globals so that components in other frameworks can access and/or manipulate them without forcing API modifications - modify the individual ess components that were affected so they use the global variables. Add a list of attributes to the nids for storing node-related data (e.g., modex attrs), and define a new object for that purpose.
Consolidate the nid/pid lookup code with the rest of the nid/pid code so that changes are easier to track. Add the ability to send cluster profile info as part of the nidmap. Cleanup the setup and teardown of the new global nidmap and pidmap objects.

This commit was SVN r20219.
2009-01-07 14:58:38 +00:00
Ralph Castain
d1ff02e924 Add a macro to construct a complete 32-bit jobid from a local jobid number. This inserts the mpirun's job family into the upper 16-bit field.
This commit was SVN r20161.
2008-12-20 23:27:25 +00:00
Ralph Castain
aff3d1df21 Remove IOF related utilities from tool communication lib - IOF has now been updated to include tool support directly.
This commit was SVN r20160.
2008-12-20 23:25:56 +00:00
Shiqing Fan
20cea164db - 3/4 commit for Windows Visual Studio and CCP support:
corrections to non-windows files (but within ifdef __WINDOWS__)
  type casts, event library for windows use win32. 
  in orte runtime, add windows sockets handling and object construction.

This commit was SVN r20110.
2008-12-10 21:13:10 +00:00
Ralph Castain
7e3ddb09d3 As requested by Aurelien at the July design meeting - long time coming, but finally got around to it.
Enable one mpirun to act as the server for another mpirun when doing MPI_Publish_name and its associated operations. The user is responsible, of course, for ensuring that the mpirun acting as a server outlives any mpiruns using it in that capacity.

Add a cmd line option to mpirun --report-pid that prints out mpirun's pid. Allow the --ompi-server option to now take pid:# (or PID:#) of the mpirun to be used as the server, and then look that pid up by searching the local mpirun contact infos for it.

This commit was SVN r20102.
2008-12-10 17:10:39 +00:00
Ralph Castain
51789c9049 Cleanup the output for nodename resolve reporting
This commit was SVN r20081.
2008-12-08 19:00:36 +00:00
Ralph Castain
ff8e83ff3b Per request from IBM/Eclipse, provide MCA param to request output when nodes are resolved to a different nodename. This really only happens for the node that mpirun executes on, but they need the alert so they can do string matching of node names.
This commit was SVN r20032.
2008-11-24 19:57:08 +00:00
Ralph Castain
d4dfb1b7a7 Plug a few bugs in the decoding of pidmaps:
1. after we get enough jobs in the pidmap, the address of the jobmap pointer array data can move due to realloc. Need to reset the jobs pointer each time through to ensure it is pointing to valid data

2. when we exit the loop, rc will be set to an error due to reading past end of buffer - need to reset so it is ignored

3. need to ensure we only try to read one jobid each time through loop

This commit was SVN r20030.
2008-11-24 17:57:55 +00:00
Ralph Castain
9a57db4a81 To support comm_spawn in fully routed environments, daemons need to know the route to all procs in their job family. They already had this information, but were not retaining it. The infrastructure to do so has existed for some time - just never had the time to complete it.
This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message.

This commit was SVN r20022.
2008-11-18 15:35:50 +00:00
Jeff Squyres
d4dfd49cdd Fix typo found in Makefile that caused problems with "make distclean";
thanks to Mehdi Bozzo-Rey for reporting the problem.

This commit was SVN r19936.
2008-11-05 20:58:27 +00:00
Ralph Castain
6db5737779 Remove a couple of mutex vars that were defined and used - but never initialized. No clear way to initialize them, and that area of the code should never see threads anyway.
This commit was SVN r19889.
2008-11-03 17:23:10 +00:00
Ralph Castain
f54fda489e This is a first step towards supporting fully-routed OOB communications:
1. remove direct routed module (hooray!)

2. add radix tree routed module (binomial remains default)

3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess

4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support

5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability

6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon)

Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm.

This will require an autogen...

This commit was SVN r19866.
2008-10-31 21:10:00 +00:00
Ralph Castain
6e5d844c36 Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code:
1. completely and cleanly separates responsibilities between the HNP, orted, and tool components.

2. removes all wireup messaging during launch and shutdown.

3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol.

4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0.

5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none".

6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout.

7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output"

This is not intended for the 1.3 release as it is a major change requiring considerable soak time.

This commit was SVN r19767.
2008-10-18 00:00:49 +00:00
Ralph Castain
037231fbcb MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required.
This commit was SVN r19637.
2008-09-25 13:39:08 +00:00
Jeff Squyres
e0a991a8c2 Print out a message telling the user how to enable non-aggregated help
/ error messages.

This commit was SVN r19604.
2008-09-22 17:42:56 +00:00
Jeff Squyres
8eccda391a Fix comment to match the code.
This commit was SVN r19598.
2008-09-20 12:35:48 +00:00
George Bosilca
579d70edad We should use #ifdef and not #if
This commit was SVN r19504.
2008-09-05 12:44:19 +00:00
Shiqing Fan
ce40b8a35e - Fix typo ;-)
This commit was SVN r19438.
2008-08-27 17:06:40 +00:00
Ralph Castain
a5efefe980 Ensure var is init before use
This commit was SVN r19416.
2008-08-26 13:38:11 +00:00
Ralph Castain
28346b5bac Get -host to not use empty nodes called out specifically later in the -host list
This commit was SVN r19403.
2008-08-26 03:02:28 +00:00
Ralph Castain
6039e385cd Per request from Terry, make -host and -hostfile respect order when used as filters. In other words, if you specify -host host1,host3,host2, then we should use the hosts in that order. Previously, we used them in whatever order they were found in the allocation - all the -host did was tell us which nodes to use, not what order to use them in.
Relative node syntax remains supported. Also, if you specify empty nodes, but have a specific empty node called out later, we will not include that node in the empties we add. I'll provide examples in the manpage.

This commit was SVN r19402.
2008-08-26 02:56:10 +00:00
Ralph Castain
6d82efba21 Add relative indexing capabilities for hostfile and -host - we can now reference hosts using a relative syntax.
See the orte_hosts manpage for an explanation

This commit was SVN r19364.
2008-08-19 15:16:27 +00:00
Dan Lacher
7ef29d4abe More fixes for #1387. Minor fixes for the orte_host.7
man page file that was missed in the inital pass.

We are using $(am_dirstamp) instead of creating our own dirstamp since there
is src code in util/hostfile directory is created.  The automake process
creates the $(am_dirstamp), we found the use of this in the generated Makefile
in the util/Makefile

This commit was SVN r19230.
2008-08-08 19:10:02 +00:00
Ralph Castain
01a7259a7d This fixes ticket #1426 - mpirun is cleaning up ALL session dirs
Mpirun - and the orteds - were doing their best to whack all session dirs on their nodes just in case there was something lingering due to an abnormal termination. Unfortunately, they were -too- good at it. They were whacking all session directories under the user's name, even those from other mpiruns!

This adds another layer to the session dir tree so that we can denote which jobs come from our own job family, and restricts the cleanup operation to only session dirs from within our own job family. So we'll still cleanup anything due to our own mpirun, but won't whack any other mpirun from this user.

Call it being polite...

This commit was SVN r19083.
2008-07-29 18:58:35 +00:00
Ralph Castain
1a77b15523 Modify the handling of hostfiles to allow them to subdivide allocations. Utilize the "slots_alloc" field of the orte_node_t object - which had previously been unused - to track the #slots allocated to a given app_context. Let the hostfile filtering action utilize the #slots field to modify the allocated slots for each app_context.
This commit was SVN r19066.
2008-07-28 15:10:40 +00:00
Ralph Castain
3137ed9255 Update the manpages for comm_spawn(_multiple) - add man page to explain host/hostfile behavior
This commit was SVN r18961.
2008-07-21 17:58:12 +00:00
Ralph Castain
8e3658b320 Remove the nodename:pid prefix from show_help output so it doesn't disrupt the formatted output
This commit was SVN r18843.
2008-07-08 22:57:50 +00:00
Ralph Castain
ba5498cdc6 Repair the MPI-2 dynamic operations. This includes:
1. repair of the linear and direct routed modules

2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI

3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature

4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.

Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB

5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.

Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.

6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch

7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch

8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies

9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.

10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon

There is still more cleanup to be done for efficiency, but this at least works.

Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.

Fixes ticket #1256

This commit was SVN r18804.
2008-07-03 17:53:37 +00:00
Ralph Castain
6f85e34d66 Detect homo/hetero scenarios in the nidmap, setup to take appropriate actions in the basic grpcomm module.
NOT for inclusion in v1.3

This commit was SVN r18786.
2008-07-01 02:44:57 +00:00
Ralph Castain
b118779c08 It is okay for us to init the ORTE mca params multiple times. Indeed, it is absolutely required by orterun as the first time has to be done prior to parsing the command line, which means that the mca values haven't been parsed yet!
Add ability for sys admins to prohibit putting session directories under specified locations. Thus, they can now protect parallel file systems from foolish user mistakes.

This commit was SVN r18721.
2008-06-24 17:50:56 +00:00
Ralph Castain
5ebe10ebf1 Fix a bad typo - need to look at the node array as the arch array hasn't been built yet
This commit was SVN r18689.
2008-06-19 21:34:39 +00:00
Ralph Castain
3b5e80fa61 Shift responsibility for preconnecting the oob to the orte routed framework, which is the only place that knows what needs to be done. Only the direct module will actually do anything - it uses the same algo as the original preconnect function.
This commit was SVN r18677.
2008-06-19 13:48:26 +00:00
Ralph Castain
282a220e7e Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation
This commit was SVN r18670.
2008-06-18 15:28:46 +00:00
George Bosilca
0f9b9c0aff Remove a warning and add arequired header (otherwise we cannot compile when
--disable-debug is specified).

This commit was SVN r18665.
2008-06-18 08:10:02 +00:00
Ralph Castain
0532d799d6 Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm.
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.

This commit was SVN r18664.
2008-06-18 03:15:56 +00:00
Ralph Castain
d61fe87d04 Use the opal_show_help system if orte_show_help has not been initialized
This fixes ticket #1342

This commit was SVN r18644.
2008-06-11 12:50:40 +00:00
Ralph Castain
8d9ff44134 Add visibility required for some environments and configs
This commit was SVN r18629.
2008-06-09 21:28:19 +00:00
Ralph Castain
03ab4f5c64 Make the ifdef name mirror the change in filename
This commit was SVN r18626.
2008-06-09 20:36:55 +00:00
Ralph Castain
c13cadc3c7 Refs trac:1255
This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something.

So the ticket can be checked by Jeff when he returns from vacation... :-)

This commit was SVN r18625.

The following Trac tickets were found above:
  Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255
2008-06-09 20:34:14 +00:00
Ralph Castain
9613b3176c Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.

I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.

This commit was SVN r18619.
2008-06-09 14:53:58 +00:00
Ralph Castain
ca91ec525b Add a suffix to the opal_output stream descriptor object - we can now output both a prefix and a suffix for a given stream. Default the suffix to NULL.
Remove lingering references to a filtering system as this will no longer be implemented.

This commit was SVN r18586.
2008-06-04 20:52:20 +00:00
Ralph Castain
8ce4b64b5a Ensure we don't go past the end of the array
This commit was SVN r18569.
2008-06-03 21:31:02 +00:00
Ralph Castain
c992e99035 Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface
This commit was SVN r18557.
2008-06-03 14:24:01 +00:00
Ralph Castain
b2a566b610 Break an infinite loop in orte_output caused by debugging of ORTE comm subsystems
This commit was SVN r18534.
2008-05-29 12:21:17 +00:00