1
1
Граф коммитов

2140 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
e694c0dac6 Get the various grpcomm modules to all inter-operate cleanly with the "hier" module
This commit was SVN r20426.
2009-02-04 22:26:35 +00:00
George Bosilca
c359762c2d We're supposed to read a string and not an int ...
This commit was SVN r20421.
2009-02-04 15:51:31 +00:00
Ralph Castain
c534757b59 Correct use of the return code from opal_pointer_array_add
This commit was SVN r20417.
2009-02-04 14:02:51 +00:00
Ralph Castain
f36b9332ab Pass along the new output-filename and xterm cmd line options to the orteds - otherwise, they won't work in ssh environments.
Modify the rsh launcher to add -X to ssh if xterm option was selected.

This commit was SVN r20407.
2009-02-03 20:06:05 +00:00
Ralph Castain
645f4c1f20 Silence compiler warnings about variables used before init
This commit was SVN r20406.
2009-02-03 20:04:27 +00:00
Ralph Castain
7282be4287 Silence compiler warnings about variables used before init
This commit was SVN r20405.
2009-02-03 20:04:01 +00:00
Ralph Castain
aa2abc8cac Fix xgrid plm by changing orte_pointer_array calls to opal_pointer_array
This commit was SVN r20404.
2009-02-03 18:43:00 +00:00
Shiqing Fan
eab19af55c Include the missing header that used by the fix commit r20402, and use the correct reference for the parameter of orte_odls_base_notify_iof_complete function call. Thanks Ralph for r20402.
This commit was SVN r20403.

The following SVN revision numbers were found above:
  r20402 --> open-mpi/ompi@f1084d6b84
2009-02-03 18:14:43 +00:00
Ralph Castain
f1084d6b84 Under Windows, tell the orted that the proc has met its IOF termination conditions when launched since Windows does its own IO forwarding.
This commit was SVN r20402.
2009-02-03 16:41:07 +00:00
Ralph Castain
104a0539e3 Fix a format statement to be compatible with all gcc compiler versions
This commit was SVN r20400.
2009-02-02 15:47:07 +00:00
Ralph Castain
9d381a4ebf Add a '!' option to the xterm iof option to invoke the -hold feature of xterm.
Correct the orte-show-help file when a rank is out of bounds, and do that test where a wildcard doesn't get incorrectly flagged as out-of-bounds.

This commit was SVN r20398.
2009-02-02 15:06:23 +00:00
Ralph Castain
0597fdd778 Ensure that orte-iof barks when given an unrecognized cmd line option
This commit was SVN r20397.
2009-02-02 14:10:54 +00:00
Ralph Castain
b19dc2a4fa Update mpirun's man page for report-pid and report-uri options
This commit was SVN r20396.
2009-02-02 13:49:07 +00:00
Ralph Castain
d207c17adf Fix a segv when an application isn't found - ensure we properly terminate.
This commit was SVN r20395.
2009-02-02 13:44:08 +00:00
Ralph Castain
c3261e1a05 Fix optimized builds
This commit was SVN r20394.
2009-02-01 20:58:17 +00:00
Ralph Castain
debf128e53 Ensure the static port array is correctly checked for size
This commit was SVN r20393.
2009-01-31 03:46:42 +00:00
Ralph Castain
2966206f58 Fix a race condition in the IOF and add some new user-requested features:
1. fix a race condition whereby a proc's output could trigger an event prior to the other outputs being setup, thus c ausing the IOF to declare the proc "terminated" too early. This was really rare, but could happen.

2. add a new "timestamp-output" option that timestamp's each line of output

3. add a new "output-filename" option that redirects each proc's output to a separate rank-named file.

4. add a new "xterm" option that redirects the output of the specified ranks to a separate xterm window.

This commit was SVN r20392.
2009-01-30 22:47:30 +00:00
Rolf vandeVaart
0704b98668 Add the ability to forward SIGTSTP (converted to SIGSTOP) and
SIGCONT to the a.outs.  By default, they are not forwarded and
the behavior remains as it has always been.  However, if one
runs with --mca orte_forward_job_control 1, then mpirun will
catch those two signals and forward them to the orteds which
will deliver them to the a.outs.  We have had requests for
this feature.

This commit was SVN r20391.
2009-01-30 18:50:10 +00:00
Ralph Castain
5e6d3ba289 Initial implementation of static ports. Provide an mca param to specify static port ranges to the OOB - can provide an
y combination of comma-separated values and ranges. Daemons will use the first port in the range, MPI procs will use the other ports in the range assuming that they know their node rank in time and enough ports were specified.

NOTE: this capability only works under specific conditions. I will outline more about this in a note to devel as the remainder of the implementation progresses. For now, the only environment where this works is slurm. The linear routed module has also been adjusted to work with static ports so that all messaging flows strictly through the topology, including the initial daemon callback - thus limiting the number of sockets opened by mpirun.

This commit was SVN r20390.
2009-01-30 18:31:43 +00:00
Jeff Squyres
35c5e28a8e Up to SVN r20383
This commit was SVN r20384.

The following SVN revision numbers were found above:
  r20383 --> open-mpi/ompi@e0638c84c8
2009-01-29 17:59:04 +00:00
Jeff Squyres
bb3d258562 Round up a few places where PATH_MAX was used instead of
OMPI_PATH_MAX.  Thanks to Andrea Iob for the bug report.

This commit was SVN r20360.
2009-01-27 22:57:50 +00:00
Ralph Castain
c92f906d7c Move the daemon collectives out of the ODLS and into the GRPCOMM framework. This removes the inherent assumption that the OOB topology is a tree, thus allowing different grpcomm/routed combinations to implement collectives appropriate to their topology.
This commit was SVN r20357.
2009-01-27 19:13:56 +00:00
Ralph Castain
fd5e15ea58 Since parsing comma-delimited, range-capable options is being used in multiple places, create a new utility that consolidates that code.
Have orte-iof use it.

This commit was SVN r20346.
2009-01-25 17:16:25 +00:00
Ralph Castain
0435108834 Improve the efficiency of the launch system by changing the outer loop to being over app_context, and adding a flag to the app_context so the daemon can record that "this app is on my node" when decoding the launch msg.
If the --wdir option is given, check to see if the user provided a relative path. If so, convert it to an absolute path. This is needed to maintain consistent behavior across environements. Some environments automatically chdir to your current working directory when launching the remote orted, while others (e.g., ssh) don't. This levels the playing field and reduces user surprise.

This commit was SVN r20342.
2009-01-25 12:39:24 +00:00
Ralph Castain
40b6ed4a40 Take another crack at fixing the -wdir problem. Move the context checking code down into just prior to launching each child app. This is necessary so that individual app context wdir options are respected. Also, ensure that we return to our "base" directory after each app is launched so that the relative positions of the wdir options for each app_context are with respect to our base directory, instead of the last wdir option.
Hopefully, this will pass the "BigRed test". :-)

This commit was SVN r20341.
2009-01-24 20:59:27 +00:00
Tim Mattox
c2d105a4d9 Refs trac:1763: Fix -wdir option
Reverted r20306 since the fix caused 100% failues on our !BigRed system.

See the comments on ticket #1763 for the details.

This commit was SVN r20339.

The following SVN revision numbers were found above:
  r20306 --> open-mpi/ompi@8c87e48721

The following Trac tickets were found above:
  Ticket 1763 --> https://svn.open-mpi.org/trac/ompi/ticket/1763
2009-01-24 15:04:47 +00:00
Ralph Castain
c6c5bc17a0 Add a new hierarchical collective grpcomm component that performs modex and barrier across the procs instead of the daemons. Modeled on the tuned collectives. Collective code is in grpcomm base for eventual use by the daemon-based components as well.
This commit was SVN r20337.
2009-01-23 21:57:51 +00:00
Ralph Castain
7154cbf2e0 Cleanup a couple of mis-labeled diagnostic outputs
This commit was SVN r20332.
2009-01-23 20:46:54 +00:00
Josh Hursey
04c69b8a82 Fixes for --preload-files and --preload-binary.
* Improved the error propagation from a backend orted
* Fixed a hang in orterun due to failed files transferred
* Fix the movement of files with relative path names
* Improved error messages when a file cannot be moved
* Move file checks to FileM instead of embedding then in the ODLS

This commit Refs trac:1770

This commit was SVN r20331.

The following Trac tickets were found above:
  Ticket 1770 --> https://svn.open-mpi.org/trac/ompi/ticket/1770
2009-01-23 15:32:24 +00:00
Josh Hursey
d066c67b53 We need to update both context->app and context->argv[0] with the new path when we use --preload-binary. This keeps orte from checking the wrong path later in the odls [orte_util_check_context_app() called from odls_base_default_setup_fork()].
Refs trac:1770

This commit was SVN r20321.

The following Trac tickets were found above:
  Ticket 1770 --> https://svn.open-mpi.org/trac/ompi/ticket/1770
2009-01-22 19:18:36 +00:00
Ralph Castain
47740d1e87 Get the inequality the correct way!
This commit was SVN r20319.
2009-01-22 16:33:07 +00:00
Ralph Castain
f6ba4f6f30 Per discussion with Jeff, an invalid local rank value should never occur - if it does, it could be indicative of deeper problems in the launch procedure. Thus, rather than allowing the launch to proceed, let's abort.
This commit was SVN r20312.
2009-01-22 00:52:46 +00:00
Jeff Squyres
90e69ac6ff Fix some man page nits noticed by the Debain OMPI maintainers. Thanks
Dirk!

This commit was SVN r20307.
2009-01-21 18:38:37 +00:00
Ralph Castain
8c87e48721 Fix a user-reported bug whereby the -wdir option would only be applied from the last app_context.
This commit was SVN r20306.
2009-01-21 15:52:12 +00:00
Josh Hursey
abfc7c6076 Per ticket #1527 orte-restart should be using {{{--default-hostfile}}} instead of {{{--hostfile}}} with app contexts.
Thanks to Gregor Dschung for reporting the problem.

This commit was SVN r20305.
2009-01-21 14:08:16 +00:00
Ralph Castain
5d9de3326c Check for valid local/node ranks before using the returned values
This commit was SVN r20304.
2009-01-21 00:54:50 +00:00
Ralph Castain
a6a7335694 Catch a potential bug spanning several ESS modules. The node_rank and local_rank types were changed to uint16_t, however the modules returned UINT8_MAX as an "invalid" value. To clean this up, define an INVALID value for these types, and change the various modules so they return this value to indicate an invalid response.
This commit was SVN r20303.
2009-01-21 00:19:37 +00:00
Ralph Castain
4da9f53fa4 Implement the xml formatted output of stdout/err/diag. Force -tag-output if -xml is set.
This commit was SVN r20302.
2009-01-20 16:58:31 +00:00
Ralph Castain
88a0af9726 Revise the way we output resolved hostnames to make life easier for the Eclipse folks. Store aliases for individual nodes (only when requested to show resolved hostnames) and then report them out as part of the display-map option.
This commit was SVN r20284.
2009-01-15 18:11:50 +00:00
Ralph Castain
253a54df12 Shutdown the socket before closing for cleaner termination.
This commit was SVN r20283.
2009-01-15 18:06:01 +00:00
Ralph Castain
a9af219ba7 Fix CID 723: a pointless whine about not checking a return code
This commit was SVN r20274.
2009-01-14 19:06:36 +00:00
Jeff Squyres
a568ba0468 Fix CID 25: it's not possible for sav to be non-NULL by the time it
gets here.

This commit was SVN r20273.
2009-01-14 18:57:48 +00:00
Jeff Squyres
0c8f8fe1ea Fix CID 733: remove some dead code (proc_name was set but effectively
never used).

This commit was SVN r20271.
2009-01-14 18:12:06 +00:00
Josh Hursey
a9da2dada1 Remove some unused variables.
This commit was SVN r20270.
2009-01-14 17:28:40 +00:00
Tim Mattox
5b70160626 For two error conditions in the ras_loadleveler_module, output
the error code reported by loadleveler.  Also, clean up a
few more internal error messages.

This commit was SVN r20255.
2009-01-13 15:44:26 +00:00
Brian Barrett
d3310a5ad1 fixes to get compiling on Red Storm again
This commit was SVN r20252.
2009-01-12 22:30:00 +00:00
Ralph Castain
694008e9bb Fix a reported bug whereby keyboard entry to a remote proc was being lost after the first iteration. In other words, if an application has a proc reading stdin from the keyboard, and that proc is not co-located with mpirun, then the system would hang.
The problem was eventually traced to two bugs in the code:

1. the orted wasn't resetting the write event flag, thus preventing itself from turning it on again.

2. the HNP needed to check if the stdin was attached to tty or not before adding the delay for fairness. If it is attached to a tty, there is no need for the delay. This prevents some strangely slow typing response.

This patch needs to move to 1.3

This commit was SVN r20246.
2009-01-12 20:12:58 +00:00
Josh Hursey
1420c32a5d Update SnapC Local Coordinator in reaction to structure changes in r20228. The list of local children became more globalized so I needed to update the loop invariants appropriately.
This commit was SVN r20245.

The following SVN revision numbers were found above:
  r20228 --> open-mpi/ompi@007d68becc
2009-01-12 19:45:48 +00:00
Ralph Castain
2778c13fac Continue to refine the timing instrumentation to identify where launch time is being spent
This commit was SVN r20244.
2009-01-12 19:12:58 +00:00
Jeff Squyres
d1c6f3f89a * Fix a truckload of Cisco copyrights to be the same as the rest of
the code base.
 * Fix a few misspellings in other copyrights.

This commit was SVN r20241.
2009-01-11 02:30:00 +00:00
Tim Mattox
820b209564 Oops, forgot to update the copyright date range...
This commit was SVN r20239.
2009-01-09 19:04:52 +00:00
Tim Mattox
af45569366 Clean up some debugging output in the loadleveler ras module.
Error output strings were changed to be unique per code site.
They are still pretty meaningless to the user, but at least now
developers might be able to find which unique place in the code
reported which error.

This commit was SVN r20238.
2009-01-09 19:03:52 +00:00
Ralph Castain
c009b51ad3 Silence warning about signed vs unsigned comparisons
This commit was SVN r20237.
2009-01-09 16:01:03 +00:00
George Bosilca
78d856e04c Release resources when a job is completed. This allows us to correctly
count and load balance MPI-2 dynamic type of applications.

This commit was SVN r20236.
2009-01-08 21:21:54 +00:00
Ralph Castain
25f578a7d2 Continue to improve timing instrumentation. Add ability to store timing data directly to a file instead of just to stdout.
This commit was SVN r20229.
2009-01-08 14:27:52 +00:00
Ralph Castain
007d68becc Make the data on local children and their jobs available globally on both daemons and the HNP. This simply shifts the data structures from the ODLS base to the orte globals area to support subsequent movement of the daemon collective operations from the odls to the grpcomm framework. As that will be a larger change, it will be implemented on a branch and rolled over separately.
This commit was SVN r20228.
2009-01-08 14:25:56 +00:00
Ralph Castain
80fb98ae32 Cleanup the modex-less operations for efficiency. Have the component default to normal modex operations if modex-less isn't specified.
This commit was SVN r20220.
2009-01-07 15:00:26 +00:00
Ralph Castain
7818779760 Expose the nidmap and pidmap as orte globals so that components in other frameworks can access and/or manipulate them without forcing API modifications - modify the individual ess components that were affected so they use the global variables. Add a list of attributes to the nids for storing node-related data (e.g., modex attrs), and define a new object for that purpose.
Consolidate the nid/pid lookup code with the rest of the nid/pid code so that changes are easier to track. Add the ability to send cluster profile info as part of the nidmap. Cleanup the setup and teardown of the new global nidmap and pidmap objects.

This commit was SVN r20219.
2009-01-07 14:58:38 +00:00
Ralph Castain
09d4a45fa5 Switch to non-blocking sends so the orted's can begin processing their own messages sooner
This commit was SVN r20218.
2009-01-07 14:52:12 +00:00
Ralph Castain
9dbcee9110 Increase efficiency for modex-less launch by storing byte objects in the profile file
This commit was SVN r20206.
2009-01-05 21:46:12 +00:00
Ralph Castain
dc3ba492a7 CID 1206: it's a complicated error path, but if a daemon is passed an ompi-top command and cannot correctly unpack the name of the tool, there really isn't anything it can do about it. Just return and let the tool hang.
This commit was SVN r20202.
2009-01-05 15:35:02 +00:00
Ralph Castain
5f5d8ad231 CID 1139-1141: remove outdated variable from the various routed components
This commit was SVN r20201.
2009-01-05 15:09:54 +00:00
Ralph Castain
1bc125c0a7 CID 1131: cleanup a minor memory leak
This commit was SVN r20200.
2009-01-05 15:05:05 +00:00
Jeff Squyres
6d0d8848ac Fix CID 1129: Remove variable that is set but never used.
This commit was SVN r20194.
2009-01-03 15:39:51 +00:00
Jeff Squyres
e52ac6da40 Fix CID 1130: remove variable that is set but never used.
This commit was SVN r20193.
2009-01-03 15:37:00 +00:00
Jeff Squyres
1bacdef317 Fix CID 1188. Minor issue; just convert to snprintf instead of sprintf.
This commit was SVN r20185.
2009-01-03 14:46:46 +00:00
Ralph Castain
91ada6c323 Ensure we avoid overflows, handle the odd number of nodes case
This commit was SVN r20171.
2008-12-31 01:11:57 +00:00
Ralph Castain
b012ed6c94 Add a somewhat unique launch time test
This commit was SVN r20170.
2008-12-30 21:42:51 +00:00
Ralph Castain
bb96474d6e Per request from Aurelien, make orterun report-pid and report-uri functions work the same as that of ompi-server. Since these are used for ompi-server-like functionality, it makes sense that the report options work the same. Make orte-top take the corresponding input the same way too for consistency.
The modified cmd line options are:

--report-uri x where x is either '-' for stdout, '+' for stderr, or a filename
--report-pid x where x is the same as above

For orte-top, you can now provide either a pid or a uri (which allows connection to remote mpiruns), specified either directly or with a "file:x" option as per mpirun's ompi-server option.

Note: I did not add a report-pid option to ompi-server as it probably wouldn't be useful - the report-uri option works as well, and allows remote access (which is likely the normal way it would be used).

This commit was SVN r20168.
2008-12-24 15:27:46 +00:00
Ralph Castain
7787f84540 Per the earlier RFC and some discussion at the Dec ORTE design meeting, add the ompi-top tool and all its supporting infrastructure. This includes a new OPAL pstat framework and data type, currently with rather weak support for Mac OSX and pretty complete support for Linux. The Sun team promised to add Solaris support as well.
Also, per chat with Jeff, modified the Makefile.am's of a few orte tools so that they were consistent in the way we generate the ompi-equivalent cmds.

This commit was SVN r20165.
2008-12-22 20:23:05 +00:00
Ralph Castain
d1ff02e924 Add a macro to construct a complete 32-bit jobid from a local jobid number. This inserts the mpirun's job family into the upper 16-bit field.
This commit was SVN r20161.
2008-12-20 23:27:25 +00:00
Ralph Castain
aff3d1df21 Remove IOF related utilities from tool communication lib - IOF has now been updated to include tool support directly.
This commit was SVN r20160.
2008-12-20 23:25:56 +00:00
Ralph Castain
caa5771908 Don't force tools to dump core files when they abort
This commit was SVN r20159.
2008-12-20 23:24:36 +00:00
Ralph Castain
9f6c1b9d07 Per discussion at the Dec ORTE design meeting, add an "set_lifeline" API to the orte_routed framework. This allows the caller to define a "lifeline" process so that, if the connection to that lifeline is subsequently lost, the process will be terminated. This helps tools that connect to an mpirun to know when that mpirun completes and terminates.
This commit was SVN r20158.
2008-12-20 23:23:11 +00:00
Brian Barrett
64f7848a84 Number of small fixes to get the trunk to build again on Catamount
This commit was SVN r20141.
2008-12-16 20:09:56 +00:00
Ralph Castain
e878ee4fa3 Revert r20128. Setting a default hostfile name breaks all the filtering code we added to the system. It would require multiple entries in several places to ensure that, should the default hostfile in fact not exist, the system will still work correctly.
Too much complexity - just put the name in the default mca param file iff you actually have a default hostfile.

This commit was SVN r20129.

The following SVN revision numbers were found above:
  r20128 --> open-mpi/ompi@ea01da0eee
2008-12-15 17:37:21 +00:00
Ralph Castain
ea01da0eee Set default name for "default-hostfile" param to "openmpi-default-hostfile" to retain backwards compatibility with OMPI 1.2
This commit was SVN r20128.
2008-12-15 17:08:59 +00:00
Shiqing Fan
5ae5f0e173 - 4/4 commit for Windows Visual Studio and CCP support:
unnecessary clean up to non windows related files (within ifdef __WINDOWS__).

This commit was SVN r20111.
2008-12-10 21:13:27 +00:00
Shiqing Fan
20cea164db - 3/4 commit for Windows Visual Studio and CCP support:
corrections to non-windows files (but within ifdef __WINDOWS__)
  type casts, event library for windows use win32. 
  in orte runtime, add windows sockets handling and object construction.

This commit was SVN r20110.
2008-12-10 21:13:10 +00:00
Shiqing Fan
8673f19f50 - 2/4 commit for Windows Visual Studio and CCP support:
changes to the already existing ccp components
  event/win32.c: merge old FD handling into new
  opal_installdirs_windows.c:fix the registry handling

This commit was SVN r20109.
2008-12-10 21:01:54 +00:00
Shiqing Fan
a5281f0434 - 1/4 commit for Windows Visual Studio and CCP support:
CMakeLists and .windows files.
  In contribs preconfigured and precompiled parts.

This commit was SVN r20108.
2008-12-10 20:59:20 +00:00
Ralph Castain
728a24c8ec After considerable patience and help with debugging/testing from Tim M and Jeff S, return a completed and pretty well tested patch of the IOF to the trunk. This commit includes the previously reverted r20074, r20068, and r20064, as well as changes to fix those commits.
Basically, the remaining problem turned out to be:

1. closing stdout/stderr during orte_finalize of mpirun

2. inadvertently setting up a write event on fd = -1

3. devising a scheme to more accurately track when the stdin write event was active vs closed so it only got released once

This passed prelim MTT testing by Jeff and Tim, but should soak for awhile before migrating to 1.3.

This commit was SVN r20106.

The following SVN revision numbers were found above:
  r20064 --> open-mpi/ompi@a07660aea8
  r20068 --> open-mpi/ompi@ec930d14a9
  r20074 --> open-mpi/ompi@2940309613
2008-12-10 20:40:47 +00:00
Ralph Castain
9d7cb82bba Modify the daemon cmd processor to relay and then process the cmd locally. We couldn't do this before due to the daemon's needing to update contact info prior to doing the relay. However, the new routed system plus the inclusion of the nidmap in the launch message now makes this possible.
It is a small launch performance improvement as now we relay the launch cmd across to the next daemon before taking the time to launch our own local procs. Still, it does allow more parallel operations during the launch procedure.

This commit was SVN r20104.
2008-12-10 19:18:36 +00:00
Josh Hursey
67ae66326c remove unused variable
This commit was SVN r20103.
2008-12-10 18:08:46 +00:00
Ralph Castain
7e3ddb09d3 As requested by Aurelien at the July design meeting - long time coming, but finally got around to it.
Enable one mpirun to act as the server for another mpirun when doing MPI_Publish_name and its associated operations. The user is responsible, of course, for ensuring that the mpirun acting as a server outlives any mpiruns using it in that capacity.

Add a cmd line option to mpirun --report-pid that prints out mpirun's pid. Allow the --ompi-server option to now take pid:# (or PID:#) of the mpirun to be used as the server, and then look that pid up by searching the local mpirun contact infos for it.

This commit was SVN r20102.
2008-12-10 17:10:39 +00:00
Ralph Castain
1ace83c470 Enable modex-less launch. Consists of:
1. minor modification to include two new opal MCA params:
   (a) opal_profile: outputs what components were selected by each framework
       currently enabled for most, but not all, frameworks
   (b) opal_profile_file: name of file that contains profile info required
       for modex

2. introduction of two new tools:
   (a) ompi-probe: MPI process that simply calls MPI_Init/Finalize with
       opal_profile set. Also reports back the rml IP address for all
       interfaces on the node
   (b) ompi-profiler: uses ompi-probe to create the profile_file, also
       reports out a summary of what framework components are actually
       being used to help with configuration options

3. modification of the grpcomm basic component to utilize the
   profile file in place of the modex where possible

4. modification of orterun so it properly sees opal mca params and
   handles opal_profile correctly to ensure we don't get its profile

5. similar mod to orted as for orterun

6. addition of new test that calls orte_init followed by calls to
   grpcomm.barrier

This is all completely benign unless actively selected. At the moment, it only supports modex-less launch for openib-based systems. Minor mod to the TCP btl would be required to enable it as well, if people are interested. Similarly, anyone interested in enabling other BTL's for modex-less operation should let me know and I'll give you the magic details.

This seems to significantly improve scalability provided the file can be locally located on the nodes. I'm looking at an alternative means of disseminating the info (perhaps in launch message) as an option for removing that constraint.

This commit was SVN r20098.
2008-12-09 23:49:02 +00:00
Ralph Castain
e28210d0dc Revert r20074, r20068, and r20064: remove the IOF proc completion code pending further off-trunk work.
This commit was SVN r20089.

The following SVN revision numbers were found above:
  r20064 --> open-mpi/ompi@a07660aea8
  r20068 --> open-mpi/ompi@ec930d14a9
  r20074 --> open-mpi/ompi@2940309613
2008-12-09 17:11:59 +00:00
Ralph Castain
61c21d787d Add missing param in tm launcher
This commit was SVN r20087.
2008-12-09 13:31:33 +00:00
Ralph Castain
6e050bc78c Update the route when it comes from a different job family.
This fixes ticket #1699

This commit was SVN r20085.
2008-12-09 01:16:18 +00:00
Ralph Castain
ce4018efeb Take a step back on the slurm and tm launchers. Problems were occurring in the MTT runs, although not under non-MTT scenarios. Preserve the modified plm versions in new components that are ompi_ignored until we can resolve the problems.
This will allow for better MTT coverage until the problem can be better understood.

This commit was SVN r20083.
2008-12-09 00:32:04 +00:00
Ralph Castain
89792bbc72 May as well have the other "clean" outputs use the same channel
This commit was SVN r20082.
2008-12-08 19:37:22 +00:00
Ralph Castain
51789c9049 Cleanup the output for nodename resolve reporting
This commit was SVN r20081.
2008-12-08 19:00:36 +00:00
Ralph Castain
c2b18b363d Initialize a variable before use
This commit was SVN r20080.
2008-12-08 16:16:40 +00:00
Ralph Castain
2940309613 Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS:
1. a direct callback from waitpid - this set the waitpid_fired flag

2. a notify event callback from the IOF - this set the iof complete flag

3. a message via the daemon cmd processor from the proc "de-registering" the sync, thus indicating it was going through MPI_Finalize.

The problem is that these could overlap, with the first two allowing the orted to declare the proc complete before the daemon had responded to #3.

This change forces all three events to flow through the daemon cmd processor, thus ensuring an ordered handling. I'm not certain this will solve the problem, but will await further MTT reports to see. Unfortunately, the problem doesn't show up on any manual or script-based tests I have been able to run, even when I duplicate the exact cmd that fails under MTT.

This commit was SVN r20074.
2008-12-05 04:20:00 +00:00
Ralph Castain
ec930d14a9 Ensure IOF tags are properly assigned to sinks and read events
This commit was SVN r20068.
2008-12-04 01:09:20 +00:00
Ralph Castain
a07660aea8 Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes:
1. coordination of job completion notification to include a requirement for both waitpid detection AND notification that all iof pipes have been closed by the app

2. change of all IOF read and write events to be non-persistent so they can properly be shutdown and restarted only when required

3. addition of a delay (currently set to 10ms) before restarting the stdin read event. This was required to ensure that the stdout, stderr, and stddiag read events had an opportunity to be serviced in scenarios where large files are attached to stdin.

This commit was SVN r20064.
2008-12-03 17:45:42 +00:00
Josh Hursey
44109e0084 Fix the ft_event function in response to r20022. Also make the structure cleanup match the finalize() function a bit more closely.
This seems to fix the segv seen on process restart.

This commit was SVN r20051.

The following SVN revision numbers were found above:
  r20022 --> open-mpi/ompi@9a57db4a81
2008-12-02 21:18:32 +00:00
Ralph Castain
ff8e83ff3b Per request from IBM/Eclipse, provide MCA param to request output when nodes are resolved to a different nodename. This really only happens for the node that mpirun executes on, but they need the alert so they can do string matching of node names.
This commit was SVN r20032.
2008-11-24 19:57:08 +00:00
Ralph Castain
d4dfb1b7a7 Plug a few bugs in the decoding of pidmaps:
1. after we get enough jobs in the pidmap, the address of the jobmap pointer array data can move due to realloc. Need to reset the jobs pointer each time through to ensure it is pointing to valid data

2. when we exit the loop, rc will be set to an error due to reading past end of buffer - need to reset so it is ignored

3. need to ensure we only try to read one jobid each time through loop

This commit was SVN r20030.
2008-11-24 17:57:55 +00:00
George Bosilca
7a30a98a89 Use the generic cast.
This commit was SVN r20028.
2008-11-24 15:52:36 +00:00
Ralph Castain
7213c109ac Revamp the TM plm module so that we detect orted termination without requiring a callback message by using the TM native capabilities. This allows TM to function with fully routed OOB comm, and to tell us what node failed to spawn a daemon.
This commit was SVN r20027.
2008-11-20 18:57:35 +00:00
Ralph Castain
5e6536eeda Ensure that mpirun properly accounts for itself when exiting without reply.
Move some debug output around so it is always seen.

This commit was SVN r20026.
2008-11-20 18:55:59 +00:00
Ralph Castain
9a57db4a81 To support comm_spawn in fully routed environments, daemons need to know the route to all procs in their job family. They already had this information, but were not retaining it. The infrastructure to do so has existed for some time - just never had the time to complete it.
This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message.

This commit was SVN r20022.
2008-11-18 15:35:50 +00:00
Ralph Castain
9ba78f6e5f Ensure exit-no-reply gets relayed to downstream orteds prior to exiting ourselves
This commit was SVN r20021.
2008-11-18 14:54:52 +00:00
Ralph Castain
89559396ea Resolve a race condition when running under a SLURM environment.
The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit.

If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state.

Mucho bad. :-/

This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths).

This commit was SVN r20018.
2008-11-18 13:59:23 +00:00
Ralph Castain
182b15e252 Remove duplicate definition of orte_xml_output - thanks Shiqing for catching it!
This commit was SVN r20017.
2008-11-18 13:53:13 +00:00
Ralph Castain
68423f7544 Partially restore the iof changes - this repairs the initial observation of inconsistent and incomplete output
This commit was SVN r19999.
2008-11-14 20:36:18 +00:00
Ralph Castain
586334d1c8 Per discussion with Tim Mattox, reset the trunk to pre-19991 level for the iof only. I will shortly add a changeset that will repair the one known error where we were incorrectly closing the stdout/err/diag file descriptors when all we wanted to do was close stdin. I will leave out the changes associated with coordinating proc termination due to race conditions IU encounted during MTT testing. I have been unable to replicate those so far, but we hope to resolve it in the near future.
This commit was SVN r19998.
2008-11-14 20:22:36 +00:00
Ralph Castain
891630ae85 Handle a race condition between mpirun detecting stdin closed (and releasing the read event), and receiving an xon/xoff notice from a remote orted that detects proc termination and tells mpirun "don't send any more input - the proc is gone". This latter was necessary since we might have hung an infinite source of input on mpirun, while the proc terminated after some point in time.
This commit was SVN r19997.
2008-11-14 15:19:53 +00:00
Ralph Castain
101b6fdeb8 Cleanup a little on how we handle the stdin write when we encounter end-of-input. Ensure that mpirun handles it correctly if the proc receiving stdin is local to mpirun
This commit was SVN r19996.
2008-11-14 14:31:33 +00:00
Ralph Castain
875741a5e3 Don't set the stdin fd to -1 before calling the object destructor as that function calls event delete, which uses the fd as an index into the event array.
This commit was SVN r19994.
2008-11-13 19:34:29 +00:00
Ralph Castain
b8ae4604ed Correct the notifier default module to include the new added API
This commit was SVN r19993.
2008-11-13 18:03:41 +00:00
Ralph Castain
702fc7154c Remove stale function definition
This commit was SVN r19992.
2008-11-13 05:07:11 +00:00
Ralph Castain
555bbf0c02 Fix the iof race conditions wrt proc termination. This is comprised of two sections:
1. modify the iof to track when a proc actually closes all of its open iof output pipes. When this occurs, notify the odls that the proc's iof is complete. This is done via a zero-time event so that we can step out of the read event before processing the notification.

2. in the odls, modify the waitpid callback so it only flags that it was called. Add a function to receive the iof-complete notification, and a function that checks for both iof complete and waitpid callback before declaring a proc fully terminated. This ensures that we read and deliver -all- of the IO prior to declaring the job complete.

Also modified the odls call to orte_iof.close (and the component's implementation) so it only closes stdin, leaving the other io channels alone. This fixes the other half of the known problem.

This should fix the ticket on this subject, but I'll wait to close it pending further testing in the trunk.

This commit was SVN r19991.
2008-11-12 23:32:01 +00:00
Ralph Castain
26cd1c1955 Fix a typo and some formatting
This commit was SVN r19990.
2008-11-12 22:01:40 +00:00
Ralph Castain
ce26e3a2fb Update the notifier framework in prep for move to v1.3. Add an API to handle the case where error messages have been expressed via "show_help" so they can look similar to what was presented to users. Add three key calls in the openib btl to drop messages into syslog.
This will sit in trunk for a few days - would like to actually see some errors reported to syslog before moving the code to 1.3

This commit was SVN r19986.
2008-11-12 18:03:51 +00:00
Josh Hursey
d5c38c2601 fix some typos. should be moved to v1.3
This commit was SVN r19964.
2008-11-10 19:05:26 +00:00
Josh Hursey
077b3df7cc Fix C/R restart case by passing the correct address to the orte_ess_base_build_nidmap() function. This cropped up from r19866.
It does not look like this effects the v1.3 branch since r19866 has not moved to the release branch.

Thanks to Leonardo Fialho for reporting this and supplying a patch.

This commit was SVN r19961.

The following SVN revision numbers were found above:
  r19866 --> open-mpi/ompi@f54fda489e
2008-11-10 15:19:28 +00:00
Ralph Castain
5889dcd30b Fix a warning reported by Jeff that actually could cause singleton operations to fail. Ensure that the byte object used to init the job map for singleton's is properly initialized.
This commit was SVN r19957.
2008-11-08 01:09:06 +00:00
Jeff Squyres
f4ba25cf3c Remove linking components against ORTE and OPAL libs. This was
removed from all other components long ago; I'm not sure how these
survived.

This commit was SVN r19956.
2008-11-08 00:56:57 +00:00
Jeff Squyres
d4dfd49cdd Fix typo found in Makefile that caused problems with "make distclean";
thanks to Mehdi Bozzo-Rey for reporting the problem.

This commit was SVN r19936.
2008-11-05 20:58:27 +00:00
Ralph Castain
25491628b8 Discovered while documenting the "preconnect" mca params that several of them didn't make sense any more. After chatting with Jeff, we agreed to the following:
1. register "mpi_preconnect_all" as a deprecated synonym for "mpi_preconnect_mpi"

2. remove "mpi_preconnect_oob" and "mpi_preconnect_oob_simultaneous" as these are no longer valid.

3. remove the routed framework's "warmup_routes" API. With the removal of the direct routed component, this function at best only wasted communications. The daemon routes are completely "warmed up" during launch, so having MPI procs order the sending of additional messages is simply wasteful.

4. remove the call to orte_routed.warmup_routes from MPI_Init. This was the only place it was used anyway.

The FAQs will be updated to reflect this changed situation, and a CMR filed to move this to the 1.3 branch.

This commit was SVN r19933.
2008-11-05 19:41:16 +00:00
Ralph Castain
b48bbec366 Cleanup modex logic to allow modex-less launch:
1. minor change in base_modex to only set modex_reqd when it -is- reqd

2. cleanup logic in grpcomm-basic module

This commit was SVN r19903.
2008-11-03 21:48:52 +00:00
Ralph Castain
6db5737779 Remove a couple of mutex vars that were defined and used - but never initialized. No clear way to initialize them, and that area of the code should never see threads anyway.
This commit was SVN r19889.
2008-11-03 17:23:10 +00:00
Ralph Castain
b700fff1d1 Update the orterun man page - it was missing numerous options and had several out-of-date ones in it
This commit was SVN r19888.
2008-11-03 16:44:27 +00:00
Kenneth Matney
c650ef58c5 Build requires prototypes, defined by "orte/util/nidmap.h".
This commit was SVN r19887.
2008-11-03 16:23:42 +00:00
Ralph Castain
03257ebb9f Since orte-iof doesn't yet know how to handle multiple ranks, adjust the help message and the code to correctly handle one specified rank
This commit was SVN r19886.
2008-11-03 14:26:01 +00:00
Ralph Castain
55f52d7a4b Ensure we know how to route to a different job family when it connects to us
This commit was SVN r19885.
2008-11-03 14:25:14 +00:00
Ralph Castain
85bc7bb26a Minor cleanups:
* fix an if condition so that we do the right thing when procs local to mpirun output to stderr

* ensure that tools can handle relays of 0-byte output, indicating that a process closed that io channel

This commit was SVN r19884.
2008-11-03 14:03:08 +00:00
Ralph Castain
58fe779388 Remove double destruct to fix segv when ctrl-c is used to terminate job
This commit was SVN r19875.
2008-11-02 02:25:20 +00:00
George Bosilca
d23fe1bb10 Include Ralph's suggestions, i.e. keep the hnp and orted management in sync.
This commit was SVN r19872.
2008-11-01 00:39:46 +00:00
George Bosilca
9528d33e90 Nothing relevant, few indentations and replace tab by spaces.
This commit was SVN r19870.
2008-10-31 22:24:52 +00:00
George Bosilca
ebe87d1842 Apply some suggestions from Ralph and avoid a pretty nasty race condition on the close of the fd.
The problem was that we close the same fd twice, and that meantime the fd could have been reassigned
to some other file or socket.

This commit was SVN r19869.
2008-10-31 22:23:53 +00:00
George Bosilca
9f17d1d67d Allow xgrid to compile with the changes from 19866.
This commit was SVN r19868.
2008-10-31 21:56:53 +00:00
Ralph Castain
f54fda489e This is a first step towards supporting fully-routed OOB communications:
1. remove direct routed module (hooray!)

2. add radix tree routed module (binomial remains default)

3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess

4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support

5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability

6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon)

Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm.

This will require an autogen...

This commit was SVN r19866.
2008-10-31 21:10:00 +00:00
George Bosilca
0ce76248e8 Close the file descriptors used to push or pull the data to the children.
Without this patch, doing spawn in a loop ended up by exhausting all
available file descriptors pretty quickly. There were about 5 file
descriptors opened per spawned process. Now the number of file
descriptors managed by the process (orted or HNP)
is a lot smaller.

This commit was SVN r19864.
2008-10-31 18:05:28 +00:00
Ralph Castain
30b3bc6761 Minor update - provide one more helpful hint regarding stdin target out-of-range, ensure we exit cleanly since daemons won't have been launched.
This commit was SVN r19847.
2008-10-29 16:00:48 +00:00
Ralph Castain
82ece176d5 Sanity check needs to allow vpid_invalid as this indicates the "none" scenario
This commit was SVN r19820.
2008-10-28 14:50:26 +00:00
Brad Penoff
d7b0fdfe5c small fix to compile trunk on FreeBSD 7
This commit was SVN r19817.
2008-10-28 03:44:23 +00:00
Ethan Mallove
2457df91b3 Add missing #include <errno.h> line (for SunStudio Solaris).
This commit was SVN r19814.
2008-10-27 17:41:33 +00:00
Jeff Squyres
b11d13cc05 Silence a trivial compiler warnings (pgcc).
This commit was SVN r19810.
2008-10-27 14:23:02 +00:00
Jeff Squyres
c078ab6b09 Minor fix for a trivial compiler warning.
This commit was SVN r19809.
2008-10-27 14:18:49 +00:00
Ralph Castain
71dcf61f9b Add sanity check to ensure that specified stdin target is within range of job. Print error message and exit if not.
Modify read_write test to allow specification of rank to read stdin.

IOF now validated to work for arbitrary rank as stdin target. Not validate to work for multiple simultaneous ranks reading stdin (untested).

This commit was SVN r19804.
2008-10-25 14:38:06 +00:00
Jeff Squyres
d96b78fee1 If the script is there, there's no real reason to have these files in
the repo.

This commit was SVN r19795.
2008-10-24 13:42:26 +00:00
Jeff Squyres
0a741d7f81 Add scripty-foo to make the data files. Revamp the data files to be
non-uniform in content as a slightly better test.

This commit was SVN r19794.
2008-10-24 13:35:47 +00:00
Ralph Castain
c56cdac379 Finish cleanup of stdin. Set non-stdio file descriptors to non-blocking (thanks to Jeff for catching that one). Handle writes that result in "would have blocked" errno.
This commit was SVN r19793.
2008-10-24 01:42:58 +00:00
Ralph Castain
6100d88ded Cleanup the new IOF:
1. remove some stale files that were overlooked in original commit

2. add a test program and data to stress iof for stdin

3. cleanup a debug statement that caused memory corruption when reading large files

4. some minor cleanups to correctly handle xon/xoff scenarios

This commit was SVN r19792.
2008-10-23 19:11:05 +00:00
George Bosilca
61317cb61d Complete the r19767 commit for XGrid, i.e. allow the PLM Xgrid to build.
This commit was SVN r19777.

The following SVN revision numbers were found above:
  r19767 --> open-mpi/ompi@6e5d844c36
2008-10-21 15:37:22 +00:00
Ralph Castain
ebaa2c59bb Cleanup non-debug builds
This commit was SVN r19771.
2008-10-18 13:09:47 +00:00
Jeff Squyres
6d026b86b7 Fix a problem reported on the user list by Teng Lin: OPAL_PREFIX
wasn't exported in the Bourne-shell-flavor case on remote nodes.

This commit was SVN r19770.
2008-10-18 12:13:10 +00:00