Tim Mattox
c2d105a4d9
Refs trac:1763: Fix -wdir option
...
Reverted r20306 since the fix caused 100% failues on our !BigRed system.
See the comments on ticket #1763 for the details.
This commit was SVN r20339.
The following SVN revision numbers were found above:
r20306 --> open-mpi/ompi@8c87e48721
The following Trac tickets were found above:
Ticket 1763 --> https://svn.open-mpi.org/trac/ompi/ticket/1763
2009-01-24 15:04:47 +00:00
Ralph Castain
c6c5bc17a0
Add a new hierarchical collective grpcomm component that performs modex and barrier across the procs instead of the daemons. Modeled on the tuned collectives. Collective code is in grpcomm base for eventual use by the daemon-based components as well.
...
This commit was SVN r20337.
2009-01-23 21:57:51 +00:00
Ralph Castain
7154cbf2e0
Cleanup a couple of mis-labeled diagnostic outputs
...
This commit was SVN r20332.
2009-01-23 20:46:54 +00:00
Josh Hursey
04c69b8a82
Fixes for --preload-files and --preload-binary.
...
* Improved the error propagation from a backend orted
* Fixed a hang in orterun due to failed files transferred
* Fix the movement of files with relative path names
* Improved error messages when a file cannot be moved
* Move file checks to FileM instead of embedding then in the ODLS
This commit Refs trac:1770
This commit was SVN r20331.
The following Trac tickets were found above:
Ticket 1770 --> https://svn.open-mpi.org/trac/ompi/ticket/1770
2009-01-23 15:32:24 +00:00
Josh Hursey
d066c67b53
We need to update both context->app and context->argv[0] with the new path when we use --preload-binary. This keeps orte from checking the wrong path later in the odls [orte_util_check_context_app() called from odls_base_default_setup_fork()].
...
Refs trac:1770
This commit was SVN r20321.
The following Trac tickets were found above:
Ticket 1770 --> https://svn.open-mpi.org/trac/ompi/ticket/1770
2009-01-22 19:18:36 +00:00
Ralph Castain
47740d1e87
Get the inequality the correct way!
...
This commit was SVN r20319.
2009-01-22 16:33:07 +00:00
Ralph Castain
f6ba4f6f30
Per discussion with Jeff, an invalid local rank value should never occur - if it does, it could be indicative of deeper problems in the launch procedure. Thus, rather than allowing the launch to proceed, let's abort.
...
This commit was SVN r20312.
2009-01-22 00:52:46 +00:00
Jeff Squyres
90e69ac6ff
Fix some man page nits noticed by the Debain OMPI maintainers. Thanks
...
Dirk!
This commit was SVN r20307.
2009-01-21 18:38:37 +00:00
Ralph Castain
8c87e48721
Fix a user-reported bug whereby the -wdir option would only be applied from the last app_context.
...
This commit was SVN r20306.
2009-01-21 15:52:12 +00:00
Josh Hursey
abfc7c6076
Per ticket #1527 orte-restart should be using {{{--default-hostfile}}} instead of {{{--hostfile}}} with app contexts.
...
Thanks to Gregor Dschung for reporting the problem.
This commit was SVN r20305.
2009-01-21 14:08:16 +00:00
Ralph Castain
5d9de3326c
Check for valid local/node ranks before using the returned values
...
This commit was SVN r20304.
2009-01-21 00:54:50 +00:00
Ralph Castain
a6a7335694
Catch a potential bug spanning several ESS modules. The node_rank and local_rank types were changed to uint16_t, however the modules returned UINT8_MAX as an "invalid" value. To clean this up, define an INVALID value for these types, and change the various modules so they return this value to indicate an invalid response.
...
This commit was SVN r20303.
2009-01-21 00:19:37 +00:00
Ralph Castain
4da9f53fa4
Implement the xml formatted output of stdout/err/diag. Force -tag-output if -xml is set.
...
This commit was SVN r20302.
2009-01-20 16:58:31 +00:00
Ralph Castain
88a0af9726
Revise the way we output resolved hostnames to make life easier for the Eclipse folks. Store aliases for individual nodes (only when requested to show resolved hostnames) and then report them out as part of the display-map option.
...
This commit was SVN r20284.
2009-01-15 18:11:50 +00:00
Ralph Castain
253a54df12
Shutdown the socket before closing for cleaner termination.
...
This commit was SVN r20283.
2009-01-15 18:06:01 +00:00
Ralph Castain
a9af219ba7
Fix CID 723: a pointless whine about not checking a return code
...
This commit was SVN r20274.
2009-01-14 19:06:36 +00:00
Jeff Squyres
a568ba0468
Fix CID 25: it's not possible for sav to be non-NULL by the time it
...
gets here.
This commit was SVN r20273.
2009-01-14 18:57:48 +00:00
Jeff Squyres
0c8f8fe1ea
Fix CID 733: remove some dead code (proc_name was set but effectively
...
never used).
This commit was SVN r20271.
2009-01-14 18:12:06 +00:00
Josh Hursey
a9da2dada1
Remove some unused variables.
...
This commit was SVN r20270.
2009-01-14 17:28:40 +00:00
Tim Mattox
5b70160626
For two error conditions in the ras_loadleveler_module, output
...
the error code reported by loadleveler. Also, clean up a
few more internal error messages.
This commit was SVN r20255.
2009-01-13 15:44:26 +00:00
Brian Barrett
d3310a5ad1
fixes to get compiling on Red Storm again
...
This commit was SVN r20252.
2009-01-12 22:30:00 +00:00
Ralph Castain
694008e9bb
Fix a reported bug whereby keyboard entry to a remote proc was being lost after the first iteration. In other words, if an application has a proc reading stdin from the keyboard, and that proc is not co-located with mpirun, then the system would hang.
...
The problem was eventually traced to two bugs in the code:
1. the orted wasn't resetting the write event flag, thus preventing itself from turning it on again.
2. the HNP needed to check if the stdin was attached to tty or not before adding the delay for fairness. If it is attached to a tty, there is no need for the delay. This prevents some strangely slow typing response.
This patch needs to move to 1.3
This commit was SVN r20246.
2009-01-12 20:12:58 +00:00
Josh Hursey
1420c32a5d
Update SnapC Local Coordinator in reaction to structure changes in r20228. The list of local children became more globalized so I needed to update the loop invariants appropriately.
...
This commit was SVN r20245.
The following SVN revision numbers were found above:
r20228 --> open-mpi/ompi@007d68becc
2009-01-12 19:45:48 +00:00
Ralph Castain
2778c13fac
Continue to refine the timing instrumentation to identify where launch time is being spent
...
This commit was SVN r20244.
2009-01-12 19:12:58 +00:00
Jeff Squyres
d1c6f3f89a
* Fix a truckload of Cisco copyrights to be the same as the rest of
...
the code base.
* Fix a few misspellings in other copyrights.
This commit was SVN r20241.
2009-01-11 02:30:00 +00:00
Tim Mattox
820b209564
Oops, forgot to update the copyright date range...
...
This commit was SVN r20239.
2009-01-09 19:04:52 +00:00
Tim Mattox
af45569366
Clean up some debugging output in the loadleveler ras module.
...
Error output strings were changed to be unique per code site.
They are still pretty meaningless to the user, but at least now
developers might be able to find which unique place in the code
reported which error.
This commit was SVN r20238.
2009-01-09 19:03:52 +00:00
Ralph Castain
c009b51ad3
Silence warning about signed vs unsigned comparisons
...
This commit was SVN r20237.
2009-01-09 16:01:03 +00:00
George Bosilca
78d856e04c
Release resources when a job is completed. This allows us to correctly
...
count and load balance MPI-2 dynamic type of applications.
This commit was SVN r20236.
2009-01-08 21:21:54 +00:00
Ralph Castain
25f578a7d2
Continue to improve timing instrumentation. Add ability to store timing data directly to a file instead of just to stdout.
...
This commit was SVN r20229.
2009-01-08 14:27:52 +00:00
Ralph Castain
007d68becc
Make the data on local children and their jobs available globally on both daemons and the HNP. This simply shifts the data structures from the ODLS base to the orte globals area to support subsequent movement of the daemon collective operations from the odls to the grpcomm framework. As that will be a larger change, it will be implemented on a branch and rolled over separately.
...
This commit was SVN r20228.
2009-01-08 14:25:56 +00:00
Ralph Castain
80fb98ae32
Cleanup the modex-less operations for efficiency. Have the component default to normal modex operations if modex-less isn't specified.
...
This commit was SVN r20220.
2009-01-07 15:00:26 +00:00
Ralph Castain
7818779760
Expose the nidmap and pidmap as orte globals so that components in other frameworks can access and/or manipulate them without forcing API modifications - modify the individual ess components that were affected so they use the global variables. Add a list of attributes to the nids for storing node-related data (e.g., modex attrs), and define a new object for that purpose.
...
Consolidate the nid/pid lookup code with the rest of the nid/pid code so that changes are easier to track. Add the ability to send cluster profile info as part of the nidmap. Cleanup the setup and teardown of the new global nidmap and pidmap objects.
This commit was SVN r20219.
2009-01-07 14:58:38 +00:00
Ralph Castain
09d4a45fa5
Switch to non-blocking sends so the orted's can begin processing their own messages sooner
...
This commit was SVN r20218.
2009-01-07 14:52:12 +00:00
Ralph Castain
9dbcee9110
Increase efficiency for modex-less launch by storing byte objects in the profile file
...
This commit was SVN r20206.
2009-01-05 21:46:12 +00:00
Ralph Castain
dc3ba492a7
CID 1206: it's a complicated error path, but if a daemon is passed an ompi-top command and cannot correctly unpack the name of the tool, there really isn't anything it can do about it. Just return and let the tool hang.
...
This commit was SVN r20202.
2009-01-05 15:35:02 +00:00
Ralph Castain
5f5d8ad231
CID 1139-1141: remove outdated variable from the various routed components
...
This commit was SVN r20201.
2009-01-05 15:09:54 +00:00
Ralph Castain
1bc125c0a7
CID 1131: cleanup a minor memory leak
...
This commit was SVN r20200.
2009-01-05 15:05:05 +00:00
Jeff Squyres
6d0d8848ac
Fix CID 1129: Remove variable that is set but never used.
...
This commit was SVN r20194.
2009-01-03 15:39:51 +00:00
Jeff Squyres
e52ac6da40
Fix CID 1130: remove variable that is set but never used.
...
This commit was SVN r20193.
2009-01-03 15:37:00 +00:00
Jeff Squyres
1bacdef317
Fix CID 1188. Minor issue; just convert to snprintf instead of sprintf.
...
This commit was SVN r20185.
2009-01-03 14:46:46 +00:00
Ralph Castain
91ada6c323
Ensure we avoid overflows, handle the odd number of nodes case
...
This commit was SVN r20171.
2008-12-31 01:11:57 +00:00
Ralph Castain
b012ed6c94
Add a somewhat unique launch time test
...
This commit was SVN r20170.
2008-12-30 21:42:51 +00:00
Ralph Castain
bb96474d6e
Per request from Aurelien, make orterun report-pid and report-uri functions work the same as that of ompi-server. Since these are used for ompi-server-like functionality, it makes sense that the report options work the same. Make orte-top take the corresponding input the same way too for consistency.
...
The modified cmd line options are:
--report-uri x where x is either '-' for stdout, '+' for stderr, or a filename
--report-pid x where x is the same as above
For orte-top, you can now provide either a pid or a uri (which allows connection to remote mpiruns), specified either directly or with a "file:x" option as per mpirun's ompi-server option.
Note: I did not add a report-pid option to ompi-server as it probably wouldn't be useful - the report-uri option works as well, and allows remote access (which is likely the normal way it would be used).
This commit was SVN r20168.
2008-12-24 15:27:46 +00:00
Ralph Castain
7787f84540
Per the earlier RFC and some discussion at the Dec ORTE design meeting, add the ompi-top tool and all its supporting infrastructure. This includes a new OPAL pstat framework and data type, currently with rather weak support for Mac OSX and pretty complete support for Linux. The Sun team promised to add Solaris support as well.
...
Also, per chat with Jeff, modified the Makefile.am's of a few orte tools so that they were consistent in the way we generate the ompi-equivalent cmds.
This commit was SVN r20165.
2008-12-22 20:23:05 +00:00
Ralph Castain
d1ff02e924
Add a macro to construct a complete 32-bit jobid from a local jobid number. This inserts the mpirun's job family into the upper 16-bit field.
...
This commit was SVN r20161.
2008-12-20 23:27:25 +00:00
Ralph Castain
aff3d1df21
Remove IOF related utilities from tool communication lib - IOF has now been updated to include tool support directly.
...
This commit was SVN r20160.
2008-12-20 23:25:56 +00:00
Ralph Castain
caa5771908
Don't force tools to dump core files when they abort
...
This commit was SVN r20159.
2008-12-20 23:24:36 +00:00
Ralph Castain
9f6c1b9d07
Per discussion at the Dec ORTE design meeting, add an "set_lifeline" API to the orte_routed framework. This allows the caller to define a "lifeline" process so that, if the connection to that lifeline is subsequently lost, the process will be terminated. This helps tools that connect to an mpirun to know when that mpirun completes and terminates.
...
This commit was SVN r20158.
2008-12-20 23:23:11 +00:00
Brian Barrett
64f7848a84
Number of small fixes to get the trunk to build again on Catamount
...
This commit was SVN r20141.
2008-12-16 20:09:56 +00:00