George Bosilca
6b52d8f519
The paffinity is apparently needed.
...
This commit was SVN r24749.
2011-06-06 01:20:01 +00:00
Ralph Castain
bd8d9a943a
Add diagnostics
...
This commit was SVN r24748.
2011-06-05 19:17:56 +00:00
Ralph Castain
1491d52bd7
Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3").
...
This commit was SVN r24747.
2011-06-05 19:16:42 +00:00
George Bosilca
454519842e
Report bindings if requested.
...
This commit was SVN r24743.
2011-06-02 17:17:10 +00:00
George Bosilca
1eccadbd87
No need for the paffinity here.
...
This commit was SVN r24742.
2011-06-02 17:16:25 +00:00
Ralph Castain
8f401a0563
Enable the ability to constrain applications to hosts on the basis of resources.
...
This commit was SVN r24736.
2011-05-28 22:18:19 +00:00
Brian Barrett
beb1bc70b2
* Add support for using modex to exchange NID/PID pairs when using Portals4.
...
Rather than try to support a bunch of lightweight environments like I did
with the Portals3 code, always use the "modex" and hack the grpcomm for
the SHMEM implementation to return the right nid/pid for a remote
process by "magic".
This commit was SVN r24733.
2011-05-25 22:10:27 +00:00
Ralph Castain
81b6c50daa
Correct stale typo
...
This commit was SVN r24725.
2011-05-23 17:34:22 +00:00
Ralph Castain
661f508e62
Fix typo
...
This commit was SVN r24723.
2011-05-22 00:20:42 +00:00
Ralph Castain
b2331113a5
Add some debug
...
This commit was SVN r24722.
2011-05-21 21:09:47 +00:00
Ralph Castain
8c08ee9c3d
Remove stale tool
...
This commit was SVN r24720.
2011-05-21 00:38:35 +00:00
Ralph Castain
1b5ca323c6
Always followup with sigkill when killing local procs as procs can trap sigterm and get stuck
...
This commit was SVN r24719.
2011-05-20 22:40:10 +00:00
Ralph Castain
c5686ecfca
Dont sample stats for pid=0 children
...
This commit was SVN r24717.
2011-05-20 14:33:23 +00:00
Ralph Castain
b03e4481a3
Plug a couple of additional places in case orte_iof is not opened
...
This commit was SVN r24716.
2011-05-20 13:42:53 +00:00
Ralph Castain
dc0bb0571b
Record the number of heartbeats recvd each period for diag purposes
...
This commit was SVN r24714.
2011-05-20 00:21:33 +00:00
Ralph Castain
69dce0ec10
Minor heartbeat cleanups
...
This commit was SVN r24713.
2011-05-19 21:27:44 +00:00
Ralph Castain
c3df95dd13
Prevent failure due to race condition during abnormal term
...
This commit was SVN r24712.
2011-05-19 21:27:05 +00:00
Ralph Castain
b0f47e6f59
Allow orte_iof to not be opened
...
This commit was SVN r24711.
2011-05-19 21:26:30 +00:00
Ralph Castain
1f3911cc8b
Add a new proc state
...
This commit was SVN r24710.
2011-05-19 21:25:58 +00:00
Ralph Castain
b47ec2ee87
Remove lingering references to opal_profile option
...
This commit was SVN r24709.
2011-05-18 18:27:29 +00:00
Ralph Castain
9678e62613
Fix possible corruption of environ. Thanks to Ariel Burton and Peter Thompson for finding it!
...
This commit was SVN r24708.
2011-05-18 16:25:35 +00:00
Ralph Castain
d34bab541d
Remove the ompi-profiler tool and its attendant ompi-probe program. Also remove the grpcomm basic component since its only function was to support profiled clusters, which nobody was doing. :-(
...
This commit was SVN r24704.
2011-05-17 03:30:25 +00:00
Ralph Castain
a3e43594a4
Extend node stats to include additional memory info. Change "darwin" pstat module to "test" as we don't really know how to get all the stat info for darwin.
...
Add a new OPAL_ERROR_LOG macro similar to the ORTE_ERROR_LOG one.
This commit was SVN r24692.
2011-05-08 14:45:16 +00:00
Ralph Castain
c160f5d5a2
Add ability to specify mcast interfaces by name
...
This commit was SVN r24691.
2011-05-08 14:42:48 +00:00
Thomas Herault
fb3fd8fd0e
items belonging to peer_send_queue are mca_oob_tcp_msg_t *, which are obtained through a opal_freelist.
...
They shouldn't be released, but returned to the freelist.
This commit was SVN r24679.
2011-05-03 21:03:09 +00:00
Ralph Castain
9df207aa51
Fix the case where a user supplies the -xterm option, which requires that we leave ssh sessions attached.
...
This commit was SVN r24668.
2011-05-02 12:39:55 +00:00
Ralph Castain
138928fcf4
Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces.
...
This commit was SVN r24666.
2011-04-29 18:46:40 +00:00
Shiqing Fan
9e90ade864
Missed one file from the last commit.
...
This commit was SVN r24664.
2011-04-29 14:44:02 +00:00
Shiqing Fan
4490fdbd34
Add the initial support for MinGW and MSYS.
...
Correctly check the dependencies of MSYS env.
Set up configure include and lib path for building the package.
update a few more CMake scripts.
This commit was SVN r24663.
2011-04-29 14:42:07 +00:00
Ralph Castain
c78531ce8a
Don't free the envar that gets putenv'd as that messes up the environ
...
This commit was SVN r24660.
2011-04-29 08:50:29 +00:00
Ralph Castain
0ff0d20e72
Grr...get the prefix right - need to strip the bin out of absolute path to mpirun.
...
This commit was SVN r24658.
2011-04-28 22:20:55 +00:00
Ralph Castain
6af2677fb8
Check for both absolute-path-to-mpirun and -prefix being specified. If the two differ, print out a warning and ignore -prefix. If they are the same, or only one was given, then proceed as directed.
...
This commit was SVN r24657.
2011-04-28 22:12:41 +00:00
Ralph Castain
b586f2952e
Arggg...revert r24645. I knew those fields were there for a reason...sigh.
...
This commit was SVN r24647.
The following SVN revision numbers were found above:
r24645 --> open-mpi/ompi@e4732110da
2011-04-28 15:07:00 +00:00
Ralph Castain
859aaab93d
In the case of direct-launched processes running under slurm, psm requires that the pre_condition_transports MCA param be set. This is normally computed by mpirun and inserted into each proc's environ, but that doesn't work here.
...
So separate out the printing of that key, and let the individual procs generate it in a way that ensures they all get the same result.
This commit was SVN r24646.
2011-04-28 13:54:33 +00:00
Ralph Castain
e4732110da
Remove a couple more stale fields
...
This commit was SVN r24645.
2011-04-28 00:26:38 +00:00
Ralph Castain
39369f8807
Remove stale fields from global objects - have been moved to the layer that actually uses them
...
This commit was SVN r24644.
2011-04-28 00:20:49 +00:00
Ralph Castain
8858d9a40e
Add a marker for other layers to use in defining data types
...
This commit was SVN r24643.
2011-04-28 00:19:35 +00:00
Ralph Castain
9988b97b97
Extend/update how we handle process stats. Add the ability to collect node-level stats separate from the process stats. Update the process stat memory fields to report in MBytes instead of KBytes as I can't find any process that runs in KBytes nowadays.
...
Rename the memusage sensor plugin to "resusage" as it will soon be updated to include full process stat monitoring.
Extend the heartbeat sensor to report node and process stats in the heartbeat.
Store the process and node stats in their respective orte_xxx_t object.
This commit was SVN r24629.
2011-04-21 22:55:45 +00:00
Ralph Castain
5f64b830f9
Ensure we only kill threads once
...
This commit was SVN r24620.
2011-04-18 14:47:09 +00:00
Ralph Castain
8014e7432c
Send recovery defined flag in app_contexts, include recovery flags in debug prints
...
This commit was SVN r24619.
2011-04-18 14:46:42 +00:00
Ralph Castain
89501e6e24
Don't try to politely end threads when abnormally terminating as we can hang if the thread is in a stuck callback.
...
This commit was SVN r24618.
2011-04-18 12:21:47 +00:00
Ralph Castain
3a28556472
Expand our handling of non-zero exit status. If a process exits with non-zero status, pass that info along to the user in case it means something to them, even if the process also exited without calling MPI_Finalize. If the process calls MPI_Abort, that trumps the exit status question.
...
Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI.
This commit was SVN r24614.
2011-04-14 15:04:21 +00:00
Ralph Castain
30fb002524
Take the first small step towards rationalizing rsh support. Create a new "rshbase" component that contains a simple rsh module - no tree spawn, uses all the base functions for launch support. Extend the base rsh support functions to include those functions in common across all rsh modules.
...
Only a minor change made to the current rsh module to avoid a naming conflict. Otherwise, left it alone to avoid creating conflicts with other external work. The current rsh module remains the default for rsh/ssh support, and continues to contain the support for SGE and Loadleveler.
This commit was SVN r24593.
2011-03-30 01:15:07 +00:00
Nysal Jan
866ae8b43a
Close the file descriptor
...
This commit was SVN r24580.
2011-03-29 08:42:49 +00:00
Nysal Jan
c8c6b0edab
Improve LoadLeveler integration with Open MPI. Add support for LL native rsh agent - llspawn
...
This commit was SVN r24579.
2011-03-29 07:46:59 +00:00
Nathan Hjelm
8634b6394f
fixed plm/tm component
...
This commit was SVN r24577.
2011-03-25 22:20:15 +00:00
Ralph Castain
d7e029cb40
Convert heartbeat to multicast basis
...
This commit was SVN r24570.
2011-03-24 19:05:39 +00:00
Ralph Castain
90698a2c02
Ensure that blocking recvs wait until the data is actually recvd
...
This commit was SVN r24558.
2011-03-22 18:45:54 +00:00
Ralph Castain
888472f671
Do not release recv as the calling function needs that data and will release it later
...
This commit was SVN r24557.
2011-03-22 18:44:56 +00:00
Ralph Castain
30981de200
Minor cleanups courtesy of Nysal - thanks!
...
This commit was SVN r24552.
2011-03-22 13:48:58 +00:00