Brian Barrett
e52fef28ca
Do something rational for the disable full support case
...
This commit was SVN r24844.
2011-06-30 14:48:19 +00:00
Ralph Castain
8ac35a8496
Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate.
...
This commit was SVN r24843.
2011-06-30 14:11:56 +00:00
Ralph Castain
c449871ade
Add an mca param to set the "fork agent" - i.e., a program to be run when forking off a process (e.g., valgrind). While you could specify this by "mpirun -n N fork_agent ./my_app", not everyone launches procs with ORTE from mpirun.
...
Provide the ability to store recent stat histories using the ring_buffer class
This commit was SVN r24842.
2011-06-30 03:12:38 +00:00
Ralph Castain
2e1fa3e08e
Don't error out if the recv.cancel comes back not found as this is just a race condition
...
This commit was SVN r24841.
2011-06-30 01:19:50 +00:00
Ralph Castain
418229c71c
Define a new error constant
...
This commit was SVN r24833.
2011-06-28 19:47:16 +00:00
Wesley Bland
84be81df95
Standardize the initialization of the EPOCH's.
...
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.
This commit was SVN r24829.
2011-06-28 14:20:33 +00:00
Ralph Castain
c203eee223
Since process names now have three fields, be sure to initialize all three of them
...
This commit was SVN r24828.
2011-06-27 20:50:08 +00:00
Ralph Castain
d316701d3c
Remove unnecessary mcast channel
...
This commit was SVN r24816.
2011-06-23 20:44:22 +00:00
Wesley Bland
e1ba09ad51
Add a resilience to ORTE. Allows the runtime to continue after a process (or
...
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.
Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php
This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
391074cde6
Add a tag
...
This commit was SVN r24813.
2011-06-23 15:12:25 +00:00
Ralph Castain
afceaaa8e4
Add support for detecting rapid failures so the errmgr can respond accordingly
...
This commit was SVN r24796.
2011-06-21 16:08:41 +00:00
Samuel Gutierrez
81f38b258a
commit of new shared memory backing facility framework (shmem) and its components.
...
This commit was SVN r24795.
2011-06-21 15:41:57 +00:00
Ralph Castain
9491fbb60c
Remove two stale modules
...
This commit was SVN r24794.
2011-06-21 05:57:39 +00:00
Ralph Castain
b95ede99d5
Ensure we use the pmi grpcomm when using pmi
...
This commit was SVN r24793.
2011-06-20 21:57:47 +00:00
Ralph Castain
92a65f21bf
Restore slurm pmi support from long, long ago. Since we already have the ability to directly srun an MPI job, just conditionally add the PMI support for key values and provide a grpcomm module that uses PMI for barriers and modex.
...
Currently ompi_ignored, and unignored only for me (others to soon follow).
This commit was SVN r24792.
2011-06-20 21:04:46 +00:00
Jeff Squyres
9531e205e1
Minor fix to a comment
...
This commit was SVN r24789.
2011-06-20 17:51:01 +00:00
Ralph Castain
d563e58348
Remove another lingering bproc reference
...
This commit was SVN r24785.
2011-06-17 18:01:23 +00:00
Ralph Castain
042ee3ec48
Support the option of outputting error_log messages with something other than the process name
...
This commit was SVN r24784.
2011-06-17 14:50:00 +00:00
Ralph Castain
61dd7f4588
Need to pass identification of input/output channels
...
This commit was SVN r24783.
2011-06-17 14:48:59 +00:00
Ralph Castain
93dcfc15d0
Let the upper layer set the channels to be opened
...
This commit was SVN r24779.
2011-06-16 20:31:52 +00:00
Ralph Castain
7f2d2e3de7
Track the app_context rank - will equal overall rank for single app_context jobs
...
This commit was SVN r24778.
2011-06-16 20:31:30 +00:00
Josh Hursey
0eb3b3b7b0
Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits.
...
Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9335.php
This commit was SVN r24775.
2011-06-15 13:10:13 +00:00
Ralph Castain
033cbbed31
Don't automatically assign group channels if not given - let the layer above figure it out.
...
This commit was SVN r24771.
2011-06-10 16:28:18 +00:00
Ralph Castain
e039c7b7ea
Avoid crashing when debugging rmaps and a non-string resource constraint is given
...
This commit was SVN r24770.
2011-06-10 16:27:30 +00:00
Josh Hursey
6539a31b23
Cleanup configure checks for C/R functionality.
...
Add a WANT_FT_CR flag different from WANT_FT so tools like *-checkpoint are not built when a different FT technique is requested.
Also fix the C/R thread check so that it is only enabled if C/R is enabled, not generally when threads are enabled.
This commit was SVN r24769.
2011-06-09 19:45:29 +00:00
Josh Hursey
9080eaedf3
Fully initialize the orte_errmgr_base_component_t structure for the app.
...
This commit was SVN r24765.
2011-06-09 14:22:25 +00:00
Josh Hursey
20339a7900
Minor coding style and intentation fixes.
...
This commit was SVN r24764.
2011-06-09 14:16:06 +00:00
Ralph Castain
fc8d920c56
Dont monitor resource usage unless requested, even if sensors are enabled during configure
...
This commit was SVN r24760.
2011-06-08 10:47:25 +00:00
Ralph Castain
906eb925f1
Update the resource usage sensor to initiate support for time analysis of measurements
...
This commit was SVN r24759.
2011-06-07 23:22:51 +00:00
Ralph Castain
f3cae3d6f3
Cleanup the handling of if_include and if_exclude arguments based on CIDR notation.
...
Fix a bug in the new code that prevented the system from correctly matching addresses.
Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide.
Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result.
Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages.
NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing.
This commit was SVN r24755.
2011-06-07 02:09:11 +00:00
George Bosilca
6b52d8f519
The paffinity is apparently needed.
...
This commit was SVN r24749.
2011-06-06 01:20:01 +00:00
Ralph Castain
bd8d9a943a
Add diagnostics
...
This commit was SVN r24748.
2011-06-05 19:17:56 +00:00
Ralph Castain
1491d52bd7
Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3").
...
This commit was SVN r24747.
2011-06-05 19:16:42 +00:00
George Bosilca
454519842e
Report bindings if requested.
...
This commit was SVN r24743.
2011-06-02 17:17:10 +00:00
George Bosilca
1eccadbd87
No need for the paffinity here.
...
This commit was SVN r24742.
2011-06-02 17:16:25 +00:00
Ralph Castain
8f401a0563
Enable the ability to constrain applications to hosts on the basis of resources.
...
This commit was SVN r24736.
2011-05-28 22:18:19 +00:00
Brian Barrett
beb1bc70b2
* Add support for using modex to exchange NID/PID pairs when using Portals4.
...
Rather than try to support a bunch of lightweight environments like I did
with the Portals3 code, always use the "modex" and hack the grpcomm for
the SHMEM implementation to return the right nid/pid for a remote
process by "magic".
This commit was SVN r24733.
2011-05-25 22:10:27 +00:00
Ralph Castain
81b6c50daa
Correct stale typo
...
This commit was SVN r24725.
2011-05-23 17:34:22 +00:00
Ralph Castain
661f508e62
Fix typo
...
This commit was SVN r24723.
2011-05-22 00:20:42 +00:00
Ralph Castain
b2331113a5
Add some debug
...
This commit was SVN r24722.
2011-05-21 21:09:47 +00:00
Ralph Castain
8c08ee9c3d
Remove stale tool
...
This commit was SVN r24720.
2011-05-21 00:38:35 +00:00
Ralph Castain
1b5ca323c6
Always followup with sigkill when killing local procs as procs can trap sigterm and get stuck
...
This commit was SVN r24719.
2011-05-20 22:40:10 +00:00
Ralph Castain
c5686ecfca
Dont sample stats for pid=0 children
...
This commit was SVN r24717.
2011-05-20 14:33:23 +00:00
Ralph Castain
b03e4481a3
Plug a couple of additional places in case orte_iof is not opened
...
This commit was SVN r24716.
2011-05-20 13:42:53 +00:00
Ralph Castain
dc0bb0571b
Record the number of heartbeats recvd each period for diag purposes
...
This commit was SVN r24714.
2011-05-20 00:21:33 +00:00
Ralph Castain
69dce0ec10
Minor heartbeat cleanups
...
This commit was SVN r24713.
2011-05-19 21:27:44 +00:00
Ralph Castain
c3df95dd13
Prevent failure due to race condition during abnormal term
...
This commit was SVN r24712.
2011-05-19 21:27:05 +00:00
Ralph Castain
b0f47e6f59
Allow orte_iof to not be opened
...
This commit was SVN r24711.
2011-05-19 21:26:30 +00:00
Ralph Castain
1f3911cc8b
Add a new proc state
...
This commit was SVN r24710.
2011-05-19 21:25:58 +00:00
Ralph Castain
b47ec2ee87
Remove lingering references to opal_profile option
...
This commit was SVN r24709.
2011-05-18 18:27:29 +00:00