Ralph Castain
8d1b31b887
Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly.
...
Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list.
This commit was SVN r24894.
2011-07-13 20:11:14 +00:00
Ralph Castain
1405bacd85
Ensure we dont segfault if we report an error
...
This commit was SVN r24890.
2011-07-13 15:00:22 +00:00
Jeff Squyres
3893a5a1de
Fix compile error introduced in r24888.
...
This commit was SVN r24889.
The following SVN revision numbers were found above:
r24888 --> open-mpi/ompi@e5253647ea
2011-07-13 14:18:00 +00:00
Shiqing Fan
e5253647ea
Fix a type cast.
...
This commit was SVN r24888.
2011-07-13 09:00:17 +00:00
Nathan Hjelm
3f4e5d7dd6
add missing thread lock/unlock around condition_broadcast
...
This commit was SVN r24885.
2011-07-12 15:43:56 +00:00
Nathan Hjelm
c3ec2e2614
fix a potential race condition in rml
...
This commit was SVN r24884.
2011-07-12 15:43:12 +00:00
Abhishek Kulkarni
6bf02d1344
Fixes (after the new ORTE resiliency layer was merged) to make the trunk build with C/R flags turned on.
...
This commit was SVN r24872.
2011-07-10 23:36:26 +00:00
Ralph Castain
05f4926bfe
Remove some remaining cruft re regular expressions - caused the trunk to fail if regex wasn't being used
...
This commit was SVN r24861.
2011-07-08 06:42:12 +00:00
Ralph Castain
1ee7c39982
Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
...
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.
Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.
This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Ralph Castain
6496b2f845
Ensure we terminate properly on non-zero exit status
...
This commit was SVN r24859.
2011-07-07 14:33:49 +00:00
Wesley Bland
0628963506
Fix a return code when a process isn't found.
...
This commit was SVN r24845.
2011-06-30 15:22:54 +00:00
Brian Barrett
e52fef28ca
Do something rational for the disable full support case
...
This commit was SVN r24844.
2011-06-30 14:48:19 +00:00
Ralph Castain
8ac35a8496
Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate.
...
This commit was SVN r24843.
2011-06-30 14:11:56 +00:00
Ralph Castain
c449871ade
Add an mca param to set the "fork agent" - i.e., a program to be run when forking off a process (e.g., valgrind). While you could specify this by "mpirun -n N fork_agent ./my_app", not everyone launches procs with ORTE from mpirun.
...
Provide the ability to store recent stat histories using the ring_buffer class
This commit was SVN r24842.
2011-06-30 03:12:38 +00:00
Ralph Castain
2e1fa3e08e
Don't error out if the recv.cancel comes back not found as this is just a race condition
...
This commit was SVN r24841.
2011-06-30 01:19:50 +00:00
Wesley Bland
84be81df95
Standardize the initialization of the EPOCH's.
...
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.
This commit was SVN r24829.
2011-06-28 14:20:33 +00:00
Ralph Castain
d316701d3c
Remove unnecessary mcast channel
...
This commit was SVN r24816.
2011-06-23 20:44:22 +00:00
Wesley Bland
e1ba09ad51
Add a resilience to ORTE. Allows the runtime to continue after a process (or
...
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.
Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php
This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
391074cde6
Add a tag
...
This commit was SVN r24813.
2011-06-23 15:12:25 +00:00
Samuel Gutierrez
81f38b258a
commit of new shared memory backing facility framework (shmem) and its components.
...
This commit was SVN r24795.
2011-06-21 15:41:57 +00:00
Ralph Castain
9491fbb60c
Remove two stale modules
...
This commit was SVN r24794.
2011-06-21 05:57:39 +00:00
Ralph Castain
b95ede99d5
Ensure we use the pmi grpcomm when using pmi
...
This commit was SVN r24793.
2011-06-20 21:57:47 +00:00
Ralph Castain
92a65f21bf
Restore slurm pmi support from long, long ago. Since we already have the ability to directly srun an MPI job, just conditionally add the PMI support for key values and provide a grpcomm module that uses PMI for barriers and modex.
...
Currently ompi_ignored, and unignored only for me (others to soon follow).
This commit was SVN r24792.
2011-06-20 21:04:46 +00:00
Ralph Castain
042ee3ec48
Support the option of outputting error_log messages with something other than the process name
...
This commit was SVN r24784.
2011-06-17 14:50:00 +00:00
Ralph Castain
61dd7f4588
Need to pass identification of input/output channels
...
This commit was SVN r24783.
2011-06-17 14:48:59 +00:00
Ralph Castain
93dcfc15d0
Let the upper layer set the channels to be opened
...
This commit was SVN r24779.
2011-06-16 20:31:52 +00:00
Ralph Castain
7f2d2e3de7
Track the app_context rank - will equal overall rank for single app_context jobs
...
This commit was SVN r24778.
2011-06-16 20:31:30 +00:00
Josh Hursey
0eb3b3b7b0
Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits.
...
Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9335.php
This commit was SVN r24775.
2011-06-15 13:10:13 +00:00
Ralph Castain
033cbbed31
Don't automatically assign group channels if not given - let the layer above figure it out.
...
This commit was SVN r24771.
2011-06-10 16:28:18 +00:00
Ralph Castain
e039c7b7ea
Avoid crashing when debugging rmaps and a non-string resource constraint is given
...
This commit was SVN r24770.
2011-06-10 16:27:30 +00:00
Josh Hursey
9080eaedf3
Fully initialize the orte_errmgr_base_component_t structure for the app.
...
This commit was SVN r24765.
2011-06-09 14:22:25 +00:00
Josh Hursey
20339a7900
Minor coding style and intentation fixes.
...
This commit was SVN r24764.
2011-06-09 14:16:06 +00:00
Ralph Castain
fc8d920c56
Dont monitor resource usage unless requested, even if sensors are enabled during configure
...
This commit was SVN r24760.
2011-06-08 10:47:25 +00:00
Ralph Castain
906eb925f1
Update the resource usage sensor to initiate support for time analysis of measurements
...
This commit was SVN r24759.
2011-06-07 23:22:51 +00:00
Ralph Castain
f3cae3d6f3
Cleanup the handling of if_include and if_exclude arguments based on CIDR notation.
...
Fix a bug in the new code that prevented the system from correctly matching addresses.
Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide.
Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result.
Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages.
NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing.
This commit was SVN r24755.
2011-06-07 02:09:11 +00:00
George Bosilca
6b52d8f519
The paffinity is apparently needed.
...
This commit was SVN r24749.
2011-06-06 01:20:01 +00:00
Ralph Castain
bd8d9a943a
Add diagnostics
...
This commit was SVN r24748.
2011-06-05 19:17:56 +00:00
Ralph Castain
1491d52bd7
Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3").
...
This commit was SVN r24747.
2011-06-05 19:16:42 +00:00
George Bosilca
454519842e
Report bindings if requested.
...
This commit was SVN r24743.
2011-06-02 17:17:10 +00:00
George Bosilca
1eccadbd87
No need for the paffinity here.
...
This commit was SVN r24742.
2011-06-02 17:16:25 +00:00
Ralph Castain
8f401a0563
Enable the ability to constrain applications to hosts on the basis of resources.
...
This commit was SVN r24736.
2011-05-28 22:18:19 +00:00
Brian Barrett
beb1bc70b2
* Add support for using modex to exchange NID/PID pairs when using Portals4.
...
Rather than try to support a bunch of lightweight environments like I did
with the Portals3 code, always use the "modex" and hack the grpcomm for
the SHMEM implementation to return the right nid/pid for a remote
process by "magic".
This commit was SVN r24733.
2011-05-25 22:10:27 +00:00
Ralph Castain
81b6c50daa
Correct stale typo
...
This commit was SVN r24725.
2011-05-23 17:34:22 +00:00
Ralph Castain
661f508e62
Fix typo
...
This commit was SVN r24723.
2011-05-22 00:20:42 +00:00
Ralph Castain
b2331113a5
Add some debug
...
This commit was SVN r24722.
2011-05-21 21:09:47 +00:00
Ralph Castain
1b5ca323c6
Always followup with sigkill when killing local procs as procs can trap sigterm and get stuck
...
This commit was SVN r24719.
2011-05-20 22:40:10 +00:00
Ralph Castain
c5686ecfca
Dont sample stats for pid=0 children
...
This commit was SVN r24717.
2011-05-20 14:33:23 +00:00
Ralph Castain
b03e4481a3
Plug a couple of additional places in case orte_iof is not opened
...
This commit was SVN r24716.
2011-05-20 13:42:53 +00:00
Ralph Castain
dc0bb0571b
Record the number of heartbeats recvd each period for diag purposes
...
This commit was SVN r24714.
2011-05-20 00:21:33 +00:00
Ralph Castain
69dce0ec10
Minor heartbeat cleanups
...
This commit was SVN r24713.
2011-05-19 21:27:44 +00:00