1
1
Граф коммитов

3156 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
aa92e0c4eb Replace a useless counter with a boolean check to see if we have already passed thru opal_finalize so we don't call finalize, and then don't pass thru it (as was happening on several tools)
This commit was SVN r24862.
2011-07-08 06:43:19 +00:00
Ralph Castain
05f4926bfe Remove some remaining cruft re regular expressions - caused the trunk to fail if regex wasn't being used
This commit was SVN r24861.
2011-07-08 06:42:12 +00:00
Ralph Castain
1ee7c39982 Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.

Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.

This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Ralph Castain
6496b2f845 Ensure we terminate properly on non-zero exit status
This commit was SVN r24859.
2011-07-07 14:33:49 +00:00
Wesley Bland
0628963506 Fix a return code when a process isn't found.
This commit was SVN r24845.
2011-06-30 15:22:54 +00:00
Brian Barrett
e52fef28ca Do something rational for the disable full support case
This commit was SVN r24844.
2011-06-30 14:48:19 +00:00
Ralph Castain
8ac35a8496 Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate.
This commit was SVN r24843.
2011-06-30 14:11:56 +00:00
Ralph Castain
c449871ade Add an mca param to set the "fork agent" - i.e., a program to be run when forking off a process (e.g., valgrind). While you could specify this by "mpirun -n N fork_agent ./my_app", not everyone launches procs with ORTE from mpirun.
Provide the ability to store recent stat histories using the ring_buffer class

This commit was SVN r24842.
2011-06-30 03:12:38 +00:00
Ralph Castain
2e1fa3e08e Don't error out if the recv.cancel comes back not found as this is just a race condition
This commit was SVN r24841.
2011-06-30 01:19:50 +00:00
Ralph Castain
418229c71c Define a new error constant
This commit was SVN r24833.
2011-06-28 19:47:16 +00:00
Wesley Bland
84be81df95 Standardize the initialization of the EPOCH's.
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.

This commit was SVN r24829.
2011-06-28 14:20:33 +00:00
Ralph Castain
c203eee223 Since process names now have three fields, be sure to initialize all three of them
This commit was SVN r24828.
2011-06-27 20:50:08 +00:00
Ralph Castain
d316701d3c Remove unnecessary mcast channel
This commit was SVN r24816.
2011-06-23 20:44:22 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
391074cde6 Add a tag
This commit was SVN r24813.
2011-06-23 15:12:25 +00:00
Ralph Castain
afceaaa8e4 Add support for detecting rapid failures so the errmgr can respond accordingly
This commit was SVN r24796.
2011-06-21 16:08:41 +00:00
Samuel Gutierrez
81f38b258a commit of new shared memory backing facility framework (shmem) and its components.
This commit was SVN r24795.
2011-06-21 15:41:57 +00:00
Ralph Castain
9491fbb60c Remove two stale modules
This commit was SVN r24794.
2011-06-21 05:57:39 +00:00
Ralph Castain
b95ede99d5 Ensure we use the pmi grpcomm when using pmi
This commit was SVN r24793.
2011-06-20 21:57:47 +00:00
Ralph Castain
92a65f21bf Restore slurm pmi support from long, long ago. Since we already have the ability to directly srun an MPI job, just conditionally add the PMI support for key values and provide a grpcomm module that uses PMI for barriers and modex.
Currently ompi_ignored, and unignored only for me (others to soon follow).

This commit was SVN r24792.
2011-06-20 21:04:46 +00:00
Jeff Squyres
9531e205e1 Minor fix to a comment
This commit was SVN r24789.
2011-06-20 17:51:01 +00:00
Ralph Castain
d563e58348 Remove another lingering bproc reference
This commit was SVN r24785.
2011-06-17 18:01:23 +00:00
Ralph Castain
042ee3ec48 Support the option of outputting error_log messages with something other than the process name
This commit was SVN r24784.
2011-06-17 14:50:00 +00:00
Ralph Castain
61dd7f4588 Need to pass identification of input/output channels
This commit was SVN r24783.
2011-06-17 14:48:59 +00:00
Ralph Castain
93dcfc15d0 Let the upper layer set the channels to be opened
This commit was SVN r24779.
2011-06-16 20:31:52 +00:00
Ralph Castain
7f2d2e3de7 Track the app_context rank - will equal overall rank for single app_context jobs
This commit was SVN r24778.
2011-06-16 20:31:30 +00:00
Josh Hursey
0eb3b3b7b0 Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits.
Per RFC:
  http://www.open-mpi.org/community/lists/devel/2011/06/9335.php

This commit was SVN r24775.
2011-06-15 13:10:13 +00:00
Ralph Castain
033cbbed31 Don't automatically assign group channels if not given - let the layer above figure it out.
This commit was SVN r24771.
2011-06-10 16:28:18 +00:00
Ralph Castain
e039c7b7ea Avoid crashing when debugging rmaps and a non-string resource constraint is given
This commit was SVN r24770.
2011-06-10 16:27:30 +00:00
Josh Hursey
6539a31b23 Cleanup configure checks for C/R functionality.
Add a WANT_FT_CR flag different from WANT_FT so tools like *-checkpoint are not built when a different FT technique is requested.

Also fix the C/R thread check so that it is only enabled if C/R is enabled, not generally when threads are enabled.

This commit was SVN r24769.
2011-06-09 19:45:29 +00:00
Josh Hursey
9080eaedf3 Fully initialize the orte_errmgr_base_component_t structure for the app.
This commit was SVN r24765.
2011-06-09 14:22:25 +00:00
Josh Hursey
20339a7900 Minor coding style and intentation fixes.
This commit was SVN r24764.
2011-06-09 14:16:06 +00:00
Ralph Castain
fc8d920c56 Dont monitor resource usage unless requested, even if sensors are enabled during configure
This commit was SVN r24760.
2011-06-08 10:47:25 +00:00
Ralph Castain
906eb925f1 Update the resource usage sensor to initiate support for time analysis of measurements
This commit was SVN r24759.
2011-06-07 23:22:51 +00:00
Ralph Castain
f3cae3d6f3 Cleanup the handling of if_include and if_exclude arguments based on CIDR notation.
Fix a bug in the new code that prevented the system from correctly matching addresses.

Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide.

Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result.

Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages.

NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing.

This commit was SVN r24755.
2011-06-07 02:09:11 +00:00
George Bosilca
6b52d8f519 The paffinity is apparently needed.
This commit was SVN r24749.
2011-06-06 01:20:01 +00:00
Ralph Castain
bd8d9a943a Add diagnostics
This commit was SVN r24748.
2011-06-05 19:17:56 +00:00
Ralph Castain
1491d52bd7 Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3").
This commit was SVN r24747.
2011-06-05 19:16:42 +00:00
George Bosilca
454519842e Report bindings if requested.
This commit was SVN r24743.
2011-06-02 17:17:10 +00:00
George Bosilca
1eccadbd87 No need for the paffinity here.
This commit was SVN r24742.
2011-06-02 17:16:25 +00:00
Ralph Castain
8f401a0563 Enable the ability to constrain applications to hosts on the basis of resources.
This commit was SVN r24736.
2011-05-28 22:18:19 +00:00
Brian Barrett
beb1bc70b2 * Add support for using modex to exchange NID/PID pairs when using Portals4.
Rather than try to support a bunch of lightweight environments like I did
  with the Portals3 code, always use the "modex" and hack the grpcomm for
  the SHMEM implementation to return the right nid/pid for a remote
  process by "magic".

This commit was SVN r24733.
2011-05-25 22:10:27 +00:00
Ralph Castain
81b6c50daa Correct stale typo
This commit was SVN r24725.
2011-05-23 17:34:22 +00:00
Ralph Castain
661f508e62 Fix typo
This commit was SVN r24723.
2011-05-22 00:20:42 +00:00
Ralph Castain
b2331113a5 Add some debug
This commit was SVN r24722.
2011-05-21 21:09:47 +00:00
Ralph Castain
8c08ee9c3d Remove stale tool
This commit was SVN r24720.
2011-05-21 00:38:35 +00:00
Ralph Castain
1b5ca323c6 Always followup with sigkill when killing local procs as procs can trap sigterm and get stuck
This commit was SVN r24719.
2011-05-20 22:40:10 +00:00
Ralph Castain
c5686ecfca Dont sample stats for pid=0 children
This commit was SVN r24717.
2011-05-20 14:33:23 +00:00
Ralph Castain
b03e4481a3 Plug a couple of additional places in case orte_iof is not opened
This commit was SVN r24716.
2011-05-20 13:42:53 +00:00
Ralph Castain
dc0bb0571b Record the number of heartbeats recvd each period for diag purposes
This commit was SVN r24714.
2011-05-20 00:21:33 +00:00