Wesley Bland
67feeb6aca
Move the errmgr code back. This shouldn't cause the svn problems that I
...
apparently caused last time. Sorry about that. This one will just be a big
changelog.
This commit was SVN r25016.
2011-08-08 16:01:08 +00:00
Wesley Bland
09274cd047
Make sure that the epoch is initialized everywhere so we don't get weird output
...
during valgrind. This shouldn't have caused any problems with any actual
execution. Just extra warnings in valgrind.
This commit was SVN r25015.
2011-08-08 15:11:55 +00:00
Ralph Castain
8014e3429e
Don't double-count procs as they are launched
...
This commit was SVN r25011.
2011-08-08 06:05:23 +00:00
Ralph Castain
4083dc617f
Fix computation of number of required files and file descriptors - it only depends on the total number of local procs, not on the number of procs in the entire job!
...
This commit was SVN r25008.
2011-08-08 04:09:40 +00:00
Ralph Castain
8b3c562b84
Adjust verbosity levels to make it easier to debug at scale
...
This commit was SVN r25006.
2011-08-07 21:14:21 +00:00
Ralph Castain
2418831bea
Pass the nodelist to the aprun command even when using all nodes
...
This commit was SVN r25004.
2011-08-06 04:19:41 +00:00
Ralph Castain
bd8e43a2de
Correct debug output so it doesn't falsely report the module
...
This commit was SVN r25003.
2011-08-05 20:30:34 +00:00
Ralph Castain
d603c79ab4
Fix the FAILED_TO_START scenario so orted doesn't segfault
...
This commit was SVN r25002.
2011-08-05 20:29:50 +00:00
Ralph Castain
c86bfb4e90
Need to copy the string
...
This commit was SVN r25001.
2011-08-05 19:03:28 +00:00
Ralph Castain
066022126e
Sort the nodes to be in numerically increasing order so the regex has a chance of working right.
...
This commit was SVN r24993.
2011-08-05 03:37:13 +00:00
Jeff Squyres
294e1f50cd
Remove compiler warning about nested comment
...
This commit was SVN r24984.
2011-08-03 18:30:56 +00:00
Wesley Bland
87a96da99c
Should fix some of the shutdown woes of the errmgr.
...
Correctly checks that the orted's job is completed.
Correctly tests to make sure that there is shutdown going on (doesn't rely on orte_orteds_term_ordered).
Adds a patch from Ralph to correctdly check the status of processes.
This commit was SVN r24962.
2011-08-01 14:00:41 +00:00
Ralph Castain
42b125ef35
Move the debug so it more accurately reports
...
This commit was SVN r24961.
2011-07-29 20:48:46 +00:00
Ralph Castain
70bca4691f
Add a new "sensor" module that supports fault tolerance tests - randomly kills local procs and/or the daemon itself
...
This commit was SVN r24960.
2011-07-29 20:48:22 +00:00
Wesley Bland
5fde3e0e00
Move the resilient orte errmgr code into a seperate errmgr for now while it's
...
still unstable. Reverted errmgr modules back to the original errmgr (with the
updates since the resilient code was brought into the trunk).
This commit was SVN r24958.
2011-07-28 21:24:34 +00:00
Ralph Castain
6c879f87fb
Add a new param "orte_remote_tmpdir_base" for those situations where the compute nodes require a different session directory head than the head node.
...
This commit was SVN r24956.
2011-07-27 19:37:17 +00:00
Ralph Castain
decab98fb2
Do a little better job of catching up on missed mcast messages, and provide a way out of scenarios where catch-up is impossible.
...
This commit was SVN r24955.
2011-07-27 14:58:30 +00:00
Ralph Castain
c3bc33b3fb
Don't be so restrictive - accept "slots" as well as "slot" in rank file
...
This commit was SVN r24954.
2011-07-27 00:45:30 +00:00
Wesley Bland
b972fd84e1
No longer sends extra FAILED_NOTIFICATION messages in the non-failure case.
...
Should reduce finalize complexity and avoid a race condition that has been
detected by a few users.
This commit was SVN r24952.
2011-07-26 20:47:44 +00:00
Ralph Castain
db193555c2
Use non-blocking sends for recovering from lost multicast messages
...
This commit was SVN r24943.
2011-07-25 18:49:47 +00:00
Ralph Castain
869024f1c6
You have to initialize th daemon param -before- using it to get epoch!!
...
This commit was SVN r24926.
2011-07-23 20:19:43 +00:00
Ralph Castain
361bcef253
Close multicast before rml
...
This commit was SVN r24925.
2011-07-23 20:19:15 +00:00
Shiqing Fan
cc4403a863
Remove two unused windows files.
...
This commit was SVN r24913.
2011-07-21 12:53:32 +00:00
Brian Barrett
3bd66a5932
* Remove unused Portals3.3 reference implementation support
...
This commit was SVN r24906.
2011-07-20 23:30:29 +00:00
Eugene Loh
921852e1e5
Clean up the computations of num_procs_alive. Do some code
...
refactoring to improve readability and to compute num_procs_alive
correctly and to remove the use of loop iteration variables for
two loops nested one inside another (causing MPI_Comm_spawn_multiple
to fail).
This commit was SVN r24903.
2011-07-14 20:10:48 +00:00
Ralph Castain
8853e0e80a
Fix regular expression analyzer for slurmd - use a slurm-specific version
...
Fix multi-node routing for daemon startup when static ports are not set
This commit was SVN r24898.
2011-07-13 22:49:56 +00:00
Ralph Castain
8d1b31b887
Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly.
...
Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list.
This commit was SVN r24894.
2011-07-13 20:11:14 +00:00
Ralph Castain
1405bacd85
Ensure we dont segfault if we report an error
...
This commit was SVN r24890.
2011-07-13 15:00:22 +00:00
Jeff Squyres
3893a5a1de
Fix compile error introduced in r24888.
...
This commit was SVN r24889.
The following SVN revision numbers were found above:
r24888 --> open-mpi/ompi@e5253647ea
2011-07-13 14:18:00 +00:00
Shiqing Fan
e5253647ea
Fix a type cast.
...
This commit was SVN r24888.
2011-07-13 09:00:17 +00:00
Nathan Hjelm
3f4e5d7dd6
add missing thread lock/unlock around condition_broadcast
...
This commit was SVN r24885.
2011-07-12 15:43:56 +00:00
Nathan Hjelm
c3ec2e2614
fix a potential race condition in rml
...
This commit was SVN r24884.
2011-07-12 15:43:12 +00:00
Abhishek Kulkarni
6bf02d1344
Fixes (after the new ORTE resiliency layer was merged) to make the trunk build with C/R flags turned on.
...
This commit was SVN r24872.
2011-07-10 23:36:26 +00:00
Ralph Castain
05f4926bfe
Remove some remaining cruft re regular expressions - caused the trunk to fail if regex wasn't being used
...
This commit was SVN r24861.
2011-07-08 06:42:12 +00:00
Ralph Castain
1ee7c39982
Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
...
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.
Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.
This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Ralph Castain
6496b2f845
Ensure we terminate properly on non-zero exit status
...
This commit was SVN r24859.
2011-07-07 14:33:49 +00:00
Wesley Bland
0628963506
Fix a return code when a process isn't found.
...
This commit was SVN r24845.
2011-06-30 15:22:54 +00:00
Brian Barrett
e52fef28ca
Do something rational for the disable full support case
...
This commit was SVN r24844.
2011-06-30 14:48:19 +00:00
Ralph Castain
8ac35a8496
Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate.
...
This commit was SVN r24843.
2011-06-30 14:11:56 +00:00
Ralph Castain
c449871ade
Add an mca param to set the "fork agent" - i.e., a program to be run when forking off a process (e.g., valgrind). While you could specify this by "mpirun -n N fork_agent ./my_app", not everyone launches procs with ORTE from mpirun.
...
Provide the ability to store recent stat histories using the ring_buffer class
This commit was SVN r24842.
2011-06-30 03:12:38 +00:00
Ralph Castain
2e1fa3e08e
Don't error out if the recv.cancel comes back not found as this is just a race condition
...
This commit was SVN r24841.
2011-06-30 01:19:50 +00:00
Wesley Bland
84be81df95
Standardize the initialization of the EPOCH's.
...
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.
This commit was SVN r24829.
2011-06-28 14:20:33 +00:00
Ralph Castain
d316701d3c
Remove unnecessary mcast channel
...
This commit was SVN r24816.
2011-06-23 20:44:22 +00:00
Wesley Bland
e1ba09ad51
Add a resilience to ORTE. Allows the runtime to continue after a process (or
...
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.
Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php
This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
391074cde6
Add a tag
...
This commit was SVN r24813.
2011-06-23 15:12:25 +00:00
Samuel Gutierrez
81f38b258a
commit of new shared memory backing facility framework (shmem) and its components.
...
This commit was SVN r24795.
2011-06-21 15:41:57 +00:00
Ralph Castain
9491fbb60c
Remove two stale modules
...
This commit was SVN r24794.
2011-06-21 05:57:39 +00:00
Ralph Castain
b95ede99d5
Ensure we use the pmi grpcomm when using pmi
...
This commit was SVN r24793.
2011-06-20 21:57:47 +00:00
Ralph Castain
92a65f21bf
Restore slurm pmi support from long, long ago. Since we already have the ability to directly srun an MPI job, just conditionally add the PMI support for key values and provide a grpcomm module that uses PMI for barriers and modex.
...
Currently ompi_ignored, and unignored only for me (others to soon follow).
This commit was SVN r24792.
2011-06-20 21:04:46 +00:00
Ralph Castain
042ee3ec48
Support the option of outputting error_log messages with something other than the process name
...
This commit was SVN r24784.
2011-06-17 14:50:00 +00:00