Ralph Castain
bd8e43a2de
Correct debug output so it doesn't falsely report the module
...
This commit was SVN r25003.
2011-08-05 20:30:34 +00:00
Ralph Castain
d603c79ab4
Fix the FAILED_TO_START scenario so orted doesn't segfault
...
This commit was SVN r25002.
2011-08-05 20:29:50 +00:00
Ralph Castain
c86bfb4e90
Need to copy the string
...
This commit was SVN r25001.
2011-08-05 19:03:28 +00:00
Ralph Castain
7b307d5bf0
Cleanup handling of all-numerical node names
...
This commit was SVN r25000.
2011-08-05 14:59:14 +00:00
Ralph Castain
157bad5435
If we can't compress the name, that's fine - but still have to move to next posn
...
This commit was SVN r24999.
2011-08-05 14:43:36 +00:00
Ralph Castain
3199663613
Correctly handle the case of mixes of character-based names and all-number names
...
This commit was SVN r24998.
2011-08-05 14:37:36 +00:00
Ralph Castain
066022126e
Sort the nodes to be in numerically increasing order so the regex has a chance of working right.
...
This commit was SVN r24993.
2011-08-05 03:37:13 +00:00
Ralph Castain
5a634caad9
Cleanly handle the case where the node "name" is just a number, and avoid the N-N output when the number is not part of a sequence.
...
This commit was SVN r24992.
2011-08-05 03:36:30 +00:00
Jeff Squyres
294e1f50cd
Remove compiler warning about nested comment
...
This commit was SVN r24984.
2011-08-03 18:30:56 +00:00
Wesley Bland
87a96da99c
Should fix some of the shutdown woes of the errmgr.
...
Correctly checks that the orted's job is completed.
Correctly tests to make sure that there is shutdown going on (doesn't rely on orte_orteds_term_ordered).
Adds a patch from Ralph to correctdly check the status of processes.
This commit was SVN r24962.
2011-08-01 14:00:41 +00:00
Ralph Castain
42b125ef35
Move the debug so it more accurately reports
...
This commit was SVN r24961.
2011-07-29 20:48:46 +00:00
Ralph Castain
70bca4691f
Add a new "sensor" module that supports fault tolerance tests - randomly kills local procs and/or the daemon itself
...
This commit was SVN r24960.
2011-07-29 20:48:22 +00:00
Wesley Bland
5fde3e0e00
Move the resilient orte errmgr code into a seperate errmgr for now while it's
...
still unstable. Reverted errmgr modules back to the original errmgr (with the
updates since the resilient code was brought into the trunk).
This commit was SVN r24958.
2011-07-28 21:24:34 +00:00
Ralph Castain
6c879f87fb
Add a new param "orte_remote_tmpdir_base" for those situations where the compute nodes require a different session directory head than the head node.
...
This commit was SVN r24956.
2011-07-27 19:37:17 +00:00
Ralph Castain
decab98fb2
Do a little better job of catching up on missed mcast messages, and provide a way out of scenarios where catch-up is impossible.
...
This commit was SVN r24955.
2011-07-27 14:58:30 +00:00
Ralph Castain
c3bc33b3fb
Don't be so restrictive - accept "slots" as well as "slot" in rank file
...
This commit was SVN r24954.
2011-07-27 00:45:30 +00:00
Wesley Bland
b972fd84e1
No longer sends extra FAILED_NOTIFICATION messages in the non-failure case.
...
Should reduce finalize complexity and avoid a race condition that has been
detected by a few users.
This commit was SVN r24952.
2011-07-26 20:47:44 +00:00
Ralph Castain
715f871605
Ignore the daemon job when reporting parseable output
...
This commit was SVN r24944.
2011-07-25 20:44:08 +00:00
Ralph Castain
db193555c2
Use non-blocking sends for recovering from lost multicast messages
...
This commit was SVN r24943.
2011-07-25 18:49:47 +00:00
Ralph Castain
199804fc35
complete implementation of parseable output
...
This commit was SVN r24929.
2011-07-23 22:23:24 +00:00
Ralph Castain
ffe6f5f40e
Fix map pack/unpack so they match
...
This commit was SVN r24928.
2011-07-23 22:23:05 +00:00
Ralph Castain
00647fa342
Update orte-ps to add parseable output - not fully tested because I couldn't get other parts of the system to work.
...
This commit was SVN r24927.
2011-07-23 20:20:31 +00:00
Ralph Castain
869024f1c6
You have to initialize th daemon param -before- using it to get epoch!!
...
This commit was SVN r24926.
2011-07-23 20:19:43 +00:00
Ralph Castain
361bcef253
Close multicast before rml
...
This commit was SVN r24925.
2011-07-23 20:19:15 +00:00
Shiqing Fan
cc4403a863
Remove two unused windows files.
...
This commit was SVN r24913.
2011-07-21 12:53:32 +00:00
Brian Barrett
3bd66a5932
* Remove unused Portals3.3 reference implementation support
...
This commit was SVN r24906.
2011-07-20 23:30:29 +00:00
Eugene Loh
921852e1e5
Clean up the computations of num_procs_alive. Do some code
...
refactoring to improve readability and to compute num_procs_alive
correctly and to remove the use of loop iteration variables for
two loops nested one inside another (causing MPI_Comm_spawn_multiple
to fail).
This commit was SVN r24903.
2011-07-14 20:10:48 +00:00
Ralph Castain
8853e0e80a
Fix regular expression analyzer for slurmd - use a slurm-specific version
...
Fix multi-node routing for daemon startup when static ports are not set
This commit was SVN r24898.
2011-07-13 22:49:56 +00:00
Ralph Castain
8d1b31b887
Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly.
...
Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list.
This commit was SVN r24894.
2011-07-13 20:11:14 +00:00
Ralph Castain
1405bacd85
Ensure we dont segfault if we report an error
...
This commit was SVN r24890.
2011-07-13 15:00:22 +00:00
Jeff Squyres
3893a5a1de
Fix compile error introduced in r24888.
...
This commit was SVN r24889.
The following SVN revision numbers were found above:
r24888 --> open-mpi/ompi@e5253647ea
2011-07-13 14:18:00 +00:00
Shiqing Fan
e5253647ea
Fix a type cast.
...
This commit was SVN r24888.
2011-07-13 09:00:17 +00:00
Ralph Castain
1ad110d2e9
After a nice, calm, rational discussion between Brian, Jeff, and myself, we decided to revert r24864 and r24862 to restore the reference counters in opal_init/finalize. The rationale was that we should instead change orte_init/finalize to also use reference counters to support multi-embedded libraries. Jeff and Brian will discuss proposing a similar change to mpi_init/finalize to the MPI Forum so that all three libraries will behave in similar manners.
...
It was agreed that opal_init_util had wound up being used in unintended ways, which raised the problem of getting reference counts to work right. However, fixing it would involve more pain than it was worth - and so long as the other layers are made to behave similarly, I have no preference either way.
Complete implementation will follow - for now, this just reverts the prior changes.
This commit was SVN r24886.
The following SVN revision numbers were found above:
r24862 --> open-mpi/ompi@aa92e0c4eb
r24864 --> open-mpi/ompi@a5062385c2
2011-07-12 17:07:41 +00:00
Nathan Hjelm
3f4e5d7dd6
add missing thread lock/unlock around condition_broadcast
...
This commit was SVN r24885.
2011-07-12 15:43:56 +00:00
Nathan Hjelm
c3ec2e2614
fix a potential race condition in rml
...
This commit was SVN r24884.
2011-07-12 15:43:12 +00:00
Abhishek Kulkarni
6bf02d1344
Fixes (after the new ORTE resiliency layer was merged) to make the trunk build with C/R flags turned on.
...
This commit was SVN r24872.
2011-07-10 23:36:26 +00:00
Ralph Castain
a5062385c2
Fix singletons
...
This commit was SVN r24864.
2011-07-08 14:38:33 +00:00
Jeff Squyres
3affb8403e
Remove extra output
...
This commit was SVN r24863.
2011-07-08 13:01:30 +00:00
Ralph Castain
aa92e0c4eb
Replace a useless counter with a boolean check to see if we have already passed thru opal_finalize so we don't call finalize, and then don't pass thru it (as was happening on several tools)
...
This commit was SVN r24862.
2011-07-08 06:43:19 +00:00
Ralph Castain
05f4926bfe
Remove some remaining cruft re regular expressions - caused the trunk to fail if regex wasn't being used
...
This commit was SVN r24861.
2011-07-08 06:42:12 +00:00
Ralph Castain
1ee7c39982
Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
...
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.
Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.
This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Ralph Castain
6496b2f845
Ensure we terminate properly on non-zero exit status
...
This commit was SVN r24859.
2011-07-07 14:33:49 +00:00
Wesley Bland
0628963506
Fix a return code when a process isn't found.
...
This commit was SVN r24845.
2011-06-30 15:22:54 +00:00
Brian Barrett
e52fef28ca
Do something rational for the disable full support case
...
This commit was SVN r24844.
2011-06-30 14:48:19 +00:00
Ralph Castain
8ac35a8496
Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate.
...
This commit was SVN r24843.
2011-06-30 14:11:56 +00:00
Ralph Castain
c449871ade
Add an mca param to set the "fork agent" - i.e., a program to be run when forking off a process (e.g., valgrind). While you could specify this by "mpirun -n N fork_agent ./my_app", not everyone launches procs with ORTE from mpirun.
...
Provide the ability to store recent stat histories using the ring_buffer class
This commit was SVN r24842.
2011-06-30 03:12:38 +00:00
Ralph Castain
2e1fa3e08e
Don't error out if the recv.cancel comes back not found as this is just a race condition
...
This commit was SVN r24841.
2011-06-30 01:19:50 +00:00
Ralph Castain
418229c71c
Define a new error constant
...
This commit was SVN r24833.
2011-06-28 19:47:16 +00:00
Wesley Bland
84be81df95
Standardize the initialization of the EPOCH's.
...
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.
This commit was SVN r24829.
2011-06-28 14:20:33 +00:00
Ralph Castain
c203eee223
Since process names now have three fields, be sure to initialize all three of them
...
This commit was SVN r24828.
2011-06-27 20:50:08 +00:00