1
1

3414 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
23f47295a8 Add even more debug
This commit was SVN r25053.
2011-08-16 16:41:33 +00:00
Ralph Castain
d624d43f69 Add more debug
This commit was SVN r25052.
2011-08-16 15:47:37 +00:00
Ralph Castain
3d96497581 Add debug
This commit was SVN r25050.
2011-08-16 12:22:05 +00:00
Shiqing Fan
7292ee2387 One .windows file is missing in the tarball.
This commit was SVN r25049.
2011-08-15 10:21:25 +00:00
Shiqing Fan
3af7c9f7bb Complete the MinGW build support on Windows.
This commit was SVN r25048.
2011-08-15 09:47:23 +00:00
Shiqing Fan
627f1dd351 Correct several export declarations.
This commit was SVN r25047.
2011-08-15 09:45:51 +00:00
Ralph Castain
ca3d29a1e6 Extend regex support to a bigger audience
This commit was SVN r25046.
2011-08-12 21:02:48 +00:00
Ralph Castain
ea4e2c2db4 Unused variables
This commit was SVN r25045.
2011-08-12 21:02:09 +00:00
Jeff Squyres
1cbfb53801 r24976 wasn't quite right -- you now actually get a warning if you
specify btl_tcp_if_include because btl_tcp_if_exclude is defaulted to
the loopback devices.

This commit does a few things:

 * Introduce a new OPAL MCA base function:
   mca_base_param_check_exclusive_string().  It checks to see that the
   ''user'' does not set two MCA parameters that are mutually
   exclusive by checking the source of those MCS param values.
 * Use the above function in many BTLs (and the OOB TCP) to ensure
   that <foo>_if_include and <foo>_if_exclude are not both specified
   ''by the user''.
 * Re-arrange many of these BTLs to move their MCA registration code
   into a separate component_register() function (vs. the
   component_open() function).

This code has been nominally reviewed and checked by Ralph, George,
Terry, and Shiqing.

This commit was SVN r25043.

The following SVN revision numbers were found above:
  r24976 --> open-mpi/ompi@8f4ac54336
2011-08-10 17:24:36 +00:00
Ralph Castain
b360c98afd Per request from Pasha, revert r25004 - but modified a touch to reflect fact that opal_argv_append copies the provided string, so we don't need to print it and then free it.
This commit was SVN r25037.

The following SVN revision numbers were found above:
  r25004 --> open-mpi/ompi@2418831bea
2011-08-09 22:42:27 +00:00
Nathan Hjelm
aa3d302a05 use persistent rml_recv in iof
This commit was SVN r25035.
2011-08-09 21:30:12 +00:00
Ralph Castain
f1951e7ccd If we are abnormally terminating, then don't wait for orteds to report back. Send them a "halt_vm" command, which instructs them to kill their local procs and immediately terminate, doing their best to cleanup on the way out.
Also do a little cleanup on debug output in rshbase.

This commit was SVN r25033.
2011-08-09 17:42:19 +00:00
Wesley Bland
67feeb6aca Move the errmgr code back. This shouldn't cause the svn problems that I
apparently caused last time. Sorry about that. This one will just be a big
changelog.  

This commit was SVN r25016.
2011-08-08 16:01:08 +00:00
Wesley Bland
09274cd047 Make sure that the epoch is initialized everywhere so we don't get weird output
during valgrind. This shouldn't have caused any problems with any actual
execution. Just extra warnings in valgrind.

This commit was SVN r25015.
2011-08-08 15:11:55 +00:00
Ralph Castain
8014e3429e Don't double-count procs as they are launched
This commit was SVN r25011.
2011-08-08 06:05:23 +00:00
Ralph Castain
7b9f958dcf Add some missing error strings. Update test to show silent errors
This commit was SVN r25010.
2011-08-08 04:21:02 +00:00
Ralph Castain
4083dc617f Fix computation of number of required files and file descriptors - it only depends on the total number of local procs, not on the number of procs in the entire job!
This commit was SVN r25008.
2011-08-08 04:09:40 +00:00
Ralph Castain
590ac70e88 Add a simple test program for error string output
This commit was SVN r25007.
2011-08-07 21:32:25 +00:00
Ralph Castain
8b3c562b84 Adjust verbosity levels to make it easier to debug at scale
This commit was SVN r25006.
2011-08-07 21:14:21 +00:00
Ralph Castain
2418831bea Pass the nodelist to the aprun command even when using all nodes
This commit was SVN r25004.
2011-08-06 04:19:41 +00:00
Ralph Castain
bd8e43a2de Correct debug output so it doesn't falsely report the module
This commit was SVN r25003.
2011-08-05 20:30:34 +00:00
Ralph Castain
d603c79ab4 Fix the FAILED_TO_START scenario so orted doesn't segfault
This commit was SVN r25002.
2011-08-05 20:29:50 +00:00
Ralph Castain
c86bfb4e90 Need to copy the string
This commit was SVN r25001.
2011-08-05 19:03:28 +00:00
Ralph Castain
7b307d5bf0 Cleanup handling of all-numerical node names
This commit was SVN r25000.
2011-08-05 14:59:14 +00:00
Ralph Castain
157bad5435 If we can't compress the name, that's fine - but still have to move to next posn
This commit was SVN r24999.
2011-08-05 14:43:36 +00:00
Ralph Castain
3199663613 Correctly handle the case of mixes of character-based names and all-number names
This commit was SVN r24998.
2011-08-05 14:37:36 +00:00
Ralph Castain
066022126e Sort the nodes to be in numerically increasing order so the regex has a chance of working right.
This commit was SVN r24993.
2011-08-05 03:37:13 +00:00
Ralph Castain
5a634caad9 Cleanly handle the case where the node "name" is just a number, and avoid the N-N output when the number is not part of a sequence.
This commit was SVN r24992.
2011-08-05 03:36:30 +00:00
Jeff Squyres
294e1f50cd Remove compiler warning about nested comment
This commit was SVN r24984.
2011-08-03 18:30:56 +00:00
Wesley Bland
87a96da99c Should fix some of the shutdown woes of the errmgr.
Correctly checks that the orted's job is completed.
Correctly tests to make sure that there is shutdown going on (doesn't rely on orte_orteds_term_ordered).
Adds a patch from Ralph to correctdly check the status of processes.

This commit was SVN r24962.
2011-08-01 14:00:41 +00:00
Ralph Castain
42b125ef35 Move the debug so it more accurately reports
This commit was SVN r24961.
2011-07-29 20:48:46 +00:00
Ralph Castain
70bca4691f Add a new "sensor" module that supports fault tolerance tests - randomly kills local procs and/or the daemon itself
This commit was SVN r24960.
2011-07-29 20:48:22 +00:00
Wesley Bland
5fde3e0e00 Move the resilient orte errmgr code into a seperate errmgr for now while it's
still unstable. Reverted errmgr modules back to the original errmgr (with the
updates since the resilient code was brought into the trunk).

This commit was SVN r24958.
2011-07-28 21:24:34 +00:00
Ralph Castain
6c879f87fb Add a new param "orte_remote_tmpdir_base" for those situations where the compute nodes require a different session directory head than the head node.
This commit was SVN r24956.
2011-07-27 19:37:17 +00:00
Ralph Castain
decab98fb2 Do a little better job of catching up on missed mcast messages, and provide a way out of scenarios where catch-up is impossible.
This commit was SVN r24955.
2011-07-27 14:58:30 +00:00
Ralph Castain
c3bc33b3fb Don't be so restrictive - accept "slots" as well as "slot" in rank file
This commit was SVN r24954.
2011-07-27 00:45:30 +00:00
Wesley Bland
b972fd84e1 No longer sends extra FAILED_NOTIFICATION messages in the non-failure case.
Should reduce finalize complexity and avoid a race condition that has been
detected by a few users.

This commit was SVN r24952.
2011-07-26 20:47:44 +00:00
Ralph Castain
715f871605 Ignore the daemon job when reporting parseable output
This commit was SVN r24944.
2011-07-25 20:44:08 +00:00
Ralph Castain
db193555c2 Use non-blocking sends for recovering from lost multicast messages
This commit was SVN r24943.
2011-07-25 18:49:47 +00:00
Ralph Castain
199804fc35 complete implementation of parseable output
This commit was SVN r24929.
2011-07-23 22:23:24 +00:00
Ralph Castain
ffe6f5f40e Fix map pack/unpack so they match
This commit was SVN r24928.
2011-07-23 22:23:05 +00:00
Ralph Castain
00647fa342 Update orte-ps to add parseable output - not fully tested because I couldn't get other parts of the system to work.
This commit was SVN r24927.
2011-07-23 20:20:31 +00:00
Ralph Castain
869024f1c6 You have to initialize th daemon param -before- using it to get epoch!!
This commit was SVN r24926.
2011-07-23 20:19:43 +00:00
Ralph Castain
361bcef253 Close multicast before rml
This commit was SVN r24925.
2011-07-23 20:19:15 +00:00
Shiqing Fan
cc4403a863 Remove two unused windows files.
This commit was SVN r24913.
2011-07-21 12:53:32 +00:00
Brian Barrett
3bd66a5932 * Remove unused Portals3.3 reference implementation support
This commit was SVN r24906.
2011-07-20 23:30:29 +00:00
Eugene Loh
921852e1e5 Clean up the computations of num_procs_alive. Do some code
refactoring to improve readability and to compute num_procs_alive
correctly and to remove the use of loop iteration variables for
two loops nested one inside another (causing MPI_Comm_spawn_multiple
to fail).

This commit was SVN r24903.
2011-07-14 20:10:48 +00:00
Ralph Castain
8853e0e80a Fix regular expression analyzer for slurmd - use a slurm-specific version
Fix multi-node routing for daemon startup when static ports are not set

This commit was SVN r24898.
2011-07-13 22:49:56 +00:00
Ralph Castain
8d1b31b887 Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly.
Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list.

This commit was SVN r24894.
2011-07-13 20:11:14 +00:00
Ralph Castain
1405bacd85 Ensure we dont segfault if we report an error
This commit was SVN r24890.
2011-07-13 15:00:22 +00:00