Ralph Castain
648c85b41b
Add a simple pattern mapper as an example of how to use the topology info to create desired mappings. Let the user specify a pattern based on resource types, and map that pattern across all available nodes as resources permit.
...
Don't automatically display the topology for each node when --display-devel-map is set as it can overwhelm the reader. Use a separate flag --display-topo to get it.
This commit was SVN r25396.
2011-10-29 15:12:45 +00:00
Ralph Castain
12a589130a
Add some debug
...
This commit was SVN r25395.
2011-10-29 15:07:58 +00:00
Ralph Castain
965b04d1a5
Use the new utilities to get a topology that reflects available cpus
...
This commit was SVN r25394.
2011-10-29 15:07:36 +00:00
Ralph Castain
e50bcbf028
Add the ability to specify a topology-containing xml file to describe the simulated nodes to support mapping tests against arbitrary topologies
...
This commit was SVN r25388.
2011-10-29 02:01:11 +00:00
Ralph Castain
7fa5f82d70
Add simulator component to support testing of large scale mapping methods. Automatically sets do-not-resolve and do-not-launch, and creates however many nodes the user wants to simulate in the system.
...
This commit was SVN r25386.
2011-10-28 23:48:53 +00:00
Ralph Castain
e2eb8d5f78
Remove bad param registration - that param was already registered as an int_name in another location.
...
This commit was SVN r25381.
2011-10-28 19:14:43 +00:00
Josh Hursey
6726590b1c
Remove the 'ess_node_rank' accessor from here. This caused running under 'tm' to segv at the orteds.
...
It just looks like this part of the component was not updated during r25331. It was removed from the 'env' and 'slurm' environments in that patch. It looks like 'tm' was updated, but did not get this particular piece.
This commit was SVN r25380.
The following SVN revision numbers were found above:
r25331 --> open-mpi/ompi@b44f8d4b28
2011-10-28 17:41:35 +00:00
Josh Hursey
59ff1dbbfb
Fix indentation problem that caused a segv when running without regex.
...
This was introduced in r25063.
This commit was SVN r25379.
The following SVN revision numbers were found above:
r25063 --> open-mpi/ompi@e58623cd5b
2011-10-28 13:39:32 +00:00
Samuel Gutierrez
922e41a318
fix typo. use PMI_Initialized for init status instead of PMI_Init.
...
This commit was SVN r25377.
2011-10-27 22:27:30 +00:00
Ralph Castain
951d72692c
Reverse the #if direction so we report daemon failure to the errmgr - otherwise, we just hang if a daemon fails to start.
...
Reviewed with Josh.
This commit was SVN r25366.
2011-10-25 19:09:52 +00:00
Ralph Castain
c55cba55a7
Totally trivial spelling fix
...
This commit was SVN r25361.
2011-10-24 14:06:33 +00:00
Ralph Castain
955d8e7d46
Allow apps to use pmi when launched by mpirun, if desired, without affecting daemons
...
This commit was SVN r25359.
2011-10-23 15:57:13 +00:00
Nathan Hjelm
e8af0d8589
don't use alps paffinity
...
This commit was SVN r25358.
2011-10-21 22:52:03 +00:00
Abhishek Kulkarni
46952e9008
Fix C/R functionality in trunk. Intra-node checkpointing of a job now works as expected.
...
Signed-off-by: Abhishek Kulkarni <adkulkar@osl.iu.edu>
This commit was SVN r25357.
2011-10-21 22:07:35 +00:00
Nathan Hjelm
7b1172b346
need a terminating character in the decoded string
...
This commit was SVN r25355.
2011-10-21 16:46:28 +00:00
Nathan Hjelm
cd257ac707
fixed typo in pmi grpcomm
...
This commit was SVN r25353.
2011-10-21 16:28:36 +00:00
Shiqing Fan
5711414eb7
Fix Windows build
...
This commit was SVN r25351.
2011-10-21 14:46:58 +00:00
Ralph Castain
53ef085567
Fix a minor issue seen by Jeff in specific failure pathway
...
This commit was SVN r25350.
2011-10-21 14:44:48 +00:00
Ralph Castain
3e72fccacf
Cray's PMI implementation is quite different from slurm's - they extended PMI-1 by adding some, but not all, of the PMI-2 APIs. So you can't just switch to using PMI-2 functions as it isn't a complete implementation. Instead, you have to selectively figure out which ones they have in PMI-2, and use any missing ones from PMI-1. What fun.
...
Modify the configure logic and the PMI components to accommodate Cray's approach. Refactor the PMI error reporting code so it resides in only one place. Cray actually decided -not- to define the PMI-2 error codes, so we have to use the PMI-1 codes instead. More fun.
This commit was SVN r25348.
2011-10-21 04:54:38 +00:00
Nathan Hjelm
beb8d8ce32
pmi return code wtf
...
This commit was SVN r25336.
2011-10-20 17:51:24 +00:00
Ralph Castain
84713d5a84
Fix singletons again - must have been broken for a very long time, which only shows how little anyone cares about this capability.
...
This commit was SVN r25332.
2011-10-19 20:19:08 +00:00
Ralph Castain
b44f8d4b28
Complete implementation of the ess.proc_get_locality API. Up to this point, the API was only capable of telling if the specified proc was sharing a node with you. However, the returned value was capable of telling you much more detailed info - e.g., if the proc shares a socket, a cache, or numa node. We just didn't have the data to provide that detail.
...
Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times.
Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h
Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun.
This commit was SVN r25331.
2011-10-19 20:18:14 +00:00
Ralph Castain
2958f3de34
Add some clarifying comments and a small efficiency improvement
...
This commit was SVN r25322.
2011-10-18 18:30:43 +00:00
Ralph Castain
b771114086
Fix the fix :-)
...
If the errmgr is going to try and hold the orted until all routes and children are gone, then the exit cmd must do the same. Otherwise, the orted exits immediately without waiting for routes to be dismantled, which is why we don't see the connections close.
Also cleanup some diagnostics and add some debug to more clearly see what's going on.
This commit was SVN r25321.
2011-10-18 17:56:37 +00:00
Ralph Castain
ae8e556d14
Okay, once again let's fix the vpid calculator. Identified problem with prior commit (some rmaps components already place their procs in the jdata->procs array, and others don't), so account for those variations.
...
This commit was SVN r25315.
2011-10-18 15:50:11 +00:00
George Bosilca
749b63c09d
Provide a generic fix for the termination issue instead of r25248. The
...
termination condition is to be checked at the daemon/HNP level not down
in the routing.
This commit was SVN r25313.
The following SVN revision numbers were found above:
r25248 --> open-mpi/ompi@b42ccc89b8
2011-10-18 03:07:37 +00:00
George Bosilca
f28890fbb7
Revert r25302 as it break the --bynode option.
...
This commit was SVN r25311.
The following SVN revision numbers were found above:
r25302 --> open-mpi/ompi@d7a8553179
2011-10-18 02:48:17 +00:00
Ralph Castain
2fdd9c6dea
Ensure mpirun doesn't pick this component
...
This commit was SVN r25307.
2011-10-17 22:28:28 +00:00
Ralph Castain
8f0ef54130
Complete implementation of pmi support. Ensure we support both mpirun and direct launch within same configuration to avoid requiring separate builds. Add support for generic pmi, not just under slurm. Add publish/subscribe support, although slurm's pmi implementation will just return an error as it hasn't been done yet.
...
This commit was SVN r25303.
2011-10-17 20:51:22 +00:00
Ralph Castain
d7a8553179
Fix the mapping algo for computing vpids - it was borked for bynode operations when using nperxxx directives
...
This commit was SVN r25302.
2011-10-17 19:49:04 +00:00
Ralph Castain
f1a5a26ba0
Minor cleanups
...
This commit was SVN r25289.
2011-10-14 18:46:03 +00:00
Ralph Castain
89a20de474
Remove unused includes. Ensure that the error log is at least always available as we otherwise segfault when reporting errors that occur prior to opening the errmgr framework
...
This commit was SVN r25288.
2011-10-14 18:45:11 +00:00
Ralph Castain
07dbbc6513
Sorry for mid-day correction - but folks are trying to test this, and we didn't realize it was still ignored :-(
...
This commit was SVN r25287.
2011-10-14 16:19:20 +00:00
Ralph Castain
7bb294f917
Fix debug flags - thanks Terry!
...
This commit was SVN r25286.
2011-10-14 16:10:21 +00:00
Ralph Castain
054c485dcf
Cleanup a race condition and an unreliable method that caused us to not properly handle procs that trapped sigterm for cleanup purposes while ORTE was trying to kill them. Thanks to Rick Payne and Ian Wells of Cisco for spending weeks chasing this down.
...
Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-(
This commit was SVN r25285.
2011-10-14 15:39:54 +00:00
Ralph Castain
08fa9e1c6a
Correct include path
...
This commit was SVN r25282.
2011-10-13 23:46:52 +00:00
Ralph Castain
b96ef2161d
Complete the PMI support. Generalize PMI operations to support both slurm and non-slurm environments. Correct some configuration issues - we really only want the PMI integration at the individual component level. Ensure that the pmi grpcomm component doesn't get selected when launching via mpirun by setting its priority below the bad component.
...
Only verified in a slurm environment as that's all I have access to...
This commit was SVN r25275.
2011-10-12 20:59:25 +00:00
Ralph Castain
634f83fc52
Fix the routed components. All had errors, some completely broken. You cannot test
...
0 == ORTE_EPOCH_CMP(target->epoch,ORTE_EPOCH_INVALID)
when epoch is not configured as this will always return true. This caused get_route to return an error in all non-binomial routed modules, and caused all components to return an error when delete_route was called.
So protect the checks with ORTE_ENABLE_EPOCH so we get the correct behavior.
This commit was SVN r25274.
2011-10-12 20:18:57 +00:00
Ralph Castain
24a46f2acb
These were missed by prior commit - need to remove lingering references to OPAL_HWLOC_HAVE_XML
...
This commit was SVN r25272.
2011-10-12 16:54:03 +00:00
George Bosilca
872d377021
Tell what the update status is.
...
This commit was SVN r25259.
2011-10-11 19:49:12 +00:00
Brian Barrett
98e98ce2c5
* opal_atomic_trylock is documented to return 0 if the lock was acquired,
...
1 otherwise. It was doing the opposite, so this patch fixes the
return values. All uses (all in ORTE) used the actual return values,
not the documented values, so fix them as well.
This commit was SVN r25257.
2011-10-11 18:43:45 +00:00
Ralph Castain
2f38ff5e54
Ensure we don't try to build this module unless pmi is specifically requested
...
This commit was SVN r25252.
2011-10-11 06:12:04 +00:00
Ralph Castain
baefdabd98
Add some debug. Now confirmed to work correctly (prior problem was with odin tcp connection, not code).
...
This commit was SVN r25249.
2011-10-11 02:15:17 +00:00
Ralph Castain
b42ccc89b8
Although this didn't solve the earlier termination problem, the code will be required once we get connection terminations properly detected. If a daemon (or HNP) is trying to terminate, then we need to check for termination conditions whenever a route is lost - when all child connections are gone, then we are free to finalize.
...
This commit was SVN r25248.
2011-10-10 21:41:49 +00:00
Ralph Castain
1aa1c2e9b4
Get the slurm pmi support working. Cannot use infiniband, of course, as the oob can't make the connection - may try other existing methods. Modex may not quite be working right yet
...
as odin was having trouble making TCP connections, but at least the configure now works so things build, so save that for now
This commit was SVN r25247.
2011-10-10 21:39:10 +00:00
Swen Boehm
08b4322a1a
patched the lex files to not issue the following compiler warning:
...
'yyunput' defined but not used
This commit was SVN r25246.
2011-10-10 18:13:04 +00:00
Ralph Castain
f1a3a35fcd
Cannot rely on detection of connection terminations for deciding when to exit as they don't always go away immediately. There is no info coming back anyway, so it's okay to just exit once the relay has been sent. The relay is sent via a blocking API, so just go ahead and quit.
...
This commit was SVN r25245.
2011-10-10 16:38:46 +00:00
George Bosilca
649af6c925
Enumerated mixed with another type (int) is tolerated but
...
easily fixable.
This commit was SVN r25241.
2011-10-09 03:54:52 +00:00
Terry Dontje
c6691b4122
clean up local procs when abort or abort signal happens
...
This commit was SVN r25237.
2011-10-06 19:19:55 +00:00
Nathan Hjelm
79b14fc3b1
removed licensing warning
...
This commit was SVN r25235.
2011-10-05 20:31:27 +00:00
Nathan Hjelm
34afb5a0fa
first cut at general pmi check
...
This commit was SVN r25234.
2011-10-05 17:14:24 +00:00
George Bosilca
80c02647c8
Each level (OPAL/ORTE/OMPI) should only return it's own constants,
...
instead of the current mismatch.
This commit was SVN r25230.
2011-10-04 14:50:31 +00:00
George Bosilca
c6d6c9aece
Remove some #if by using the correct macro (aka. ORTE_EPOCH_CMP).
...
This commit was SVN r25229.
2011-10-04 14:42:40 +00:00
Samuel Gutierrez
25cbf79592
modifications to ras alps. this commit allows users to mpirun without having to set id environment variables (BASIL_RESERVATION_ID, OMPI_ALPS_RESID). note, however, that we preserved the old behavior. if an id environment variable is set, it will be obeyed and our new code path is essentially bypassed. if we missed something, please yell at us. with this commit, the use of ras-alps-command.sh is no longer needed... at least that is our hope.
...
This commit was SVN r25181.
2011-09-26 21:31:08 +00:00
Ralph Castain
8347385630
Fix the radix routed component.
...
This commit was SVN r25175.
2011-09-22 09:32:53 +00:00
Jeff Squyres
ecd603256a
* Rename opal_hwloc_components to opal_hwloc_base_components
...
* Fix some comments
This commit was SVN r25150.
2011-09-17 11:54:36 +00:00
Ralph Castain
1cd7b02df3
Add a set of default errmgr components that support solely the default "everything dies on error" behavior. Set their priority to be selected by default, but provide params to adjust those priorities to allow other component selection.
...
This commit was SVN r25139.
2011-09-13 22:03:45 +00:00
Ralph Castain
3c4f04f4d9
Ensure opal_hwloc_topology is NULL after being destroyed
...
This commit was SVN r25138.
2011-09-13 19:21:10 +00:00
Nathan Hjelm
079ccdf8b1
fix debugger co-location launching
...
This commit was SVN r25136.
2011-09-13 15:08:03 +00:00
Ralph Castain
ca7638553f
Remove stale code
...
This commit was SVN r25133.
2011-09-12 23:00:41 +00:00
Ralph Castain
556a05566e
Silence warning
...
This commit was SVN r25130.
2011-09-12 16:21:51 +00:00
Shiqing Fan
0aea775837
Set the compiler flags in a better way.
...
This commit was SVN r25125.
2011-09-12 08:24:27 +00:00
Ralph Castain
92c7372e20
Per the RFC from Jeff, move hwloc from opal/mca/common to its own static framework ala libevent. Have ORTE daemons collect the topology info at startup and, if --enable-hwloc-xml is set, send that info back to the HNP for later use. The HNP only retains unique topology "templates" to reduce memory footprint. Have the daemon include the local topology info in the nidmap buffer sent to each app so the apps don't all hammer the local system to discover it for themselves.
...
Remove the sysinfo framework as hwloc replaces that functionality.
This commit was SVN r25124.
2011-09-11 19:02:24 +00:00
Ralph Castain
2091e39bee
Record the file descriptor on the read event when building optimized
...
This commit was SVN r25123.
2011-09-11 18:57:14 +00:00
Rainer Keller
9d5afc58c6
- Fix breakage of the epoch changes with PGI:
...
Don't juse include pre-processor macros between two strins ("s1" #if 0 ... "s2")...
Rather print out the epoch as 0 always...
This commit was SVN r25110.
2011-08-31 08:40:31 +00:00
Wesley Bland
f8740e5478
Correct a typo reported by Pasha.
...
This commit was SVN r25109.
2011-08-30 18:44:52 +00:00
Ralph Castain
03ddf8520b
Resolve not-used warnings
...
This commit was SVN r25101.
2011-08-27 14:27:15 +00:00
Ralph Castain
56ebfa23cc
ORTE configure options belong in orte/config, not opal.
...
This commit was SVN r25100.
2011-08-27 14:23:49 +00:00
Wesley Bland
f542ecd578
Fix a couple of problems with the resil code not compiling.
...
This commit was SVN r25099.
2011-08-27 03:21:00 +00:00
George Bosilca
a4245b8d63
Remove some warnings related to the resilience patch.
...
This commit was SVN r25097.
2011-08-27 00:15:34 +00:00
George Bosilca
67ec5a0556
While doing cleanup remove pending warnings.
...
This commit was SVN r25095.
2011-08-26 23:38:06 +00:00
George Bosilca
fc1184c41f
Remove a warnings introduced by the epoch commit.
...
This commit was SVN r25094.
2011-08-26 23:36:52 +00:00
Wesley Bland
4e7ff0bd5e
By popular demand the epoch code is now disabled by default.
...
To enable the epochs and the resilient orte code, use the configure flag:
--enable-resilient-orte
This will define both:
ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE
This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Ralph Castain
1c08a4006c
Refactor some code to remove a few API handles from errmgr. Reviewed/tested by Wes.
...
This commit was SVN r25064.
2011-08-18 16:24:45 +00:00
Ralph Castain
e58623cd5b
Bring alps back to full operations by correctly computing daemon names. Unfortunately, alps doesn't assign cnos rank in node-based order - i.e., cnos rank=0 isn't necessarily on the first node of the execution. So adjust when using static ports.
...
Add some debug to nidmap
Ensure that the HNP's node name is not included in the regex when launching via rshbase as that node is automatically included in the daemon map.
This commit was SVN r25063.
2011-08-18 14:59:18 +00:00
Wesley Bland
a2a20c3766
I believe this should fix the race condition that Terry is seeing in the MTT
...
tests. It appears that nothing in the errmgr was using the mutexes to protect
the odls child list.
This commit was SVN r25062.
2011-08-18 14:52:30 +00:00
Shiqing Fan
6d0ab9bd6c
One library was missing for linking orterun on Windows.
...
This commit was SVN r25057.
2011-08-18 09:33:41 +00:00
Ralph Castain
23f47295a8
Add even more debug
...
This commit was SVN r25053.
2011-08-16 16:41:33 +00:00
Ralph Castain
d624d43f69
Add more debug
...
This commit was SVN r25052.
2011-08-16 15:47:37 +00:00
Ralph Castain
3d96497581
Add debug
...
This commit was SVN r25050.
2011-08-16 12:22:05 +00:00
Shiqing Fan
7292ee2387
One .windows file is missing in the tarball.
...
This commit was SVN r25049.
2011-08-15 10:21:25 +00:00
Shiqing Fan
3af7c9f7bb
Complete the MinGW build support on Windows.
...
This commit was SVN r25048.
2011-08-15 09:47:23 +00:00
Shiqing Fan
627f1dd351
Correct several export declarations.
...
This commit was SVN r25047.
2011-08-15 09:45:51 +00:00
Ralph Castain
ca3d29a1e6
Extend regex support to a bigger audience
...
This commit was SVN r25046.
2011-08-12 21:02:48 +00:00
Ralph Castain
ea4e2c2db4
Unused variables
...
This commit was SVN r25045.
2011-08-12 21:02:09 +00:00
Jeff Squyres
1cbfb53801
r24976 wasn't quite right -- you now actually get a warning if you
...
specify btl_tcp_if_include because btl_tcp_if_exclude is defaulted to
the loopback devices.
This commit does a few things:
* Introduce a new OPAL MCA base function:
mca_base_param_check_exclusive_string(). It checks to see that the
''user'' does not set two MCA parameters that are mutually
exclusive by checking the source of those MCS param values.
* Use the above function in many BTLs (and the OOB TCP) to ensure
that <foo>_if_include and <foo>_if_exclude are not both specified
''by the user''.
* Re-arrange many of these BTLs to move their MCA registration code
into a separate component_register() function (vs. the
component_open() function).
This code has been nominally reviewed and checked by Ralph, George,
Terry, and Shiqing.
This commit was SVN r25043.
The following SVN revision numbers were found above:
r24976 --> open-mpi/ompi@8f4ac54336
2011-08-10 17:24:36 +00:00
Ralph Castain
b360c98afd
Per request from Pasha, revert r25004 - but modified a touch to reflect fact that opal_argv_append copies the provided string, so we don't need to print it and then free it.
...
This commit was SVN r25037.
The following SVN revision numbers were found above:
r25004 --> open-mpi/ompi@2418831bea
2011-08-09 22:42:27 +00:00
Nathan Hjelm
aa3d302a05
use persistent rml_recv in iof
...
This commit was SVN r25035.
2011-08-09 21:30:12 +00:00
Ralph Castain
f1951e7ccd
If we are abnormally terminating, then don't wait for orteds to report back. Send them a "halt_vm" command, which instructs them to kill their local procs and immediately terminate, doing their best to cleanup on the way out.
...
Also do a little cleanup on debug output in rshbase.
This commit was SVN r25033.
2011-08-09 17:42:19 +00:00
Wesley Bland
67feeb6aca
Move the errmgr code back. This shouldn't cause the svn problems that I
...
apparently caused last time. Sorry about that. This one will just be a big
changelog.
This commit was SVN r25016.
2011-08-08 16:01:08 +00:00
Wesley Bland
09274cd047
Make sure that the epoch is initialized everywhere so we don't get weird output
...
during valgrind. This shouldn't have caused any problems with any actual
execution. Just extra warnings in valgrind.
This commit was SVN r25015.
2011-08-08 15:11:55 +00:00
Ralph Castain
8014e3429e
Don't double-count procs as they are launched
...
This commit was SVN r25011.
2011-08-08 06:05:23 +00:00
Ralph Castain
7b9f958dcf
Add some missing error strings. Update test to show silent errors
...
This commit was SVN r25010.
2011-08-08 04:21:02 +00:00
Ralph Castain
4083dc617f
Fix computation of number of required files and file descriptors - it only depends on the total number of local procs, not on the number of procs in the entire job!
...
This commit was SVN r25008.
2011-08-08 04:09:40 +00:00
Ralph Castain
590ac70e88
Add a simple test program for error string output
...
This commit was SVN r25007.
2011-08-07 21:32:25 +00:00
Ralph Castain
8b3c562b84
Adjust verbosity levels to make it easier to debug at scale
...
This commit was SVN r25006.
2011-08-07 21:14:21 +00:00
Ralph Castain
2418831bea
Pass the nodelist to the aprun command even when using all nodes
...
This commit was SVN r25004.
2011-08-06 04:19:41 +00:00
Ralph Castain
bd8e43a2de
Correct debug output so it doesn't falsely report the module
...
This commit was SVN r25003.
2011-08-05 20:30:34 +00:00
Ralph Castain
d603c79ab4
Fix the FAILED_TO_START scenario so orted doesn't segfault
...
This commit was SVN r25002.
2011-08-05 20:29:50 +00:00
Ralph Castain
c86bfb4e90
Need to copy the string
...
This commit was SVN r25001.
2011-08-05 19:03:28 +00:00
Ralph Castain
7b307d5bf0
Cleanup handling of all-numerical node names
...
This commit was SVN r25000.
2011-08-05 14:59:14 +00:00
Ralph Castain
157bad5435
If we can't compress the name, that's fine - but still have to move to next posn
...
This commit was SVN r24999.
2011-08-05 14:43:36 +00:00
Ralph Castain
3199663613
Correctly handle the case of mixes of character-based names and all-number names
...
This commit was SVN r24998.
2011-08-05 14:37:36 +00:00
Ralph Castain
066022126e
Sort the nodes to be in numerically increasing order so the regex has a chance of working right.
...
This commit was SVN r24993.
2011-08-05 03:37:13 +00:00
Ralph Castain
5a634caad9
Cleanly handle the case where the node "name" is just a number, and avoid the N-N output when the number is not part of a sequence.
...
This commit was SVN r24992.
2011-08-05 03:36:30 +00:00
Jeff Squyres
294e1f50cd
Remove compiler warning about nested comment
...
This commit was SVN r24984.
2011-08-03 18:30:56 +00:00
Wesley Bland
87a96da99c
Should fix some of the shutdown woes of the errmgr.
...
Correctly checks that the orted's job is completed.
Correctly tests to make sure that there is shutdown going on (doesn't rely on orte_orteds_term_ordered).
Adds a patch from Ralph to correctdly check the status of processes.
This commit was SVN r24962.
2011-08-01 14:00:41 +00:00
Ralph Castain
42b125ef35
Move the debug so it more accurately reports
...
This commit was SVN r24961.
2011-07-29 20:48:46 +00:00
Ralph Castain
70bca4691f
Add a new "sensor" module that supports fault tolerance tests - randomly kills local procs and/or the daemon itself
...
This commit was SVN r24960.
2011-07-29 20:48:22 +00:00
Wesley Bland
5fde3e0e00
Move the resilient orte errmgr code into a seperate errmgr for now while it's
...
still unstable. Reverted errmgr modules back to the original errmgr (with the
updates since the resilient code was brought into the trunk).
This commit was SVN r24958.
2011-07-28 21:24:34 +00:00
Ralph Castain
6c879f87fb
Add a new param "orte_remote_tmpdir_base" for those situations where the compute nodes require a different session directory head than the head node.
...
This commit was SVN r24956.
2011-07-27 19:37:17 +00:00
Ralph Castain
decab98fb2
Do a little better job of catching up on missed mcast messages, and provide a way out of scenarios where catch-up is impossible.
...
This commit was SVN r24955.
2011-07-27 14:58:30 +00:00
Ralph Castain
c3bc33b3fb
Don't be so restrictive - accept "slots" as well as "slot" in rank file
...
This commit was SVN r24954.
2011-07-27 00:45:30 +00:00
Wesley Bland
b972fd84e1
No longer sends extra FAILED_NOTIFICATION messages in the non-failure case.
...
Should reduce finalize complexity and avoid a race condition that has been
detected by a few users.
This commit was SVN r24952.
2011-07-26 20:47:44 +00:00
Ralph Castain
715f871605
Ignore the daemon job when reporting parseable output
...
This commit was SVN r24944.
2011-07-25 20:44:08 +00:00
Ralph Castain
db193555c2
Use non-blocking sends for recovering from lost multicast messages
...
This commit was SVN r24943.
2011-07-25 18:49:47 +00:00
Ralph Castain
199804fc35
complete implementation of parseable output
...
This commit was SVN r24929.
2011-07-23 22:23:24 +00:00
Ralph Castain
ffe6f5f40e
Fix map pack/unpack so they match
...
This commit was SVN r24928.
2011-07-23 22:23:05 +00:00
Ralph Castain
00647fa342
Update orte-ps to add parseable output - not fully tested because I couldn't get other parts of the system to work.
...
This commit was SVN r24927.
2011-07-23 20:20:31 +00:00
Ralph Castain
869024f1c6
You have to initialize th daemon param -before- using it to get epoch!!
...
This commit was SVN r24926.
2011-07-23 20:19:43 +00:00
Ralph Castain
361bcef253
Close multicast before rml
...
This commit was SVN r24925.
2011-07-23 20:19:15 +00:00
Shiqing Fan
cc4403a863
Remove two unused windows files.
...
This commit was SVN r24913.
2011-07-21 12:53:32 +00:00
Brian Barrett
3bd66a5932
* Remove unused Portals3.3 reference implementation support
...
This commit was SVN r24906.
2011-07-20 23:30:29 +00:00
Eugene Loh
921852e1e5
Clean up the computations of num_procs_alive. Do some code
...
refactoring to improve readability and to compute num_procs_alive
correctly and to remove the use of loop iteration variables for
two loops nested one inside another (causing MPI_Comm_spawn_multiple
to fail).
This commit was SVN r24903.
2011-07-14 20:10:48 +00:00
Ralph Castain
8853e0e80a
Fix regular expression analyzer for slurmd - use a slurm-specific version
...
Fix multi-node routing for daemon startup when static ports are not set
This commit was SVN r24898.
2011-07-13 22:49:56 +00:00
Ralph Castain
8d1b31b887
Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly.
...
Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list.
This commit was SVN r24894.
2011-07-13 20:11:14 +00:00
Ralph Castain
1405bacd85
Ensure we dont segfault if we report an error
...
This commit was SVN r24890.
2011-07-13 15:00:22 +00:00
Jeff Squyres
3893a5a1de
Fix compile error introduced in r24888.
...
This commit was SVN r24889.
The following SVN revision numbers were found above:
r24888 --> open-mpi/ompi@e5253647ea
2011-07-13 14:18:00 +00:00
Shiqing Fan
e5253647ea
Fix a type cast.
...
This commit was SVN r24888.
2011-07-13 09:00:17 +00:00
Ralph Castain
1ad110d2e9
After a nice, calm, rational discussion between Brian, Jeff, and myself, we decided to revert r24864 and r24862 to restore the reference counters in opal_init/finalize. The rationale was that we should instead change orte_init/finalize to also use reference counters to support multi-embedded libraries. Jeff and Brian will discuss proposing a similar change to mpi_init/finalize to the MPI Forum so that all three libraries will behave in similar manners.
...
It was agreed that opal_init_util had wound up being used in unintended ways, which raised the problem of getting reference counts to work right. However, fixing it would involve more pain than it was worth - and so long as the other layers are made to behave similarly, I have no preference either way.
Complete implementation will follow - for now, this just reverts the prior changes.
This commit was SVN r24886.
The following SVN revision numbers were found above:
r24862 --> open-mpi/ompi@aa92e0c4eb
r24864 --> open-mpi/ompi@a5062385c2
2011-07-12 17:07:41 +00:00
Nathan Hjelm
3f4e5d7dd6
add missing thread lock/unlock around condition_broadcast
...
This commit was SVN r24885.
2011-07-12 15:43:56 +00:00
Nathan Hjelm
c3ec2e2614
fix a potential race condition in rml
...
This commit was SVN r24884.
2011-07-12 15:43:12 +00:00
Abhishek Kulkarni
6bf02d1344
Fixes (after the new ORTE resiliency layer was merged) to make the trunk build with C/R flags turned on.
...
This commit was SVN r24872.
2011-07-10 23:36:26 +00:00
Ralph Castain
a5062385c2
Fix singletons
...
This commit was SVN r24864.
2011-07-08 14:38:33 +00:00
Jeff Squyres
3affb8403e
Remove extra output
...
This commit was SVN r24863.
2011-07-08 13:01:30 +00:00
Ralph Castain
aa92e0c4eb
Replace a useless counter with a boolean check to see if we have already passed thru opal_finalize so we don't call finalize, and then don't pass thru it (as was happening on several tools)
...
This commit was SVN r24862.
2011-07-08 06:43:19 +00:00
Ralph Castain
05f4926bfe
Remove some remaining cruft re regular expressions - caused the trunk to fail if regex wasn't being used
...
This commit was SVN r24861.
2011-07-08 06:42:12 +00:00
Ralph Castain
1ee7c39982
Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
...
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.
Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.
This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Ralph Castain
6496b2f845
Ensure we terminate properly on non-zero exit status
...
This commit was SVN r24859.
2011-07-07 14:33:49 +00:00
Wesley Bland
0628963506
Fix a return code when a process isn't found.
...
This commit was SVN r24845.
2011-06-30 15:22:54 +00:00
Brian Barrett
e52fef28ca
Do something rational for the disable full support case
...
This commit was SVN r24844.
2011-06-30 14:48:19 +00:00
Ralph Castain
8ac35a8496
Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate.
...
This commit was SVN r24843.
2011-06-30 14:11:56 +00:00
Ralph Castain
c449871ade
Add an mca param to set the "fork agent" - i.e., a program to be run when forking off a process (e.g., valgrind). While you could specify this by "mpirun -n N fork_agent ./my_app", not everyone launches procs with ORTE from mpirun.
...
Provide the ability to store recent stat histories using the ring_buffer class
This commit was SVN r24842.
2011-06-30 03:12:38 +00:00
Ralph Castain
2e1fa3e08e
Don't error out if the recv.cancel comes back not found as this is just a race condition
...
This commit was SVN r24841.
2011-06-30 01:19:50 +00:00
Ralph Castain
418229c71c
Define a new error constant
...
This commit was SVN r24833.
2011-06-28 19:47:16 +00:00
Wesley Bland
84be81df95
Standardize the initialization of the EPOCH's.
...
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.
This commit was SVN r24829.
2011-06-28 14:20:33 +00:00
Ralph Castain
c203eee223
Since process names now have three fields, be sure to initialize all three of them
...
This commit was SVN r24828.
2011-06-27 20:50:08 +00:00
Ralph Castain
d316701d3c
Remove unnecessary mcast channel
...
This commit was SVN r24816.
2011-06-23 20:44:22 +00:00
Wesley Bland
e1ba09ad51
Add a resilience to ORTE. Allows the runtime to continue after a process (or
...
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.
Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php
This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
391074cde6
Add a tag
...
This commit was SVN r24813.
2011-06-23 15:12:25 +00:00