Ralph Castain
84713d5a84
Fix singletons again - must have been broken for a very long time, which only shows how little anyone cares about this capability.
...
This commit was SVN r25332.
2011-10-19 20:19:08 +00:00
Ralph Castain
b44f8d4b28
Complete implementation of the ess.proc_get_locality API. Up to this point, the API was only capable of telling if the specified proc was sharing a node with you. However, the returned value was capable of telling you much more detailed info - e.g., if the proc shares a socket, a cache, or numa node. We just didn't have the data to provide that detail.
...
Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times.
Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h
Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun.
This commit was SVN r25331.
2011-10-19 20:18:14 +00:00
Ralph Castain
2958f3de34
Add some clarifying comments and a small efficiency improvement
...
This commit was SVN r25322.
2011-10-18 18:30:43 +00:00
Ralph Castain
b771114086
Fix the fix :-)
...
If the errmgr is going to try and hold the orted until all routes and children are gone, then the exit cmd must do the same. Otherwise, the orted exits immediately without waiting for routes to be dismantled, which is why we don't see the connections close.
Also cleanup some diagnostics and add some debug to more clearly see what's going on.
This commit was SVN r25321.
2011-10-18 17:56:37 +00:00
Ralph Castain
ae8e556d14
Okay, once again let's fix the vpid calculator. Identified problem with prior commit (some rmaps components already place their procs in the jdata->procs array, and others don't), so account for those variations.
...
This commit was SVN r25315.
2011-10-18 15:50:11 +00:00
George Bosilca
749b63c09d
Provide a generic fix for the termination issue instead of r25248. The
...
termination condition is to be checked at the daemon/HNP level not down
in the routing.
This commit was SVN r25313.
The following SVN revision numbers were found above:
r25248 --> open-mpi/ompi@b42ccc89b8
2011-10-18 03:07:37 +00:00
George Bosilca
f28890fbb7
Revert r25302 as it break the --bynode option.
...
This commit was SVN r25311.
The following SVN revision numbers were found above:
r25302 --> open-mpi/ompi@d7a8553179
2011-10-18 02:48:17 +00:00
Ralph Castain
2fdd9c6dea
Ensure mpirun doesn't pick this component
...
This commit was SVN r25307.
2011-10-17 22:28:28 +00:00
Ralph Castain
8f0ef54130
Complete implementation of pmi support. Ensure we support both mpirun and direct launch within same configuration to avoid requiring separate builds. Add support for generic pmi, not just under slurm. Add publish/subscribe support, although slurm's pmi implementation will just return an error as it hasn't been done yet.
...
This commit was SVN r25303.
2011-10-17 20:51:22 +00:00
Ralph Castain
d7a8553179
Fix the mapping algo for computing vpids - it was borked for bynode operations when using nperxxx directives
...
This commit was SVN r25302.
2011-10-17 19:49:04 +00:00
Ralph Castain
f1a5a26ba0
Minor cleanups
...
This commit was SVN r25289.
2011-10-14 18:46:03 +00:00
Ralph Castain
89a20de474
Remove unused includes. Ensure that the error log is at least always available as we otherwise segfault when reporting errors that occur prior to opening the errmgr framework
...
This commit was SVN r25288.
2011-10-14 18:45:11 +00:00
Ralph Castain
07dbbc6513
Sorry for mid-day correction - but folks are trying to test this, and we didn't realize it was still ignored :-(
...
This commit was SVN r25287.
2011-10-14 16:19:20 +00:00
Ralph Castain
7bb294f917
Fix debug flags - thanks Terry!
...
This commit was SVN r25286.
2011-10-14 16:10:21 +00:00
Ralph Castain
054c485dcf
Cleanup a race condition and an unreliable method that caused us to not properly handle procs that trapped sigterm for cleanup purposes while ORTE was trying to kill them. Thanks to Rick Payne and Ian Wells of Cisco for spending weeks chasing this down.
...
Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-(
This commit was SVN r25285.
2011-10-14 15:39:54 +00:00
Ralph Castain
08fa9e1c6a
Correct include path
...
This commit was SVN r25282.
2011-10-13 23:46:52 +00:00
Ralph Castain
b96ef2161d
Complete the PMI support. Generalize PMI operations to support both slurm and non-slurm environments. Correct some configuration issues - we really only want the PMI integration at the individual component level. Ensure that the pmi grpcomm component doesn't get selected when launching via mpirun by setting its priority below the bad component.
...
Only verified in a slurm environment as that's all I have access to...
This commit was SVN r25275.
2011-10-12 20:59:25 +00:00
Ralph Castain
634f83fc52
Fix the routed components. All had errors, some completely broken. You cannot test
...
0 == ORTE_EPOCH_CMP(target->epoch,ORTE_EPOCH_INVALID)
when epoch is not configured as this will always return true. This caused get_route to return an error in all non-binomial routed modules, and caused all components to return an error when delete_route was called.
So protect the checks with ORTE_ENABLE_EPOCH so we get the correct behavior.
This commit was SVN r25274.
2011-10-12 20:18:57 +00:00
Ralph Castain
24a46f2acb
These were missed by prior commit - need to remove lingering references to OPAL_HWLOC_HAVE_XML
...
This commit was SVN r25272.
2011-10-12 16:54:03 +00:00
George Bosilca
872d377021
Tell what the update status is.
...
This commit was SVN r25259.
2011-10-11 19:49:12 +00:00
Brian Barrett
98e98ce2c5
* opal_atomic_trylock is documented to return 0 if the lock was acquired,
...
1 otherwise. It was doing the opposite, so this patch fixes the
return values. All uses (all in ORTE) used the actual return values,
not the documented values, so fix them as well.
This commit was SVN r25257.
2011-10-11 18:43:45 +00:00
Ralph Castain
2f38ff5e54
Ensure we don't try to build this module unless pmi is specifically requested
...
This commit was SVN r25252.
2011-10-11 06:12:04 +00:00
Ralph Castain
baefdabd98
Add some debug. Now confirmed to work correctly (prior problem was with odin tcp connection, not code).
...
This commit was SVN r25249.
2011-10-11 02:15:17 +00:00
Ralph Castain
b42ccc89b8
Although this didn't solve the earlier termination problem, the code will be required once we get connection terminations properly detected. If a daemon (or HNP) is trying to terminate, then we need to check for termination conditions whenever a route is lost - when all child connections are gone, then we are free to finalize.
...
This commit was SVN r25248.
2011-10-10 21:41:49 +00:00
Ralph Castain
1aa1c2e9b4
Get the slurm pmi support working. Cannot use infiniband, of course, as the oob can't make the connection - may try other existing methods. Modex may not quite be working right yet
...
as odin was having trouble making TCP connections, but at least the configure now works so things build, so save that for now
This commit was SVN r25247.
2011-10-10 21:39:10 +00:00
Swen Boehm
08b4322a1a
patched the lex files to not issue the following compiler warning:
...
'yyunput' defined but not used
This commit was SVN r25246.
2011-10-10 18:13:04 +00:00
Ralph Castain
f1a3a35fcd
Cannot rely on detection of connection terminations for deciding when to exit as they don't always go away immediately. There is no info coming back anyway, so it's okay to just exit once the relay has been sent. The relay is sent via a blocking API, so just go ahead and quit.
...
This commit was SVN r25245.
2011-10-10 16:38:46 +00:00
George Bosilca
649af6c925
Enumerated mixed with another type (int) is tolerated but
...
easily fixable.
This commit was SVN r25241.
2011-10-09 03:54:52 +00:00
Terry Dontje
c6691b4122
clean up local procs when abort or abort signal happens
...
This commit was SVN r25237.
2011-10-06 19:19:55 +00:00
Nathan Hjelm
79b14fc3b1
removed licensing warning
...
This commit was SVN r25235.
2011-10-05 20:31:27 +00:00
Nathan Hjelm
34afb5a0fa
first cut at general pmi check
...
This commit was SVN r25234.
2011-10-05 17:14:24 +00:00
George Bosilca
80c02647c8
Each level (OPAL/ORTE/OMPI) should only return it's own constants,
...
instead of the current mismatch.
This commit was SVN r25230.
2011-10-04 14:50:31 +00:00
George Bosilca
c6d6c9aece
Remove some #if by using the correct macro (aka. ORTE_EPOCH_CMP).
...
This commit was SVN r25229.
2011-10-04 14:42:40 +00:00
Samuel Gutierrez
25cbf79592
modifications to ras alps. this commit allows users to mpirun without having to set id environment variables (BASIL_RESERVATION_ID, OMPI_ALPS_RESID). note, however, that we preserved the old behavior. if an id environment variable is set, it will be obeyed and our new code path is essentially bypassed. if we missed something, please yell at us. with this commit, the use of ras-alps-command.sh is no longer needed... at least that is our hope.
...
This commit was SVN r25181.
2011-09-26 21:31:08 +00:00
Ralph Castain
8347385630
Fix the radix routed component.
...
This commit was SVN r25175.
2011-09-22 09:32:53 +00:00
Jeff Squyres
ecd603256a
* Rename opal_hwloc_components to opal_hwloc_base_components
...
* Fix some comments
This commit was SVN r25150.
2011-09-17 11:54:36 +00:00
Ralph Castain
1cd7b02df3
Add a set of default errmgr components that support solely the default "everything dies on error" behavior. Set their priority to be selected by default, but provide params to adjust those priorities to allow other component selection.
...
This commit was SVN r25139.
2011-09-13 22:03:45 +00:00
Ralph Castain
3c4f04f4d9
Ensure opal_hwloc_topology is NULL after being destroyed
...
This commit was SVN r25138.
2011-09-13 19:21:10 +00:00
Nathan Hjelm
079ccdf8b1
fix debugger co-location launching
...
This commit was SVN r25136.
2011-09-13 15:08:03 +00:00
Ralph Castain
ca7638553f
Remove stale code
...
This commit was SVN r25133.
2011-09-12 23:00:41 +00:00
Ralph Castain
556a05566e
Silence warning
...
This commit was SVN r25130.
2011-09-12 16:21:51 +00:00
Shiqing Fan
0aea775837
Set the compiler flags in a better way.
...
This commit was SVN r25125.
2011-09-12 08:24:27 +00:00
Ralph Castain
92c7372e20
Per the RFC from Jeff, move hwloc from opal/mca/common to its own static framework ala libevent. Have ORTE daemons collect the topology info at startup and, if --enable-hwloc-xml is set, send that info back to the HNP for later use. The HNP only retains unique topology "templates" to reduce memory footprint. Have the daemon include the local topology info in the nidmap buffer sent to each app so the apps don't all hammer the local system to discover it for themselves.
...
Remove the sysinfo framework as hwloc replaces that functionality.
This commit was SVN r25124.
2011-09-11 19:02:24 +00:00
Ralph Castain
2091e39bee
Record the file descriptor on the read event when building optimized
...
This commit was SVN r25123.
2011-09-11 18:57:14 +00:00
Rainer Keller
9d5afc58c6
- Fix breakage of the epoch changes with PGI:
...
Don't juse include pre-processor macros between two strins ("s1" #if 0 ... "s2")...
Rather print out the epoch as 0 always...
This commit was SVN r25110.
2011-08-31 08:40:31 +00:00
Wesley Bland
f8740e5478
Correct a typo reported by Pasha.
...
This commit was SVN r25109.
2011-08-30 18:44:52 +00:00
Ralph Castain
03ddf8520b
Resolve not-used warnings
...
This commit was SVN r25101.
2011-08-27 14:27:15 +00:00
Ralph Castain
56ebfa23cc
ORTE configure options belong in orte/config, not opal.
...
This commit was SVN r25100.
2011-08-27 14:23:49 +00:00
Wesley Bland
f542ecd578
Fix a couple of problems with the resil code not compiling.
...
This commit was SVN r25099.
2011-08-27 03:21:00 +00:00
George Bosilca
a4245b8d63
Remove some warnings related to the resilience patch.
...
This commit was SVN r25097.
2011-08-27 00:15:34 +00:00