Ralph Castain
951d72692c
Reverse the #if direction so we report daemon failure to the errmgr - otherwise, we just hang if a daemon fails to start.
...
Reviewed with Josh.
This commit was SVN r25366.
2011-10-25 19:09:52 +00:00
Nathan Hjelm
433cfa3665
use single copy for some sends
...
This commit was SVN r25365.
2011-10-25 18:38:42 +00:00
Mike Dubman
9ffeeb69d9
fix help message
...
This commit was SVN r25364.
2011-10-25 14:02:43 +00:00
Samuel Gutierrez
c646c93eec
remove unneeded flags from cray xe6 platform file.
...
This commit was SVN r25363.
2011-10-24 18:42:43 +00:00
Samuel Gutierrez
663f4546f5
fix define typo in psm mtl.
...
This commit was SVN r25362.
2011-10-24 18:38:12 +00:00
Ralph Castain
c55cba55a7
Totally trivial spelling fix
...
This commit was SVN r25361.
2011-10-24 14:06:33 +00:00
Ralph Castain
a7cbc25658
Minor cleanups - check hwloc returns everywhere. Thanks to Chris Yeoh for pointing this out.
...
This commit was SVN r25360.
2011-10-24 14:05:26 +00:00
Ralph Castain
955d8e7d46
Allow apps to use pmi when launched by mpirun, if desired, without affecting daemons
...
This commit was SVN r25359.
2011-10-23 15:57:13 +00:00
Nathan Hjelm
e8af0d8589
don't use alps paffinity
...
This commit was SVN r25358.
2011-10-21 22:52:03 +00:00
Abhishek Kulkarni
46952e9008
Fix C/R functionality in trunk. Intra-node checkpointing of a job now works as expected.
...
Signed-off-by: Abhishek Kulkarni <adkulkar@osl.iu.edu>
This commit was SVN r25357.
2011-10-21 22:07:35 +00:00
Samuel Gutierrez
949364d2d6
update LANL Cray XE6 platform files to include PMI support.
...
This commit was SVN r25356.
2011-10-21 21:05:23 +00:00
Nathan Hjelm
7b1172b346
need a terminating character in the decoded string
...
This commit was SVN r25355.
2011-10-21 16:46:28 +00:00
Nathan Hjelm
fb19f56965
Cray doesn't define PMI2_SUCCESS
...
This commit was SVN r25354.
2011-10-21 16:34:22 +00:00
Nathan Hjelm
cd257ac707
fixed typo in pmi grpcomm
...
This commit was SVN r25353.
2011-10-21 16:28:36 +00:00
Nathan Hjelm
cd68dbe2b8
only try to build vader if xpmem is installed. unignore vader
...
This commit was SVN r25352.
2011-10-21 15:45:05 +00:00
Shiqing Fan
5711414eb7
Fix Windows build
...
This commit was SVN r25351.
2011-10-21 14:46:58 +00:00
Ralph Castain
53ef085567
Fix a minor issue seen by Jeff in specific failure pathway
...
This commit was SVN r25350.
2011-10-21 14:44:48 +00:00
Jeff Squyres
cbafea8f69
Add a DEPENDENCIES line so that if you edit something down in the
...
hwloc tree, it'll get picked up by the component (and therefore by
libopen-pal).
Thanks to Terry for finding the problem.
This commit was SVN r25349.
2011-10-21 11:39:52 +00:00
Ralph Castain
3e72fccacf
Cray's PMI implementation is quite different from slurm's - they extended PMI-1 by adding some, but not all, of the PMI-2 APIs. So you can't just switch to using PMI-2 functions as it isn't a complete implementation. Instead, you have to selectively figure out which ones they have in PMI-2, and use any missing ones from PMI-1. What fun.
...
Modify the configure logic and the PMI components to accommodate Cray's approach. Refactor the PMI error reporting code so it resides in only one place. Cray actually decided -not- to define the PMI-2 error codes, so we have to use the PMI-1 codes instead. More fun.
This commit was SVN r25348.
2011-10-21 04:54:38 +00:00
Ralph Castain
e2adc8fa3a
Ignore until Nathan can fix - probably configure problem
...
This commit was SVN r25347.
2011-10-21 03:43:01 +00:00
Ralph Castain
5947f61b86
Remove windows reference for now
...
This commit was SVN r25346.
2011-10-21 01:19:03 +00:00
Nathan Hjelm
414677a082
default to no xpmem support
...
This commit was SVN r25345.
2011-10-20 22:13:45 +00:00
Nathan Hjelm
ce29170968
update lanl xe6 platform files for vader
...
This commit was SVN r25344.
2011-10-20 21:50:53 +00:00
Nathan Hjelm
808a73a5c5
removed erroneous add of .deps
...
This commit was SVN r25343.
2011-10-20 21:41:51 +00:00
Nathan Hjelm
3dbaaf6879
initial commit of vader (xpmem) btl
...
This commit was SVN r25342.
2011-10-20 21:39:44 +00:00
Nathan Hjelm
586403f052
more pmi return code wtf
...
This commit was SVN r25337.
2011-10-20 17:53:04 +00:00
Nathan Hjelm
beb8d8ce32
pmi return code wtf
...
This commit was SVN r25336.
2011-10-20 17:51:24 +00:00
Ralph Castain
43e35486a4
Correct flag type - thanks George!
...
This commit was SVN r25335.
2011-10-20 04:00:13 +00:00
Nathan Hjelm
e1e8837992
add a uintptr_t to the seg_key union
...
This commit was SVN r25334.
2011-10-19 21:48:52 +00:00
George Bosilca
78751b3b2d
Put back the OPI errors after the ORTE one.
...
This commit was SVN r25333.
2011-10-19 20:57:13 +00:00
Ralph Castain
84713d5a84
Fix singletons again - must have been broken for a very long time, which only shows how little anyone cares about this capability.
...
This commit was SVN r25332.
2011-10-19 20:19:08 +00:00
Ralph Castain
b44f8d4b28
Complete implementation of the ess.proc_get_locality API. Up to this point, the API was only capable of telling if the specified proc was sharing a node with you. However, the returned value was capable of telling you much more detailed info - e.g., if the proc shares a socket, a cache, or numa node. We just didn't have the data to provide that detail.
...
Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times.
Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h
Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun.
This commit was SVN r25331.
2011-10-19 20:18:14 +00:00
George Bosilca
1bc5da0911
These are supposed to be OPAL level errors.
...
This commit was SVN r25326.
2011-10-19 14:22:09 +00:00
Ralph Castain
72a4b0bd8a
Fix constants
...
This commit was SVN r25325.
2011-10-19 14:14:58 +00:00
George Bosilca
a5f24bcdcf
The error here is meaningless.
...
This commit was SVN r25324.
2011-10-19 13:04:46 +00:00
George Bosilca
efd88e10d7
Cleanup the error codes. Get rid of all the useless ones, and
...
mark the distinction between ORTE and OMPI errors.
This commit was SVN r25323.
2011-10-19 03:51:53 +00:00
Ralph Castain
2958f3de34
Add some clarifying comments and a small efficiency improvement
...
This commit was SVN r25322.
2011-10-18 18:30:43 +00:00
Ralph Castain
b771114086
Fix the fix :-)
...
If the errmgr is going to try and hold the orted until all routes and children are gone, then the exit cmd must do the same. Otherwise, the orted exits immediately without waiting for routes to be dismantled, which is why we don't see the connections close.
Also cleanup some diagnostics and add some debug to more clearly see what's going on.
This commit was SVN r25321.
2011-10-18 17:56:37 +00:00
Nathan Hjelm
adf950f4ab
LANL: don't use per-peer receive queues on rr-class
...
This commit was SVN r25320.
2011-10-18 16:45:44 +00:00
Nathan Hjelm
9155f1ba1f
LANL: up cq size
...
This commit was SVN r25319.
2011-10-18 16:40:35 +00:00
Nathan Hjelm
e16559983e
LANL: match tlcc QP settings with tlcc2
...
This commit was SVN r25318.
2011-10-18 16:32:05 +00:00
Nathan Hjelm
607d387088
LANL: use only shared receive queues on tlcc
...
This commit was SVN r25317.
2011-10-18 16:23:46 +00:00
Nathan Hjelm
90c55c5b35
LANL: use pmi on tlcc
...
This commit was SVN r25316.
2011-10-18 16:12:14 +00:00
Ralph Castain
ae8e556d14
Okay, once again let's fix the vpid calculator. Identified problem with prior commit (some rmaps components already place their procs in the jdata->procs array, and others don't), so account for those variations.
...
This commit was SVN r25315.
2011-10-18 15:50:11 +00:00
George Bosilca
749b63c09d
Provide a generic fix for the termination issue instead of r25248. The
...
termination condition is to be checked at the daemon/HNP level not down
in the routing.
This commit was SVN r25313.
The following SVN revision numbers were found above:
r25248 --> open-mpi/ompi@b42ccc89b8
2011-10-18 03:07:37 +00:00
George Bosilca
c453614f8b
A more meaningful name for this function (mpi_proc_complete_init
...
instead of ompi_proc_set_arch). Change the comment to reflect the
real behavior of the function.
This commit was SVN r25312.
2011-10-18 02:54:38 +00:00
George Bosilca
f28890fbb7
Revert r25302 as it break the --bynode option.
...
This commit was SVN r25311.
The following SVN revision numbers were found above:
r25302 --> open-mpi/ompi@d7a8553179
2011-10-18 02:48:17 +00:00
Ralph Castain
0bf4f48aa3
Don't need priority in this framework
...
This commit was SVN r25308.
2011-10-17 22:39:15 +00:00
Ralph Castain
2fdd9c6dea
Ensure mpirun doesn't pick this component
...
This commit was SVN r25307.
2011-10-17 22:28:28 +00:00
Nathan Hjelm
ad9005820f
fixed typo in last commit
...
This commit was SVN r25306.
2011-10-17 21:35:22 +00:00