Ralph Castain
bf09133631
Correctly track the number of debugger daemons being spawned
...
This commit was SVN r25741.
2012-01-19 18:17:07 +00:00
Ralph Castain
9d556e2f17
Allow daemons to use PMI to get their name where PMI support is available while using the standard grpcomm and other capabilities. Remove the GNI code from the alps ess component as that component should only be for alps/cnos installations.
...
This commit was SVN r25737.
2012-01-18 20:56:53 +00:00
Ralph Castain
6235a355de
Correctly handle co-spawning of daemons when attaching to a running job. We cannot use the general process mappers as we only want debugger daemons spawned on nodes where application procs already exist. So custom build the map for the debugger daemon job, and have the plm just launch that job without doing its usual vm-spawn step.
...
This commit was SVN r25736.
2012-01-18 00:19:49 +00:00
Ralph Castain
11a37d3978
Fix the default
...
This commit was SVN r25733.
2012-01-17 21:09:27 +00:00
Ralph Castain
12d163293b
Yeah, I know it's the middle of the afternoon. I'm bound to forget and commit this in with something else if I don't. Per request from LANL, if PMI support is requested on an ALPS machine, add a couple of libs in the right ordering so that static builds will work correctly.
...
This commit was SVN r25732.
2012-01-17 20:41:50 +00:00
Ralph Castain
fd0d9f73c6
Make preload_binaries an MCA param so it can be set in the default MCA parameters for a system
...
This commit was SVN r25728.
2012-01-17 17:16:05 +00:00
Shiqing Fan
f57f873404
Disable the debugger support for Windows.
...
This commit was SVN r25725.
2012-01-17 16:21:33 +00:00
Nathan Hjelm
a2437feba7
removed debug message
...
This commit was SVN r25722.
2012-01-12 20:23:59 +00:00
Nathan Hjelm
5ab1674138
fixed de bruijn copyrights
...
This commit was SVN r25720.
2012-01-12 17:18:08 +00:00
Nathan Hjelm
c57f18999d
added Debruijn routed component
...
This commit was SVN r25717.
2012-01-12 17:11:03 +00:00
Ralph Castain
477582abef
Grrrr....fix ALL the cases where the membind warning occurs.
...
This commit was SVN r25715.
2012-01-11 23:51:18 +00:00
Ralph Castain
ce7ddd0e10
Create the debugger attach fifo unless the user requests that we periodically poll insteaad.
...
This commit was SVN r25714.
2012-01-11 19:44:22 +00:00
Ralph Castain
bf103de66c
My apologies for doing this outside of the usual time restrictions, but we need to get this in so we can make progress.
...
Move the ORTE-level debugger code back into orterun and out of the ORTE library to resolve symbol conflicts.
This commit was SVN r25713.
2012-01-11 15:53:09 +00:00
Ralph Castain
167ad944c4
Surprise, surprise - hwloc treats memory binding as at the thread, not process, level. Thus, hwloc always sets the membind proc-level support flag to false, and indicates actual memory binding support via the thread-level flag. So...just to be safe, test -both- flags and issue the "no support" warning ONLY if both are false.
...
This commit was SVN r25709.
2012-01-11 01:12:57 +00:00
Shiqing Fan
e3dfc49ced
make correct use of the newly updated structures in the Windows module.
...
This commit was SVN r25699.
2012-01-09 11:08:34 +00:00
Ralph Castain
840841bb8f
Missed a couple
...
This commit was SVN r25686.
2011-12-29 23:30:19 +00:00
Ralph Castain
af7fb68cfb
If we forward envars in rsh, then we have to be very careful about both duplicate entries and disallowed characters on the cmd line. To aid with detecting duplicates, make all cmd line options be given in their mca variant. Check anything we might add for semi-colons and protect those values with quotes.
...
This commit was SVN r25685.
2011-12-29 23:25:25 +00:00
Jeff Squyres
a4c8bb27fa
Pull in the MPIR_Breakpoint symbol via a dummy function in
...
debuggers_base_fns.c: orte_debugger_base_pull_mpir_breakpoint().
This commit was SVN r25660.
2011-12-15 18:39:34 +00:00
Ralph Castain
2dd2694f25
Fix comm_spawn in oversubscribed conditions. IF oversubscription is allowed, let nodes flow into the mapper even if they are oversubscribed, constrained by the slots_max absolute ceiling. Cleanup error messages when comm_spawn fails so it correctly and succintly reports the ereror.
...
This commit was SVN r25659.
2011-12-15 18:04:48 +00:00
Ralph Castain
437c52d2bf
Routing must be enabled by default
...
This commit was SVN r25657.
2011-12-15 17:13:52 +00:00
Ralph Castain
1adefcc176
When routing is not enabled, all routes must go direct
...
This commit was SVN r25656.
2011-12-15 15:32:09 +00:00
Ralph Castain
a309c53bf2
Set the lifeline when we are tree spawning under rsh so that the orted can self-terminate when its parent dies
...
This commit was SVN r25655.
2011-12-15 15:29:53 +00:00
Nathan Hjelm
9dec101043
fix totalview launch through --debug
...
This commit was SVN r25654.
2011-12-15 15:19:13 +00:00
Ralph Castain
e683b2f9c7
Minor touchup - reset the pointer to the end of the list each time to ensure we get the nodes in correct daemon order
...
This commit was SVN r25651.
2011-12-14 22:16:52 +00:00
Ralph Castain
912abe8a6c
Catch one more use-case
...
This commit was SVN r25649.
2011-12-14 21:03:19 +00:00
Ralph Castain
f531b09a8d
Correctly handle -host and -hostfile options. Ensure the initial vm launch constrains itself to the union of specified hosts if those options are given. Get oversubscribe set correctly for that case.
...
This commit was SVN r25648.
2011-12-14 20:01:15 +00:00
George Bosilca
ac26f58bd7
I guess this wasn't yet ready for prime time.
...
This commit was SVN r25624.
2011-12-12 23:55:11 +00:00
Nathan Hjelm
885d5cbcf8
enable ptmalloc with using uGNI
...
This commit was SVN r25621.
2011-12-12 20:52:51 +00:00
Nathan Hjelm
be11acf727
bug fix. don't add node to allocated_nodes twice
...
This commit was SVN r25619.
2011-12-12 19:14:41 +00:00
Ralph Castain
3f1ae5d89b
No longer need this include
...
This commit was SVN r25606.
2011-12-09 00:40:07 +00:00
Ralph Castain
44094cd5b3
Remove compiler warning
...
This commit was SVN r25601.
2011-12-08 16:35:41 +00:00
Samuel Gutierrez
0a922dcb3e
fixes XE6 build.
...
This commit was SVN r25600.
2011-12-08 16:13:58 +00:00
Samuel Gutierrez
0588e9ba36
add Cray XK6 support to ras alps. the configuration file is a different format and is in a different place.
...
This commit was SVN r25599.
2011-12-08 14:05:02 +00:00
Ralph Castain
7180ad40ad
Fix a copule of minor buglets
...
This commit was SVN r25589.
2011-12-07 21:08:35 +00:00
Ralph Castain
3e7ab1212a
Since this has come up a number of times, have the rsh launcher add MCA params from the environment by default. If it finds that the cmd line is too long, error out with a message directing the user to set a param to ignore the environmental MCA params.
...
This commit was SVN r25581.
2011-12-07 01:24:36 +00:00
Ralph Castain
7510339725
Remove stale orte_vm_launch param. Add a param that allows users to specify envars to forward/set so they can do it in the MCA param file instead of only via mpirun cmd line.
...
This commit was SVN r25580.
2011-12-06 21:31:22 +00:00
Ralph Castain
15facc4ba6
Fix comm_spawn yet again...add another test
...
This commit was SVN r25579.
2011-12-06 20:15:40 +00:00
Ralph Castain
90b7f2a7bf
The rest of the multi app_context fix. Remove the restriction on number of app_contexts that can have zero np specified as multiple mappers now support that use-case. Update the ranking algorithms to respect and track bookmarks. Ensure we properly set the oversubscribed flag on a per-node basis.
...
This commit was SVN r25578.
2011-12-06 17:28:29 +00:00
Ralph Castain
d9c7764e9b
Remove some debug
...
This commit was SVN r25575.
2011-12-05 22:04:50 +00:00
Ralph Castain
df2f594aa8
Some cleanup associated with multiple app_contexts. Ensure nodes only get entered once into the map. Correctly handle bookmarks. Cleanup tracking of slots_inuse and correct detection of oversubscription.
...
Still need to resolve the ranking issue so it starts at the bookmark, but that will come next.
This commit was SVN r25574.
2011-12-05 22:01:08 +00:00
Abhishek Kulkarni
0b7c51fae2
Correct an invalid reference to a missing help file.
...
This commit was SVN r25573.
2011-12-05 21:29:07 +00:00
Josh Hursey
b5ac320826
* If not able to checkpoint at this time (say because we are already checkpointing or restarting) then make sure to re-set the listener so that we can checkpoint later.
...
* Work around duplicate node names in the map. It should not happen normally, but if the rmaps component gets this wrong provide a work around. Ralph is working on a rmaps fix for this, so we will likely remove/comment out the fix later.
This commit was SVN r25572.
2011-12-05 19:29:26 +00:00
Josh Hursey
cc57840b53
Fix ess/tool so that it does not segv when using the rsh PLM. Just have it use the base function directly to avoid similar problems with finalizing other components.
...
This commit was SVN r25571.
2011-12-05 15:40:46 +00:00
Ralph Castain
6fefe236a4
Warn users if they set opal_paffinity_alone, either to true or false, that this parameter is no longer functional - they must use the --bind-to option and its corresponding mca param.
...
This commit was SVN r25567.
2011-12-03 01:10:52 +00:00
Ralph Castain
6cbd8fa6c9
Keep everyone in sync with new job state
...
This commit was SVN r25563.
2011-12-02 14:12:40 +00:00
Ralph Castain
07655e2945
Handle the case where the allocator "fibs" to us about the node names. In some cases (ahem...you know who you are!), the allocator will tell us a node number (e.g., "16"). However, the daemon will return a node name (e.g., "nid0016") - leaving us not recognizing its location.
...
So provide a new parameter (can't have too many!) that handles this situation by stripping the prefix from the returned node name. Also do a little cleanup to ensure we cleanly exit from errors, without generating too many annoying messages.
This commit was SVN r25562.
2011-12-02 14:10:08 +00:00
Jeff Squyres
ecf6ba910c
Silence a few icc warnings and about mixing enums with other types.
...
This commit was SVN r25560.
2011-12-02 13:18:54 +00:00
Ralph Castain
641e17f26c
A better way of handling fqdn allocations. Prior method was wrong as it equated "node1" with "node10", which definitely caused problems.
...
Detect the addition of fqdn nodes in the allocation. If not found, then strip all incoming hostnames from daemons of any domain info when matching those names against the names in the node pool.
Leave some protection and "live" diagnostic output in place so we can continue to detect problems across all environments.
This commit was SVN r25557.
2011-12-01 14:24:43 +00:00
Ralph Castain
512aea79bc
Print the right nodename value, fix the strange case
...
This commit was SVN r25556.
2011-12-01 02:31:56 +00:00
Ralph Castain
44394c6b34
Add a little more protection
...
This commit was SVN r25555.
2011-12-01 00:30:56 +00:00
Ralph Castain
c4ea7a252a
Add a little protection against badly formed node names so we don't segfault if they are encountered
...
This commit was SVN r25554.
2011-11-30 23:33:59 +00:00
Ralph Castain
fa9e99454a
Don't divide by cpus-per-task - we'll deal with that at binding time.
...
This commit was SVN r25552.
2011-11-30 21:35:25 +00:00
Ralph Castain
c56acf60ca
Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order.
...
Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use.
Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact.
Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code.
This commit was SVN r25551.
2011-11-30 19:58:24 +00:00
George Bosilca
25476c7e54
buffer is not yet initialized, so there is no reason to release it.
...
This commit was SVN r25549.
2011-11-29 23:50:18 +00:00
Jeff Squyres
6fbbfd0f7a
Gah! r25545 acidentally included ''waaaay'' more stuff than it was
...
supposed to. I.e., half-baked/not complete stuff.
This commit backs out all of r25545. Sorry folks!
This commit was SVN r25546.
The following SVN revision numbers were found above:
r25545 --> open-mpi/ompi@7f9ae11faf
2011-11-29 23:24:52 +00:00
Jeff Squyres
7f9ae11faf
Per http://www.open-mpi.org/community/lists/users/2011/11/17862.php ,
...
to make MPI_IN_PLACE (and other sentinel Fortran constants) work on OS
X, we need to use the following compiler (linker) flag:
-Wl,-commons,use_dylibs
So if we're compiling on OS X, test to see if that flag works with the
compiler. If so, add it to the wrapper FFLAGS and FCFLAGS (note that
per a future update, we'll only have one Fortran compiler anyway).
Fixes trac:1982.
This commit was SVN r25545.
The following Trac tickets were found above:
Ticket 1982 --> https://svn.open-mpi.org/trac/ompi/ticket/1982
2011-11-29 23:05:54 +00:00
George Bosilca
7a238933b6
Silence a compiler warning.
...
This commit was SVN r25543.
2011-11-29 20:53:08 +00:00
Terry Dontje
b1bb339d23
fix r25507 rationalization of rsh support by removing include of plm_base_rsh_support.h from tm module
...
This commit was SVN r25519.
The following SVN revision numbers were found above:
r25507 --> open-mpi/ompi@b475421c16
2011-11-29 11:49:41 +00:00
Ralph Castain
0d55a3d739
Missed one spot...
...
This commit was SVN r25517.
2011-11-28 22:30:53 +00:00
Ralph Castain
237c79b6d7
Fix daemon collectives - missed the one spot where returning orte_routed_tree_t was required. Sigh. Change the routed components to return that type on the list of children when get_routing_tree is called.
...
This commit was SVN r25516.
2011-11-28 22:24:49 +00:00
Ralph Castain
89e5bd27a2
Fix copyright date
...
This commit was SVN r25512.
2011-11-28 15:54:04 +00:00
Ralph Castain
70ab8422b1
Per the internal comments, the delay between ssh invocations is not there for debugging purposes, but rather to allow for NIS authetication times. We have seen that problem in the past, so don't just do the delay when we are debugging - use the delay for the intended purpose. Also, allow for shorter than second-level delays as it doesn't always have to be so long.
...
This commit was SVN r25510.
2011-11-27 01:49:42 +00:00
Ralph Castain
b173316b74
Dont induce a delay between spawns unless specifically asked to do so
...
This commit was SVN r25509.
2011-11-26 16:50:31 +00:00
Ralph Castain
b475421c16
As promised, rationalize the rsh support. Remove rshbase and the base rsh support, centralizing all rsh support into the rsh component. Remove the "slave" launch support as that experiment is complete. Fix tree spawn and make that the default method for rsh launch, turning it "off" for qrsh as that system does not support tree spawn.
...
This commit was SVN r25507.
2011-11-26 02:33:05 +00:00
Ralph Castain
9b59d8de6f
This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build.
...
Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations.
Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way.
This commit was SVN r25497.
2011-11-22 21:24:35 +00:00
George Bosilca
1000af1c48
No need to abort there, returning an error trigger the
...
abort at the upper level.
This commit was SVN r25494.
2011-11-18 19:07:26 +00:00
Ralph Castain
866edf6a89
Now that George has found his problem, we no longer need the bozo check. Interesting how these platform-specific issues surface...
...
This commit was SVN r25493.
2011-11-18 17:43:14 +00:00
George Bosilca
b613c7eacb
Fix the issue with the round robin mapper. When mixing
...
different precisions, one should manually promote the
participants to the expected type. In this particular
example as opal_list_get_size returns an unsigned long,
the computation on the left side is translated to an
unsigned. If the hostfile contains more nodes that what
required (via the -np), this leads to a gigantic value
for the balance, and breaks the round robin algorithm.
This commit was SVN r25492.
2011-11-18 17:03:35 +00:00
Ralph Castain
1e5e9bde77
Add protection against a bozo case where we could end up in an infinite loop while calculating ranks
...
This commit was SVN r25491.
2011-11-18 15:35:55 +00:00
George Bosilca
88d32312d6
The bind_level should be initialized to zero or weird things happens. I'm
...
not yet sure how and why, but packing a uint8_t with opal_dss lead to
weird values during unpack (except if the original value is already
set to zero).
This commit was SVN r25490.
2011-11-18 10:22:58 +00:00
George Bosilca
61f273b987
Do not tolerate uninitialized variables.
...
This commit was SVN r25489.
2011-11-18 10:19:24 +00:00
Ralph Castain
b34acd0476
Grrr....get the correct number too!
...
This commit was SVN r25478.
2011-11-15 11:11:47 +00:00
Ralph Castain
593fc388a9
Make it a little more obvious as to which nodes are from each topology by labeling them with a letter.
...
This commit was SVN r25477.
2011-11-15 11:10:39 +00:00
Ralph Castain
6310361532
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
...
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
Ralph Castain
c8e105bd8c
Remove stale code
...
This commit was SVN r25475.
2011-11-14 23:39:23 +00:00
Ralph Castain
793f4c688f
Extend capability to support heterogeneous clusters with multiple topologies
...
This commit was SVN r25474.
2011-11-13 23:23:09 +00:00
Ralph Castain
6b5e1b89cf
Turn off tree spawn as it doesn't currently work - will fix shortly. Add topology collection
...
This commit was SVN r25472.
2011-11-11 23:42:36 +00:00
Ralph Castain
d008aeb531
Silence debug
...
This commit was SVN r25471.
2011-11-11 16:42:45 +00:00
George Bosilca
3d318a4c26
Put the interface of our MPIR support in sync with the document accepted by the MPI
...
Forum (http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf ).
This commit was SVN r25456.
2011-11-08 01:24:16 +00:00
George Bosilca
85a18dab74
MPIR_partial_attach_ok is not a volatile, but a constant.
...
This commit was SVN r25455.
2011-11-08 01:00:38 +00:00
Ralph Castain
a3ce355a60
Revert r25453 and r25450 until we can fix the libevent2013 configure code - still not getting the includedir to eval correctly.
...
This commit was SVN r25454.
The following SVN revision numbers were found above:
r25450 --> open-mpi/ompi@7f7d5c4f1f
r25453 --> open-mpi/ompi@c9fe8c32e2
2011-11-07 16:23:44 +00:00
Samuel Gutierrez
3ea59cce96
minor cleanup to getenv_pmi.c.
...
This commit was SVN r25449.
2011-11-07 03:18:07 +00:00
Samuel Gutierrez
e03bc93fb7
only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code.
...
This commit was SVN r25447.
2011-11-06 17:28:40 +00:00
Ralph Castain
34f0a27cb6
Initialize the locality info - at time of pmap creation, we at least know node locality
...
This commit was SVN r25446.
2011-11-06 17:06:41 +00:00
Ralph Castain
729935dffb
Minor cleanups, mirroring what Jeff did to ompi_info
...
This commit was SVN r25438.
2011-11-05 00:42:49 +00:00
Ralph Castain
fcee46b063
Add an option for printing a diffable process map for testing mappers
...
This commit was SVN r25428.
2011-11-03 14:22:07 +00:00
Samuel Gutierrez
3fe7b3ee54
add PMI support to ess alps module. xt system guys: please yell at me if i missed something in cnos.
...
This commit was SVN r25423.
2011-11-03 04:04:32 +00:00
Samuel Gutierrez
27b9bcfafd
update ess alps configuration file to include CNOS and PMI checks. some of the features committed here aren't being used, but they will be. also update orte_check_pmi.m4 to include missing call to action-if-not-found if --with-pmi is not specified or is disabled.
...
This commit was SVN r25422.
2011-11-03 02:14:47 +00:00
Jeff Squyres
7f6f7bd0eb
Remove this component; twitter long ago switched to the oauth
...
authentication, and no one has ever updated this component to match.
It can be revived out of history if anyone cares.
This commit was SVN r25421.
2011-11-02 21:04:49 +00:00
Ralph Castain
891027c10d
Cleanup error reports
...
This commit was SVN r25420.
2011-11-02 18:34:19 +00:00
Ralph Castain
b2e2d24726
As in the rsh module, report failed daemons to the errmgr for proper cleanup
...
This commit was SVN r25419.
2011-11-02 18:30:22 +00:00
Ralph Castain
3e4165fd8d
Cleanup includes
...
This commit was SVN r25418.
2011-11-02 18:28:28 +00:00
Ralph Castain
b77552c45d
Cleanup some include files, return a silent error in open/select as the complaining component already output a message
...
This commit was SVN r25416.
2011-11-02 17:42:06 +00:00
Ralph Castain
198e001554
Add another test
...
This commit was SVN r25415.
2011-11-02 15:59:16 +00:00
Ralph Castain
55b996678e
Minor indentation changes
...
This commit was SVN r25414.
2011-11-02 15:56:56 +00:00
Ralph Castain
f00753881e
Handle the case where mpirun -is- of the same topology as the compute nodes.
...
This commit was SVN r25412.
2011-11-01 22:26:03 +00:00
Ralph Castain
d28dd55d33
Minimize the amount of topology info returned by the daemons. Most clusters, especially at scale, use the same node topology on every node, so there is no re
...
ason to return the topology from every daemon. Borrow a page from the --hetero-apps page and let users indicate that the node topology differs by adding a --
hetero-nodes option to mpirun. If the option is set, then every daemon returns topology info. If not set, then only daemon vpid=1 returns it.
We always want one daemon to return the topology as the head node is often different from the compute nodes. Having one daemon return the compute node topolo
gy allows us to detect any such difference. All compute nodes are then set to the same topology.
This commit was SVN r25408.
2011-11-01 18:43:10 +00:00
Ralph Castain
14966e0f8f
Cleanup PMI startup - if a component isn't selected, it should finalize PMI IFF it started it. Otherwise, components that aren't selected can finalize PMI when it is in use by other parts of the system.
...
This commit was SVN r25407.
2011-11-01 16:25:12 +00:00
Ralph Castain
71ed8e3cd3
Bring back the local node's binding capabilities along with its topology. Clean up indentation.
...
This commit was SVN r25399.
2011-10-30 13:20:16 +00:00
Ralph Castain
d492b20975
Bozo check for topology info
...
This commit was SVN r25398.
2011-10-30 11:49:38 +00:00
Ralph Castain
4232115a98
Ensure pruning remains within the current job/app being mapped.
...
This commit was SVN r25397.
2011-10-30 00:02:20 +00:00
Ralph Castain
648c85b41b
Add a simple pattern mapper as an example of how to use the topology info to create desired mappings. Let the user specify a pattern based on resource types, and map that pattern across all available nodes as resources permit.
...
Don't automatically display the topology for each node when --display-devel-map is set as it can overwhelm the reader. Use a separate flag --display-topo to get it.
This commit was SVN r25396.
2011-10-29 15:12:45 +00:00
Ralph Castain
12a589130a
Add some debug
...
This commit was SVN r25395.
2011-10-29 15:07:58 +00:00
Ralph Castain
965b04d1a5
Use the new utilities to get a topology that reflects available cpus
...
This commit was SVN r25394.
2011-10-29 15:07:36 +00:00
Ralph Castain
e50bcbf028
Add the ability to specify a topology-containing xml file to describe the simulated nodes to support mapping tests against arbitrary topologies
...
This commit was SVN r25388.
2011-10-29 02:01:11 +00:00
Ralph Castain
7fa5f82d70
Add simulator component to support testing of large scale mapping methods. Automatically sets do-not-resolve and do-not-launch, and creates however many nodes the user wants to simulate in the system.
...
This commit was SVN r25386.
2011-10-28 23:48:53 +00:00
Ralph Castain
e2eb8d5f78
Remove bad param registration - that param was already registered as an int_name in another location.
...
This commit was SVN r25381.
2011-10-28 19:14:43 +00:00
Josh Hursey
6726590b1c
Remove the 'ess_node_rank' accessor from here. This caused running under 'tm' to segv at the orteds.
...
It just looks like this part of the component was not updated during r25331. It was removed from the 'env' and 'slurm' environments in that patch. It looks like 'tm' was updated, but did not get this particular piece.
This commit was SVN r25380.
The following SVN revision numbers were found above:
r25331 --> open-mpi/ompi@b44f8d4b28
2011-10-28 17:41:35 +00:00
Josh Hursey
59ff1dbbfb
Fix indentation problem that caused a segv when running without regex.
...
This was introduced in r25063.
This commit was SVN r25379.
The following SVN revision numbers were found above:
r25063 --> open-mpi/ompi@e58623cd5b
2011-10-28 13:39:32 +00:00
Samuel Gutierrez
922e41a318
fix typo. use PMI_Initialized for init status instead of PMI_Init.
...
This commit was SVN r25377.
2011-10-27 22:27:30 +00:00
Ralph Castain
951d72692c
Reverse the #if direction so we report daemon failure to the errmgr - otherwise, we just hang if a daemon fails to start.
...
Reviewed with Josh.
This commit was SVN r25366.
2011-10-25 19:09:52 +00:00
Ralph Castain
c55cba55a7
Totally trivial spelling fix
...
This commit was SVN r25361.
2011-10-24 14:06:33 +00:00
Ralph Castain
955d8e7d46
Allow apps to use pmi when launched by mpirun, if desired, without affecting daemons
...
This commit was SVN r25359.
2011-10-23 15:57:13 +00:00
Nathan Hjelm
e8af0d8589
don't use alps paffinity
...
This commit was SVN r25358.
2011-10-21 22:52:03 +00:00
Abhishek Kulkarni
46952e9008
Fix C/R functionality in trunk. Intra-node checkpointing of a job now works as expected.
...
Signed-off-by: Abhishek Kulkarni <adkulkar@osl.iu.edu>
This commit was SVN r25357.
2011-10-21 22:07:35 +00:00
Nathan Hjelm
7b1172b346
need a terminating character in the decoded string
...
This commit was SVN r25355.
2011-10-21 16:46:28 +00:00
Nathan Hjelm
cd257ac707
fixed typo in pmi grpcomm
...
This commit was SVN r25353.
2011-10-21 16:28:36 +00:00
Shiqing Fan
5711414eb7
Fix Windows build
...
This commit was SVN r25351.
2011-10-21 14:46:58 +00:00
Ralph Castain
53ef085567
Fix a minor issue seen by Jeff in specific failure pathway
...
This commit was SVN r25350.
2011-10-21 14:44:48 +00:00
Ralph Castain
3e72fccacf
Cray's PMI implementation is quite different from slurm's - they extended PMI-1 by adding some, but not all, of the PMI-2 APIs. So you can't just switch to using PMI-2 functions as it isn't a complete implementation. Instead, you have to selectively figure out which ones they have in PMI-2, and use any missing ones from PMI-1. What fun.
...
Modify the configure logic and the PMI components to accommodate Cray's approach. Refactor the PMI error reporting code so it resides in only one place. Cray actually decided -not- to define the PMI-2 error codes, so we have to use the PMI-1 codes instead. More fun.
This commit was SVN r25348.
2011-10-21 04:54:38 +00:00
Nathan Hjelm
beb8d8ce32
pmi return code wtf
...
This commit was SVN r25336.
2011-10-20 17:51:24 +00:00
Ralph Castain
84713d5a84
Fix singletons again - must have been broken for a very long time, which only shows how little anyone cares about this capability.
...
This commit was SVN r25332.
2011-10-19 20:19:08 +00:00
Ralph Castain
b44f8d4b28
Complete implementation of the ess.proc_get_locality API. Up to this point, the API was only capable of telling if the specified proc was sharing a node with you. However, the returned value was capable of telling you much more detailed info - e.g., if the proc shares a socket, a cache, or numa node. We just didn't have the data to provide that detail.
...
Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times.
Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h
Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun.
This commit was SVN r25331.
2011-10-19 20:18:14 +00:00
Ralph Castain
2958f3de34
Add some clarifying comments and a small efficiency improvement
...
This commit was SVN r25322.
2011-10-18 18:30:43 +00:00
Ralph Castain
b771114086
Fix the fix :-)
...
If the errmgr is going to try and hold the orted until all routes and children are gone, then the exit cmd must do the same. Otherwise, the orted exits immediately without waiting for routes to be dismantled, which is why we don't see the connections close.
Also cleanup some diagnostics and add some debug to more clearly see what's going on.
This commit was SVN r25321.
2011-10-18 17:56:37 +00:00
Ralph Castain
ae8e556d14
Okay, once again let's fix the vpid calculator. Identified problem with prior commit (some rmaps components already place their procs in the jdata->procs array, and others don't), so account for those variations.
...
This commit was SVN r25315.
2011-10-18 15:50:11 +00:00
George Bosilca
749b63c09d
Provide a generic fix for the termination issue instead of r25248. The
...
termination condition is to be checked at the daemon/HNP level not down
in the routing.
This commit was SVN r25313.
The following SVN revision numbers were found above:
r25248 --> open-mpi/ompi@b42ccc89b8
2011-10-18 03:07:37 +00:00
George Bosilca
f28890fbb7
Revert r25302 as it break the --bynode option.
...
This commit was SVN r25311.
The following SVN revision numbers were found above:
r25302 --> open-mpi/ompi@d7a8553179
2011-10-18 02:48:17 +00:00
Ralph Castain
2fdd9c6dea
Ensure mpirun doesn't pick this component
...
This commit was SVN r25307.
2011-10-17 22:28:28 +00:00
Ralph Castain
8f0ef54130
Complete implementation of pmi support. Ensure we support both mpirun and direct launch within same configuration to avoid requiring separate builds. Add support for generic pmi, not just under slurm. Add publish/subscribe support, although slurm's pmi implementation will just return an error as it hasn't been done yet.
...
This commit was SVN r25303.
2011-10-17 20:51:22 +00:00
Ralph Castain
d7a8553179
Fix the mapping algo for computing vpids - it was borked for bynode operations when using nperxxx directives
...
This commit was SVN r25302.
2011-10-17 19:49:04 +00:00
Ralph Castain
f1a5a26ba0
Minor cleanups
...
This commit was SVN r25289.
2011-10-14 18:46:03 +00:00
Ralph Castain
89a20de474
Remove unused includes. Ensure that the error log is at least always available as we otherwise segfault when reporting errors that occur prior to opening the errmgr framework
...
This commit was SVN r25288.
2011-10-14 18:45:11 +00:00
Ralph Castain
07dbbc6513
Sorry for mid-day correction - but folks are trying to test this, and we didn't realize it was still ignored :-(
...
This commit was SVN r25287.
2011-10-14 16:19:20 +00:00
Ralph Castain
7bb294f917
Fix debug flags - thanks Terry!
...
This commit was SVN r25286.
2011-10-14 16:10:21 +00:00
Ralph Castain
054c485dcf
Cleanup a race condition and an unreliable method that caused us to not properly handle procs that trapped sigterm for cleanup purposes while ORTE was trying to kill them. Thanks to Rick Payne and Ian Wells of Cisco for spending weeks chasing this down.
...
Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-(
This commit was SVN r25285.
2011-10-14 15:39:54 +00:00
Ralph Castain
08fa9e1c6a
Correct include path
...
This commit was SVN r25282.
2011-10-13 23:46:52 +00:00
Ralph Castain
b96ef2161d
Complete the PMI support. Generalize PMI operations to support both slurm and non-slurm environments. Correct some configuration issues - we really only want the PMI integration at the individual component level. Ensure that the pmi grpcomm component doesn't get selected when launching via mpirun by setting its priority below the bad component.
...
Only verified in a slurm environment as that's all I have access to...
This commit was SVN r25275.
2011-10-12 20:59:25 +00:00
Ralph Castain
634f83fc52
Fix the routed components. All had errors, some completely broken. You cannot test
...
0 == ORTE_EPOCH_CMP(target->epoch,ORTE_EPOCH_INVALID)
when epoch is not configured as this will always return true. This caused get_route to return an error in all non-binomial routed modules, and caused all components to return an error when delete_route was called.
So protect the checks with ORTE_ENABLE_EPOCH so we get the correct behavior.
This commit was SVN r25274.
2011-10-12 20:18:57 +00:00
Ralph Castain
24a46f2acb
These were missed by prior commit - need to remove lingering references to OPAL_HWLOC_HAVE_XML
...
This commit was SVN r25272.
2011-10-12 16:54:03 +00:00
George Bosilca
872d377021
Tell what the update status is.
...
This commit was SVN r25259.
2011-10-11 19:49:12 +00:00
Brian Barrett
98e98ce2c5
* opal_atomic_trylock is documented to return 0 if the lock was acquired,
...
1 otherwise. It was doing the opposite, so this patch fixes the
return values. All uses (all in ORTE) used the actual return values,
not the documented values, so fix them as well.
This commit was SVN r25257.
2011-10-11 18:43:45 +00:00
Ralph Castain
2f38ff5e54
Ensure we don't try to build this module unless pmi is specifically requested
...
This commit was SVN r25252.
2011-10-11 06:12:04 +00:00
Ralph Castain
baefdabd98
Add some debug. Now confirmed to work correctly (prior problem was with odin tcp connection, not code).
...
This commit was SVN r25249.
2011-10-11 02:15:17 +00:00
Ralph Castain
b42ccc89b8
Although this didn't solve the earlier termination problem, the code will be required once we get connection terminations properly detected. If a daemon (or HNP) is trying to terminate, then we need to check for termination conditions whenever a route is lost - when all child connections are gone, then we are free to finalize.
...
This commit was SVN r25248.
2011-10-10 21:41:49 +00:00
Ralph Castain
1aa1c2e9b4
Get the slurm pmi support working. Cannot use infiniband, of course, as the oob can't make the connection - may try other existing methods. Modex may not quite be working right yet
...
as odin was having trouble making TCP connections, but at least the configure now works so things build, so save that for now
This commit was SVN r25247.
2011-10-10 21:39:10 +00:00
Swen Boehm
08b4322a1a
patched the lex files to not issue the following compiler warning:
...
'yyunput' defined but not used
This commit was SVN r25246.
2011-10-10 18:13:04 +00:00
Ralph Castain
f1a3a35fcd
Cannot rely on detection of connection terminations for deciding when to exit as they don't always go away immediately. There is no info coming back anyway, so it's okay to just exit once the relay has been sent. The relay is sent via a blocking API, so just go ahead and quit.
...
This commit was SVN r25245.
2011-10-10 16:38:46 +00:00
George Bosilca
649af6c925
Enumerated mixed with another type (int) is tolerated but
...
easily fixable.
This commit was SVN r25241.
2011-10-09 03:54:52 +00:00
Terry Dontje
c6691b4122
clean up local procs when abort or abort signal happens
...
This commit was SVN r25237.
2011-10-06 19:19:55 +00:00