1
1
Граф коммитов

2548 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
7a238933b6 Silence a compiler warning.
This commit was SVN r25543.
2011-11-29 20:53:08 +00:00
Terry Dontje
b1bb339d23 fix r25507 rationalization of rsh support by removing include of plm_base_rsh_support.h from tm module
This commit was SVN r25519.

The following SVN revision numbers were found above:
  r25507 --> open-mpi/ompi@b475421c16
2011-11-29 11:49:41 +00:00
Ralph Castain
0d55a3d739 Missed one spot...
This commit was SVN r25517.
2011-11-28 22:30:53 +00:00
Ralph Castain
237c79b6d7 Fix daemon collectives - missed the one spot where returning orte_routed_tree_t was required. Sigh. Change the routed components to return that type on the list of children when get_routing_tree is called.
This commit was SVN r25516.
2011-11-28 22:24:49 +00:00
Ralph Castain
89e5bd27a2 Fix copyright date
This commit was SVN r25512.
2011-11-28 15:54:04 +00:00
Ralph Castain
70ab8422b1 Per the internal comments, the delay between ssh invocations is not there for debugging purposes, but rather to allow for NIS authetication times. We have seen that problem in the past, so don't just do the delay when we are debugging - use the delay for the intended purpose. Also, allow for shorter than second-level delays as it doesn't always have to be so long.
This commit was SVN r25510.
2011-11-27 01:49:42 +00:00
Ralph Castain
b173316b74 Dont induce a delay between spawns unless specifically asked to do so
This commit was SVN r25509.
2011-11-26 16:50:31 +00:00
Ralph Castain
b475421c16 As promised, rationalize the rsh support. Remove rshbase and the base rsh support, centralizing all rsh support into the rsh component. Remove the "slave" launch support as that experiment is complete. Fix tree spawn and make that the default method for rsh launch, turning it "off" for qrsh as that system does not support tree spawn.
This commit was SVN r25507.
2011-11-26 02:33:05 +00:00
Ralph Castain
9b59d8de6f This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build.
Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations.

Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way.

This commit was SVN r25497.
2011-11-22 21:24:35 +00:00
George Bosilca
1000af1c48 No need to abort there, returning an error trigger the
abort at the upper level.

This commit was SVN r25494.
2011-11-18 19:07:26 +00:00
Ralph Castain
866edf6a89 Now that George has found his problem, we no longer need the bozo check. Interesting how these platform-specific issues surface...
This commit was SVN r25493.
2011-11-18 17:43:14 +00:00
George Bosilca
b613c7eacb Fix the issue with the round robin mapper. When mixing
different precisions, one should manually promote the
participants to the expected type. In this particular
example as opal_list_get_size returns an unsigned long,
the computation on the left side is translated to an
unsigned. If the hostfile contains more nodes that what
required (via the -np), this leads to a gigantic value 
for the balance, and breaks the round robin algorithm.

This commit was SVN r25492.
2011-11-18 17:03:35 +00:00
Ralph Castain
1e5e9bde77 Add protection against a bozo case where we could end up in an infinite loop while calculating ranks
This commit was SVN r25491.
2011-11-18 15:35:55 +00:00
George Bosilca
61f273b987 Do not tolerate uninitialized variables.
This commit was SVN r25489.
2011-11-18 10:19:24 +00:00
Ralph Castain
b34acd0476 Grrr....get the correct number too!
This commit was SVN r25478.
2011-11-15 11:11:47 +00:00
Ralph Castain
593fc388a9 Make it a little more obvious as to which nodes are from each topology by labeling them with a letter.
This commit was SVN r25477.
2011-11-15 11:10:39 +00:00
Ralph Castain
6310361532 At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement

The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.

In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:

1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.

2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.

3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.

As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.

This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
Ralph Castain
c8e105bd8c Remove stale code
This commit was SVN r25475.
2011-11-14 23:39:23 +00:00
Ralph Castain
793f4c688f Extend capability to support heterogeneous clusters with multiple topologies
This commit was SVN r25474.
2011-11-13 23:23:09 +00:00
Ralph Castain
6b5e1b89cf Turn off tree spawn as it doesn't currently work - will fix shortly. Add topology collection
This commit was SVN r25472.
2011-11-11 23:42:36 +00:00
Ralph Castain
d008aeb531 Silence debug
This commit was SVN r25471.
2011-11-11 16:42:45 +00:00
George Bosilca
3d318a4c26 Put the interface of our MPIR support in sync with the document accepted by the MPI
Forum (http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf).

This commit was SVN r25456.
2011-11-08 01:24:16 +00:00
George Bosilca
85a18dab74 MPIR_partial_attach_ok is not a volatile, but a constant.
This commit was SVN r25455.
2011-11-08 01:00:38 +00:00
Ralph Castain
a3ce355a60 Revert r25453 and r25450 until we can fix the libevent2013 configure code - still not getting the includedir to eval correctly.
This commit was SVN r25454.

The following SVN revision numbers were found above:
  r25450 --> open-mpi/ompi@7f7d5c4f1f
  r25453 --> open-mpi/ompi@c9fe8c32e2
2011-11-07 16:23:44 +00:00
Samuel Gutierrez
e03bc93fb7 only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code.
This commit was SVN r25447.
2011-11-06 17:28:40 +00:00
Ralph Castain
fcee46b063 Add an option for printing a diffable process map for testing mappers
This commit was SVN r25428.
2011-11-03 14:22:07 +00:00
Samuel Gutierrez
3fe7b3ee54 add PMI support to ess alps module. xt system guys: please yell at me if i missed something in cnos.
This commit was SVN r25423.
2011-11-03 04:04:32 +00:00
Samuel Gutierrez
27b9bcfafd update ess alps configuration file to include CNOS and PMI checks. some of the features committed here aren't being used, but they will be. also update orte_check_pmi.m4 to include missing call to action-if-not-found if --with-pmi is not specified or is disabled.
This commit was SVN r25422.
2011-11-03 02:14:47 +00:00
Jeff Squyres
7f6f7bd0eb Remove this component; twitter long ago switched to the oauth
authentication, and no one has ever updated this component to match.
It can be revived out of history if anyone cares.

This commit was SVN r25421.
2011-11-02 21:04:49 +00:00
Ralph Castain
b2e2d24726 As in the rsh module, report failed daemons to the errmgr for proper cleanup
This commit was SVN r25419.
2011-11-02 18:30:22 +00:00
Ralph Castain
3e4165fd8d Cleanup includes
This commit was SVN r25418.
2011-11-02 18:28:28 +00:00
Ralph Castain
b77552c45d Cleanup some include files, return a silent error in open/select as the complaining component already output a message
This commit was SVN r25416.
2011-11-02 17:42:06 +00:00
Ralph Castain
55b996678e Minor indentation changes
This commit was SVN r25414.
2011-11-02 15:56:56 +00:00
Ralph Castain
f00753881e Handle the case where mpirun -is- of the same topology as the compute nodes.
This commit was SVN r25412.
2011-11-01 22:26:03 +00:00
Ralph Castain
d28dd55d33 Minimize the amount of topology info returned by the daemons. Most clusters, especially at scale, use the same node topology on every node, so there is no re
ason to return the topology from every daemon. Borrow a page from the --hetero-apps page and let users indicate that the node topology differs by adding a --
hetero-nodes option to mpirun. If the option is set, then every daemon returns topology info. If not set, then only daemon vpid=1 returns it.

We always want one daemon to return the topology as the head node is often different from the compute nodes. Having one daemon return the compute node topolo
gy allows us to detect any such difference. All compute nodes are then set to the same topology.

This commit was SVN r25408.
2011-11-01 18:43:10 +00:00
Ralph Castain
14966e0f8f Cleanup PMI startup - if a component isn't selected, it should finalize PMI IFF it started it. Otherwise, components that aren't selected can finalize PMI when it is in use by other parts of the system.
This commit was SVN r25407.
2011-11-01 16:25:12 +00:00
Ralph Castain
d492b20975 Bozo check for topology info
This commit was SVN r25398.
2011-10-30 11:49:38 +00:00
Ralph Castain
4232115a98 Ensure pruning remains within the current job/app being mapped.
This commit was SVN r25397.
2011-10-30 00:02:20 +00:00
Ralph Castain
648c85b41b Add a simple pattern mapper as an example of how to use the topology info to create desired mappings. Let the user specify a pattern based on resource types, and map that pattern across all available nodes as resources permit.
Don't automatically display the topology for each node when --display-devel-map is set as it can overwhelm the reader. Use a separate flag --display-topo to get it.

This commit was SVN r25396.
2011-10-29 15:12:45 +00:00
Ralph Castain
12a589130a Add some debug
This commit was SVN r25395.
2011-10-29 15:07:58 +00:00
Ralph Castain
965b04d1a5 Use the new utilities to get a topology that reflects available cpus
This commit was SVN r25394.
2011-10-29 15:07:36 +00:00
Ralph Castain
e50bcbf028 Add the ability to specify a topology-containing xml file to describe the simulated nodes to support mapping tests against arbitrary topologies
This commit was SVN r25388.
2011-10-29 02:01:11 +00:00
Ralph Castain
7fa5f82d70 Add simulator component to support testing of large scale mapping methods. Automatically sets do-not-resolve and do-not-launch, and creates however many nodes the user wants to simulate in the system.
This commit was SVN r25386.
2011-10-28 23:48:53 +00:00
Ralph Castain
e2eb8d5f78 Remove bad param registration - that param was already registered as an int_name in another location.
This commit was SVN r25381.
2011-10-28 19:14:43 +00:00
Josh Hursey
6726590b1c Remove the 'ess_node_rank' accessor from here. This caused running under 'tm' to segv at the orteds.
It just looks like this part of the component was not updated during r25331. It was removed from the 'env' and 'slurm' environments in that patch. It looks like 'tm' was updated, but did not get this particular piece.

This commit was SVN r25380.

The following SVN revision numbers were found above:
  r25331 --> open-mpi/ompi@b44f8d4b28
2011-10-28 17:41:35 +00:00
Josh Hursey
59ff1dbbfb Fix indentation problem that caused a segv when running without regex.
This was introduced in r25063.

This commit was SVN r25379.

The following SVN revision numbers were found above:
  r25063 --> open-mpi/ompi@e58623cd5b
2011-10-28 13:39:32 +00:00
Samuel Gutierrez
922e41a318 fix typo. use PMI_Initialized for init status instead of PMI_Init.
This commit was SVN r25377.
2011-10-27 22:27:30 +00:00
Ralph Castain
951d72692c Reverse the #if direction so we report daemon failure to the errmgr - otherwise, we just hang if a daemon fails to start.
Reviewed with Josh.

This commit was SVN r25366.
2011-10-25 19:09:52 +00:00
Ralph Castain
c55cba55a7 Totally trivial spelling fix
This commit was SVN r25361.
2011-10-24 14:06:33 +00:00
Ralph Castain
955d8e7d46 Allow apps to use pmi when launched by mpirun, if desired, without affecting daemons
This commit was SVN r25359.
2011-10-23 15:57:13 +00:00