1
1
Граф коммитов

2902 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
b5e26a90ea Singletons should insist that the spawned orted HNP use the novm state machine so that any subsequent comm_spawn occurs only on required nodes
This commit was SVN r27220.
2012-09-04 01:09:01 +00:00
Shiqing Fan
ddbd542732 Remove one .windows file.
Add a macro definition for isblank function.

This commit was SVN r27217.
2012-09-03 09:51:44 +00:00
Ralph Castain
66c3f5d18d When getting target nodes for mapping, there is a difference between not finding any nodes that match the required constraints (either in hostfile or dash-host filtering) and finding at least one such node, but all its slots are busy. Make the return code reflect this difference so the caller can take appropriate action.
This commit was SVN r27213.
2012-09-01 10:30:40 +00:00
Ralph Castain
95019cc310 Fix a few places where we weren't completely identifying hostfile-based operations against "localhost" entries. Tell the mapper base to be silent when we don't want errors announced because nodes aren't available for mapping (something it is okay if they are fully used). Fix an infinite loop in the file prepositioning code.
This commit was SVN r27210.
2012-08-31 21:28:49 +00:00
Jeff Squyres
da00d281e6 Oops -- we want the priority to be low, not high (for now). :-)
This commit was SVN r27208.
2012-08-31 21:08:35 +00:00
Jeff Squyres
fcc1c7e33c = Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component.  This component is used to effect many different types of
regular of process/processor affinity patterns.  Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.  

Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.).  The LAMA core algorithm is described here:

  http://www.open-mpi.org/papers/cluster-2011-lama/

= LAMA Usage Levels =

LAMA allows specifying affinity multiple different ways:

 1. None: Speciying no affinity options to mpirun results in exactly
    the same behavior as today: no affinity is used.
 1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
    <LEVEL>" to indicate how "wide" each process should be bound
    (i.e., bind to a processor core, or to a processor socket, etc.)
    and how to lay out the processes (i.e., round robin by cores,
    sockets, etc.).  
 1. Expert: Using four new MCA parameters to effect process mapping
    and binding to processors.  These options are a bit complex, and
    are not for the faint at heart, but offer a high degree of
    (regular pattern) flexibility (each of these are described more
    fully below): 
    * rmaps_lama_map: a sequence of characters describing how to lay
      out processes
    * rmaps_lama_bind: a sequence of characters describing the
      resources to bind to each process
    * rmaps_lama_mppr: a sequence of characters describing the maximum
      number of processes to allow per resource (i.e., a specific
      definition of "oversubscription")
    * rmaps_lama_ordering: once all processes are in place, how to
      order the ranks in MPI_COMM_WORLD

We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.  

The Expert level was designed for two purposes:

 1. To provide a precise definition for the "Simple" level (i.e.,
 every
    --bind-to/--map-by option in the "Simple" level has a
 corresponding
    precise specification in the "Expert" level)
 1. As modern computing platforms become more complex, we simply
    cannot predict what application developers will need in terms of
    processor affinity.  LAMA is an attempt to provide a highly
    flexible mechanism that allows applications to utilize a variety
    of complex, unique affinity patterns beyond the common "bind to
    core" and "bind to socket" patterns.

= LAMA Simple Level =

The "Simple" level is pretty much the same as what Open MPI has
offered for years.  It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.

Specifically, the following options are available for both --bind-to
and --map-by:

 * slot
 * hwthread
 * core
 * l1cache
 * l2cache
 * l3cache
 * socket
 * numa
 * board
 * node

= LAMA Expert Level =

The "Expert" level requires some explanation.  I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek.  It is
flexible, but complex.  '''Most users won't need the Expert level.'''

LAMA works in three phases: mapping, binding, and ordering.  Each is
described below.

== Expert: Mapping ==

Processes are paired with sets of resources.  For example, each
process may be paired with a single processor core.  Or each process
may be paired with an entire processor socket.  LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits.  More on MPPR, below.

Mapping can be performed across multiple hardware levels:

 * h: Hardware thread
 * c: Processor core
 * s: Processor socket
 * L1: L1 cache
 * L2: L2 cache
 * L3: L3 cache
 * N: NUMA node
 * b: Processor board
 * n: Server node

If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.

But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.''  That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them.  And you can imagine that changing the order of nesting
would change the traversal pattern.

LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.

For example, consider a "simple" traversal: csL1L2L3Nbnh.  Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.

Wait... what?  That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads.  Why are they
tacked on to the end?

In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core).  When there are no more resources indicated by that
token, it goes on to the next token.

Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.

Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).  

If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core.  And the next
process will be mapped to the second hyperthread on the second core.
And so on.

Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.

As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).

The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.

=== Max Processes Per Resource (MPPR) ===

The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource.  In effect, it
defines the concept of "oversubscription."  Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.

This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).  

But what if your MPI processes are multi-threaded, and they need
multiple processes per core?  You'd need a different description of
"oversubscription" in this case.  Perhaps you want to have one MPI
process per socket.  This would be expressed in a MPPR string of
"1:s".

The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification.  For example "1:c" is pronounced "one
process per core."

Multiple MPPR specifications can be strung together into a
comma-delimited list, too.  All of these MPPR values and then taken
into account when mapping.  Here's some examples:

 * 1:c -- allow, at most, one process per processor core (i.e., don't
   schedule by hyperthread)
 * 1:s -- allow, at most, one process per processor socket (e.g., 
   that process may be multithreaded, or wants exclusive use of the
   socket's caches)
 * 1:s,2:n -- only allow one process per processor socket, but, at
   most, two processes per server node (e.g., if the two MPI processes
   will consume all the RAM on the server node, even if there are more
   processor cores available)

If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed.  If --oversubscribe was specified
on the mpirun command line, the job continues.  Otherwise, LAMA will
abort the job.

Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.

== Expert: Binding == 

Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources.  For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.

To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.  

With binding, however, processes are bound to a set of hardware
threads.  The number of threads to which the process is bound is
sometimes referred to as the "binding width".  For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.

(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket.  BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")

Bindings are expressed as an integer and a token from the mapping
string.  For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).

Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).

== Expert: Ordering ==

Finally, processes are assigned a rank in MPI_COMM_WORLD.  LAMA
currently offers two ordering modes: sequential or natural:

 * Sequential: if you laid out all the hardware resources in a single
   line, and then overlaid all the MPI processes on top of them, they
   are ordered from 0 to (N-1) from left-to-right.
 * Natural: the ordering of ranks follows the mapping ordering.  For
   example, consider a server node with two processor sockets, each
   containing four cores.  The command line "mpirun -np 8 --bind-to
   core --map-by socket --order n a.out" would result in MCW ranks
   that look like this: [0 2 4 6] [1 3 5 7].

= Execution =

At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered.  It now starts its execution.

= Final Notes =

Note that at this point, lama is not the default mapper.  It must be
activiated with "--mca rmaps lama".  We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.

Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets).  For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.

See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/

This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
Jeff Squyres
ca5bd85364 Add some help messages to the RAS simulator for PEBKAC issues.
This commit was SVN r27203.
2012-08-31 16:33:29 +00:00
Ralph Castain
38ce23db43 Add some protection to allow NULL bytes in byte objects and NULL strings to be handled cleanly in nidmaps and modex entries. Ensure there is a valid nidmap available for the HNP to pass down to any local procs when it is operating alone.
This commit was SVN r27188.
2012-08-31 01:07:36 +00:00
Ralph Castain
efc4a40c8a It is okay for a key not to be found
This commit was SVN r27187.
2012-08-30 15:12:23 +00:00
Ralph Castain
6dbb7a8493 Just because a peer didn't post a particular pmi key, that doesn't mean it is an immediate irrecoverable error - could be they just don't have a matching interface. Let the upper layer decide what to do about it.
This commit was SVN r27186.
2012-08-30 14:12:09 +00:00
Ralph Castain
1b659de132 Get staged execution working on multi-node setups. Improve efficiency by only remapping if all procs not yet mapped in the job.
This commit was SVN r27181.
2012-08-29 20:35:52 +00:00
Ralph Castain
a3b08f5800 Fix a few things relating to comm_spawn that causes new daemons to be launched. Ensure that all new daemons receive a full pidmap. Properly mark the daemon job as "updated" when daemons are added
This commit was SVN r27177.
2012-08-29 03:11:37 +00:00
Ralph Castain
f0077820f2 Silence warning
This commit was SVN r27175.
2012-08-28 22:27:41 +00:00
Ralph Castain
a414ffdf4c Remove debug
This commit was SVN r27174.
2012-08-28 22:18:00 +00:00
Ralph Castain
30fd9d7abc MPI procs should definitely not be trapping SIGCHLD - only ORTE tools need to do so
This commit was SVN r27166.
2012-08-28 21:39:06 +00:00
Ralph Castain
98580c117b Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine.
Remove some stale configure.m4's we no longer need.

Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.

This commit was SVN r27165.
2012-08-28 21:20:17 +00:00
Ralph Castain
aadfe1b61e Fix a missing test that breaks novm operation.
CMR:v1.7

This commit was SVN r27163.
2012-08-28 21:13:57 +00:00
Ralph Castain
d310dd8c58 Fix a strange race condition by creating a separate buffer for each send - apparently, just a retain isn't enough protection on some systems
This commit was SVN r27161.
2012-08-28 17:17:34 +00:00
Ralph Castain
11c68e2299 Correct the count in the pmi key
This commit was SVN r27156.
2012-08-28 15:05:02 +00:00
Ralph Castain
6e8c97c77c Per Sam's eagle-eyed review, free the malloc'd memory if getcwd fails for some strange reason.
This commit was SVN r27150.
2012-08-27 19:15:16 +00:00
Ralph Castain
bccc20d13e Deal with one last corner case of positioning a dot-file
This commit was SVN r27144.
2012-08-26 03:49:31 +00:00
Ralph Castain
63d41c643d Minor cleanup
This commit was SVN r27143.
2012-08-25 14:24:45 +00:00
Ralph Castain
05f0b4c653 Couple of minor cleanups
This commit was SVN r27135.
2012-08-24 21:14:40 +00:00
Ralph Castain
d6cbff6d4e Since the preload flags are at the app_context level, we need to link only those files/exe's that pertain to each app_context to the corresponding procs. Also, gain a little optimization by checking to ensure we only send files once - this probably won't work when daemons are created on-the-fly, but that's for some other day
This commit was SVN r27134.
2012-08-24 16:16:30 +00:00
Jeff Squyres
20612c4194 Don't close the IOF stdin if we happen to read less than a full
buffer's worth of data -- interactive stdin will have that behavior
frequently. 

This commit was SVN r27131.
2012-08-24 14:29:19 +00:00
Ralph Castain
e0c39c94e8 Complete the cleanup of the preload files system. Remove the dest_dir option as moving things to arbitrary locations - especially absolute paths - can prove disastrous. Remove the preload_libs option as these can be treated as just files. Cleanup some of the pack/unpack code as the dss handles NULL strings just fine. Deal a little better with absolute paths, noting that tar now strips the leading '/' for us (showing my age as it didn't used to do so).
Remove the odls_base_state.c file as that code is now covered by the new broadcast form of preload_files.

This commit was SVN r27127.
2012-08-24 02:28:29 +00:00
Ralph Castain
b4a544ad2a Per discussion with Josh, use the --preload-xxx cmd line options to broadcast files to all nodes. Add --set-cwd-to-session-dir option to start procs in their session directories. Add OMPI_FILE_LOCATION envar to tell procs where their prepositioned files went.
This commit was SVN r27125.
2012-08-23 21:28:05 +00:00
Ralph Castain
855c9ae6cf Support archives .tar, .bz[2,zip], and .gz[ip]
This commit was SVN r27123.
2012-08-23 15:38:39 +00:00
Ralph Castain
286c610712 Protect us against the scenario where filem is included in enable-mca-no-build
This commit was SVN r27122.
2012-08-23 13:52:06 +00:00
Shiqing Fan
d141d94bd7 Include the new .windows files into the tarball.
This commit was SVN r27121.
2012-08-23 12:50:51 +00:00
Ralph Castain
7237a938bf Extend the filem interface to support prepositioning and linking required local files for execution. Create a new "raw" module that uses xcast to send the files to all nodes as this is faster than doing an scp in a linear pattern
This commit was SVN r27118.
2012-08-22 21:43:20 +00:00
Ralph Castain
ed4b354846 Ensure we pass along user-specified mca params from the cmd line when doing a tree spawn, but don't extend the cmd line with duplicates or things that shouldn't be there
This commit was SVN r27117.
2012-08-22 21:41:50 +00:00
Ralph Castain
5d7872fd68 Cleanup the tag list
This commit was SVN r27115.
2012-08-22 21:37:58 +00:00
Ralph Castain
7bcf2f8b5c Stop leaving droppings behind us
This commit was SVN r27111.
2012-08-22 17:39:22 +00:00
Shiqing Fan
95b9552546 include several components for Windows build.
This commit was SVN r27108.
2012-08-22 14:46:49 +00:00
Jeff Squyres
c8cee23ee7 Priorities really shouldn't be less than 0.
This commit was SVN r27098.
2012-08-21 15:47:15 +00:00
Ralph Castain
dacb07000d Turn udcm and ud oob off by default, but allow them to build and be used if someone wants to test them
cmr:v1.7

This commit was SVN r27097.
2012-08-21 15:18:34 +00:00
Ralph Castain
64cf75cec5 Add some debug
This commit was SVN r27087.
2012-08-17 02:19:26 +00:00
Ralph Castain
35fef87202 Make the "no virtual machine" selection more intuitive by providing a --novm option to mpirun.
This commit was SVN r27048.
2012-08-15 14:55:03 +00:00
Nathan Hjelm
8e03f77004 update alps configure scripts
This commit was SVN r27026.
2012-08-13 22:57:55 +00:00
Ralph Castain
589acf550c Improve the new MPI_INFO_ENV to better handle Java applications and to correctly report the info for singletons.
This commit was SVN r27025.
2012-08-13 22:13:49 +00:00
Ralph Castain
49a757e0bd Silly me - now that all daemons are stripping their prefix on the backend, we no longer need to do it as they report
This commit was SVN r27023.
2012-08-13 20:48:13 +00:00
Ralph Castain
b9b41d8662 For cases where the alpha+non-zero prefix must be removed from a node name, be sure to do it everywhere we access node names - otherwise, modex methods such as pmi will fail to correctly identify procs on the same node
This commit was SVN r27022.
2012-08-13 20:44:56 +00:00
Ralph Castain
cb48fd52d4 Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions:
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context

Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.

This commit was SVN r27005.
2012-08-12 01:28:23 +00:00
Ralph Castain
c4ee297a60 Cleanup the pmi grpcomm module so it passes non-btl modex data correctly.
This commit was SVN r26992.
2012-08-10 20:35:50 +00:00
Ralph Castain
e3e9b7345d First cut at updating the ccp launcher to use the state machine
This commit was SVN r26986.
2012-08-10 17:09:33 +00:00
Shiqing Fan
e304c19920 This is also used on Windows.
This commit was SVN r26975.
2012-08-08 16:44:00 +00:00
George Bosilca
ba879c2c51 Remove the unused map.
This commit was SVN r26960.
2012-08-07 12:06:13 +00:00
Shiqing Fan
2f442799f8 fix several typecasts
This commit was SVN r26957.
2012-08-07 10:41:53 +00:00
Ralph Castain
53b1a1c976 Cleanly error out when someone asks to map-to <object> if that object doesn't exist on a node.
This commit was SVN r26950.
2012-08-04 21:52:36 +00:00
Ralph Castain
61b09a132b Fix bynode mapping of multiple app-contexts
This commit was SVN r26949.
2012-08-03 21:45:40 +00:00
Ralph Castain
96f6f94c24 Ensure we don't get trapped in an infinite loop when ranking bynode if something isn't right
This commit was SVN r26948.
2012-08-03 21:45:10 +00:00
Ralph Castain
0d878937fe If a callback is set in the state machine, and the state doesn't yet exist, create it
This commit was SVN r26947.
2012-08-03 21:43:36 +00:00
Ralph Castain
431d5361ed For those who really preferred our prior mode of operation that mapped procs and only launched daemons on the nodes that had procs on them, introduce the "novm" state machine component. This recreates the old mode of operation by re-ordering the launch sequence so that we allocate, then map, and then launch daemons only on the reqd nodes (instead of across the entire allocation).
This commit was SVN r26946.
2012-08-03 16:30:05 +00:00
Ralph Castain
dc22ea5cde A little cleaner on the message about repeated ctrl-c, and re-enable the event so we can abort if we see multiple ctrl-c's that don't meet the time requirement
This commit was SVN r26945.
2012-08-03 01:26:18 +00:00
Ralph Castain
e6c72bfd53 Ensure we can forcibly exit even when we are stuck inside of an event by replacing the libevent signal handler with a POSIX one that (a) attempts to trip a libevent termination event and (b) if anothe ctrl-c hits within 5 seconds, just calls exit.
This commit was SVN r26943.
2012-08-02 21:15:35 +00:00
Ralph Castain
d818c9d407 Includes a patch from Jeff and Josh: update the simulator module to allow specification of multiple slot and max_slot counts for each node group (but don't require it). Remove the requirement that each node group provide its own topology. Adjust verbosities to allow showing some light debug output to see what nodes have been added without getting a bunch of other stuff.
This commit was SVN r26936.
2012-08-02 04:57:13 +00:00
Jeff Squyres
62c2ff7ee7 It's actually ''not'' an error to exit if all routes and children are
gone.  So exit with 0, not ORTE_ERROR_DEFAULT_EXIT_CODE (which is 1).

This fixes a race condition in the rsh launcher upon termination,
where ORTE would sometimes think that a daemon failed to launch.

This commit was SVN r26935.
2012-08-01 19:49:19 +00:00
Nathan Hjelm
4557e15c18 oob/ud fix compile error
This commit was SVN r26933.
2012-07-31 21:50:34 +00:00
Ralph Castain
6ee35e4977 Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well
This commit was SVN r26931.
2012-07-31 21:21:50 +00:00
Jeff Squyres
88cbe9c780 .ompi_ignore this component until it can be fixed.
This commit was SVN r26930.
2012-07-31 21:02:06 +00:00
Nathan Hjelm
980692804d oob/ud: don't start listening for ud requests unless we have one usable port
This commit was SVN r26929.
2012-07-31 19:00:18 +00:00
Ralph Castain
23c2a315a9 Add missing line to set flag indicating at least one port found
This commit was SVN r26914.
2012-07-30 17:54:38 +00:00
Ralph Castain
6285f7d8c0 Per request of Shiqing, restore the ccp components
This commit was SVN r26904.
2012-07-29 23:49:59 +00:00
Ralph Castain
94d11e04fd Add an intermediate state when the VM is ready so that third party tools can take action prior to mapping/launching apps
This commit was SVN r26902.
2012-07-28 15:33:09 +00:00
Ralph Castain
8bc6694a62 Ensure the daemons don't incorrectly declare a failed launch
This commit was SVN r26875.
2012-07-26 19:05:06 +00:00
Ralph Castain
07846f12ae Reconnect the rsh/ssh error reporting code for remote spawns to report failure to launch. Ensure the HNP correctly reports non-zero exit status when ssh encounters a problem.
Thanks to Terry for spotting it!

This commit was SVN r26868.
2012-07-25 21:46:45 +00:00
Jeff Squyres
e5cfad0c1a This variable is only used in FT builds.
This commit was SVN r26854.
2012-07-24 12:48:47 +00:00
Shiqing Fan
12d99a9ebb Update the hwloc build on Windows and related files.
This commit was SVN r26818.
2012-07-20 12:14:28 +00:00
Abhishek Kulkarni
1ce378b5c6 Make C/R work with nodes > 1. This fix makes sure that the app coordinators send
the "ready-to-checkpoint" signal to the global coordinator only after ORTE has
initialized.

This commit was SVN r26795.
2012-07-13 23:37:29 +00:00
Abhishek Kulkarni
1878f276cd Replace the pattern while(flag) { opal_progress() }; in the C/R code
with the ORTE_WAIT_FOR_COMPLETION macro.

This commit was SVN r26794.
2012-07-13 23:31:56 +00:00
George Bosilca
772ec212eb Fix another compiler warning.
This commit was SVN r26775.
2012-07-10 15:57:42 +00:00
Abhishek Kulkarni
eec5a28aa4 More C/R fixes.
* Fix a typo introduced by the removal of the notifier framework
* Fix to flush the modex cached data correctly using the orte DB API.

This commit was SVN r26773.
2012-07-10 01:19:46 +00:00
Abhishek Kulkarni
5c58a1c9c1 Fix C/R support in the trunk.
Among other things, this patch deals with the following issues:
* fix ompi-checkpoint argument parsing
* ompi-restart -showme prints an extraneous "Restarted child with PID" 
  message. Move around the debug statement to avoid this.
* fixes for the state machine changes

This commit was SVN r26770.
2012-07-09 23:34:13 +00:00
George Bosilca
ec760454a6 Cleaning ...
This commit was SVN r26747.
2012-07-04 21:22:13 +00:00
Ralph Castain
6ae5776904 Cleanup IPV6 build
This commit was SVN r26738.
2012-07-04 00:03:50 +00:00
Ralph Castain
1a90471374 Drat - missed the other one
This commit was SVN r26718.
2012-07-02 22:18:31 +00:00
Ralph Castain
9a6a969f60 Remove debug
This commit was SVN r26717.
2012-07-02 22:18:08 +00:00
Ralph Castain
b83fc41d54 Add a state that allows mpirun or other tools to be notified of a job completion prior to terminating so that alternative actions can be performed.
This commit was SVN r26716.
2012-07-02 22:16:32 +00:00
Ralph Castain
8bebf2fa47 Ensure we don't build the MR iof components unless hadoop support is enabled
This commit was SVN r26694.
2012-06-28 18:20:15 +00:00
Ralph Castain
9aa821d8b4 Add missing file to tarball
This commit was SVN r26688.
2012-06-28 02:57:10 +00:00
Ralph Castain
0dfe29b1a6 Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required.
Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework.

Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one.

This commit was SVN r26678.
2012-06-27 14:53:55 +00:00
Brian Barrett
b22faedd9d Remove the Portals4 SHMEM reference implementation runtime support, as we're
no longer using the runtime provided by the reference implementation.

Remove the Catamount support from ORTE, since we're no longer supporting
Catamount.  Left the Catamount timer component, because I'm not sure whether
it's used on the XTs running CNL.

This commit was SVN r26677.
2012-06-27 14:17:43 +00:00
Josh Hursey
28681deffa Backout the ORCA commit. :(
There is a linking issue on Mac OSX that needs to be addressed before this is able to come back into the trunk.

This commit was SVN r26676.
2012-06-27 01:28:28 +00:00
Josh Hursey
542330e3a7 Commit of ORCA: Open MPI Runtime Collaborative Abstraction
This is a runtime interposition project that sits between the OMPI and ORTE layers in Open MPI.

The project is described on the wiki:
  https://svn.open-mpi.org/trac/ompi/wiki/Runtime_Interposition

And on this email thread:
  http://www.open-mpi.org/community/lists/devel/2012/06/11109.php

This commit was SVN r26670.
2012-06-26 21:42:16 +00:00
Ralph Castain
a34f09e67a Ensure common port is off when not being used
This commit was SVN r26666.
2012-06-26 16:09:58 +00:00
Ralph Castain
92527da4e3 Remove unused component
This commit was SVN r26660.
2012-06-26 00:49:28 +00:00
Ralph Castain
0103f82918 Turn off the common port for slurm for now
This commit was SVN r26656.
2012-06-25 21:55:51 +00:00
Shiqing Fan
6f746cdb33 remove a unused file.
This commit was SVN r26645.
2012-06-25 10:17:21 +00:00
Jeff Squyres
148ae6d6e3 This commit unifies the configury of some verbs-lovin' components.
* Add new configure command line options and deprecate some old ones:
   * --with-verbs replaces --with-openib
   * --with-verbs-libdir replaces --with-openib-libdir
 * If you specify --with-openib[-libdir] without
   --with-verbs[-libdir], you'll get a "these options have been
   deprecated!" warning, but then they'll act just like
   --with-verbs[--libdir]. 

  '''Sidenote:''' Note that we are not renaming any components at this
  time, nor are we renaming the top-level OMPI_CHECK_OPENIB m4 macro
  (which is pretty strongly tied to the openib BTL and is bastaridzed
  by the ofud BTL).  Note that there will likely be more changes in
  this area coming soon (next week?) when some long-standing changes
  move to the SVN trunk: some openib BTL infrastructure will move to
  ompi/mca/common, and its configury gets split up / refactored.

We extend our philosophy of other --with-<foo> configure options of
--with-verbs to ''all'' verbs-lovin components:

 * If you specify --with-verbs, then all verbs-lovin' components must
   configure successfully (or abort).  This currently means: OOB ud,
   BTL ofud, BTL openib.
 * If you specify --with-verbs=DIR, then all verbs-lovin' component
   must configure successfully (or abort), and will use DIR to find
   verbs headers and libraries.
 * If you specify --without-verbs, then all verbs-lovin' components
   will be ignored.

This commit also fixes a problem where the --with-openib=DIR form
would not use DIR for ''all'' verbs-lovin' components (I think only
BTL openib and BTL ofud used that DIR).  Now all of them do, as does
hwloc (because hwloc has some !OpenFabrics helper functions that
require ibv types from verbs.h).

There's a little new m4 infrastructure worth mentioning:

 * If you create a new verbs-lovin' component (i.e., a component that
   need verbs), your configure.m4 should
   AC_REQUIRE([OPAL_CHECK_VERBS_DIR]). 
 * You can then use three global shell variables: $opal_want_verbs,
   $opal_verbs_dir, $opal_verbs_libdir, which will be set as follows:
   * opal_want_verbs will be "yes" and opal_verbs_dir and
     opal_verbs_libdir will both be set to directory values, '''OR'''
   * opal_want_verbs will be "no" and opal_verbs_dir and
     opal_verbs_libdir will both be set empty

This commit was SVN r26640.
2012-06-22 19:53:56 +00:00
Ralph Castain
e6f3586415 Remove the orte notifier framework, per discussion at the devel meeting and follow-up with Jeff (who took the action item)
This commit was SVN r26637.
2012-06-22 18:09:23 +00:00
Ralph Castain
60758faa55 Fix data type
This commit was SVN r26633.
2012-06-21 23:48:55 +00:00
Ralph Castain
e9591f2563 Fix tree spawn in the rsh/qrsh environment
This commit was SVN r26631.
2012-06-21 21:29:28 +00:00
Ralph Castain
0a713cd27e Add database framework to ORTE and refactor modex code to utilize it. Create the "hash" db component from the prior modex db code. Leave the other components ignored for now - will activate them later.
Modex is still a blocking operation at this point.

This commit was SVN r26618.
2012-06-19 13:38:42 +00:00
Ralph Castain
9e0bb6ae28 Revert r26600 and r26601 for a couple of reasons:
1. they modified the OMPI-ORTE interface, which is something I promised to avoid doing unless absolutely necessary, and

2. the framework ident is already in the component name key provided to the modex db. What is missing is the project ident, but as Jeff and I discussed last week, we really need to add that field to the component struct anyway to avoid multi-project collisions on framework names. That will be done over the next couple of weeks as a separate effort.

This commit was SVN r26613.

The following SVN revision numbers were found above:
  r26600 --> open-mpi/ompi@5ba4deff07
  r26601 --> open-mpi/ompi@0e3094c318
2012-06-16 09:11:03 +00:00
Ralph Castain
3c2a03b16d Update the other routed components to use common ports. Per conversation with Josh, remove the "cm" component.
This commit was SVN r26608.
2012-06-15 15:36:08 +00:00
Ralph Castain
96c778656a Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP.
Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build.

This commit was SVN r26606.
2012-06-15 10:15:07 +00:00
Ralph Castain
0e3094c318 Update the other grpcomm modules to new API
This commit was SVN r26601.
2012-06-14 03:28:48 +00:00
Ralph Castain
5ba4deff07 Extend the modex database to support multiple projects and frameworks that might have duplicate component names. No visible API change in the BTL's as it was executed solely in the ompi modex code.
This commit was SVN r26600.
2012-06-14 02:55:06 +00:00
Ralph Castain
ecc51d8583 Add missing endif
This commit was SVN r26596.
2012-06-12 15:07:09 +00:00
Ralph Castain
078a4667e4 Some more cleanup on direct routed when daemons are involved
This commit was SVN r26594.
2012-06-11 23:46:22 +00:00
Ralph Castain
269cb2b8d9 Some cleanup to remove calls to opal_progress when running with orte progress threads, and to ensure that all orte-related events are in the orte event base.
This commit was SVN r26591.
2012-06-11 19:59:53 +00:00
Ralph Castain
75e66ad51e Restore the direct routed component
This commit was SVN r26590.
2012-06-11 17:16:02 +00:00
Brian Barrett
7406ef1241 Make all the PMI components depend on the common pmi library and properly
install the common pmi library

This commit was SVN r26588.
2012-06-11 15:58:09 +00:00
Ralph Castain
2812579246 Just because we find an IB device does not mean we can get a QP on it. Check to see if we can before we select the UD OOB module for use.
This commit was SVN r26587.
2012-06-10 01:42:51 +00:00
Ralph Castain
0442a807c0 Default the OOB to the "ud" component IFF the HNP finds itself on a node with a supported Infiniband device. Ensure that the daemons all pick the matching component by dictating the selection via mca param on the orted cmd line.
This commit was SVN r26582.
2012-06-08 01:23:08 +00:00
Ralph Castain
05122a2f93 Make debruijn the default routed component. Update the radix component to "short-circuit" the tree when the job size permits
This commit was SVN r26580.
2012-06-08 00:35:36 +00:00
Ralph Castain
ffcca0185a Remove no longer needed component
This commit was SVN r26578.
2012-06-08 00:18:59 +00:00
Ralph Castain
980768965f Remove unused and unsupported component
This commit was SVN r26577.
2012-06-07 23:48:06 +00:00
Ralph Castain
350900f70e Remove unused and unsupported component
This commit was SVN r26576.
2012-06-07 23:47:35 +00:00
Nathan Hjelm
625c8078c3 oob/ud: fix typo
This commit was SVN r26569.
2012-06-07 19:21:23 +00:00
Ralph Castain
7a94a52420 No reason not to build this
This commit was SVN r26568.
2012-06-07 19:11:44 +00:00
Shiqing Fan
2abf783fa0 Remove a unnecessary definition before the real one.
This commit was SVN r26562.
2012-06-06 14:15:39 +00:00
Ralph Castain
166d254d4e Add new routed component
This commit was SVN r26557.
2012-06-06 11:53:12 +00:00
Ralph Castain
d6279fc971 Fix the debugger daemon launch support to fit the new state machine. Treat debugger daemons just like any other job, except that we map them only to nodes where an app process currently exists (as opposed to every node in the system). Trigger breakpoint and rank0 release only after the debugger daemons are in position.
This commit was SVN r26556.
2012-06-06 02:01:23 +00:00
Jeff Squyres
0b8849e2c4 Make "mpirun --report-bindings" have a user-friendly output (i.e.,
readable by normal human beings, vs. having a bitmap of physical
PU's).  Use the new hwloc base prettyprint functions to generate the
output. 

This commit was SVN r26533.
2012-06-01 16:35:31 +00:00
Jeff Squyres
99c5afb397 Remove clang compiler warnings.
This commit was SVN r26523.
2012-05-29 23:36:06 +00:00
Ralph Castain
b0938a254e Dont use mutex where it isn't needed
This commit was SVN r26521.
2012-05-29 20:21:11 +00:00
Ralph Castain
32b66c166b Missed one blasted spot
This commit was SVN r26520.
2012-05-29 20:20:10 +00:00
Ralph Castain
9bedb25dda Cleanup some compiler warnings, some of which are actual logic errors
This commit was SVN r26519.
2012-05-29 20:11:51 +00:00
Ralph Castain
d7ac424d8d Silence optimized build warnings
This commit was SVN r26518.
2012-05-29 19:55:47 +00:00
Ralph Castain
bf5ec1ac0c Silence optimized build warnings
This commit was SVN r26517.
2012-05-29 19:55:31 +00:00
Ralph Castain
be6ed9c2df Allow partial use of allocations by specifying the max number of daemons (i.e., max VM size) for the job
This commit was SVN r26499.
2012-05-27 16:48:19 +00:00
Ralph Castain
7fb49b1559 Silence warning
This commit was SVN r26480.
2012-05-23 13:59:41 +00:00
Ralph Castain
da28a4b0e6 Silence warning
This commit was SVN r26479.
2012-05-23 13:59:22 +00:00
Nathan Hjelm
b9959a95cd ack! one more
This commit was SVN r26472.
2012-05-22 20:52:52 +00:00
Nathan Hjelm
f2d4e95429 doh! add missing include
This commit was SVN r26471.
2012-05-22 20:49:13 +00:00
Nathan Hjelm
cdc3c87ba6 move pmi init/finalize into a common component
This commit was SVN r26470.
2012-05-22 15:15:39 +00:00
Nathan Hjelm
78b8b3cf76 bug fix: actually close ess components
This commit was SVN r26469.
2012-05-22 15:09:18 +00:00
Nathan Hjelm
6eeca66475 add an option to enable static ports. diabled by default
This commit was SVN r26462.
2012-05-21 19:56:15 +00:00
Ralph Castain
83d69b6c95 Enable the ORTE progress thread for apps (not needed in the tools as they already continuously loop in the event lib). This appears to be working, at least for MPI apps that only use shared memory (a simple "hello"). More testing is required to identify where problems will occur - this is only intended to allow further development.
In order to use the progress thread, you must configure with:

--enable-orte-progress-threads --enable-event-thread-support

This commit was SVN r26457.
2012-05-20 15:14:43 +00:00
Ralph Castain
c4f8043064 Per Nathan, with a little cleanup by me: update the PMI support to aggregate modex info, thus reducing the number of keys required so it fits within Cray default constraints
This commit was SVN r26456.
2012-05-19 16:12:52 +00:00
Jeff Squyres
cab31eafce Revert r26413: it was causing too much confusion. When an MPI proc
exits with status 77, the whole job will be killed, but mpirun will
still return an exit status of 77, so MTT will report it as a skip
anyway. 

This commit was SVN r26445.

The following SVN revision numbers were found above:
  r26413 --> open-mpi/ompi@02aa36f2e5
2012-05-16 14:45:58 +00:00
Jeff Squyres
02aa36f2e5 ORTE defaults to killing the entire job when any process exits with a
nonzero status (we polled other MPI implementations since one one in
the OMPI community had a concrete opinion on what behavior to do here
-- all other MPI's seem to adhere to this behavior, too).

This commit adds an MCA parameter that allows us to tell ORTE to
''not'' kill jobs when a process exits with a status of 77, meaning
the GNU testing standard of "this test was skipped".  In all the OMPI
tests, all procs will either return 77 or not.  So if they all return
77, mpirun won't consider it an error, but will still return an exit
status of 77 (so that MTT can know that the test was cleanly skipped).

This commit was SVN r26413.
2012-05-08 21:49:05 +00:00
Ralph Castain
70a106fa71 Fix binding on remote nodes - need to pass the binding bitmap!
This commit was SVN r26403.
2012-05-08 03:52:39 +00:00
Jeff Squyres
2ba10c37fe Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
   wholly replaced by hwloc.
 * Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
 * Update sm, smcuda, wv, and openib components to no longer use carto.
   Instead, use hwloc data.  There are still optimizations possible in
   the sm/smcuda BTLs (i.e., making multiple mpools).  Also, the old
   carto-based code found out how many NUMA nodes were ''available''
   -- not how many were used ''in this job''.  The new hwloc-using
   code computes the same value -- it was not updated to calculate how
   many NUMA nodes are used ''by this job.''
   * Note that I cannot compile the smcuda and wv BTLs -- I ''think''
     they're right, but they need to be verified by their owners.
 * The openib component now does a bunch of stuff to figure out where
   "near" OpenFabrics devices are.  '''THIS IS A CHANGE IN DEFAULT
   BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
   (I do not have a NUMA machine with an OpenFabrics device that is a
   non-uniform distance from multiple different NUMA nodes).
 * Completely rewrite the OMPI_Affinity_str() routine from the
   "affinity" mpiext extension.  This extension now understands
   hyperthreads; the output format of it has changed a bit to reflect
   this new information.
 * Bunches of minor changes around the code base to update names/types
   from maffinity/paffinity-based names to hwloc-based names.
 * Add some helper functions into the hwloc base, mainly having to do
   with the fact that we have the hwloc data reporting ''all''
   topology information, but sometimes you really only want the
   (online | available) data.

This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
Ralph Castain
44b8608f0a Convert debug to verbose
This commit was SVN r26384.
2012-05-05 17:46:10 +00:00
Ralph Castain
96bfeb591c Ensure flag is passed to remote daemons
This commit was SVN r26383.
2012-05-03 22:31:25 +00:00
Ralph Castain
45fee2b491 Resolve the case where only the HNP is in the system (i.e., single-node operation)
This commit was SVN r26382.
2012-05-03 18:00:01 +00:00
Ralph Castain
c352ca36c2 Minor cleanup
This commit was SVN r26381.
2012-05-02 21:23:37 +00:00
Ralph Castain
b2f77bf08f Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications.
Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well.

This commit was SVN r26380.
2012-05-02 21:00:22 +00:00
Ralph Castain
c5da4f24d7 Fix stupid singletons - get the pidmap message correct
This commit was SVN r26378.
2012-05-02 17:48:02 +00:00
Ralph Castain
9f724db182 Remove duplicate event assignment
This commit was SVN r26360.
2012-04-30 16:06:20 +00:00
Ralph Castain
289f9f41ec From long-term discussions, have the daemons use the node_t and proc_t structs and arrays instead of the pidmap and nidmap arrays. Sets the stage for future work.
This commit was SVN r26359.
2012-04-29 00:10:01 +00:00
Ralph Castain
f3e3704c9e Per request from Brian, enable mapping of stddiag output (output from opal_output calls) to stderr of the local process. This allows you to obtain that output in a local window (for example, when using xterm for each process) instead of having it automatically forwarded to mpirun. Turn this on automatically whenever someone uses the -xterm option, and to be set manually using the orte_map_stddiag_to_stderr mca param.
This commit was SVN r26352.
2012-04-27 14:39:34 +00:00
Jeff Squyres
46f47e08b6 Remove typo/extra brackets and parens.
This commit was SVN r26351.
2012-04-27 13:48:43 +00:00
Jeff Squyres
9d0df5a9a6 Update configury in the new oob ud component: actually check to see if
it succeeds and run $1 or $2, accordingly.  This allows "make dist" to
run properly on machines that do not have OpenFabrics stuff installed
(e.g., the nightly tarball build machine).

There's still more to be done here -- it doesn't check for non-uniform
directories where the OpenFabrics headers/libraries might be
installed.  We might need to re-tool/combine
ompi/config/ompi_check_openib.m4 (which checks for way more than
oob/ud needs) and move it up to config/ompi_check_ofa.m4, or
something...?

This commit was SVN r26350.
2012-04-27 11:32:56 +00:00
Jeff Squyres
9829d2279f System-level includes should be at the top of the file, before most
OPAL/ORTE/OMPI includes.

This commit was SVN r26349.
2012-04-27 11:29:22 +00:00
Ralph Castain
38af7db183 Ensure the progress message comes out right away. Otherwise, on a large system where proc state messages are arriving frequently, the message doesn't get printed until the launch is done!
This commit was SVN r26346.
2012-04-26 23:41:03 +00:00
Nathan Hjelm
e1e0d466e5 Merge ssh://ct-fe1/usr/projects/hpctools/hjelmn/ompi-trunk-git into HEAD
This commit was SVN r26344.
2012-04-26 22:06:12 +00:00