config/ directory. We split them apart a while ago in the hopes that
it would simplify things, but it didn't really (e.g., because there
were still some ompi/opal .m4 files in the top-level config/
directory, resulting in developer confusion where any given m4 macro
was defined).
So this commit consolidates them back into the top-level directory for
simplicity.
There's still (at least) two changes that would be nice to make:
1. Split any generated .m4 file (e.g., autogen-generated .m4 files)
into a separate directory somewhere so that a top-level -Iconfig/
will only get our explicitly defined macros, not the autogen stuff
(e.g., with libevent2019 needing to get the visibility macro, but
NOT all the autogen-generated inclusion of component configure.m4
files).
1. Change configure to be of the form:
{{{
# ...a small amount of preamble/setup...
OPAL_SETUP
m4_ifdef([project_orte], [ORTE_SETUP])
m4_ifdef([project_ompi], [OMPI_SETUP])
# ...a small amount of finishing stuff...
}}}
I doubt we'll ever get anything as clean as that, but that would be
the goal to shoot for.
This commit was SVN r27704.
Fix bug reported by FreyGuy19713: in cases where HNP node has multiple entries in a hostfile or other allocation, we need to track the total slots allocated to that node.
This commit was SVN r27673.
The following Trac tickets were found above:
Ticket 3429 --> https://svn.open-mpi.org/trac/ompi/ticket/3429
* Add some comments in the *-wrapper-data-txt.in files just so that
someone doesn't forget in the future why we link in what we do in
the MPI and ORTE wrapper compilers.
* Update ompi_wrapper_script.in to match the new behavior.
* Update orte_wrapper_script.in to support --openmpi:linkall (which
is a no-op in this case)
This commit was SVN r27672.
The following Trac tickets were found above:
Ticket 3422 --> https://svn.open-mpi.org/trac/ompi/ticket/3422
additional functionality. Rationale (refs trac:3422):
* Normal MPI applications only ever use the MPI API. Hence, -lmpi is
sufficient (they'll never directly call ORTE or OPAL
functions). This is arguably the most common case.
* That being said, we do have some test programs (e.g., those in
orte/test/mpi) that call MPI functions but also call ORTE/OPAL
functions. I've also written the occasional MPI test program that
calls opal_output, for example (there even might be a few tests in
the IBM test suite that directly call ORTE/OPAL functions).
* Even though this is not a common case, these applications should
also compile/link with mpicc.
* So we should add a --openmpi:linkall option that will also link
in whatever is necessary to call ORTE/OPAL functions
* Yes, we could hard-code "-lopen-rte -lopen-pal" in Makefiles, but
we do reserve the right to change those library names and/or add
others someday, so it's better to abstract out the names and let
the wrapper supply whatever is necessary.
* ORTE programs, however, are different. They almost always call OPAL
functions (e.g., if they want to send a message, they must use the
OPAL DSS). As such, it seems like the ORTE programs should always
link in OPAL.
Therefore:
* Add undocumented --openmpi:linkall flag to the wrapper compilers.
See the comment in opal_wrapper.c for an explanation of what it
does. This flag is only intended for Open MPI developers -- not
end users. That's why it's undocumented.
* Update orte/test/mpi/Makefile.am to add --openmpi:linkall
* Make ortecc/ortec++'s wrapper data text files always explicitly
link in libopen-pal
This commit was SVN r27670.
The following SVN revision numbers were found above:
r27668 --> open-mpi/ompi@cf845897aa
The following Trac tickets were found above:
Ticket 3422 --> https://svn.open-mpi.org/trac/ompi/ticket/3422
1. Restore libopen-pal.la, libopen-rte.la, and libmpi.la to be
separate entities (i.e., don't have libopen-rte.la include
libopen-pal.la, and don't have libmpi.la include libopen-pal.la).
Yay!
1. Consequently, make the wrapper compilers look for flags indicating
that the user wants to compile statically (currently: -static,
!--static, -Bstatic, and "-Wl," in front of all of those). If it
is, follow a 6-way matrix for determinining which libraries to
list on the underlying command line.
1. To support that, add the name of a token static and dynamic
library to look for in each of the wrapper compiler data files.
1. Fix a long-standing typo in the opalcc wrapper data file.
This commit was SVN r27662.
Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality.
This commit was SVN r27570.
It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now.
This commit was SVN r27527.
The following SVN revision numbers were found above:
r27451 --> open-mpi/ompi@d59034e6ef
r27456 --> open-mpi/ompi@ecdbf34937
ompi/mca/sbgp/basesmsocket
orte/mca/rmaps/lama
Remove stale configure.params files from the sbgp framework as the OMPI build system no longer looks at those files.
This commit was SVN r27377.
debruijn when launching fewer processes than are actually available within an
allocation. When this is fixed, please revert this change.
This commit was SVN r27376.
This now results in the procs being bound within their assigned location. It also causes us to use only the 0th HT on a core unless --use-hwthread-cpus has been specified (in which case, we use all the HTs in a core). Bind to core binds you to all HTs regardless - the --use-hwthread-cpus only impacts the oversubscribed determination and when binding to HT.
cmr:v1.7
This commit was SVN r27342.
As a secondary cleanup, the HNP doesn't need to update its nidmap during an xcast as it already has an up-to-date picture of the situation. So just dump that data and move along.
This commit was SVN r27318.
We ran into a case where the OMPI SVN trunk grew a new acceptable MCA
parameter value, but this new value was not accepted on the v1.6
branch (hwloc_base_mem_bind_failure_action -- on the trunk it accepts
the value "silent", but on the older v1.6 branch, it doesn't). If you
set "hwloc_base_mem_bind_failure_action=silent" in the default MCA
params file and then accidentally ran with the v1.6 branch, every OMPI
executable (including ompi_info) just failed because hwloc_base_open()
would say "hey, 'silent' is not a valid value for
hwloc_base_mem_bind_failure_action!". Kaboom.
The only problem is that it didn't give you any indication of where
this value was being set. Quite maddening, from a user perspective.
So we changed the ompi_info handles this case. If any framework open
function return OMPI_ERR_BAD_PARAM (either because its base MCA params
got a bad value or because one of its component register/open
functions return OMPI_ERR_BAD_PARAM), ompi_info will stop, print out
a warning that it received and error, and then dump out the parameters
that it has received so far in the framework that had a problem.
At a minimum, this will show the user the MCA param that had an error
(it's usually the last one), and ''where it was set from'' (so that
they can go fix it).
We updated ompi_info to check for O???_ERR_BAD_PARAM from each from
the framework opens. Also updated the doxygen docs in mca.h for this
O???_BAD_PARAM behavior. And we noticed that mca.h had MCA_SUCCESS
and MCA_ERR_??? codes. Why? I think we used them in exactly one
place in the code base (mca_base_components_open.c). So we deleted
those and just used the normal OPAL_* codes instead.
While we were doing this, we also cleaned up a little memory
management during ompi_info/orte-info/opal-info finalization.
Valgrind still reports a truckload of memory still in use at ompi_info
termination, but they mostly look to be components not freeing
memory/resources properly (and outside the scope of this fix).
This commit was SVN r27306.
The following Trac tickets were found above:
Ticket 3275 --> https://svn.open-mpi.org/trac/ompi/ticket/3275
If orte_sstore_base_global_snapshot_ref is null, then it will default appropriately when it is used. When prelaunching we always specify this parameter, but if we are not prelaunching it is possible to allow this to be null and it will initialize when used. However we setup the prelaunching variable in both situtations and in the latter that would result in a NULL reference. This patch protects that code segment.
This commit was SVN r27289.
So (finally) consolidate these two fields into one "slots" field. Add a field in orte_job_t to indicate when all the procs for a job will be launched together, so that staged operations can know when MPI operations are allowed.
This commit was SVN r27239.
For those nodes (and *only* those nodes) where the user does *not* specify a slot count, we will set the number of slots according to their direction: either to the number of cores, numas, sockets, or hwthreads. Otherwise, the slot count is set to 1.
Note that the default behavior remains unchanged: in the absence of any value for #slots, and in the absence of any directive to set #slots, we will set #slots=1.
This commit was SVN r27236.
1. if a hostfile is given, then add the nodes found in that hostfile to our list - i.e., the resulting allocation contains the UNION of all nodes specified in hostfiles from across all apps.
2. any app that has no hostfile but has a dash-host, will have those nodes added to the list
3. any app that fails to have a hostfile or a dash-host will be given the default hostfile, if we have it
Each app will subsequently be filtered using their hostfile and/or dash-host data to ensure that the app only has access to the hosts it specified
Note that any relative node syntax found in the hostfiles or dash-host data will generate an error in this scenario, so only non-relative syntax can be present
This commit was SVN r27223.
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component. This component is used to effect many different types of
regular of process/processor affinity patterns. Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.
Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.). The LAMA core algorithm is described here:
http://www.open-mpi.org/papers/cluster-2011-lama/
= LAMA Usage Levels =
LAMA allows specifying affinity multiple different ways:
1. None: Speciying no affinity options to mpirun results in exactly
the same behavior as today: no affinity is used.
1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
<LEVEL>" to indicate how "wide" each process should be bound
(i.e., bind to a processor core, or to a processor socket, etc.)
and how to lay out the processes (i.e., round robin by cores,
sockets, etc.).
1. Expert: Using four new MCA parameters to effect process mapping
and binding to processors. These options are a bit complex, and
are not for the faint at heart, but offer a high degree of
(regular pattern) flexibility (each of these are described more
fully below):
* rmaps_lama_map: a sequence of characters describing how to lay
out processes
* rmaps_lama_bind: a sequence of characters describing the
resources to bind to each process
* rmaps_lama_mppr: a sequence of characters describing the maximum
number of processes to allow per resource (i.e., a specific
definition of "oversubscription")
* rmaps_lama_ordering: once all processes are in place, how to
order the ranks in MPI_COMM_WORLD
We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.
The Expert level was designed for two purposes:
1. To provide a precise definition for the "Simple" level (i.e.,
every
--bind-to/--map-by option in the "Simple" level has a
corresponding
precise specification in the "Expert" level)
1. As modern computing platforms become more complex, we simply
cannot predict what application developers will need in terms of
processor affinity. LAMA is an attempt to provide a highly
flexible mechanism that allows applications to utilize a variety
of complex, unique affinity patterns beyond the common "bind to
core" and "bind to socket" patterns.
= LAMA Simple Level =
The "Simple" level is pretty much the same as what Open MPI has
offered for years. It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.
Specifically, the following options are available for both --bind-to
and --map-by:
* slot
* hwthread
* core
* l1cache
* l2cache
* l3cache
* socket
* numa
* board
* node
= LAMA Expert Level =
The "Expert" level requires some explanation. I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek. It is
flexible, but complex. '''Most users won't need the Expert level.'''
LAMA works in three phases: mapping, binding, and ordering. Each is
described below.
== Expert: Mapping ==
Processes are paired with sets of resources. For example, each
process may be paired with a single processor core. Or each process
may be paired with an entire processor socket. LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits. More on MPPR, below.
Mapping can be performed across multiple hardware levels:
* h: Hardware thread
* c: Processor core
* s: Processor socket
* L1: L1 cache
* L2: L2 cache
* L3: L3 cache
* N: NUMA node
* b: Processor board
* n: Server node
If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.
But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.'' That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them. And you can imagine that changing the order of nesting
would change the traversal pattern.
LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.
For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.
Wait... what? That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads. Why are they
tacked on to the end?
In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core). When there are no more resources indicated by that
token, it goes on to the next token.
Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.
Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).
If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core. And the next
process will be mapped to the second hyperthread on the second core.
And so on.
Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.
As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).
The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.
=== Max Processes Per Resource (MPPR) ===
The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource. In effect, it
defines the concept of "oversubscription." Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.
This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).
But what if your MPI processes are multi-threaded, and they need
multiple processes per core? You'd need a different description of
"oversubscription" in this case. Perhaps you want to have one MPI
process per socket. This would be expressed in a MPPR string of
"1:s".
The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification. For example "1:c" is pronounced "one
process per core."
Multiple MPPR specifications can be strung together into a
comma-delimited list, too. All of these MPPR values and then taken
into account when mapping. Here's some examples:
* 1:c -- allow, at most, one process per processor core (i.e., don't
schedule by hyperthread)
* 1:s -- allow, at most, one process per processor socket (e.g.,
that process may be multithreaded, or wants exclusive use of the
socket's caches)
* 1:s,2:n -- only allow one process per processor socket, but, at
most, two processes per server node (e.g., if the two MPI processes
will consume all the RAM on the server node, even if there are more
processor cores available)
If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed. If --oversubscribe was specified
on the mpirun command line, the job continues. Otherwise, LAMA will
abort the job.
Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.
== Expert: Binding ==
Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources. For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.
To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.
With binding, however, processes are bound to a set of hardware
threads. The number of threads to which the process is bound is
sometimes referred to as the "binding width". For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.
(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket. BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")
Bindings are expressed as an integer and a token from the mapping
string. For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).
Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).
== Expert: Ordering ==
Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA
currently offers two ordering modes: sequential or natural:
* Sequential: if you laid out all the hardware resources in a single
line, and then overlaid all the MPI processes on top of them, they
are ordered from 0 to (N-1) from left-to-right.
* Natural: the ordering of ranks follows the mapping ordering. For
example, consider a server node with two processor sockets, each
containing four cores. The command line "mpirun -np 8 --bind-to
core --map-by socket --order n a.out" would result in MCW ranks
that look like this: [0 2 4 6] [1 3 5 7].
= Execution =
At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered. It now starts its execution.
= Final Notes =
Note that at this point, lama is not the default mapper. It must be
activiated with "--mca rmaps lama". We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.
Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets). For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.
See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/
This commit was SVN r27206.
Remove some stale configure.m4's we no longer need.
Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.
This commit was SVN r27165.
following:
* Provides a fixed number of resource slots (i.e., "hotel rooms").
* Allows one thing to occupy a resource slot at a time (i.e., each
hotel room can have an occupant check in to that room).
* Resource slots can be vacated at any time (i.e., occupants can
voluntarily check out of their hotel room).
* Resource slots can be occupied for a specific maximum amount of
time. If that time expires, the occupant is forcibly evicted and
the upper layer is notified via (libevent) callback (i.e., the maid
will kick an occupant of out of their room when their reservation
is over).
This class can be to be used for things like retransmission schemes
for unreliable transports. For example, a message sent on an
unreliable transport can be checked in to a hotel room. If an ACK for
that message is received, the message can be checked out. But if the
ACK is never received, the message will eventually be evicted from its
room and the upper layer will be notified that the message failed to
check out in time (i.e., that an ACK for that message was not received
in time).
Code using this class is currently being developed off-trunk, but will
be coming to SVN soon.
This commit was SVN r27067.
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context
Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.
This commit was SVN r27005.
r26951. The feeling is that fixing the actual problem of the command
line parser not always identifying when invalid command line options
were specified (i.e., r26953) was a better solution.
This commit was SVN r26979.
The following SVN revision numbers were found above:
r26951 --> open-mpi/ompi@1f8df92c3c
r26953 --> open-mpi/ompi@0b7b3feba9
gone. So exit with 0, not ORTE_ERROR_DEFAULT_EXIT_CODE (which is 1).
This fixes a race condition in the rsh launcher upon termination,
where ORTE would sometimes think that a daemon failed to launch.
This commit was SVN r26935.