1
1
Граф коммитов

829 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Gilles Gouaillardet
f96d382d1d Fix typo.
Thanks to Christopher Samuel for reporting it

This commit was SVN r32520.
2014-08-13 05:54:59 +00:00
Gilles Gouaillardet
e184733ef6 check-help-strings cleanup
This commit was SVN r32496.
2014-08-11 03:26:21 +00:00
Jeff Squyres
4da3c85b54 fortran: revert Absoft-based fixes
Rever r32246, r32254, and 32255 -- they were fixing side-effects of
the real bug.  Real fix coming after this one.

This commit was SVN r32286.

The following SVN revision numbers were found above:
  r32246 --> open-mpi/ompi@08d2a1a48d
  r32254 --> open-mpi/ompi@232d4dbb7b
2014-07-22 21:49:22 +00:00
Jeff Squyres
6cc538ae16 help-orterun.txt: wrap long messages, clarify new messages
Clarify the new -x/mca_base_env_list help messages.

This commit was SVN r32199.
2014-07-10 17:24:52 +00:00
Ralph Castain
796f57f709 Protect against problems if someone passes us thru a pipe and then abnormally terminates the pipe early
This commit was SVN r32189.
2014-07-09 22:41:53 +00:00
Joshua Ladd
801e2cb544 Fix error and warning messages after reverting
the mca_base_env_list to being semicolon delimited.

This commit was SVN r32179.
2014-07-09 14:46:19 +00:00
Joshua Ladd
30da6d3a17 Opal: add a new MCA parameter that allows the user to specify a list of environment variables. This parameter will become the standard mechanism by which environment variables are set for OMPI applications replacing the -x option.
mpirun ... -x env_foo1=val1 -x env_foo2 -x env_foo3=val3  should now be expressed as

mpirun ... -mca mca_base_env_list env_foo1=val1+env_foo2+env_foo3=val3. 

The motivation for doing this is so that a list of environment variables may be set via standard MCA mechanisms such as mca parameter files, amca lists, etc. 

This feature was developed by Elena Shipunova and was reviewed by Josh Ladd.

This commit was SVN r32163.
2014-07-09 00:38:25 +00:00
Adrian Reber
cabf1d4e68 use the orte attributes in the FT code to fix compile errors
This commit was SVN r32093.
2014-06-26 03:19:17 +00:00
Ralph Castain
5f6be06b54 Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended.
cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load

This commit was SVN r32072.
2014-06-24 18:11:39 +00:00
Ralph Castain
8db76e9c6f Ensure that we change to the session dir if we preload binaries so we'll use the loaded one
Special patch created for v1.8 and CMR filed

This commit was SVN r31963.
2014-06-06 21:43:23 +00:00
Ralph Castain
f1978fba7c Cleanup a set of typos on the orte_get_attribute call
This commit was SVN r31942.
2014-06-03 20:36:38 +00:00
Ralph Castain
8736a1c138 Per RFC:
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php

Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).

This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Oscar Vega-Gisbert
83bdebbf81 Java bindings for OSHMEM.
This commit was SVN r31810.
2014-05-18 21:48:09 +00:00
Ralph Castain
5602156a1c Use the correct abstraction layer name for the data dirs
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091 The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
4def94900a Per RFC: OMPI_INSTALL_BINARIES -> OPAL_INSTALL_BINARIES
This commit was SVN r31634.
2014-05-05 21:43:05 +00:00
Ralph Castain
7a79b25577 Ensure we cleanup some files so session dirs can be rolled up
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31569.
2014-04-30 17:52:10 +00:00
Ralph Castain
c4c9bc1573 As per the RFC:
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php

Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM

This commit was SVN r31557.
2014-04-29 21:49:23 +00:00
Jeff Squyres
e1655ae68d opal/util/fd.c: add new convenience function for setting FD_CLOEXEC
Paul Hargrove pointed out that Stevens tells us that we should
FD_GETFL before FD_SETFL.  And so we shall.

Make a new convenience function to do this (opal_fd_set_cloexec()),
just so that we don't have to litter this 2-step process throughout
the code.

Refs trac:4550

This commit was SVN r31513.

The following Trac tickets were found above:
  Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 13:04:49 +00:00
Jeff Squyres
87e6232e67 orterun.c: set an fd to be close-on-exec
Make sure the debugger attach fifo is marked as close-on-exec so that
children procs don't inherit it.  For example, if you salloc a SLURM
allocation and run "mpirun ..." in there (i.e., mpirun is running on
the head node, and launching on to back-end nodes), the forked srun's
will inherit this fd if it is still open.

Refs trac:4550

This commit was SVN r31499.

The following Trac tickets were found above:
  Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-22 21:55:09 +00:00
Jeff Squyres
63b7ef4103 orterun.1in: Document --allow-run-as-root option
Add some verbiage about how mpirun now defaults to disallowing running
as root, but you can use the --allow-run-as-root option to override
this default behavior.

Refs trac:4536

This commit was SVN r31477.

The following Trac tickets were found above:
  Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-22 14:34:32 +00:00
Jeff Squyres
482b465c05 Trivial format change: use the same length of lines and \n offsets as
opal_show_help().

Refs trac:4536

This commit was SVN r31437.

The following Trac tickets were found above:
  Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:14:45 +00:00
Ralph Castain
12094eb7b2 Add some further protections after discussion with Jeff
Refs trac:4536

This commit was SVN r31422.

The following Trac tickets were found above:
  Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 16:21:55 +00:00
Ralph Castain
7c4fa3446c Per the telecon, revert r31302 for now pending an RFC review on the idea of setting app proc envar's using an MCA param
This commit was SVN r31345.

The following SVN revision numbers were found above:
  r31302 --> open-mpi/ompi@6a1b78e26b
2014-04-08 15:47:12 +00:00
Mike Dubman
6a1b78e26b opal: add mca param to control ranks env variables
add -mca base_env_list "var1=val1 var2=val2 ..." mca parameter that can be used in mca param files
or with -am app.conf mpirun commandline to set rank env variables with mca mechanism

fixed by Elena, reviewed by Miked

cmr=v1.8.1:reviewer=ompi-rm1.8

This commit was SVN r31302.
2014-04-01 21:14:31 +00:00
Jeff Squyres
173c046617 build: add Automake-like silent/verbose macros for "ln -s ..." operations
Also, since I put some of the macros for these silent/verbose rules up
in the top-level Makefile.man-page-rules file, I renamed it to
Makefile.ompi-rules.

I've had this sitting around for a while; now seems like as good a
time as any to commit it.

This commit was SVN r31271.
2014-03-28 18:24:32 +00:00
Ralph Castain
f7df960198 Silence warning
This commit was SVN r31139.
2014-03-18 23:15:29 +00:00
Ralph Castain
518ba55cf4 Ensure MPIEXEC_TIMEOUT calls the correct state to exit
cmr=v1.7.5:reviewer=dgoodell

This commit was SVN r31125.
2014-03-18 20:12:02 +00:00
Ralph Castain
38e02890aa ORTE doesn't care about cxx flags
cmr=v1.8:reviewer=jsquyres

This commit was SVN r31086.
2014-03-17 21:21:54 +00:00
Ralph Castain
7869402f5f Sigh - looks like I did too good a job of turning things off. Back some of it out in favor of trying again when more time is available
Refs trac:4368

This commit was SVN r31017.

The following Trac tickets were found above:
  Ticket 4368 --> https://svn.open-mpi.org/trac/ompi/ticket/4368
2014-03-12 02:10:35 +00:00
Ralph Castain
9c66c4f439 Correctly implement --disable-oshmem and --without-orte so we don't build the disabled section of code. Fix a bunch of code rot in the PMI rte component, and add several missing headers when building --without-orte.
NOTE: I transferred the oshmem-disabled-by-default from the 1.7 branch to the trunk to minimize future disruption if/when we change that option.

cmr=v1.8:reviewer=jsquyres

This commit was SVN r31006.
2014-03-11 22:02:40 +00:00
Adrian Reber
e5bef82ee1 OPAL_ENABLE_FT_CR: remove compiler warnings
When compiling --with-ft there are a few compiler warnings about
unused variables. This patch fixes those compiler warnings.

This commit was SVN r30927.
2014-03-04 15:28:07 +00:00
Ralph Castain
0ac97761cc Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named.
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.

In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.

Also cleanup some a couple of issues in the mapping/binding system:

* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed

* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch

* minor cleanup to the warning message when oversubscribed and binding was requested

cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system

This commit was SVN r30909.
2014-03-03 16:46:37 +00:00
Ralph Castain
1565816988 Do a little better job of cleaning up the session directory left by mpirun by ensuring we delete the event associated with debugger attachment and unlinking the pipe used for that purpose. Also, we no longer leave "abort" files around, so remove that check when deleting session directory trees
cmr=v1.7.5:reviewer=jsquyres:subject=cleanup session directories better

This commit was SVN r30689.
2014-02-11 22:16:17 +00:00
Ralph Castain
a49e0db8dd We haven't supported a c++ wrapper for ORTE in quite some time
cmr=v1.7.5:reviewer=ompi-gk1.7:subject=remove c++ cruft

This commit was SVN r30653.
2014-02-10 17:16:30 +00:00
Ralph Castain
bc7cc09749 After a lot of pain, I've managed to resolve the problem of conflicting mapping directives caused by mismatched MCA params - i.e., where someone has one variant of an MCA param (e.g., rmaps_base_mapping_policy) in their default MCA param file, and then specifies another variant (e.g., --npernode) on the command line. I can't fully resolve the problem as there is no way to know precisely what the user meant - we can only guess which param was really intended since the MCA param system
can't apply its normal precedence rules.

So...print a big "deprecated" warning for the old params and error out if a conflict is detected. I know that isn't what people really wanted, but it's the best we
 can do. If only the old style param is given, then process it after the warning.

Extend the current map-by param to add support for ppr and cpus-per-proc, adding the latter to the list of allowed modifiers using "pe=n" for processing elements/proc. Thus, you can map-by socket:pe=2,oversubscribe to map by socket, binding 2 processing elements/process, with oversubscription allowed. Or you can map-by ppr:2:socket:pe=4 to map two processes to every socket in the allocation, binding each process to 4 processing elements.

For those wondering, a processing element is defined as a hwthread if --use-hwthreads-as-cpus is given, or else as a core.

Refs trac:4117

This commit was SVN r30620.

The following Trac tickets were found above:
  Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-07 21:25:40 +00:00
Jeff Squyres
4edeb229cc Add MPIEXEC_TIMEOUT environment variable to the man page.
cmr=v1.7.4:reviewer=rhc

This commit was SVN r30455.
2014-01-28 14:40:17 +00:00
Jeff Squyres
21ffddbbd0 Addendum to r30408: if we're going to remove stale kruft, let's remove
all of it.  :-)

Refs trac:4175.

This commit was SVN r30417.

The following SVN revision numbers were found above:
  r30408 --> open-mpi/ompi@31acdb15bc

The following Trac tickets were found above:
  Ticket 4175 --> https://svn.open-mpi.org/trac/ompi/ticket/4175
2014-01-24 22:19:36 +00:00
Ralph Castain
e3cb4b4a5b Grant Nathan his wish - add an --disable-getpwuid to the configure options and protect all users of that code so it disappears if disabled.
cmr=v1.7.5:reviewer=hjelmn:subject=disable getpwuid if requested

This commit was SVN r30413.
2014-01-24 19:18:37 +00:00
Ralph Castain
31acdb15bc We haven't really supported orteCC in a long time, so let's remove the stale cruft. Thanks to Paul Hargrove for noticing!
cmr=v1.7.4:reviewer=jsquyres:subject=remove stale orteCC cruft

This commit was SVN r30408.
2014-01-24 17:26:54 +00:00
Adrian Reber
0af2897c12 removed trailing whitespaces in orte-checkpoint.c
This commit was SVN r30407.
2014-01-24 17:23:49 +00:00
Adrian Reber
659eb1b10a silence two compiler warnings
This commit was SVN r30406.
2014-01-24 17:22:28 +00:00
Adrian Reber
919260a0d2 fix communication between orte-checkpoint and orterun
Right after starting the communication with orterun the buffer
containing the message is deleted. This patch removes the deletion
of the buffer which is now done by orte_rml_send_callback(). This is
now also the callback function used by orte_rml.send_buffer_nb().
The previous callback hnp_receiver() was introduced by an
earlier patch which only was trying to get the code to compile again.

This commit was SVN r30405.
2014-01-24 17:18:28 +00:00
Jeff Squyres
87e476ebd8 Clean up many references to "rank": usually change to "process" and/or
specifically delineate that we're referring to the process' rank in
MPI_COMM_WORLD.

Refs trac:4068

This commit was SVN r30181.

The following Trac tickets were found above:
  Ticket 4068 --> https://svn.open-mpi.org/trac/ompi/ticket/4068
2014-01-09 16:37:49 +00:00
Ralph Castain
2a0e4b5e62 Update the orterun help messages and man page to reflect new map/rank/bind options and defaults. Thanks to Paul Hargrove for reporting it.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30173.
2014-01-09 04:44:28 +00:00
Jeff Squyres
13b29cff2c This commit compliements/completes r30140. r30140 made all the
configury/Makefile.am changes; this commit renames the internal
installdirs.h framework struct field names to match the configry macro
names:

 * pkgdatdir ->	ompidatadir
 * pkglibdir -> ompilibdir
 * pkgincludedir -> ompiincludedir

This commit was SVN r30145.

The following SVN revision numbers were found above:
  r30140 --> open-mpi/ompi@8b778903d8
2014-01-07 23:36:33 +00:00
Brian Barrett
8b778903d8 Fix longstanding issue with our multi-project support. Rather than using
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi.  This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.

This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Ralph Castain
d5a5caa7e0 Restore the bycore mpirun option for backward compatibility
Refs trac:4044

cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30103.

The following Trac tickets were found above:
  Ticket 4044 --> https://svn.open-mpi.org/trac/ompi/ticket/4044
2014-01-02 04:16:43 +00:00
Adrian Reber
53a70fe87f Trying to get the C/R code to compile again. (send_*_nb)
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.

Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Changes from V2:
* just replace the blocking calls with the non-blocking calls
* all #ifdef's introduced in V1 are gone
* send_* returns error code or ORTE_SUCCESS (not the number of bytes)

This commit was SVN r30036.
2013-12-20 21:58:28 +00:00