1
1
Граф коммитов

824 Коммитов

Автор SHA1 Сообщение Дата
Joshua Ladd
b3f88c4a1d Per the RFC schedule, this commit adds Mellanox OpenSHMEM to the trunk. It does not yet run on OSX or with CM PML for an MTL other than MXM. Mellanox is aware of these issues and is in the process of resolving them. This should be added to \ncmr=v1.7.4:subject=Move OSHMEM to 1.7.4:reviewer=rhc
This commit was SVN r29153.
2013-09-10 15:34:09 +00:00
Ralph Castain
0da3968ade Update this script to output diff files
This commit was SVN r29146.
2013-09-06 19:08:12 +00:00
Ralph Castain
a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
Ralph Castain
991e59a58a Update MCA param in platform file
This commit was SVN r29039.
2013-08-16 22:18:22 +00:00
Ralph Castain
b0a98b2b16 Update platform files
This commit was SVN r28994.
2013-08-03 11:23:44 +00:00
Ralph Castain
72dc8f1f6e Blasted typo
This commit was SVN r28991.
2013-08-02 19:18:33 +00:00
Ralph Castain
f81cbad3e3 Fix platform files so trunk tarball can build
This commit was SVN r28989.
2013-08-02 16:22:51 +00:00
Nathan Hjelm
ba8bfeded0 lanl: clean up tlcc plaform files
No review necessary.

cmr=v1.7.3:reviewer=ompi-gk1.7

This commit was SVN r28976.
2013-08-01 19:54:29 +00:00
Ralph Castain
37db1727a2 Refs trac:3710
Simplify the whole stripping of prefix method by consolidating it into a single MCA param. Allow for multiple prefixes to be stripped, each separated in the param by a comma. If no prefix is given, or the specified prefix isn't in the nodename, then just use the hostname itself.

This commit was SVN r28974.

The following Trac tickets were found above:
  Ticket 3710 --> https://svn.open-mpi.org/trac/ompi/ticket/3710
2013-08-01 00:32:10 +00:00
Nathan Hjelm
83a3fc2fd2 Add an option to control which hostnames orte_strip_prefix_from_node_names works
on.

This corrects a problem with Cray systems where the login node's hostname
was being stripped causing the login node to be used as a compute node by
mpirun.

cmr=v1.7.3:reviewer=rhc

This commit was SVN r28970.
2013-07-31 18:42:02 +00:00
Nathan Hjelm
278522d8e8 Update LANL platform files for changes in linux memory hook configuration.
No review necessary

cmr=v1.7.3:reviewer=ompi-gk1.7

This commit was SVN r28969.
2013-07-31 17:56:22 +00:00
Ralph Castain
7a73c5dd0b Platform file update
This commit was SVN r28963.
2013-07-30 16:01:55 +00:00
Ralph Castain
d64e45cfa3 Add utility for comparing two code trees
This commit was SVN r28883.
2013-07-20 21:48:23 +00:00
Dave Goodell
94977f9501 add authors-to-cvsimport.pl script
Helpful when creating a git-svn clone of the OMPI repository.

Reviewed by jsquyres@

cmr:v1.7:reviewer=jsquyres

This commit was SVN r28825.
2013-07-17 21:21:15 +00:00
Ralph Castain
10ca1c1b04 Turns out that there was exactly ONE place in all of the OMPI code base that still referred to OPAL_TRACE, though a few places retained the include file for no reason. So no point in letting this sit as it is clearly an unused "feature".
This commit was SVN r28789.
2013-07-14 18:57:20 +00:00
Ralph Castain
5bdc4082ea Remove stale platform directory
This commit was SVN r28777.
2013-07-12 18:49:08 +00:00
Ralph Castain
ee3e3a3dd8 Fix typo for make tarball
This commit was SVN r28776.
2013-07-12 18:48:38 +00:00
Ralph Castain
0230a9d4f8 Hmmm...try and remove these again
This commit was SVN r28775.
2013-07-12 18:47:55 +00:00
Ralph Castain
af950eb8ac Update platform files
This commit was SVN r28774.
2013-07-12 18:45:03 +00:00
Ralph Castain
e093187ff6 Cleanup and rename of platform files
This commit was SVN r28773.
2013-07-12 18:42:16 +00:00
Dave Goodell
5626371196 update-my-copyright.pl now works with Git
This script now takes command line options:
```
./update-my-copyright.pl [options]

--help | -h          This help message
--quiet | -q         Only output critical messages to stdout
--check-only         exit(111) if there are files with copyrights to edit
--search-name=NAME   Set search name to NAME
--formal-same=NAME   Set formal name to NAME
```

The `--check-only` and `--quiet` options are suitable for use in a git
pre-commit script to check for out of date copyright headers.

Reviewed by jsquyres

This commit was SVN r28742.
2013-07-09 14:39:41 +00:00
George Bosilca
c9e5ab9ed1 Our macros for the OMPI-level free list had one extra argument, a possible return
value to signal that the operation of retrieving the element from the free list
failed. However in this case the returned pointer was set to NULL as well, so the
error code was redundant. Moreover, this was a continuous source of warnings when
the picky mode is on.

The attached parch remove the rc argument from the OMPI_FREE_LIST_GET and
OMPI_FREE_LIST_WAIT macros, and change to check if the item is NULL instead of
using the return code.

This commit was SVN r28722.
2013-07-04 08:34:37 +00:00
Joshua Ladd
0b5c1f2ea8 Add 'generic' support for PMI2 (previously, we checked for PMI2 only on Cray systems.) If your resource manager (e.g. SLURM) has support for PMI2, then the --with-pmi configure flag will enable its usage. If you don't have PMI2, then you will fallback to regular old PMI1. This patch was submitted by Ralph Castain and reviewed and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd
This commit was SVN r28666.
2013-06-21 15:28:14 +00:00
Nathan Hjelm
e61a1aa865 Update LANL XE-6 platform files
This commit was SVN r28574.
2013-05-30 18:33:27 +00:00
Jeff Squyres
f85dca0285 Fix spelling mistake that has been there for a long, long time...
This commit was SVN r28562.
2013-05-24 22:28:15 +00:00
Ralph Castain
850dbe77ec Update platform files
This commit was SVN r28448.
2013-05-05 14:35:13 +00:00
Ralph Castain
700034cda3 Update platform files
This commit was SVN r28406.
2013-04-27 00:09:58 +00:00
Nathan Hjelm
bdd6d35eeb update LANL platform files
This commit was SVN r28375.
2013-04-24 15:46:44 +00:00
Ralph Castain
7a5172a280 Update platform files
This commit was SVN r28339.
2013-04-16 20:40:09 +00:00
Jeff Squyres
dc47473e6d Update the update-my-copyright.pl script to also be able to handle git
checkouts.

This commit was SVN r28318.
2013-04-09 22:08:03 +00:00
Ralph Castain
112fd70da1 Update platform file
This commit was SVN r28307.
2013-04-08 23:30:28 +00:00
Ralph Castain
a8fa2bd1dd Update platform files
This commit was SVN r28304.
2013-04-08 13:20:01 +00:00
Ralph Castain
6909346306 Update platform file
This commit was SVN r28301.
2013-04-07 15:49:32 +00:00
Ralph Castain
1c26a6e5b8 For some reason, tree spawn isn't working on my cluster after reboot - so turn it off so MTT doesn't bomb
This commit was SVN r28296.
2013-04-06 17:33:12 +00:00
Ralph Castain
701a170387 Update platform files
This commit was SVN r28277.
2013-04-03 17:06:07 +00:00
Jeff Squyres
aa8e45367a Use the proper git command to revert a file ("git checkout VERSION")
This commit was SVN r28271.
2013-04-02 12:05:30 +00:00
Ralph Castain
db77484ceb Update show_load_errors param
This commit was SVN r28263.
2013-03-28 16:52:34 +00:00
Nathan Hjelm
cf377db823 MCA/base: Add new MCA variable system
Features:
 - Support for an override parameter file (openmpi-mca-param-override.conf).
   Variable values in this file can not be overridden by any file or environment
   value.
 - Support for boolean, unsigned, and unsigned long long variables.
 - Support for true/false values.
 - Support for enumerations on integer variables.
 - Support for MPIT scope, verbosity, and binding.
 - Support for command line source.
 - Support for setting variable source via the environment using
   OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
 - Cleaner API.
 - Support for variable groups (equivalent to MPIT categories).

Notes:
 - Variables must be created with a backing store (char **, int *, or bool *)
   that must live at least as long as the variable.
 - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
   mca_base_var_set_value() to change the value.
 - String values are duplicated when the variable is registered. It is up to
   the caller to free the original value if necessary. The new value will be
   freed by the mca_base_var system and must not be freed by the user.
 - Variables with constant scope may not be settable.
 - Variable groups (and all associated variables) are deregistered when the
   component is closed or the component repository item is freed. This
   prevents a segmentation fault from accessing a variable after its component
   is unloaded.
 - After some discussion we decided we should remove the automatic registration
   of component priority variables. Few component actually made use of this
   feature.
 - The enumerator interface was updated to be general enough to handle
   future uses of the interface.
 - The code to generate ompi_info output has been moved into the MCA variable
   system. See mca_base_var_dump().

opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system

This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.

This commit was SVN r28236.
2013-03-27 21:09:41 +00:00
Nathan Hjelm
9d1041b058 fix typo. should have been field 1 not 2
This commit was SVN r28235.
2013-03-27 17:51:11 +00:00
Nathan Hjelm
4719e545f0 Add support for git to make_dist_tarball
This commit was SVN r28234.
2013-03-27 17:48:09 +00:00
Brian Barrett
1aa9e74767 s/openib/verbs/
This commit was SVN r28151.
2013-03-06 19:00:54 +00:00
Ralph Castain
5b09cccacc Revise the build-ignore script for git:
1. remove the "die if not dual repo" and automatic "git add" for the .gitignore as we might want to run this script outside of a dual repo.

2. put the results in a single .gitignore file at the top so it mimics the mercurial script and is easier to copy to a git repo

3. don't prefix the entries with "./" as git doesn't recognize the entry if you do

This commit was SVN r28148.
2013-03-06 14:53:51 +00:00
Brian Barrett
04ac6c4d6f fix typo
This commit was SVN r28144.
2013-03-01 22:21:48 +00:00
Brian Barrett
be361cf91b First take at kitten config file
This commit was SVN r28143.
2013-03-01 22:21:22 +00:00
Ralph Castain
40e0b7be36 Update platform file
This commit was SVN r28138.
2013-02-28 20:18:42 +00:00
Ralph Castain
a4b6fb241f Remove all remaining vestiges of the Windows integration
This commit was SVN r28137.
2013-02-28 17:31:47 +00:00
Jeff Squyres
a951fde1ec Run a final "svn up" at the end of a successful gkcommit so that we
get a stable svnversion number (thereby allowing the next merge).

This commit was SVN r28128.
2013-02-27 15:04:51 +00:00
Ralph Castain
8d2fa3693b First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package.
This commit was SVN r28116.
2013-02-26 20:44:56 +00:00
Joshua Ladd
70ad711337 Backing out the Open SHMEM project
This commit was SVN r28050.
2013-02-12 17:45:27 +00:00
Mike Dubman
ff384daab4 Added new project: oshmem.
This commit was SVN r28048.
2013-02-12 15:33:21 +00:00