1
1
Граф коммитов

25 Коммитов

Автор SHA1 Сообщение Дата
Greg Koenig
60485ff95f This is a very large change to rename several #define values from
OMPI_* to OPAL_*.  This allows opal layer to be used more independent
from the whole of ompi.

NOTE: 9 "svn mv" operations immediately follow this commit.

This commit was SVN r21180.
2009-05-06 20:11:28 +00:00
Ralph Castain
54b2cf747e These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.

This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:

As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.

In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.

The incoming changes revamp these procedures in three ways:

1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.

The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.

Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.


2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.

The size of this data has been reduced in three ways:

(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.

To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.

(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.

(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.

While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.


3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.

It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.

Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.


There are a few minor additional changes in the commit that I'll just note in passing:

* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.

* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.

* cleanup of some stale header files

This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
Brian Barrett
6f8b366acb Rename liborte to libopen-rte and libopal to libopen-pal per telecon today
and bug #632.

Refs trac:632

This commit was SVN r12762.

The following Trac tickets were found above:
  Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632
2006-12-05 18:27:24 +00:00
Brian Barrett
1398169700 * forgot to fix up includes in the test directory with yesterday's commit.
This commit was SVN r8996.
2006-02-12 19:51:24 +00:00
Brian Barrett
566a050c23 Next step in the project split, mainly source code re-arranging
- move files out of toplevel include/ and etc/, moving it into the
    sub-projects
  - rather than including config headers with <project>/include, 
    have them as <project>
  - require all headers to be included with a project prefix, with
    the exception of the config headers ({opal,orte,ompi}_config.h
    mpi.h, and mpif.h)

This commit was SVN r8985.
2006-02-12 01:33:29 +00:00
Ralph Castain
4b9f015c0b Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list.
This commit was SVN r8912.
2006-02-07 03:32:36 +00:00
Jeff Squyres
42ec26e640 Update the copyright notices for IU and UTK.
This commit was SVN r7999.
2005-11-05 19:57:48 +00:00
Brian Barrett
ed56e743b7 * update configure.ac to use the modern version of AC_INIT and
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
  number to be set at autoconf time (instead of at configure time, as
  it was before).  Set the version number, minus the subversion r number,
  at autoconf time.  Override the internal variables to include the r
  number (if needed) at configure time.  Basically, the right thing
  should always happen.  The only place it might not is the version
  reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
  them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
  in the directory containing source files, even if the Makefile.am is
  in another directory.  This should start making it feasible to
  reduce the number of Makefile.am files we have in the tree, which
  will greatly reduce the time to run autogen and configure.

This commit was SVN r7211.
2005-09-07 05:54:53 +00:00
Brian Barrett
07b589100e * add test for init_finalize of orte (useful for memory leak checks)
* update ORTE tests to cope with change in prototype for orte_init()

This commit was SVN r7081.
2005-08-29 19:32:46 +00:00
Brian Barrett
e2e18d49a3 * add trivial opal init/finalize app
This commit was SVN r7011.
2005-08-24 20:20:21 +00:00
Jeff Squyres
9dab81d86b A bunch of updates to the unit tests
- Update svn:ignore's to match new exectuable names
- Consolidate the unit test Makefile.am flags into a testing
  Makefile.options 
- Remove a bunch of SUBDIRS from test/mca/Makefile so that they don't
  run by default, but can be invoked manually (they're still in
  DIST_SUBDIRS) 

This commit was SVN r6598.
2005-07-23 11:11:19 +00:00
Brian Barrett
b04c726ad1 Fix up tests so that they all compile and (mostly) run
This commit was SVN r6338.
2005-07-04 14:53:10 +00:00
Brian Barrett
ccd2624e3f * rename ompi_progress to opal_progress
This commit was SVN r6326.
2005-07-03 21:57:43 +00:00
Ralph Castain
689a290711 Add one further degree of separation between opal and orte - allow separate init of the two systems. This allows the restart capability to avoid hitting opal utilities (e.g., mca_base_open, ompi_output_init) repeatedly.
Clean up the ignores as well.

This commit was SVN r5811.
2005-05-22 18:40:03 +00:00
Ralph Castain
7b6db8a18f Can now start/finalize/restart the run-time without crashing.
Add a unit test for that functionality - will test more fully next week.

This commit was SVN r5806.
2005-05-22 03:11:33 +00:00
Brian Barrett
de128a69fb Skip test when on old LinuxThreads machines and using progress threads
since you can't fork() in one thread and waitpid() on the child in another,
which is what this test expects you to do.  If Linux would just implement
the stupid POSIX standard already, this wouldn't be a problem.

This commit was SVN r5482.
2005-04-21 19:33:18 +00:00
Ralph Castain
c52a21e1b3 Fix a couple of minor bugs that were preventing the session directory system from completely cleaning up. Unit test now shows that it will cleanup all session directory levels IF no files are present.
This commit was SVN r5210.
2005-04-07 19:19:48 +00:00
Jeff Squyres
1701010301 Back out this patch; it appears to break in at least 64 bit
environments.  Working on the fix, but don't break everyone's unit
tests while I'm working on it -- will re-commit once ompi_setenv() and
ompi_unsetenv() are fixed.

This commit was SVN r5166.
2005-04-04 22:55:26 +00:00
Jeff Squyres
38b814b0cc Convert to use portable ompi_setenv() and ompi_unsetenv()
This commit was SVN r5164.
2005-04-04 22:17:48 +00:00
Jeff Squyres
65017ac13c Add some printf's just so that one can see what is going on with the
test.  :-)

This commit was SVN r5039.
2005-03-26 13:09:21 +00:00
Jeff Squyres
3f5541349a Add UC copyright
This commit was SVN r5009.
2005-03-24 12:43:37 +00:00
Brian Barrett
9b2b3ec078 * Make the sigchld test work again
This commit was SVN r4987.
2005-03-22 15:24:45 +00:00
Brian Barrett
4b6aecf82d * Fix up the test directory so everything uses Automake's "make check" to
build / run.  Only things that actually build / run right now are the
  asm and class tests.  The mca tests probably will with a static build
  but that hasn't been verified

This commit was SVN r4979.
2005-03-22 04:25:01 +00:00
Brian Barrett
aa70a35fea * Sync trunk to r4977 of the tim branch
This commit was SVN r4978.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r4977
2005-03-22 00:31:17 +00:00
Brian Barrett
6822a519bb * results from initial merge of the tim branch into the trunk. Compiles and
ompi_info works, but that's all that has been tested.

This commit was SVN r4827.
2005-03-14 20:57:21 +00:00