openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Brian Barrett	1398169700	* forgot to fix up includes in the test directory with yesterday's commit. This commit was SVN r8996.	2006-02-12 19:51:24 +00:00
Brian Barrett	566a050c23	Next step in the project split, mainly source code re-arranging - move files out of toplevel include/ and etc/, moving it into the sub-projects - rather than including config headers with <project>/include, have them as <project> - require all headers to be included with a project prefix, with the exception of the config headers ({opal,orte,ompi}_config.h mpi.h, and mpif.h) This commit was SVN r8985.	2006-02-12 01:33:29 +00:00
Ralph Castain	4b9f015c0b	Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list. This commit was SVN r8912.	2006-02-07 03:32:36 +00:00
Jeff Squyres	42ec26e640	Update the copyright notices for IU and UTK. This commit was SVN r7999.	2005-11-05 19:57:48 +00:00
Brian Barrett	ed56e743b7	* update configure.ac to use the modern version of AC_INIT and AM_INIT_AUTOMAKE, instead of the deprecated version. * Work around dumbness in modern AC_INIT that requires the version number to be set at autoconf time (instead of at configure time, as it was before). Set the version number, minus the subversion r number, at autoconf time. Override the internal variables to include the r number (if needed) at configure time. Basically, the right thing should always happen. The only place it might not is the version reported as part of configure --help will not have an r number. * Since AM_INIT_AUTOMAKE taks a list of options, no need to specify them in all the Makefile.am files. * Addes support for subdir-objects, meaning that object files are put in the directory containing source files, even if the Makefile.am is in another directory. This should start making it feasible to reduce the number of Makefile.am files we have in the tree, which will greatly reduce the time to run autogen and configure. This commit was SVN r7211.	2005-09-07 05:54:53 +00:00
Brian Barrett	07b589100e	* add test for init_finalize of orte (useful for memory leak checks) * update ORTE tests to cope with change in prototype for orte_init() This commit was SVN r7081.	2005-08-29 19:32:46 +00:00
Brian Barrett	e2e18d49a3	* add trivial opal init/finalize app This commit was SVN r7011.	2005-08-24 20:20:21 +00:00
Jeff Squyres	9dab81d86b	A bunch of updates to the unit tests - Update svn:ignore's to match new exectuable names - Consolidate the unit test Makefile.am flags into a testing Makefile.options - Remove a bunch of SUBDIRS from test/mca/Makefile so that they don't run by default, but can be invoked manually (they're still in DIST_SUBDIRS) This commit was SVN r6598.	2005-07-23 11:11:19 +00:00
Brian Barrett	b04c726ad1	Fix up tests so that they all compile and (mostly) run This commit was SVN r6338.	2005-07-04 14:53:10 +00:00
Brian Barrett	ccd2624e3f	* rename ompi_progress to opal_progress This commit was SVN r6326.	2005-07-03 21:57:43 +00:00
Ralph Castain	689a290711	Add one further degree of separation between opal and orte - allow separate init of the two systems. This allows the restart capability to avoid hitting opal utilities (e.g., mca_base_open, ompi_output_init) repeatedly. Clean up the ignores as well. This commit was SVN r5811.	2005-05-22 18:40:03 +00:00
Ralph Castain	7b6db8a18f	Can now start/finalize/restart the run-time without crashing. Add a unit test for that functionality - will test more fully next week. This commit was SVN r5806.	2005-05-22 03:11:33 +00:00
Brian Barrett	de128a69fb	Skip test when on old LinuxThreads machines and using progress threads since you can't fork() in one thread and waitpid() on the child in another, which is what this test expects you to do. If Linux would just implement the stupid POSIX standard already, this wouldn't be a problem. This commit was SVN r5482.	2005-04-21 19:33:18 +00:00
Ralph Castain	c52a21e1b3	Fix a couple of minor bugs that were preventing the session directory system from completely cleaning up. Unit test now shows that it will cleanup all session directory levels IF no files are present. This commit was SVN r5210.	2005-04-07 19:19:48 +00:00
Jeff Squyres	1701010301	Back out this patch; it appears to break in at least 64 bit environments. Working on the fix, but don't break everyone's unit tests while I'm working on it -- will re-commit once ompi_setenv() and ompi_unsetenv() are fixed. This commit was SVN r5166.	2005-04-04 22:55:26 +00:00
Jeff Squyres	38b814b0cc	Convert to use portable ompi_setenv() and ompi_unsetenv() This commit was SVN r5164.	2005-04-04 22:17:48 +00:00
Jeff Squyres	65017ac13c	Add some printf's just so that one can see what is going on with the test. :-) This commit was SVN r5039.	2005-03-26 13:09:21 +00:00
Jeff Squyres	3f5541349a	Add UC copyright This commit was SVN r5009.	2005-03-24 12:43:37 +00:00
Brian Barrett	9b2b3ec078	* Make the sigchld test work again This commit was SVN r4987.	2005-03-22 15:24:45 +00:00
Brian Barrett	4b6aecf82d	* Fix up the test directory so everything uses Automake's "make check" to build / run. Only things that actually build / run right now are the asm and class tests. The mca tests probably will with a static build but that hasn't been verified This commit was SVN r4979.	2005-03-22 04:25:01 +00:00
Brian Barrett	aa70a35fea	* Sync trunk to r4977 of the tim branch This commit was SVN r4978. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r4977	2005-03-22 00:31:17 +00:00
Brian Barrett	6822a519bb	* results from initial merge of the tim branch into the trunk. Compiles and ompi_info works, but that's all that has been tested. This commit was SVN r4827.	2005-03-14 20:57:21 +00:00

24 Коммитов