openmpi

Автор	SHA1	Сообщение	Дата
George Bosilca	1299ed433e	Don't release the ODLS twice. This commit was SVN r16430.	2007-10-11 17:30:03 +00:00
Jeff Squyres	3500376d9e	Remove a warning about an unused label. This commit was SVN r16429.	2007-10-11 16:38:37 +00:00
Ralph Castain	3dbd4d9be7	Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by: 1. taking advantage of the fact that we no longer create the launch message via a GPR trigger. In earlier times, we had the GPR create the launch message based on a subscription. In that mode of operation, we could not guarantee the order in which the data was stored in the message - hence, we had no choice but to parse the message in a loop that checked each value against a list of possible "keys" until the corresponding value was found. Now, however, we construct the message "by hand", so we know precisely what data is in each location in the message. Thus, we no longer need to send the character string "keys" for each data value any more. This represents a rather large savings in the message size - to give you an example, we typically would use a 30-char "key" for a 2-byte data value. As you can see, the overhead can become very large. 2. sending node-specific data only once. Again, because we used to construct the message via subscriptions that were done on a per-proc basis, the data for each node (e.g., the daemon's name, whether or not the node was oversubscribed) would be included in the data for each proc. Thus, the node-specific data was repeated for every proc. Now that we construct the message "by hand", there is no reason to do this any more. Instead, we can insert the data for a specific node only once, and then provide the per-proc data for that node. We therefore not only save all that extra data in the message, but we also only need to parse the per-node data once. The savings become significant at scale. Here is a comparison between the revised trunk and the trunk prior to this commit (all data was taken on odin, using openib, 64 nodes, unity message routing, tested with application consisting of mpi_init/mpi_barrier/mpi_finalize, all execution times given in seconds, all launch message sizes in bytes): Per-node scaling, taken at 1ppn: #nodes original trunk revised trunk time size time size 1 0.10 819 0.09 564 2 0.14 1070 0.14 677 3 0.15 1321 0.14 790 4 0.15 1572 0.15 903 8 0.17 2576 0.20 1355 16 0.25 4584 0.21 2259 32 0.28 8600 0.27 4067 64 0.50 16632 0.39 7683 Per-proc scaling, taken at 64 nodes ppn original trunk revised trunk time size time size 1 0.50 16669 0.40 7720 2 0.55 32733 0.54 11048 3 0.87 48797 0.81 14376 4 1.0 64861 0.85 17704 Condensing those numbers, it appears we gained: per-node message size: 251 bytes/node -> 113 bytes/node per-proc message size: 251 bytes/proc -> 52 bytes/proc per-job message size: 568 bytes/job -> 399 bytes/job (job-specific data such as jobid, override oversubscribe flag, total #procs in job, total slots allocated) The fact that the two pre-commit trunk numbers are the same confirms the fact that each proc was containing the node data as well. It isn't quite the 10x message reduction I had hoped to get, but it is significant and gives much better scaling. Note that the timing info was, as usual, pretty chaotic - the numbers cited here were typical across several runs taken after the initial one to avoid NFS file positioning influences. Also note that this commit removes the orte_process_info.vpid_start field and the handful of places that passed that useless value. By definition, all jobs start at vpid=0, so all we were doing is passing "0" around. In fact, many places simply hardwired it to "0" anyway rather than deal with it. This commit was SVN r16428.	2007-10-11 15:57:26 +00:00
Rolf vandeVaart	25c95c9ee9	Fix build on solaris. Need to include sys/wait.h. This commit was SVN r16426.	2007-10-11 15:04:30 +00:00
Jeff Squyres	e2df42eea3	Move the <sys/wait.h> below "orte_config.h" This commit was SVN r16424.	2007-10-11 11:31:09 +00:00
George Bosilca	7cc9f588a8	Decorate the base functions with ORTE_DECLSPEC. This commit was SVN r16423.	2007-10-11 00:02:49 +00:00
Ralph Castain	53af94fd87	Modify the configure system so that gridengine support is only built in specific conditions: 1. --with-sge, always builds 2. --without-sge, never builds 3. if neither is specified, build if and only if either SGE_ROOT is set or "qrsh" is found in the path This commit was SVN r16422.	2007-10-10 21:39:16 +00:00
Josh Hursey	6e5341c659	Forgot to move a header in the code movement. This commit was SVN r16420.	2007-10-10 15:39:40 +00:00
Ralph Castain	82a8e2d10d	Reorganize the odls framework to place common functionality in the base, thus making maintenance easier. We still need this to be a framework as some environments (e.g., bproc) require significantly different functionality. However, there is quite a bit of commonality across the components, so this ensures that fixes in one get propagated across the others. This patch also fixes a minor bug discovered along the way: we had "lost" the passing of the oversubscribed condition flag from the mapper to the orteds. Thus, we were not setting sched_yield correctly when in oversubscribed conditions (except when a hostfile was specified - different logic there because we treat the number of slots allocated on the node as "uncertain") I did not modify the process component in this patch - I will send a proposed patch to the maintainers of that component so they can review it first. This commit was SVN r16418.	2007-10-10 15:02:10 +00:00
Josh Hursey	7f833a9cb2	silence a warning that is triggered on restart This commit was SVN r16417.	2007-10-10 14:25:49 +00:00
Ethan Mallove	d0b61db65c	Add in a missing #include for Solaris builds. This commit was SVN r16416.	2007-10-10 12:49:15 +00:00
George Bosilca	e3105a85be	Don't require a progress function from the PML. If there is one then the PML base will take care of the registration with the event library. Otherwise, (and this apply for the CM case) the MTL are in charge of registering their own progress function. This commit was SVN r16415.	2007-10-09 23:28:53 +00:00
Josh Hursey	aa8391f888	Local and global coordinators should be the only ones involved in the movement of checkpoint files. This reduces the overhead on the applicaiton. This commit was SVN r16412.	2007-10-09 19:52:47 +00:00
Galen Shipman	6a25a635de	that shouldn't have slipped through.. This commit was SVN r16411.	2007-10-09 19:07:23 +00:00
Galen Shipman	fda1306807	revert my stupidity.. This commit was SVN r16410.	2007-10-09 19:01:20 +00:00
Galen Shipman	6b051e255e	already checked size.. no need to do it again.. This commit was SVN r16409.	2007-10-09 18:59:10 +00:00
Nysal Jan	b51d85fb3f	Fix assertion failure "assert( 0 == btl_endpoint->endpoint_cache_length )" while executing mt_coll testcase. This commit was SVN r16408.	2007-10-09 18:00:01 +00:00
Galen Shipman	62ade993ca	Seperate finalize and close for the PML, this gives the PML a chance to complete any outstanding operations prior to close. Before this change we just called pml_finalize in pml_close which causes problems if there are outstanding events that a BTL/MTL needs to progress during finalize. The problem is that MPI_COMM_WORLD and others were destroyed prior to closing the PML, pml_close would call pml_finalize, events would progress in the BTL, and these events expected MPI_COMM_WORLD to still be around.. This commit was SVN r16405.	2007-10-09 15:28:56 +00:00
Andrew Friedley	c15047b264	Add LLNL copyright to the file i modified yesterday This commit was SVN r16404.	2007-10-09 15:18:23 +00:00
Josh Hursey	8fe2ef5647	a missing include This commit was SVN r16402.	2007-10-09 14:32:36 +00:00
Josh Hursey	06a30e7f3a	Add a quick check to make sure the BLCR being used has a working cr_request. If it doesn (version < 0.6.0) then fallback to fork/exec of cr_checkpoint command. This commit was SVN r16400.	2007-10-09 13:51:28 +00:00
Andrew Friedley	fd51d9cf28	The call to opal_list_insert() had an off by one error (I think), causing selected components to get lost with certain load orderings. I went ahead and rewrote the code to use opal_list_insert_pos() instead, which gives a cleaner flow and more speed. This commit was SVN r16392.	2007-10-08 23:01:36 +00:00
Josh Hursey	7437f37e96	This commit contains the following: * Fix some missing includes in a few places. * Add the cr_request() functionality to the BLCR CRS component. We are now dependent upon the 0.6.* series of BLCR. * Made the CR notification mechanism a registered function. This way we can have an OPAL-only version and it can be replaced at runtime with the ORTE version. * Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only CR functionality when the user wants it. Default: Disabled. * Fix the placement of a checkpoint request check in MPI_Init * Pull the OPAL notification mechanism into the SnapC framework. * We no longer fork/exec the 'opal-checkpoint' command for local checkpointing, the Local coordinator in the orted does this directly. * The Local and Application coordinator talk together bypassing the OPAL notifiation mechanism. * Optimized the Local <-> App Coordinator communication. * Improved the structure used to track vpid_snapshots in the local coord. * Fix a race condition in which an application under heavy communication load may produce an inconsistent global checkpoint. This commit was SVN r16389.	2007-10-08 20:53:02 +00:00
Galen Shipman	1c1b9d5480	make cray happy This commit was SVN r16377.	2007-10-08 14:31:59 +00:00
Jeff Squyres	f92d9097d8	Some more changes to update to coll v1.1.0 that were missed yesterday. This actually exposed a very, very long-standing bug where part of the coll base was incorrectly checking the coll API version against the MCA API version. When coll went to v1.1 (yesterday) and was no longer the same as the MCA v1.0, the test started failing. This commit fixes to check for v1.1 everywhere in the coll base, and to ensure to check coll framework/API version numbers against coll framework/API version numbers (vs. against the MCA API version number). This commit was SVN r16373.	2007-10-07 12:20:22 +00:00
Jeff Squyres	3d34bff596	No technical/functional changes: simply change the name of the "data" parameter to "module" everywhere, just to be a little more clear what the purpose of that parameter is. This commit was SVN r16372.	2007-10-07 08:36:45 +00:00
Jeff Squyres	b5abb12c98	Commit Ralph's fix for MPI_APPNUM. This commit was SVN r16371.	2007-10-06 18:54:43 +00:00
Torsten Hoefler	e985812e1f	fixing a comment to be more detailed about opal_output_open functionality ... This commit was SVN r16370.	2007-10-06 17:33:57 +00:00
Jeff Squyres	74fd678de8	Fix a help message to also show the default value. This commit was SVN r16369.	2007-10-06 14:25:38 +00:00
Jeff Squyres	fc2b4376e9	Update forgotten macro. This commit was SVN r16368.	2007-10-06 14:11:35 +00:00
Jeff Squyres	fff1057597	We do ''not'' want the orted to yield the processor more than necessary (it already uses blocking semantics while waiting for events on fd's, so it's not taking more cycles from MPI applications than necessary), or this can/will cause lengthy delays in orted processing. The problem we most recently saw was with routed OOB messages getting significantly delayed before being delivered to the target MPI processes. This was problematic for BTLs that use OOB messaging for wireup. There is a lengthy comment in orted_main.c that describes this in more detail. Setting the orted to not voluntarily call yield() prevents this bad behavior. The orted only runs in small bursts anyway (and blocks the rest of the time), so this is not harmful to application performance. This commit was SVN r16365.	2007-10-06 09:24:51 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
Jelena Pjesivac-Grbovic	ada43fef9e	This fixes bug #1157 in coll/self module. All vector functions had incorrect handling of the offset. This commit was SVN r16360.	2007-10-05 17:40:16 +00:00
Jeff Squyres	f92154fc72	Gah -- ompi_info doesn't setup the connect pseudo component, so it'll be NULL. Ensure to protect for this. This commit was SVN r16333.	2007-10-04 18:03:56 +00:00
Jeff Squyres	13fa7ae93e	It's not necessary to link against all 3 libs (in fact, we shouldn't do it -- let libtool pull them in via the .la file if it needs to) This commit was SVN r16332.	2007-10-04 18:01:30 +00:00
Jeff Squyres	80ce974291	Fixes trac:1156: ensure to finalize the "connect" sub-component. This commit was SVN r16330. The following Trac tickets were found above: Ticket 1156 --> https://svn.open-mpi.org/trac/ompi/ticket/1156	2007-10-04 17:36:12 +00:00
Andrew Friedley	2e66590993	Fix mistakes in the basic component.. can't call collectives on the communicator and always pass the basic module.. have to give them the module off the communicator. This commit was SVN r16329.	2007-10-04 16:29:24 +00:00
Galen Shipman	77f080575f	fix for the cray.. This commit was SVN r16317.	2007-10-03 19:25:23 +00:00
Jeff Squyres	7b0fe8b152	Revert r15900; the variable was already named correctly. This fixes static builds for OMPI components that required extra LIBS or LDFLAGS (e.g., the openib BTL). Fixes trac:1155. This commit was SVN r16314. The following SVN revision numbers were found above: r15900 --> open-mpi/ompi@50941ec389 The following Trac tickets were found above: Ticket 1155 --> https://svn.open-mpi.org/trac/ompi/ticket/1155	2007-10-03 06:46:39 +00:00
Andrew Friedley	5be7f5e2dc	fixes trac:1154 Check if an exclusion string (i.e. '-mca btl ^sm) was provided; if so OFUD just disables itself. This commit was SVN r16307. The following Trac tickets were found above: Ticket 1154 --> https://svn.open-mpi.org/trac/ompi/ticket/1154	2007-10-02 20:37:16 +00:00
Tim Mattox	ed7fd5ad90	Put in a 1.2.5 NEWS section on the trunk. This commit was SVN r16301.	2007-10-02 14:33:46 +00:00
Tim Prins	34966edaf1	remove unneeded and never-initialized lock. The orte_ns.assign_tag function does all the locking we need for us. This commit was SVN r16299.	2007-10-02 14:22:29 +00:00
Gleb Natapov	60af46d541	We have QP description in component structure, module structure and endpoint. Each one of them has a field to store QP type, but this is redundant. Store qp type only in one structure (the component one). This commit was SVN r16272.	2007-09-30 16:14:17 +00:00
Gleb Natapov	9c04b127f5	Forget to put this fix in previous commit. This commit was SVN r16271.	2007-09-30 15:33:20 +00:00
Gleb Natapov	3a15d645be	Remove lcl_qp_attr from endpoint qp description. It is used during init only. This commit was SVN r16270.	2007-09-30 15:29:35 +00:00
Brian Barrett	3a0067249c	The previous hack to deal with Libtool not speaking Objective C stopped working with Automake 1.10. This is a new hack, which should be much more flexible. The ras doesn't contain any Objective C, so remove the hack entirely from that Makefile.am. This commit was SVN r16269.	2007-09-30 03:40:25 +00:00
Rolf vandeVaart	a87267ef92	Fix a build error on Solaris. MAXHOSTNAMELEN is defined in netdb.h. This commit was SVN r16268.	2007-09-28 20:15:28 +00:00
Brian Barrett	48c49cb89c	Handle case where modex_recv_string() isn't implemented (ie, the Cray) This commit was SVN r16267.	2007-09-28 18:50:37 +00:00
Tim Prins	1d1d0f6d4c	Fix segfault when user provides a working directory for comm_spawn. Thanks to Murat Knecht for reporting this and suggesting a fix. This commit was SVN r16266.	2007-09-27 23:30:40 +00:00
Tim Prins	b161732af9	Be sure to restore the library flags in case of error. Thanks to Ake Sandgren for pointing this out. This commit was SVN r16263.	2007-09-27 21:35:52 +00:00

1 2 3 4 5 ...

10565 Коммитов