openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
George Bosilca	2e2bf472ff	Mark the orte_abort function as noreturn and change the return value from int to void. This function call exit at the end, so there is no way to return from there. Apply the same thing to the errmsg_abort function and update all components. This commit was SVN r15704.	2007-07-31 16:09:52 +00:00
Ralph Castain	4fff584a68	Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that did start. The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system. Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed. Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief. With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn. Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put". This commit was SVN r14711.	2007-05-21 18:31:28 +00:00
Ralph Castain	18b2dca51c	Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly. There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd really have to try). This also involved a slight change to the oob.xcast API, so propagated that as required. Note: this has only been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-) Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately. This commit was SVN r14475.	2007-04-23 18:41:04 +00:00
Rolf vandeVaart	bf5113198d	Update to orte-clean so it will remove files on local and remote nodes. It will also kill off rogue orteds and orterun processes. The killing of processes is ifdef'ed out for Windows since I do not know how to do it there. Note that this change will requite an autogen. This commit was SVN r13477.	2007-02-03 00:25:42 +00:00
Rolf vandeVaart	fdf44cc4ab	Add the ability to not only report broken files and directories, but remove them also. This current set of changes will affect nothing as no one is making use of this ability. However, orte-clean will be changed soon to utilize this new feature. This commit was SVN r12996.	2007-01-04 21:48:34 +00:00
Ralph Castain	a3be8261fb	Fix a bug that had us generate an error message and abort startup when there were stale universe directories around. Now, we just ignore them. This commit was SVN r12472.	2006-11-07 21:34:57 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
George Bosilca	f52c10d18e	And ORTE is ready for prime-time. All Windows tricks are in: - use the OPAL functions for PATH and environment variables - make all headers C++ friendly - no unamed structures - no implicit cast. Plus a full implementation for the orte_wait functions. This commit was SVN r11347.	2006-08-23 03:32:36 +00:00
George Bosilca	6afa4c6c64	Windows friendly version. We have to split the OMPI_DECLSPEC in at least 3 different macros, one for each project. Therefore, now we have OPAL_DECLSPEC, ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project. This commit was SVN r11270.	2006-08-20 15:54:04 +00:00
Josh Hursey	0a931f9fad	Brining over the session directory and universe changes from the tmp/jjhursey-ft-cr branch. In this commit we change the way universe names are created. Before we by default first created "default-universe" then if there was a conflict we created "default-universe-PID" where PID is the PID of the HNP. Now we create "default-universe-PID" all the time (when a default universe name is used). This makes it much easier when trying to find a HNP from an outside app (e.g. orte-ps, orteconsole, ...) This also adds a "search" function to find all of the universes on the machine. This is useful in many contexts when trying to find a persistent daemon or when trying to connect to a HNP. This commit also makes orte_universe_t an opal_object_t, which is something that needed to happen, and only effected the SDS in one of it's base functions. I was asked to bring this over to aid in fixing orteconsole and orteprobe. Due to the change of orte_universe_t to an object orteprobe may need to be updated to reflect this change. Since orteprobe needs to be looked at anyway I'll leave this to Ralph to take care of. Note: These changes do not depend upon any of the FT work (but the FT work does depend upon them). These were brought over to help in fixing some of the ORTE tool set that require the functionality layed out in this patch. Testing: Ran the 'ibm' tests before and after this change, and all was as well as before the change. If anyone notices additional irregularities in the system let me know. But none are expected. This commit was SVN r10550.	2006-06-28 21:03:31 +00:00
Brian Barrett	566a050c23	Next step in the project split, mainly source code re-arranging - move files out of toplevel include/ and etc/, moving it into the sub-projects - rather than including config headers with <project>/include, have them as <project> - require all headers to be included with a project prefix, with the exception of the config headers ({opal,orte,ompi}_config.h mpi.h, and mpif.h) This commit was SVN r8985.	2006-02-12 01:33:29 +00:00
Jeff Squyres	42ec26e640	Update the copyright notices for IU and UTK. This commit was SVN r7999.	2005-11-05 19:57:48 +00:00
Josh Hursey	4eefb33182	Some param changes: - Change orte_base_infrastructre to orte_infrastructre to conform with ompi_info's needs - Move MCA Param registration in ORTE to a centralized function that is called first in orte_init_stage1 - Set the infrastructre flag as an argument to orte_init - Adjust initalization functions to properly pass down the infrastructre flag. This commit was SVN r7053.	2005-08-26 20:13:35 +00:00
Brian Barrett	9f44b80291	* rename ompi_argv to opal_argv * rename ompi_basename to opal_basename * rename ompi bitop functions to opal * rename ompi_cmd_line to opal_cmd_line * rename ompi_sizet2int to opal_sizet2int * rename orte_daemon_init to opal_daemon_init * rename ompi_few to opal_few This commit was SVN r6330.	2005-07-04 00:13:44 +00:00
Jeff Squyres	35c141aef6	While we're moving directories around, move ompi/mpi/runtime -> ompi/runtime, for consistency and parallel-ness will orte/runtime. Also remove a few useless #includes along the way. This commit was SVN r6317.	2005-07-03 12:07:29 +00:00
Jeff Squyres	1b18979f79	Initial population of orte tree This commit was SVN r6266.	2005-07-02 13:42:54 +00:00

18 Коммитов