openmpi

Автор	SHA1	Сообщение	Дата
Josh Hursey	2c736873bb	Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors. The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge. The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit. Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it. * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level. * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components. * Update ft_event functions in PML and BML to handle the new restart state. * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging. This commit was SVN r18276.	2008-04-24 17:54:22 +00:00
Josh Hursey	cc83d41ad9	Merge in tmp/jjh-scratch {{{ svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch . }}} Contains: * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart. * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P. * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry * Some other sundry cleanup items all dealing with C/R functionality in the trunk. This commit was SVN r18241.	2008-04-23 00:17:12 +00:00
Adrian Knoth	84e4013530	Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set. This commit was SVN r18164.	2008-04-16 09:31:15 +00:00
Adrian Knoth	0ddfff4ffe	Added new oob-tcp parameter oob_tcp_disable_family. Like btl_tcp_disable_family, this parameter more or less disables a whole address family. Though the sockets are still created, the corresponding information isn't added to the connection strings. Likewise, we don't try to connect to addresses matching the disabled address family. This is particularly important for multidomain clusters, where IPv4 is oftenly filtered (firewalled), sometimes by simply dropping the packets instead of rejecting them (thus causing a connection timeout instead of a quick "no route to host"). This commit was SVN r18163.	2008-04-16 09:22:00 +00:00
Ralph Castain	11c6773c83	Commit a patch from Brian that fixes potential segfaults in systems where IPv6 include files are found, but the kernel doesn't actually support IPv6. This commit was SVN r18106.	2008-04-09 12:53:24 +00:00
Adrian Knoth	a56b9b1df1	Fix broken build with --disable-ipv6. This commit was SVN r18071.	2008-04-02 10:53:48 +00:00
Ralph Castain	39c2680e9a	Silence warning This commit was SVN r18057.	2008-04-01 13:42:16 +00:00
Ralph Castain	3e8846d685	Some code cleanups from Brian to clarify port selection and opening logic This commit was SVN r18055.	2008-04-01 12:39:02 +00:00
Ralph Castain	60d931217f	Modify the routed framework to allow greater control/flexibility over response to lost routes and initial wireup of jobs as required by several soon-to-come new modules. Specifically, add two new APIs: 1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job. Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections. 2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job. The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module. This commit was SVN r17969.	2008-03-26 01:00:24 +00:00
Ralph Castain	f8642e9390	Add debug to tell us when we opened a socket and to whom This commit was SVN r17911.	2008-03-21 15:47:47 +00:00
Ralph Castain	19ffdfef42	Add some debugging output to tell us what interfaces were considered and used by OOB This commit was SVN r17909.	2008-03-21 15:35:40 +00:00
Ralph Castain	27a73ad9ee	Fix a race condition between the orteds and HNP that can cause the orteds to output the "lost lifeline" message. This has been a long-time problem. I tried to reduce the problem by having the orteds tell the HNP they were finalizing, and having the HNP wait until all orteds had reported or we timed out. What was observed was that all the orteds were correctly reporting that they are leaving, but the HNP is able to exit before the orteds, thus closing the orteds lifeline socket and generating the error output. This is caused by the fact that the orteds have to whack all remaining session directories, which includes that blasted monster shared memory file! Cleaning up the SM file can take quite a while. The HNP doesn't have that problem as there is no SM file there! So it gets out first. What we had done in the past to resolve that problem was put a little test in the OOB that checks to see if we are finalizing. If we are, then we ignore the lifeline connection being lost. That check was still in the code - however, we had lost the line in orte_finalize that set the flag!! This commit was SVN r17893.	2008-03-20 13:30:51 +00:00
Ralph Castain	ec64bf3da8	Clarify the error output so we can understand if it was a daemon or process that lost its lifeline This commit was SVN r17880.	2008-03-19 19:06:52 +00:00
Ralph Castain	ff99aa054f	In order to prevent orphaned processes when using non-unity routing methods, the procs need to realize that their local daemon is a critical connection - if that connection unexpectedly closes, they need to terminate. This commit adds definition for a "lifeline" connection. For an HNP, there is no lifeline, so the lifeline proc is NULL. For a daemon, the lifeline is the HNP - the daemon should abort if it loses that connection. For a proc using unity routed, the lifeline is the HNP since it connects directly to the HNP. For a proc using tree routed, the lifeline is the local daemon. Adjusted OOB to call abort if the lifeline (as opposed to HNP) connection is lost. This commit was SVN r17761.	2008-03-06 15:30:44 +00:00
Tim Prins	5de3e1965e	Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte. Everything should work, however I am unable to compile and test the sctp BTL. This commit was SVN r17751.	2008-03-05 22:44:35 +00:00
Ralph Castain	6450962d59	Add some debugging to the message event object. Cleanup some no-longer-used values This commit was SVN r17671.	2008-02-29 20:10:31 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
George Bosilca	eb71a634c6	Don't forget to initialize the msg_origin field. This commit was SVN r17055.	2008-01-04 23:24:49 +00:00
George Bosilca	48f5a26e8c	Cast to keep VC happy (quiet). This commit was SVN r17054.	2008-01-04 23:13:32 +00:00
Adrian Knoth	42d5fe62f9	Fixed misplaced #endif This commit was SVN r17028.	2008-01-01 11:02:38 +00:00
Jeff Squyres	213b5d5c6e	Per long threads on the mailing list and much confusion discussion about linkers, have all OPAL, ORTE, and OMPI components '''not'' link against the OPAL, ORTE, or OMPI libraries. See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a better-formatted version of the same info). This commit was SVN r16968.	2007-12-15 13:32:02 +00:00
Jeff Squyres	c20350b943	Patch submitted by Brian Barrett, inspired by this thread: http://www.open-mpi.org/community/lists/users/2007/11/4547.php. - Better handling of ECONNABORTED from connect on Linux. - Reduce extraneous output from OOB when TCP connections must be retried. This commit was SVN r16808.	2007-11-30 21:42:15 +00:00
George Bosilca	d67c0eefb4	Remove a compilation warning about using uninitialized variables. This commit was SVN r16589.	2007-10-26 20:15:28 +00:00
George Bosilca	b1b5cb6453	Looks like SO_REUSEPORT it's not defined on some platforms. Switch to the conventional SO_REUSEADDR instead. This commit was SVN r16588.	2007-10-26 19:56:21 +00:00
George Bosilca	337f78a4a8	Restrict the port range for the OOB and the BTL. Each protocols (v4 and v6) has his own range which is defined by a min value and a range. By default there is no limitation on the port range, which is exactly the same behavior as before. This commit was SVN r16584.	2007-10-26 16:36:51 +00:00
Jeff Squyres	5637c7a5a0	In addition to r16513, this commit fixes trac:1170. If we cannot resolve the route to the peer that we're trying to send to, don't queue up the message in the TCP OOB -- instead, return it to the upper layer (e.g., the RML) and let it decide what to do. In the case of the routed RML, the tree component will queue it up for later transmission. Hence, we don't want the message queued up both here in the TCP OOB and the tree routed. Also see some more discussion / explanation in #1171. This commit was SVN r16540. The following SVN revision numbers were found above: r16513 --> open-mpi/ompi@7ae9589d70 The following Trac tickets were found above: Ticket 1170 --> https://svn.open-mpi.org/trac/ompi/ticket/1170	2007-10-22 13:46:57 +00:00
Jeff Squyres	abf1b728b9	Minor code maintenance fix -- put the THREAD_UNLOCK outside the if statement so that you only have to have it once. This commit was SVN r16512.	2007-10-19 12:36:26 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
George Bosilca	e5d316dba6	Coverty: fix issues with using a string once it get freed. The problem, is that the mca_base_register_string don't set the result to NULL is an error occurs. This commit was SVN r16108.	2007-09-12 18:16:53 +00:00
Shiqing Fan	548a4fe943	- Use IOVBASE_TYPE instead of char to avoid warnings on some systems. This commit was SVN r16092.	2007-09-11 16:24:23 +00:00
Shiqing Fan	c1065d8262	- Some more type casts. This commit was SVN r16087.	2007-09-11 11:28:43 +00:00
Brian Barrett	59524a9009	Fix issue where we set state to SHUTDOWN rather than CONNECTING when we had to switch socket types. This commit was SVN r15784.	2007-08-06 22:55:41 +00:00
Rainer Keller	2c5d07217d	- Coverity: use snprintf, instead of sprintf.... This commit was SVN r15669.	2007-07-29 11:23:23 +00:00
Brian Barrett	f06b61cff9	Don't use the OOB TCP key for contact information, remove the need to include a not so public header file. FIxes a compile error on the Cray. This commit was SVN r15613.	2007-07-25 15:12:07 +00:00
George Bosilca	00796cfdab	Make sure the oob_tcp_windows_progress_callback is registered in all cases. This is now done in the oob tcp open function. As a result, the unregistering have to be done in the close function. This commit was SVN r15603.	2007-07-25 05:55:14 +00:00
George Bosilca	c961cb5749	The Windows support is now back in bussiness. This commit was SVN r15599.	2007-07-25 03:55:34 +00:00
Brian Barrett	4e23c7c5a2	Fixes for case where IPv6 support is disabled. Fixes trac:1102. This commit was SVN r15584. The following Trac tickets were found above: Ticket 1102 --> https://svn.open-mpi.org/trac/ompi/ticket/1102	2007-07-24 17:01:39 +00:00
Brian Barrett	5b9fa7e998	reapply r15517 and r15520, which were removed in r15527 so that I could get the RML/OOB merge in slightly easier This commit was SVN r15530. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95 r15520 --> open-mpi/ompi@9cbc9df1b8 r15527 --> open-mpi/ompi@2d17dd9516	2007-07-20 02:34:29 +00:00
Brian Barrett	39a6057fc6	A number of improvements / changes to the RML/OOB layers: * General TCP cleanup for OPAL / ORTE * Simplifying the OOB by moving much of the logic into the RML * Allowing the OOB RML component to do routing of messages * Adding a component framework for handling routing tables * Moving the xcast functionality from the OOB base to its own framework Includes merge from tmp/bwb-oob-rml-merge revisions: r15506, r15507, r15508, r15510, r15511, r15512, r15513 This commit was SVN r15528. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r15506 r15507 r15508 r15510 r15511 r15512 r15513	2007-07-20 01:34:02 +00:00
Brian Barrett	2d17dd9516	temporarily back our r15517 and 15520 so that I can get the RML / OOB changes to cleanly apply This commit was SVN r15527. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95	2007-07-20 01:10:34 +00:00
Ralph Castain	41977fcc95	Remove the cellid field from the orte_process_name_t structure. This only affects a handful of files in itself, but... Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point. Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings. This commit was SVN r15517.	2007-07-19 20:56:46 +00:00
Ralph Castain	bd65f8ba88	Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. Short description: major changes include - 1. singletons now fork/exec a local daemon to manage their operations. 2. the orte daemon code now resides in libopen-rte 3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided. I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine. This commit was SVN r15390.	2007-07-12 19:53:18 +00:00
Brian Barrett	1d02b9e7b5	Fix a bunch of issues exposed by Ken Cain in getting Open MPI to work with VxWorks. Still some issues remaining, I'm sure. Refs trac:1010 This commit was SVN r15320. The following Trac tickets were found above: Ticket 1010 --> https://svn.open-mpi.org/trac/ompi/ticket/1010	2007-07-10 03:46:57 +00:00
Brian Barrett	f8fb1e9720	Fix some compile failures on Solaris 9 because it doesn't have V6ONLY. This commit was SVN r15237.	2007-06-28 18:52:15 +00:00
Ralph Castain	e653da1d11	Where or where did that patch go??? Ah - there it went! ;-) Fix singleton operations - allow multiple xcasts to be queued. This commit was SVN r15097.	2007-06-15 13:45:29 +00:00
George Bosilca	a4d99ddef6	More synchronizations for the Windows version. The problem came from the multiple threads accessing the OOB/registry asynchronously via the callbacks. The quickest solution (but definitively not the cleanest) is to serialize these callbacks in such a way that at any given time only one thread can execute a callbacks. This commit was SVN r15086.	2007-06-14 22:35:38 +00:00
George Bosilca	fb9ff5cc75	Don't remove the tcp events from the list, they will remove themselves in the destructor. This commit was SVN r15085.	2007-06-14 22:33:09 +00:00
George Bosilca	95a607b945	A more Windows friendly version. As the socket event will be generated through the win dll using multiple threads, we have to insure that the oob callbacks happens only in a synchronous way or really bad things happens with the current design (blocking messages from a receive callback). This commit was SVN r15069.	2007-06-14 04:38:06 +00:00
Ralph Castain	5adef03179	Clean up a diagnostic so it only outputs when requested This commit was SVN r15048.	2007-06-13 15:53:10 +00:00
George Bosilca	715f6012cf	The DSS pack function can use the const attribute for the src field as it is never modified by the pack functions directly. Enforce it all over the code base. This commit was SVN r15026.	2007-06-12 22:47:14 +00:00

1 2 3 4 5

203 Коммитов