openmpi

Автор	SHA1	Сообщение	Дата
Brian Barrett	8b778903d8	Fix longstanding issue with our multi-project support. Rather than using pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is always set to {datadir,libdir,includedir}/openmpi. This will keep us from having help files in prefix/share/open-rte when building without Open MPI, but in prefix/share/openmpi when building with Open MPI. This commit was SVN r30140.	2014-01-07 22:11:15 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Ralph Castain	8d2fa3693b	First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package. This commit was SVN r28116.	2013-02-26 20:44:56 +00:00
Brian Barrett	312f37706e	In talking about this with Jeff and Ralph, we don't actually need ompi_show_help, because opal_show_help is replaced with an aggregating version when using ORTE, so there's no reason to directly call orte_show_help. This commit was SVN r28051.	2013-02-12 21:10:11 +00:00
Brian Barrett	f42783ae1a	Move the RTE framework change into the trunk. With this change, all non-CR runtime code goes through one of the rte, dpm, or pubsub frameworks. This commit was SVN r27934.	2013-01-27 23:25:10 +00:00
Samuel Gutierrez	4c28c8cbd0	New sm BTL initialization take two. This approach is pretty simple. Instead of using the modex or RML to share sm initialization information, have node rank 0 create a file containing initialization information in a well-known place. Then during add_procs, the rest of the node processes requiring sm BTL initialization will just read from that file to complete their initialization. This commit was SVN r27789.	2013-01-11 16:24:56 +00:00
Samuel Gutierrez	c4acd20eb9	Backout r27739. This commit was SVN r27745. The following SVN revision numbers were found above: r27739 --> open-mpi/ompi@a159bfaf25	2013-01-05 01:54:23 +00:00
Samuel Gutierrez	a159bfaf25	sm BTL initialization via modex, as discussed at last year's meeting. This commit was SVN r27739.	2013-01-03 21:52:20 +00:00
Samuel Gutierrez	6188d97e1a	Getting out of bed this morning was a bad idea... Reverting the sm update once more because it breaks direct launch. Will address this issue and commit the update once it has all been tested. Sorry everyone! This commit was SVN r27001.	2012-08-10 22:20:38 +00:00
Samuel Gutierrez	159bd2e62e	Let's try this again: sm BTL initialization via modex. This commit was SVN r26989.	2012-08-10 20:12:36 +00:00
Samuel Gutierrez	6a70063812	Yikes - that's not right! Back out 26987. I'll try again in a bit... Sorry! This commit was SVN r26988.	2012-08-10 19:57:51 +00:00
Samuel Gutierrez	2c80273246	sm BTL initialization via modex. This commit was SVN r26987.	2012-08-10 19:51:41 +00:00
Shiqing Fan	204fbfe4b1	update the wv btl component. This commit was SVN r26872.	2012-07-26 15:35:01 +00:00
Samuel Gutierrez	76d94bf9bf	Plug leak. Thanks, Nathan. This commit was SVN r26846.	2012-07-23 21:11:21 +00:00
Samuel Gutierrez	8096852a16	Towards RML-less shared-memory initialization (primarily for eventual BTL move). Extended common sm API with: mca_common_sm_module_create and mca_common_sm_module_attach. Please note that the new routines aren't currently used -- but will be... This commit was SVN r26845.	2012-07-23 19:38:13 +00:00
Josh Hursey	28681deffa	Backout the ORCA commit. :( There is a linking issue on Mac OSX that needs to be addressed before this is able to come back into the trunk. This commit was SVN r26676.	2012-06-27 01:28:28 +00:00
Josh Hursey	542330e3a7	Commit of ORCA: Open MPI Runtime Collaborative Abstraction This is a runtime interposition project that sits between the OMPI and ORTE layers in Open MPI. The project is described on the wiki: https://svn.open-mpi.org/trac/ompi/wiki/Runtime_Interposition And on this email thread: http://www.open-mpi.org/community/lists/devel/2012/06/11109.php This commit was SVN r26670.	2012-06-26 21:42:16 +00:00
Samuel Gutierrez	63869c431b	init seg_num_procs_inited to zero before the atomic add. This commit was SVN r25710.	2012-01-11 03:37:23 +00:00
Samuel Gutierrez	d1a44ecd34	send packed buffers instead of using iovecs in common sm rml. this commit will hopefully resolve the periodic bus errors that some mtt tests have been encountering. This commit was SVN r25692.	2012-01-05 00:11:59 +00:00
Samuel Gutierrez	519f71ab7e	silences valgrind warning in common sm (Syscall param writev(vector[...]) points to uninitialised byte(s)). probably also silences a large stack allocation warning in coverity. This commit was SVN r25666.	2011-12-16 23:17:48 +00:00
Samuel Gutierrez	0ca6603fa0	remove some unused cruft in shmem. minor common sm cleanup. This commit was SVN r25665.	2011-12-16 22:43:55 +00:00
Samuel Gutierrez	375162c693	this commit fixes a few things. 1. silence warning in common sm. 2. remove unneeded config code in common sm. 3. move opal_shmem_base_close to a better place in opal_finalize. 4. fix opal_path_nfs output. This commit was SVN r25518.	2011-11-28 23:41:19 +00:00
Samuel Gutierrez	b4edf0ff5c	getting ready for 1.5 port of the shared memory enhancements. remove some unused/unneeded stuff and minor style update. This commit was SVN r25513.	2011-11-28 16:08:32 +00:00
George Bosilca	efd88e10d7	Cleanup the error codes. Get rid of all the useless ones, and mark the distinction between ORTE and OMPI errors. This commit was SVN r25323.	2011-10-19 03:51:53 +00:00
Samuel Gutierrez	81f38b258a	commit of new shared memory backing facility framework (shmem) and its components. This commit was SVN r24795.	2011-06-21 15:41:57 +00:00
Samuel Gutierrez	0867454a06	Fixes CID #1665 . This commit was SVN r24519.	2011-03-12 03:41:49 +00:00
Samuel Gutierrez	5cff21842a	a friday night in sf, nm. fixes CID 1666. This commit was SVN r24517.	2011-03-12 02:39:31 +00:00
Jeff Squyres	ec90a3ba6d	Fix a few memory leaks, and ensure that coll sm is also registering the common SM MCA params. This commit was SVN r24497.	2011-03-08 17:36:59 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Samuel Gutierrez	c25945ce48	remove one more extra semicolon This commit was SVN r23954.	2010-10-26 17:30:34 +00:00
Samuel Gutierrez	e1589a2a28	remove an extra semi-colon This commit was SVN r23953.	2010-10-26 17:23:30 +00:00
Jeff Squyres	73bcc4a36b	Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers This commit was SVN r23801. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-24 22:53:28 +00:00
Samuel Gutierrez	90a132b0a2	disable system v shared memory support when checkpoint/restart is enabled. this combo could presumably work properly someday. This commit was SVN r23792.	2010-09-22 22:05:07 +00:00
Samuel Gutierrez	1c8f3e1add	fix common sm segf when used with cr - thanks to Ananda for finding this issue. This commit was SVN r23781.	2010-09-20 22:20:43 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Samuel Gutierrez	3b572e14ce	Fix build issues on Windows. Thanks to Shiqing for pointing this out. This commit was SVN r23646.	2010-08-24 14:01:05 +00:00
Shiqing Fan	a987eafc90	Add another sm definition for ignoring posix sm on Windows, and exclude those source files. This commit was SVN r23640.	2010-08-24 09:28:56 +00:00
Samuel Gutierrez	3b162593e6	New POSIX shared memory component and other common sm enhancements. NOTE: mmap is still the default. Some highlights: o Silent component failover. o The sysv component will only be queried for selection if it is placed before the mmap component (for example, -mca mpi_common_sm sysv,posix,mmap). In the default case, sysv will never be queried/selected. o Per some on-list discussion, now unlinking mmaped file in both mmap and posix components (see: "System V Shared Memory for Open MPI: Request for Community Input and Testing" thread). o Assuming local process homogeneity with respect to all utilized shared memory facilities. That is, if one local process deems a particular shared memory facility acceptable, then ALL local processes should be able to utilize that facility. As it stands, this is an important point because one process dictates to all other local processes which common sm component will be selected based on its own, local run-time test. o Addressed some of George's code reuse concerns. This commit was SVN r23633.	2010-08-23 16:04:13 +00:00
Shiqing Fan	d391c57b0f	A more proper fix for the HANDLE definition. This commit was SVN r23269.	2010-06-14 14:17:07 +00:00
Samuel Gutierrez	2fb7c344fc	Added a new System V (sysv) shared memory component for Open MPI. Configure Option: --enable-sysv MCA Parameter: mpi_common_sm mpi_common_sm accepts a comma delimited list of: [sysv],mmap (order dependent). The first component that is successfully selected is used. For example, -mca mpi_common_sm sysv,mmap will first try sysv. If sysv is not successfully selected, then mmap will be used. mmap will be used if mpi_common_sm is not provided. Notes: Please make certain that your system's shmmax limit, or equivalent, is larger than mpool_sm_min_size. Otherwise, shmget may fail. This commit was SVN r23260.	2010-06-09 16:58:52 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Samuel Gutierrez	7654b39349	Fix segfault in two error paths. This commit was SVN r22978.	2010-04-15 15:51:57 +00:00
Josh Hursey	3db01f0795	Add the process name to the error message resulting from a failed mmap(), open(), or ftruncate() so that it is slightly easier to figure out which process in the system caused the problem with sm. This commit was SVN r22803.	2010-03-10 00:18:04 +00:00
Samuel Gutierrez	dcb5a2331f	Fixed some typos in comments. This commit was SVN r22801.	2010-03-09 20:41:25 +00:00
Eugene Loh	316892b49f	Fix spelling of "degradation". This commit was SVN r22714.	2010-02-25 19:41:59 +00:00
Jeff Squyres	583394e30b	This help message got a little jumbled. This commit was SVN r22689.	2010-02-23 21:09:16 +00:00
Rainer Keller	548d6f7c61	- Incorporated a rewording proposal by Jeff. This commit was SVN r22670.	2010-02-19 14:37:09 +00:00
Rainer Keller	ea4de16561	- Check whether file is opened on network file-system. If file does not exist, check the directory it lives in... Maybe used by caller, trying to open mmap() on NFS, Lustre or Panasas (thanks Sam). For now, this is used to warn about the usage of mmap on such FS. Please note, that Ralph mentioned the orte_no_session_dir parameter. The help message includes a reference to this. Tested on NFS and Lustre on Linux on smoky: mpirun --mca orte_tmpdir_base $HOME/tmp -np 2 ./mpi_stub jaguar: mpirun ... --mca orte_tmpdir_base /tmp/work/$USER ... Fixes trac:1354 This should cmr:v1.5 once it has soaked and is shown to work on Solaris This commit was SVN r22604. The following Trac tickets were found above: Ticket 1354 --> https://svn.open-mpi.org/trac/ompi/ticket/1354	2010-02-10 23:18:29 +00:00
Jeff Squyres	a6c1fe888f	We also need .so versioning of the OMPI "common" components since they are installed as standalone libraries in $libdir. This commit was SVN r22148.	2009-10-27 20:58:34 +00:00
Jeff Squyres	0d1e177453	Remove 2 extraneous ORTE_ERROR_LOGs and 1 extraneous opal_output. This commit was SVN r22071.	2009-10-07 20:12:37 +00:00

1 2 3

118 Коммитов