openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	59d09ad9de	orte: fix several small memory leaks grpcomm: fix memory leaks We were leaking the caddy object used to pass data to the callback function. This commit fixes these leaks. oob,rml: fix memory leaks This commit fixes several leaks: - Both the oob/base and oob/tcp were leaking objects on their peer hash tables. Iterate on the hash tables and free any objects. - Leaked sent messages because of missing OBJ_RELEASE. I placed the release in ORTE_RML_SEND_COMPLETE to catch all the possible paths. ess/base: close the state framework cmr=v1.8.2:reviewer=rhc This commit was SVN r31776.	2014-05-15 15:06:27 +00:00
Ralph Castain	11faab1091	The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees. This commit was SVN r31679.	2014-05-08 02:01:35 +00:00
Adrian Reber	34625b360b	use the newly created JOB_STATE_FT_* events This commit was SVN r31021.	2014-03-12 12:37:14 +00:00
Ralph Castain	956aab03a7	Track the origin of a message so it can be passed across transports Refs trac:4184 This commit was SVN r30433. The following Trac tickets were found above: Ticket 4184 --> https://svn.open-mpi.org/trac/ompi/ticket/4184	2014-01-26 21:09:26 +00:00
Ralph Castain	657796f9e0	Revert r30327 - turns out it isn't quite right just yet. :-( Closes trac:4138 This commit was SVN r30328. The following SVN revision numbers were found above: r30327 --> open-mpi/ompi@87d5f86025 The following Trac tickets were found above: Ticket 4138 --> https://svn.open-mpi.org/trac/ompi/ticket/4138	2014-01-18 23:38:39 +00:00
Ralph Castain	87d5f86025	Enable use of unix domain sockets for local OOB communications, thereby removing the requirement for an active network interface when running strictly on a single node. Update the overall OOB system to support cross-transport movement of messages so that the OOB can move a received message to another transport for transmission. cmr=v1.7.5:reviewer=jsquyres:subject=Enable use of unix domain sockets for local OOB communications This commit was SVN r30327.	2014-01-18 21:36:49 +00:00
Ralph Castain	286ff6d552	For large scale systems, we would like to avoid doing a full modex during MPI_Init so that launch will scale a little better. At the moment, our options are somewhat limited as only a few BTLs don't immediately call modex_recv on all procs during startup. However, for those situations where someone can take advantage of it, add the ability to do a "modex on demand" retrieval of data from remote procs when we launch via mpirun. NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message! Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place. This behavior will only take effect under the following conditions: 1. launched via mpirun 2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX 3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to want this behavior to get it. The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes: 1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data". 2. adjust the SM rendezvous logic to loop until the required file has been created Creating a placeholder to bring this over to 1.7.5 when ready. cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale This commit was SVN r30259.	2014-01-11 17:36:06 +00:00
Brian Barrett	8b778903d8	Fix longstanding issue with our multi-project support. Rather than using pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is always set to {datadir,libdir,includedir}/openmpi. This will keep us from having help files in prefix/share/open-rte when building without Open MPI, but in prefix/share/openmpi when building with Open MPI. This commit was SVN r30140.	2014-01-07 22:11:15 +00:00
George Bosilca	38cbaeaa82	Try to impose a little bit of consistency on how we parse lists of modules by enforcing the use of OPAL list accessors. This commit was SVN r30045.	2013-12-21 23:23:33 +00:00
Adrian Reber	53a70fe87f	Trying to get the C/R code to compile again. (send__nb) This patch changes all send/send_buffer occurrences in the C/R code to send_nb/send_buffer_nb. The new code compiles but does not work. Changes from V1: #ifdef out the code (so it is preserved for later re-design) * marked the broken C/R code with ENABLE_FT_FIXED Changes from V2: * just replace the blocking calls with the non-blocking calls * all #ifdef's introduced in V1 are gone * send_* returns error code or ORTE_SUCCESS (not the number of bytes) This commit was SVN r30036.	2013-12-20 21:58:28 +00:00
Adrian Reber	a3813d37c7	Trying to get the C/R code to compile again. (recv__nb) This patch changes all recv/recv_buffer occurrences in the C/R code to recv_nb/recv_buffer_nb. The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED). The new code compiles but does not work. Changes from V1: #ifdef out the code (so it is preserved for later re-design) * marked the broken C/R code with ENABLE_FT_FIXED Changes from V2: * only #ifdef out the code where the behaviour is changed (used to be blocking; now non-blocking) This commit was SVN r30035.	2013-12-20 21:05:40 +00:00
Adrian Reber	b42aad44a3	Trying to get the C/R code to compile again. This patch includes various fixes all over the C/R code which are hard to group like the other patches. Changes from V1: * explain why mca_base_component_distill_checkpoint_ready no longer works * compare return result of opal functions with OPAL_* values Changes from V2: * use orte_rml_oob_ft_event() instead of referencing through the modules * properly protect variable (thanks to --enable-picky) This commit was SVN r29922.	2013-12-16 15:35:28 +00:00
Ralph Castain	1ff12362da	Cleanup merge conflict that was incorrectly committed This commit was SVN r29851.	2013-12-09 20:20:14 +00:00
Jeff Squyres	ed9aba3896	This patch fixes error: void value not ignored as it ought to be in the C/R code by ignoring the return value of functions which no longer return a value (only void). Signed-off-by: Adrian Reber <adrian.reber@hs-esslingen.de> This commit was SVN r29816.	2013-12-06 14:40:10 +00:00
Ralph Castain	7c23a5ad65	Fix headers when building with ft enabled. Thanks to Adrian Reber for the patch! This commit was SVN r29743.	2013-11-23 22:58:32 +00:00
Ralph Castain	99611ac1d2	Revert r29166 in favor of a better solution from George This commit was SVN r29199. The following SVN revision numbers were found above: r29166 --> open-mpi/ompi@497c7e6abb	2013-09-18 01:41:26 +00:00
Ralph Castain	497c7e6abb	Fixes trac:2904 The intercomm "merge" function can create a linkage between procs that was not reflected anywhere in a modex, and so at least some of the procs in the resulting communicator don't know how to talk to some of the new communicator's peers. For example, consider the case where: 1. parent job A comm_spawns a process (job B) - these processes exchange modex and can communicate 2. parent job A now comm_spawns another process (job C) - again, these can communicate, but the proc in C knows nothing of B 3. do an intercomm merge across the communicators created by the two comm_spawns. This puts B and C into the same communicator, but they know nothing about how to talk to each other as they were not involved in any exchange of contact info. Hence, collectives on that communicator now fail. This fix adds an API to the ompi/dpm framework that (a) exchanges the modex info across the procs in the merge to ensure all procs know how to communicate, and (b) calls add_procs to give the btl's a chance to select transports to any new procs. cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29166. The following Trac tickets were found above: Ticket 2904 --> https://svn.open-mpi.org/trac/ompi/ticket/2904	2013-09-15 15:00:40 +00:00
Ralph Castain	13ae51a91b	Protect against possible race conditions and threads by ensuring that rml send always occurs inside an event. cmr:v1.7.4:reviewer=jsquyres:subject=Protect against race conditions in rml send This commit was SVN r29128.	2013-09-05 01:16:32 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Ralph Castain	611d7f9f6b	When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require. This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times. Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes: * upon first request for data, have the OPAL db pmi component fetch and decode all the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally * reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test * reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued). Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time. This commit was SVN r29040.	2013-08-17 00:49:18 +00:00
Ralph Castain	285429a1c6	Remove release of buffer - non-blocking send callback will do it This commit was SVN r28985.	2013-08-02 03:49:17 +00:00
Jeff Squyres	089c632cce	Remove a bunch of dead code: gcc 4.7 warns of set-but-unused variables. So get rid of them. This commit was SVN r28538.	2013-05-17 21:45:49 +00:00
Ralph Castain	c52b94af8b	Revert r28453 and r28452 - wrong fix This commit was SVN r28454. The following SVN revision numbers were found above: r28452 --> open-mpi/ompi@756ee4b5e0 r28453 --> open-mpi/ompi@6da24143a2	2013-05-06 21:52:17 +00:00
Ralph Castain	6da24143a2	Minor performance improvement This commit was SVN r28453.	2013-05-06 20:27:16 +00:00
Ralph Castain	756ee4b5e0	Update the rml_uri for each proc so debuggers can attach This commit was SVN r28452.	2013-05-06 20:18:14 +00:00
Nathan Hjelm	c041156f60	Update ORTE frameworks to use the MCA framework system. This commit was SVN r28240.	2013-03-27 21:14:43 +00:00
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
Ralph Castain	cf9796accd	Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes This commit was SVN r28134.	2013-02-28 01:35:55 +00:00
Ralph Castain	8d2fa3693b	First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package. This commit was SVN r28116.	2013-02-26 20:44:56 +00:00
Ralph Castain	beddf3b379	Add required rml tag This commit was SVN r27751.	2013-01-05 06:32:20 +00:00
George Bosilca	994d1aba50	Nothing. This commit was SVN r27626.	2012-11-21 20:07:20 +00:00
Nathan Hjelm	bdedd8b0d3	Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1. Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality. This commit was SVN r27570.	2012-11-06 19:09:26 +00:00
Ralph Castain	df642f1508	Add an API to get a remote file's size. Separate dfs cmds from returned data messages so daemons don't get confused. This commit was SVN r27487.	2012-10-25 22:23:08 +00:00
Ralph Castain	094d6f3143	Add a new "distributed file system" capability to support file access operations across nodes that do not have a network file system attached to them. Add a set of URI create/parse utilities This commit was SVN r27483.	2012-10-25 17:15:17 +00:00
Ralph Castain	5d7872fd68	Cleanup the tag list This commit was SVN r27115.	2012-08-22 21:37:58 +00:00
Ralph Castain	e6f3586415	Remove the orte notifier framework, per discussion at the devel meeting and follow-up with Jeff (who took the action item) This commit was SVN r26637.	2012-06-22 18:09:23 +00:00
Ralph Castain	96c778656a	Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP. Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build. This commit was SVN r26606.	2012-06-15 10:15:07 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Jeff Squyres	81dc6a11ee	Fix typo in copyright notice, found by Paul Hargrove This commit was SVN r26070.	2012-02-29 02:02:54 +00:00
Jeff Squyres	b6a90434e4	Fix some include file header ordering issues for some BSDs, suggested by Paul Hargrove. This commit was SVN r25984.	2012-02-21 13:32:14 +00:00
Ralph Castain	9b59d8de6f	This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build. Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations. Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way. This commit was SVN r25497.	2011-11-22 21:24:35 +00:00
Wesley Bland	4e7ff0bd5e	By popular demand the epoch code is now disabled by default. To enable the epochs and the resilient orte code, use the configure flag: --enable-resilient-orte This will define both: ORTE_ENABLE_EPOCH ORTE_RESIL_ORTE This commit was SVN r25093.	2011-08-26 22:16:14 +00:00
Ralph Castain	1c08a4006c	Refactor some code to remove a few API handles from errmgr. Reviewed/tested by Wes. This commit was SVN r25064.	2011-08-18 16:24:45 +00:00
Wesley Bland	09274cd047	Make sure that the epoch is initialized everywhere so we don't get weird output during valgrind. This shouldn't have caused any problems with any actual execution. Just extra warnings in valgrind. This commit was SVN r25015.	2011-08-08 15:11:55 +00:00
Jeff Squyres	3893a5a1de	Fix compile error introduced in r24888. This commit was SVN r24889. The following SVN revision numbers were found above: r24888 --> open-mpi/ompi@e5253647ea	2011-07-13 14:18:00 +00:00
Shiqing Fan	e5253647ea	Fix a type cast. This commit was SVN r24888.	2011-07-13 09:00:17 +00:00
Nathan Hjelm	3f4e5d7dd6	add missing thread lock/unlock around condition_broadcast This commit was SVN r24885.	2011-07-12 15:43:56 +00:00
Nathan Hjelm	c3ec2e2614	fix a potential race condition in rml This commit was SVN r24884.	2011-07-12 15:43:12 +00:00
Ralph Castain	1ee7c39982	Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long. Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again. Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines. This commit was SVN r24860.	2011-07-07 18:54:30 +00:00
Wesley Bland	84be81df95	Standardize the initialization of the EPOCH's. Everyone will be starting at MIN anyway (until we implement restart of course) so there's no reason to set the epoch to INVALID and then immediately reset them to MIN. This way there's less room to make mistakes later. This commit was SVN r24829.	2011-06-28 14:20:33 +00:00
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Ralph Castain	391074cde6	Add a tag This commit was SVN r24813.	2011-06-23 15:12:25 +00:00
Ralph Castain	f3cae3d6f3	Cleanup the handling of if_include and if_exclude arguments based on CIDR notation. Fix a bug in the new code that prevented the system from correctly matching addresses. Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide. Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result. Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages. NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing. This commit was SVN r24755.	2011-06-07 02:09:11 +00:00
Ralph Castain	b47ec2ee87	Remove lingering references to opal_profile option This commit was SVN r24709.	2011-05-18 18:27:29 +00:00
George Bosilca	7f34a28c8f	Correct a comment. This commit was SVN r24504.	2011-03-10 00:41:41 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Ralph Castain	b09f57b03d	Update the multicast subsystem - ported from Cisco branch This commit was SVN r24246.	2011-01-13 01:54:05 +00:00
Ralph Castain	2dc5cbb483	Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago. This commit was SVN r24153.	2010-12-05 15:58:21 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	9ea2b196ce	Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component. Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work. This commit was SVN r23966.	2010-10-28 15:22:46 +00:00
Ralph Castain	86c7365e8e	Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem. Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution. This commit was SVN r23943.	2010-10-26 02:41:42 +00:00
Ralph Castain	fceabb2498	Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects. Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems. Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct. I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things: 1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new) 2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it. There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do. This commit was SVN r23925.	2010-10-24 18:35:54 +00:00
Jeff Squyres	73bcc4a36b	Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers This commit was SVN r23801. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-24 22:53:28 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Josh Hursey	e12ca48cd9	A number of C/R enhancements per RFC below: http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary\|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413	2010-08-10 20:51:11 +00:00
Ralph Castain	d8ec83f939	Remove an unneeded tag This commit was SVN r23533.	2010-07-29 02:13:06 +00:00
Ralph Castain	f3e13b9766	Improve the efficiency by making the check for uniqueness of incoming hnp contact info much faster by including the hnp_uri in the job_family tracker object. Replace the global buffer storage with a quick routine to build the buffer from the jobfams array This commit was SVN r23443.	2010-07-20 08:30:47 +00:00
Ralph Castain	248320b91a	Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started. Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them. Add a test provided by Philippe that tests this ability. This commit was SVN r23438.	2010-07-20 04:22:45 +00:00
Ralph Castain	12cd07c9a9	Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases. The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang. This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is: * pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place * modified the errmgr to directly call the new routines when termination is detected * removed the grpcomm.onesided_barrier and its associated RML tag * add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree * use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial. This commit was SVN r23429.	2010-07-17 21:03:27 +00:00
Ralph Castain	8bb0c16c2f	Add new tag This commit was SVN r23382.	2010-07-13 06:29:13 +00:00
Ralph Castain	dd85689560	Cleanup pointer array addressing This commit was SVN r23329.	2010-07-01 19:33:10 +00:00
Ralph Castain	b60c369489	Add missing rml tag This commit was SVN r23232.	2010-06-01 22:58:23 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Ralph Castain	2ff1ae13e1	Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set. This commit was SVN r23102.	2010-05-05 00:48:43 +00:00
Ralph Castain	55889934d8	After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon. Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly. This commit was SVN r23045.	2010-04-27 03:39:32 +00:00
Ralph Castain	efbb5c9b7c	Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base: * add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway. * update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states. * remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails. * update the plm_base_receive.c code to call the errmgr with state updates from remote daemons * update the routed modules to reflect that process state is updated in the errmgr * ensure that the orted's open the errmgr and select their appropriate module * add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place * define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired. * update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios This commit was SVN r23023.	2010-04-23 04:44:41 +00:00
Josh Hursey	62f8d3c471	r22885 missed a few symbol updates when it changed ompi_want_ft to opal_want_ft This commit was SVN r22916. The following SVN revision numbers were found above: r22885 --> open-mpi/ompi@522a23d6a3	2010-03-30 16:47:39 +00:00
Josh Hursey	e4f2d03d28	ErrMgr Framework redesign to better support fault tolerance development activities. Explained in more detail in the following RFC: http://www.open-mpi.org/community/lists/devel/2010/03/7589.php This commit was SVN r22872.	2010-03-23 21:28:02 +00:00
Josh Hursey	e9b5162d79	Fix the configure logic for --with-ft so that it properly takes a comma separated list. Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those. The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed. This commit was SVN r22824.	2010-03-12 23:57:50 +00:00
Ralph Castain	e88627a7ca	Ensure we don't go through rml open/select more than once. Open the rml to get the uri when bootstrapping daemons This commit was SVN r22538.	2010-02-03 15:38:32 +00:00
Shiqing Fan	ad763c327d	Restore several linked libraries that were deleted by mistake in r22405. This commit was SVN r22415. The following SVN revision numbers were found above: r22405 --> open-mpi/ompi@872a4047ba	2010-01-14 21:50:42 +00:00
Shiqing Fan	872a4047ba	Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake. In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time. This commit was SVN r22405.	2010-01-14 18:10:20 +00:00
Ralph Castain	5faf857840	Add a new tag for pnp/multicast send of direct messages This commit was SVN r22352.	2009-12-31 20:34:58 +00:00
Ralph Castain	db2cbd3166	Okay, okay - do it at destruct time too. This commit was SVN r22331.	2009-12-17 20:08:49 +00:00
Ralph Castain	a56e09c874	Per suggestion from Josh, init the sender field of the msg_packet object to INVALID This commit was SVN r22330.	2009-12-17 20:03:35 +00:00
Ralph Castain	9acec283af	Add a new TCP module to the reliable multicast framework. This module uses ORTE's grpcomm.xcast functionality to "fake" multicasts for environments where regular multicast isn't reliable. Modify the startup logic to allow for this use-case. This commit was SVN r22310.	2009-12-15 01:18:27 +00:00
Ralph Castain	84cc847be8	Next phase of auto-wireup using multicast. Enable use of multicast groups to separate comm from different application groups. Have the orted bootstrap message go to a different rml tag so the node can be added to the pool. This commit was SVN r22083.	2009-10-10 01:19:56 +00:00
Rainer Keller	8e1b23779f	- Replace combinations of #if defined (c_plusplus) defined (__cplusplus) followed by extern "C" { and the closing counterpart by BEGIN_C_DECLS and END_C_DECLS. Notable exceptions are: - opal/include/opal_config_bottom.h: This is our generated code, that itself defines BEGIN_C_DECL and END_C_DECL - ompi/mpi/cxx/mpicxx.h: Here we do not include opal_config_bottom.h: - Belongs to external code: opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.c opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.h - opal/include/opal/prefetch.h: Has C++ specific macros that are protected: - Had #if ... } #endif _and_ END_C_DECLS (aka end up with 2x END_C_DECLS) ompi/mca/btl/openib/btl_openib.h - opal/event/event.h has #ifdef __cplusplus as BEGIN_C_DECLS... - opal/win32/ompi_process.h: had extern "C"\n {... opal/win32/ompi_process.h: dito - ompi/mca/btl/pcie/btl_pcie_lex.l: needed to add *_C_DECLS ompi/mpi/f90/test/align_c.c: dito - ompi/debuggers/msgq_interface.h: used #ifdef __cplusplus - ompi/mpi/f90/xml/common-C.xsl: Amend Tested on linux using --with-openib and --with-mx The following do not contain either opal_config.h, orte_config.h or ompi_config.h (but possibly other header files, that include one of the above): ompi/mca/bml/r2/bml_r2_ft.h ompi/mca/btl/gm/btl_gm_endpoint.h ompi/mca/btl/gm/btl_gm_proc.h ompi/mca/btl/mx/btl_mx_endpoint.h ompi/mca/btl/ofud/btl_ofud_endpoint.h ompi/mca/btl/ofud/btl_ofud_frag.h ompi/mca/btl/ofud/btl_ofud_proc.h ompi/mca/btl/openib/btl_openib_mca.h ompi/mca/btl/portals/btl_portals_endpoint.h ompi/mca/btl/portals/btl_portals_frag.h ompi/mca/btl/sctp/btl_sctp_endpoint.h ompi/mca/btl/sctp/btl_sctp_proc.h ompi/mca/btl/tcp/btl_tcp_endpoint.h ompi/mca/btl/tcp/btl_tcp_ft.h ompi/mca/btl/tcp/btl_tcp_proc.h ompi/mca/btl/template/btl_template_endpoint.h ompi/mca/btl/template/btl_template_proc.h ompi/mca/btl/udapl/btl_udapl_eager_rdma.h ompi/mca/btl/udapl/btl_udapl_endpoint.h ompi/mca/btl/udapl/btl_udapl_mca.h ompi/mca/btl/udapl/btl_udapl_proc.h ompi/mca/mtl/mx/mtl_mx_endpoint.h ompi/mca/mtl/mx/mtl_mx.h ompi/mca/mtl/psm/mtl_psm_endpoint.h ompi/mca/mtl/psm/mtl_psm.h ompi/mca/pml/cm/pml_cm_component.h ompi/mca/pml/csum/pml_csum_comm.h ompi/mca/pml/dr/pml_dr_comm.h ompi/mca/pml/dr/pml_dr_component.h ompi/mca/pml/dr/pml_dr_endpoint.h ompi/mca/pml/dr/pml_dr_recvfrag.h ompi/mca/pml/example/pml_example.h ompi/mca/pml/ob1/pml_ob1_comm.h ompi/mca/pml/ob1/pml_ob1_component.h ompi/mca/pml/ob1/pml_ob1_endpoint.h ompi/mca/pml/ob1/pml_ob1_rdmafrag.h ompi/mca/pml/ob1/pml_ob1_recvfrag.h ompi/mca/pml/v/pml_v_output.h opal/include/opal/prefetch.h opal/mca/timer/aix/timer_aix.h opal/util/qsort.h test/support/components.h This commit was SVN r21855. The following SVN revision numbers were found above: r2 --> open-mpi/ompi@58fdc18855	2009-08-20 11:42:18 +00:00
Shiqing Fan	bce2f44154	Update related .windows files with proper compiling properties, in order to have a successful DSO build. This commit was SVN r21805.	2009-08-12 08:55:58 +00:00
Shiqing Fan	3e24d3df70	An ORTE event fix for Windows, i.e. using socket pairs instead of pipes on Windows. This commit was SVN r21726.	2009-07-22 07:39:52 +00:00
Ralph Castain	1a5f7245c8	Create a new message handling method for serializing responses. Place recvd messages on a list, using a file descriptor and the event library to trigger processing. This is identical in design to what is used in the IOF. Use it first in the plm_base_receive to serialize multiple comm_spawn and update_proc requests. This commit was SVN r21717.	2009-07-19 18:07:04 +00:00
George Bosilca	3e971e61f3	The system headers are supposed to be protected by #ifdef and not by #if. This commit was SVN r21700.	2009-07-16 18:27:33 +00:00
Rainer Keller	b98a095d22	- Similar to r21229, check for return code from orte_rml_base_update_contact_info This commit was SVN r21233. The following SVN revision numbers were found above: r21229 --> open-mpi/ompi@9ad9b20847	2009-05-14 00:36:51 +00:00
Rainer Keller	9b7c4c2354	- Update comment This commit was SVN r21232.	2009-05-14 00:32:43 +00:00
Ralph Castain	c45ff0d59f	Take the next step towards fully utilizing static ports for the daemons to eliminate the initial "phone home" to mpirun by modifying the orted termination procedure to eliminate the need for a full barrier-like operation. Instead, we add a "onesided" barrier to the grpcomm framework API that releases the orted once it has completed its own contribution to the barrier - i.e., the orteds now exit as the "ack" message rolls up towards mpirun instead of sending the "ack" directly to mpirun. This causes the orteds in the routing tree to remain alive until all termination "acks" from orteds below them have passed through. Thus, if we use static ports, we no longer require a direct orted-to-mpirun connection. Also modify the binomial routed module so it conforms to what all the other routed modules do and have all messages pass along the routing tree instead of short-circuiting between orteds. This further reduces the number of ports being opened on backend nodes. This commit was SVN r21203.	2009-05-11 14:11:44 +00:00
Shiqing Fan	cd565923d3	Completely remove ltdl support for Windows build. This commit was SVN r21170.	2009-05-05 18:59:13 +00:00
Ralph Castain	4be24521aa	Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans. Minor mod to the ompi layer to use the new #define's - just one-line name replacements. This commit was SVN r21144.	2009-05-04 11:07:40 +00:00
Rainer Keller	221fb9dbca	... Delayed due to notifier commits earlier this day ... - Delete unnecessary header files using contrib/check_unnecessary_headers.sh after applying patches, that include headers, being "lost" due to inclusion in one of the now deleted headers... In total 817 files are touched. In ompi/mpi/c/ header files are moved up into the actual c-file, where necessary (these are the only additional #include), otherwise it is only deletions of #include (apart from the above additions required due to notifier...) - To get different MCAs (OpenIB, TM, ALPS), an earlier version was successfully compiled (yesterday) on: Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled This commit was SVN r21096.	2009-04-29 01:32:14 +00:00
Rainer Keller	6c1cce8761	- For the upcoming header cleanup commit, several header files (previously included by header-files) now have to be moved "upward". This is mainly system headers such as string.h, stdio.h and for networking, but also some orte headers. This commit was SVN r21095.	2009-04-29 00:49:23 +00:00
Shiqing Fan	3d4e0472d6	Add windows support files into the tarball, including .windows, CMakeLists.txt files, and CMake modules. Thanks to Jeff for testing it on Linux. This commit was SVN r21069.	2009-04-24 16:39:33 +00:00

1 2 3 4 5 ...

283 Коммитов