openmpi

Автор	SHA1	Сообщение	Дата
Abhishek Kulkarni	1ce378b5c6	Make C/R work with nodes > 1. This fix makes sure that the app coordinators send the "ready-to-checkpoint" signal to the global coordinator only after ORTE has initialized. This commit was SVN r26795.	2012-07-13 23:37:29 +00:00
Abhishek Kulkarni	eec5a28aa4	More C/R fixes. * Fix a typo introduced by the removal of the notifier framework * Fix to flush the modex cached data correctly using the orte DB API. This commit was SVN r26773.	2012-07-10 01:19:46 +00:00
Abhishek Kulkarni	5c58a1c9c1	Fix C/R support in the trunk. Among other things, this patch deals with the following issues: * fix ompi-checkpoint argument parsing * ompi-restart -showme prints an extraneous "Restarted child with PID" message. Move around the debug statement to avoid this. * fixes for the state machine changes This commit was SVN r26770.	2012-07-09 23:34:13 +00:00
Ralph Castain	0dfe29b1a6	Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required. Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework. Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one. This commit was SVN r26678.	2012-06-27 14:53:55 +00:00
Brian Barrett	b22faedd9d	Remove the Portals4 SHMEM reference implementation runtime support, as we're no longer using the runtime provided by the reference implementation. Remove the Catamount support from ORTE, since we're no longer supporting Catamount. Left the Catamount timer component, because I'm not sure whether it's used on the XTs running CNL. This commit was SVN r26677.	2012-06-27 14:17:43 +00:00
Ralph Castain	92527da4e3	Remove unused component This commit was SVN r26660.	2012-06-26 00:49:28 +00:00
Ralph Castain	e6f3586415	Remove the orte notifier framework, per discussion at the devel meeting and follow-up with Jeff (who took the action item) This commit was SVN r26637.	2012-06-22 18:09:23 +00:00
Ralph Castain	0a713cd27e	Add database framework to ORTE and refactor modex code to utilize it. Create the "hash" db component from the prior modex db code. Leave the other components ignored for now - will activate them later. Modex is still a blocking operation at this point. This commit was SVN r26618.	2012-06-19 13:38:42 +00:00
Ralph Castain	96c778656a	Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP. Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build. This commit was SVN r26606.	2012-06-15 10:15:07 +00:00
Ralph Castain	269cb2b8d9	Some cleanup to remove calls to opal_progress when running with orte progress threads, and to ensure that all orte-related events are in the orte event base. This commit was SVN r26591.	2012-06-11 19:59:53 +00:00
Brian Barrett	7406ef1241	Make all the PMI components depend on the common pmi library and properly install the common pmi library This commit was SVN r26588.	2012-06-11 15:58:09 +00:00
Nathan Hjelm	f2d4e95429	doh! add missing include This commit was SVN r26471.	2012-05-22 20:49:13 +00:00
Nathan Hjelm	cdc3c87ba6	move pmi init/finalize into a common component This commit was SVN r26470.	2012-05-22 15:15:39 +00:00
Nathan Hjelm	78b8b3cf76	bug fix: actually close ess components This commit was SVN r26469.	2012-05-22 15:09:18 +00:00
Nathan Hjelm	6eeca66475	add an option to enable static ports. diabled by default This commit was SVN r26462.	2012-05-21 19:56:15 +00:00
Ralph Castain	70a106fa71	Fix binding on remote nodes - need to pass the binding bitmap! This commit was SVN r26403.	2012-05-08 03:52:39 +00:00
Jeff Squyres	2ba10c37fe	Per RFC, bring in the following changes: * Remove paffinity, maffinity, and carto frameworks -- they've been wholly replaced by hwloc. * Move ompi_mpi_init() affinity-setting/checking code down to ORTE. * Update sm, smcuda, wv, and openib components to no longer use carto. Instead, use hwloc data. There are still optimizations possible in the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old carto-based code found out how many NUMA nodes were ''available'' -- not how many were used ''in this job''. The new hwloc-using code computes the same value -- it was not updated to calculate how many NUMA nodes are used ''by this job.'' * Note that I cannot compile the smcuda and wv BTLs -- I ''think'' they're right, but they need to be verified by their owners. * The openib component now does a bunch of stuff to figure out where "near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors (I do not have a NUMA machine with an OpenFabrics device that is a non-uniform distance from multiple different NUMA nodes). * Completely rewrite the OMPI_Affinity_str() routine from the "affinity" mpiext extension. This extension now understands hyperthreads; the output format of it has changed a bit to reflect this new information. * Bunches of minor changes around the code base to update names/types from maffinity/paffinity-based names to hwloc-based names. * Add some helper functions into the hwloc base, mainly having to do with the fact that we have the hwloc data reporting ''all'' topology information, but sometimes you really only want the (online \| available) data. This commit was SVN r26391.	2012-05-07 14:52:54 +00:00
Ralph Castain	45fee2b491	Resolve the case where only the HNP is in the system (i.e., single-node operation) This commit was SVN r26382.	2012-05-03 18:00:01 +00:00
Ralph Castain	9f724db182	Remove duplicate event assignment This commit was SVN r26360.	2012-04-30 16:06:20 +00:00
Ralph Castain	289f9f41ec	From long-term discussions, have the daemons use the node_t and proc_t structs and arrays instead of the pidmap and nidmap arrays. Sets the stage for future work. This commit was SVN r26359.	2012-04-29 00:10:01 +00:00
Ralph Castain	5d14fa7546	Fix mpi_abort, minimize error output. This commit was SVN r26266.	2012-04-11 14:37:08 +00:00
Ralph Castain	14d5525fb1	Some minor cleanups. Get singletons working. Cleanup abort handling so it gets properly identified. This commit was SVN r26261.	2012-04-10 19:08:54 +00:00
Ralph Castain	a34be856aa	Now that we have PMI support, this is no longer needed This commit was SVN r26254.	2012-04-07 13:36:24 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Ralph Castain	46b040c79f	Fix typo This commit was SVN r26189.	2012-03-24 00:31:05 +00:00
Ralph Castain	2bd75ec7e3	Fix Cray XE builds - the priority here needs to equal that of the HNP component so that both build. Otherwise, mpirun tries to use PMI for its basis, and that doesn't work! This commit was SVN r26188.	2012-03-23 20:06:34 +00:00
Josh Hursey	4dd9f89a99	Create an MCA parameter (ess_base_stream_buffering) that allows the user to override the system default for buffering of stdout/stderr streams. See 'man setvbuf' for more information. Note: I am working on a system that buffered all output until the application fishished due to a default of 'fully buffered.' This makes debugging painful. This switch fixed the problem by allowing me to adjust the buffering. This commit was SVN r26119.	2012-03-08 22:02:28 +00:00
Ralph Castain	834a86420b	Ensure we use the slurm module for slurm environments, and correct init order in pmi module when used by daemons This commit was SVN r26089.	2012-03-02 23:10:48 +00:00
Ralph Castain	a83da303c5	When using PMI, we know the ranks that share our node and their relative local/node ranks. Save that info in the pidmap array so that BTLs that require early knowledge of local ranks can access it. This commit was SVN r25992.	2012-02-21 16:43:17 +00:00
Jeff Squyres	b295a01d8e	Fix another configury error found by Paul Hargrove. Thanks, Paul! This commit was SVN r25971.	2012-02-20 21:38:27 +00:00
Ralph Castain	a0edae52f2	Ensure the wrapper flags get entered in the right order, with -lpmi coming before the alps util libs This commit was SVN r25809.	2012-01-27 20:56:21 +00:00
Ralph Castain	be3dfb6a1a	Ensure that we only add -lpmi once to the wrapper compilers, no matter how many components might use it. This commit was SVN r25753.	2012-01-20 04:56:38 +00:00
Ralph Castain	d7fe1615b6	Add missing dollar sign on variable This commit was SVN r25745.	2012-01-19 20:45:22 +00:00
Ralph Castain	0d20f745e2	Remove stale function def This commit was SVN r25744.	2012-01-19 20:40:48 +00:00
Nathan Hjelm	6d0e7a0a0e	don't enable ess/alps unless cnos is available This commit was SVN r25743.	2012-01-19 19:36:00 +00:00
Ralph Castain	9d556e2f17	Allow daemons to use PMI to get their name where PMI support is available while using the standard grpcomm and other capabilities. Remove the GNI code from the alps ess component as that component should only be for alps/cnos installations. This commit was SVN r25737.	2012-01-18 20:56:53 +00:00
Ralph Castain	bf103de66c	My apologies for doing this outside of the usual time restrictions, but we need to get this in so we can make progress. Move the ORTE-level debugger code back into orterun and out of the ORTE library to resolve symbol conflicts. This commit was SVN r25713.	2012-01-11 15:53:09 +00:00
Ralph Castain	7180ad40ad	Fix a copule of minor buglets This commit was SVN r25589.	2011-12-07 21:08:35 +00:00
Abhishek Kulkarni	0b7c51fae2	Correct an invalid reference to a missing help file. This commit was SVN r25573.	2011-12-05 21:29:07 +00:00
Josh Hursey	cc57840b53	Fix ess/tool so that it does not segv when using the rsh PLM. Just have it use the base function directly to avoid similar problems with finalizing other components. This commit was SVN r25571.	2011-12-05 15:40:46 +00:00
Ralph Castain	c56acf60ca	Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order. Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use. Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact. Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code. This commit was SVN r25551.	2011-11-30 19:58:24 +00:00
Ralph Castain	b475421c16	As promised, rationalize the rsh support. Remove rshbase and the base rsh support, centralizing all rsh support into the rsh component. Remove the "slave" launch support as that experiment is complete. Fix tree spawn and make that the default method for rsh launch, turning it "off" for qrsh as that system does not support tree spawn. This commit was SVN r25507.	2011-11-26 02:33:05 +00:00
Ralph Castain	9b59d8de6f	This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build. Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations. Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way. This commit was SVN r25497.	2011-11-22 21:24:35 +00:00
George Bosilca	61f273b987	Do not tolerate uninitialized variables. This commit was SVN r25489.	2011-11-18 10:19:24 +00:00
Samuel Gutierrez	e03bc93fb7	only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code. This commit was SVN r25447.	2011-11-06 17:28:40 +00:00
Samuel Gutierrez	3fe7b3ee54	add PMI support to ess alps module. xt system guys: please yell at me if i missed something in cnos. This commit was SVN r25423.	2011-11-03 04:04:32 +00:00
Samuel Gutierrez	27b9bcfafd	update ess alps configuration file to include CNOS and PMI checks. some of the features committed here aren't being used, but they will be. also update orte_check_pmi.m4 to include missing call to action-if-not-found if --with-pmi is not specified or is disabled. This commit was SVN r25422.	2011-11-03 02:14:47 +00:00
Ralph Castain	b77552c45d	Cleanup some include files, return a silent error in open/select as the complaining component already output a message This commit was SVN r25416.	2011-11-02 17:42:06 +00:00
Ralph Castain	55b996678e	Minor indentation changes This commit was SVN r25414.	2011-11-02 15:56:56 +00:00
Ralph Castain	d28dd55d33	Minimize the amount of topology info returned by the daemons. Most clusters, especially at scale, use the same node topology on every node, so there is no re ason to return the topology from every daemon. Borrow a page from the --hetero-apps page and let users indicate that the node topology differs by adding a -- hetero-nodes option to mpirun. If the option is set, then every daemon returns topology info. If not set, then only daemon vpid=1 returns it. We always want one daemon to return the topology as the head node is often different from the compute nodes. Having one daemon return the compute node topolo gy allows us to detect any such difference. All compute nodes are then set to the same topology. This commit was SVN r25408.	2011-11-01 18:43:10 +00:00
Ralph Castain	14966e0f8f	Cleanup PMI startup - if a component isn't selected, it should finalize PMI IFF it started it. Otherwise, components that aren't selected can finalize PMI when it is in use by other parts of the system. This commit was SVN r25407.	2011-11-01 16:25:12 +00:00
Ralph Castain	965b04d1a5	Use the new utilities to get a topology that reflects available cpus This commit was SVN r25394.	2011-10-29 15:07:36 +00:00
Ralph Castain	e2eb8d5f78	Remove bad param registration - that param was already registered as an int_name in another location. This commit was SVN r25381.	2011-10-28 19:14:43 +00:00
Josh Hursey	6726590b1c	Remove the 'ess_node_rank' accessor from here. This caused running under 'tm' to segv at the orteds. It just looks like this part of the component was not updated during r25331. It was removed from the 'env' and 'slurm' environments in that patch. It looks like 'tm' was updated, but did not get this particular piece. This commit was SVN r25380. The following SVN revision numbers were found above: r25331 --> open-mpi/ompi@b44f8d4b28	2011-10-28 17:41:35 +00:00
Josh Hursey	59ff1dbbfb	Fix indentation problem that caused a segv when running without regex. This was introduced in r25063. This commit was SVN r25379. The following SVN revision numbers were found above: r25063 --> open-mpi/ompi@e58623cd5b	2011-10-28 13:39:32 +00:00
Ralph Castain	3e72fccacf	Cray's PMI implementation is quite different from slurm's - they extended PMI-1 by adding some, but not all, of the PMI-2 APIs. So you can't just switch to using PMI-2 functions as it isn't a complete implementation. Instead, you have to selectively figure out which ones they have in PMI-2, and use any missing ones from PMI-1. What fun. Modify the configure logic and the PMI components to accommodate Cray's approach. Refactor the PMI error reporting code so it resides in only one place. Cray actually decided -not- to define the PMI-2 error codes, so we have to use the PMI-1 codes instead. More fun. This commit was SVN r25348.	2011-10-21 04:54:38 +00:00
Nathan Hjelm	beb8d8ce32	pmi return code wtf This commit was SVN r25336.	2011-10-20 17:51:24 +00:00
Ralph Castain	b44f8d4b28	Complete implementation of the ess.proc_get_locality API. Up to this point, the API was only capable of telling if the specified proc was sharing a node with you. However, the returned value was capable of telling you much more detailed info - e.g., if the proc shares a socket, a cache, or numa node. We just didn't have the data to provide that detail. Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times. Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun. This commit was SVN r25331.	2011-10-19 20:18:14 +00:00
Ralph Castain	2fdd9c6dea	Ensure mpirun doesn't pick this component This commit was SVN r25307.	2011-10-17 22:28:28 +00:00
Ralph Castain	8f0ef54130	Complete implementation of pmi support. Ensure we support both mpirun and direct launch within same configuration to avoid requiring separate builds. Add support for generic pmi, not just under slurm. Add publish/subscribe support, although slurm's pmi implementation will just return an error as it hasn't been done yet. This commit was SVN r25303.	2011-10-17 20:51:22 +00:00
Ralph Castain	08fa9e1c6a	Correct include path This commit was SVN r25282.	2011-10-13 23:46:52 +00:00
Ralph Castain	b96ef2161d	Complete the PMI support. Generalize PMI operations to support both slurm and non-slurm environments. Correct some configuration issues - we really only want the PMI integration at the individual component level. Ensure that the pmi grpcomm component doesn't get selected when launching via mpirun by setting its priority below the bad component. Only verified in a slurm environment as that's all I have access to... This commit was SVN r25275.	2011-10-12 20:59:25 +00:00
Brian Barrett	98e98ce2c5	* opal_atomic_trylock is documented to return 0 if the lock was acquired, 1 otherwise. It was doing the opposite, so this patch fixes the return values. All uses (all in ORTE) used the actual return values, not the documented values, so fix them as well. This commit was SVN r25257.	2011-10-11 18:43:45 +00:00
Ralph Castain	3c4f04f4d9	Ensure opal_hwloc_topology is NULL after being destroyed This commit was SVN r25138.	2011-09-13 19:21:10 +00:00
Ralph Castain	556a05566e	Silence warning This commit was SVN r25130.	2011-09-12 16:21:51 +00:00
Ralph Castain	92c7372e20	Per the RFC from Jeff, move hwloc from opal/mca/common to its own static framework ala libevent. Have ORTE daemons collect the topology info at startup and, if --enable-hwloc-xml is set, send that info back to the HNP for later use. The HNP only retains unique topology "templates" to reduce memory footprint. Have the daemon include the local topology info in the nidmap buffer sent to each app so the apps don't all hammer the local system to discover it for themselves. Remove the sysinfo framework as hwloc replaces that functionality. This commit was SVN r25124.	2011-09-11 19:02:24 +00:00
Wesley Bland	f8740e5478	Correct a typo reported by Pasha. This commit was SVN r25109.	2011-08-30 18:44:52 +00:00
Wesley Bland	4e7ff0bd5e	By popular demand the epoch code is now disabled by default. To enable the epochs and the resilient orte code, use the configure flag: --enable-resilient-orte This will define both: ORTE_ENABLE_EPOCH ORTE_RESIL_ORTE This commit was SVN r25093.	2011-08-26 22:16:14 +00:00
Ralph Castain	1c08a4006c	Refactor some code to remove a few API handles from errmgr. Reviewed/tested by Wes. This commit was SVN r25064.	2011-08-18 16:24:45 +00:00
Ralph Castain	e58623cd5b	Bring alps back to full operations by correctly computing daemon names. Unfortunately, alps doesn't assign cnos rank in node-based order - i.e., cnos rank=0 isn't necessarily on the first node of the execution. So adjust when using static ports. Add some debug to nidmap Ensure that the HNP's node name is not included in the regex when launching via rshbase as that node is automatically included in the daemon map. This commit was SVN r25063.	2011-08-18 14:59:18 +00:00
Ralph Castain	ca3d29a1e6	Extend regex support to a bigger audience This commit was SVN r25046.	2011-08-12 21:02:48 +00:00
Wesley Bland	09274cd047	Make sure that the epoch is initialized everywhere so we don't get weird output during valgrind. This shouldn't have caused any problems with any actual execution. Just extra warnings in valgrind. This commit was SVN r25015.	2011-08-08 15:11:55 +00:00
Ralph Castain	c86bfb4e90	Need to copy the string This commit was SVN r25001.	2011-08-05 19:03:28 +00:00
Wesley Bland	5fde3e0e00	Move the resilient orte errmgr code into a seperate errmgr for now while it's still unstable. Reverted errmgr modules back to the original errmgr (with the updates since the resilient code was brought into the trunk). This commit was SVN r24958.	2011-07-28 21:24:34 +00:00
Ralph Castain	6c879f87fb	Add a new param "orte_remote_tmpdir_base" for those situations where the compute nodes require a different session directory head than the head node. This commit was SVN r24956.	2011-07-27 19:37:17 +00:00
Ralph Castain	361bcef253	Close multicast before rml This commit was SVN r24925.	2011-07-23 20:19:15 +00:00
Brian Barrett	3bd66a5932	* Remove unused Portals3.3 reference implementation support This commit was SVN r24906.	2011-07-20 23:30:29 +00:00
Ralph Castain	8853e0e80a	Fix regular expression analyzer for slurmd - use a slurm-specific version Fix multi-node routing for daemon startup when static ports are not set This commit was SVN r24898.	2011-07-13 22:49:56 +00:00
Ralph Castain	1ee7c39982	Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long. Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again. Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines. This commit was SVN r24860.	2011-07-07 18:54:30 +00:00
Wesley Bland	0628963506	Fix a return code when a process isn't found. This commit was SVN r24845.	2011-06-30 15:22:54 +00:00
Brian Barrett	e52fef28ca	Do something rational for the disable full support case This commit was SVN r24844.	2011-06-30 14:48:19 +00:00
Ralph Castain	8ac35a8496	Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate. This commit was SVN r24843.	2011-06-30 14:11:56 +00:00
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Ralph Castain	b95ede99d5	Ensure we use the pmi grpcomm when using pmi This commit was SVN r24793.	2011-06-20 21:57:47 +00:00
Ralph Castain	92a65f21bf	Restore slurm pmi support from long, long ago. Since we already have the ability to directly srun an MPI job, just conditionally add the PMI support for key values and provide a grpcomm module that uses PMI for barriers and modex. Currently ompi_ignored, and unignored only for me (others to soon follow). This commit was SVN r24792.	2011-06-20 21:04:46 +00:00
Josh Hursey	0eb3b3b7b0	Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9335.php This commit was SVN r24775.	2011-06-15 13:10:13 +00:00
Ralph Castain	c78531ce8a	Don't free the envar that gets putenv'd as that messes up the environ This commit was SVN r24660.	2011-04-29 08:50:29 +00:00
Ralph Castain	859aaab93d	In the case of direct-launched processes running under slurm, psm requires that the pre_condition_transports MCA param be set. This is normally computed by mpirun and inserted into each proc's environ, but that doesn't work here. So separate out the printing of that key, and let the individual procs generate it in a way that ensures they all get the same result. This commit was SVN r24646.	2011-04-28 13:54:33 +00:00
Ralph Castain	dc6f616599	Enable VM launch. For some time, ORTE has had the ability to launch daemons on all nodes prior to launching an application. It has largely been used outside of the OMPI community, and so was never explicitly turned "on" inside OMPI releases. Nevertheless, the code has been there. Allowing VM launches does not require ANY changes to existing PLM components. All that was required was to have orterun launch the daemons as a separate call to orte_plm.spawn -prior- to launching the applications. The rest of the VM support code resides in the rmaps framework: (a) a check when asked to map a job to see if it is the daemon job, and (b) a separate "setup_virtual_machine" mapper in the rmaps base that creates the required map so the PLM's will do the right thing. In order to support those users who have no RM allocation but like to give the allocation in the form of a -host or -hostfile argument to their application, there is a little more code in orterun and the setup_virtual_machine mapper to capture information passed in that manner. This has been tested with rsh and slurm environments, and, since there is nothing environment-specific in the implementation, should work in others as well - but needs to be proven. This commit was SVN r24524.	2011-03-12 22:50:53 +00:00
Jeff Squyres	06d5c59115	Fix a few valgrind-reported memory leaks This commit was SVN r24498.	2011-03-08 17:37:28 +00:00
George Bosilca	9bbe00bdc3	Set the return code from the processes upstream. This commit was SVN r24483.	2011-03-03 00:02:21 +00:00
George Bosilca	c6a5f9706a	Thomas's patch: Assume we won't fail unless notified by a child. This commit was SVN r24482.	2011-03-02 23:50:01 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	71669720a3	Just get the output once on sigpipe error, and include the fd This commit was SVN r24092.	2010-11-25 15:32:48 +00:00
Ralph Castain	30c635fd4d	Don't endlessly output sigpipe errors. Count the number of times we trap it, and abort if we get more than 10 of them. This commit was SVN r24091.	2010-11-25 15:25:24 +00:00
Ralph Castain	9ea2b196ce	Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component. Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work. This commit was SVN r23966.	2010-10-28 15:22:46 +00:00
Brian Barrett	3ed00ba148	More fixes to make OMPI compile with minimal ORTE support again This commit was SVN r23962.	2010-10-27 20:40:39 +00:00
Shiqing Fan	a3d9c91ff7	Exclude stdbool.h for Windows, and use the definition in opal. Immigrate the socket pair support from libevent. Fix other minor things and make it compile. This commit was SVN r23951.	2010-10-26 14:53:50 +00:00
Ralph Castain	894230b121	This stuff is soooo out-of-date that a complete rewrite would be required - thankfully, nobody cares This commit was SVN r23944.	2010-10-26 06:22:31 +00:00
Ralph Castain	86c7365e8e	Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem. Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution. This commit was SVN r23943.	2010-10-26 02:41:42 +00:00
Ralph Castain	fceabb2498	Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects. Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems. Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct. I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things: 1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new) 2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it. There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do. This commit was SVN r23925.	2010-10-24 18:35:54 +00:00
Brian Barrett	9febaa475e	* Add shell of functionality required for supporting Portals4 * Update places where orte-free builds have failed This commit was SVN r23891.	2010-10-14 22:49:09 +00:00
Jeff Squyres	73bcc4a36b	Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers This commit was SVN r23801. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-24 22:53:28 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Rainer Keller	5eb571c458	- As suggested in CMR #2558 , attribute-macros should be be tested on function pointers and assigned accordingly, instead of using the pre-processor in the header files. A functional change is (re-) specifying __opal_attribute_noreturn__ on orte_errmgr_base_abort(): All modules in the errmgr framework either use this function, or define their own abort function, which sets __opal_attribute_noreturn__. This attributes was taken out with the errmgr overhaul in r22872. This commit was SVN r23689. The following SVN revision numbers were found above: r22872 --> open-mpi/ompi@e4f2d03d28	2010-08-31 10:28:51 +00:00
Rainer Keller	4abcf5a0d7	- The Sun-compiler 12 update 1 complains about noreturn-attributes assigned to function-declarations. Check this case and mark the currently only case existing in trunk. Thanks to Paul Hargrove for bringing this up. Let's test the svn commit msg CMR:v1.5 This commit was SVN r23676.	2010-08-27 09:18:30 +00:00
Ralph Castain	2886da5669	Ensure that the local daemon vpid gets defined so that the locality procedures work when using the ess generic module. This commit was SVN r23638.	2010-08-24 04:38:21 +00:00
Josh Hursey	e12ca48cd9	A number of C/R enhancements per RFC below: http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary\|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413	2010-08-10 20:51:11 +00:00
Terry Dontje	b74ef351b7	Added new solaris sysinfo module. Also added code to assign orte_local_chip_type and orte_local_chip_model in MPI processes it the appropriate sysinfo module found the values on the machine. This commit was SVN r23581.	2010-08-09 19:28:56 +00:00
Ralph Castain	e16ec40283	Check the correct envar. Thanks to Philippe for pointing out the error! This commit was SVN r23499.	2010-07-27 03:49:03 +00:00
Ralph Castain	6158cab2e1	Add missing header This commit was SVN r23497.	2010-07-27 01:37:41 +00:00
Ralph Castain	e858b029aa	Re-establish support for IBM's POE environment by resurrecting the poe launcher from the v1.2 series, updated for the current trunk. Modify the hnp errmgr module to support batch-operated jobs. Update the new ess generic module to support a wider range of mapping options. This commit was SVN r23493.	2010-07-24 23:35:20 +00:00
Ralph Castain	dfdb1eca0c	Remove windows dist file requirement This commit was SVN r23480.	2010-07-23 01:23:15 +00:00
Ralph Castain	67027cfce4	Add a new "generic" ess module that supports direct launched MPI jobs so long as certain minimum envars are provided, regardless of launch environment. This commit was SVN r23478.	2010-07-22 21:51:43 +00:00
Ralph Castain	12cd07c9a9	Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases. The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang. This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is: * pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place * modified the errmgr to directly call the new routines when termination is detected * removed the grpcomm.onesided_barrier and its associated RML tag * add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree * use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial. This commit was SVN r23429.	2010-07-17 21:03:27 +00:00
Ralph Castain	570d19106b	Allow singletons to use ompi-server for rendezvous via pubsub as well as comm_spawn without starting their own local daemons This commit was SVN r23384.	2010-07-13 06:33:07 +00:00
Ralph Castain	31295e8dc2	As discussed on today's telecon, reorganize the debugger attachment code in orte to better support efforts within the tool community aimed at exploring alternative methods. Move the debugger attachment code from the orterun directory to a new debugger framework. Organize the existing standard support code into an "mpir" component. Organize the current extensions for co-spawning debugger daemons into a separate "mpirx" component. Since the MPIR symbols are now included in the ORTE library, remove duplicate declarations in OMPI and replace them with extern references to their ORTE instantiations. This commit was SVN r23360.	2010-07-06 23:35:42 +00:00
Ralph Castain	98e7bc94f5	This belongs more properly in the ORCM repo now that we have MCA capability over there This commit was SVN r23346.	2010-07-05 18:25:03 +00:00
Shiqing Fan	681df0089b	Add a few new files into the tarball. This commit was SVN r23297.	2010-06-22 16:45:56 +00:00
Shiqing Fan	2e5e9f0a03	Fix a wrong windows path in hpn_contack, which causes problems when looking up in the session directories. Add two more ess module for Windows. This commit was SVN r23286.	2010-06-21 09:47:33 +00:00
Ralph Castain	1e90b91b84	Unset envars we set during initialization so we leave environ intact after orte_finalize. Thanks to Damien Gunter for pointing it out. This commit was SVN r23277.	2010-06-17 13:42:21 +00:00
Ralph Castain	bd045468e5	Let apps use the ess cm module too... This commit was SVN r23246.	2010-06-07 14:16:34 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Ralph Castain	7ce34223f1	Per off-list discussion, implement the new OMPI exit status policy (soon to be on wiki) and further cleanup error reporting to cover new cases. Implement process migration when failed nodes are detected. Some testing still required This commit was SVN r23121.	2010-05-12 18:11:58 +00:00
Ralph Castain	4bd25f587c	Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so. Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort. This commit was SVN r23112.	2010-05-11 00:34:12 +00:00
Ralph Castain	2ff1ae13e1	Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set. This commit was SVN r23102.	2010-05-05 00:48:43 +00:00
Ralph Castain	9dfb5c7c62	Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set This commit was SVN r23082.	2010-05-03 04:11:03 +00:00
Ralph Castain	55889934d8	After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon. Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly. This commit was SVN r23045.	2010-04-27 03:39:32 +00:00
Ralph Castain	b9893aacc5	Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here: 1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified. 2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set. This support must be enabled by configuring --enable-sensors. ompi_info and orte-info have been updated to include the new framework. Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized. Implementation continues. This commit was SVN r23043.	2010-04-26 22:15:57 +00:00
George Bosilca	95394456ca	No Windows EOF. This commit was SVN r23024.	2010-04-23 05:51:29 +00:00
Ralph Castain	efbb5c9b7c	Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base: * add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway. * update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states. * remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails. * update the plm_base_receive.c code to call the errmgr with state updates from remote daemons * update the routed modules to reflect that process state is updated in the errmgr * ensure that the orted's open the errmgr and select their appropriate module * add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place * define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired. * update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios This commit was SVN r23023.	2010-04-23 04:44:41 +00:00
Ralph Castain	4308922f59	Ensure that any application-specific selection of ess module doesn't get overridden by what is given to the orted or orterun Cleanup tool name determination for CM This commit was SVN r22980.	2010-04-15 18:10:50 +00:00
Ralph Castain	eeccf2f15c	Some minor changes to support vm's This commit was SVN r22975.	2010-04-14 01:20:43 +00:00
Ralph Castain	8da781af84	Continue developing support for distributed virtual machines - minor changes to ensure correct jobid gets used and that dvm's can communicate with tools This commit was SVN r22958.	2010-04-12 22:33:09 +00:00
Shiqing Fan	96b20a29b5	An easy solution to make singleton work on Windows. This commit was SVN r22952.	2010-04-10 16:30:59 +00:00
Ralph Castain	75e99e6118	Do a better job of selecting cm ess component, handle tool and daemon issues This commit was SVN r22942.	2010-04-07 18:59:21 +00:00
Ralph Castain	8e29a6858a	Properly handle the case when a daemon is given both parts of its name This commit was SVN r22935.	2010-04-06 22:41:18 +00:00
Josh Hursey	e4f2d03d28	ErrMgr Framework redesign to better support fault tolerance development activities. Explained in more detail in the following RFC: http://www.open-mpi.org/community/lists/devel/2010/03/7589.php This commit was SVN r22872.	2010-03-23 21:28:02 +00:00
Ralph Castain	0b9552cd4e	Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly. This commit was SVN r22870.	2010-03-23 20:47:41 +00:00
Ralph Castain	d49f93b743	Cleanup the initialization handshake for multicast apps This commit was SVN r22855.	2010-03-19 20:15:01 +00:00
Ralph Castain	abbdc2b527	Pass the job family to tools that need to connect to specific HNPs This commit was SVN r22853.	2010-03-19 04:01:33 +00:00
Ralph Castain	ffd5be6aa1	Add a new framework to ORTE for saving and recovering state information. Two components are included that use the db or dbm library for storing the data, with a distributed hash table component coming later. Note that each of these components will only be selected if specifically requested - otherwise, a "NULL" component will be used. The framework is only opened by the HNP and orteds, though neither is currently coded to save/restore state This commit was SVN r22839.	2010-03-16 20:59:48 +00:00
Josh Hursey	e9b5162d79	Fix the configure logic for --with-ft so that it properly takes a comma separated list. Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those. The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed. This commit was SVN r22824.	2010-03-12 23:57:50 +00:00
Ralph Castain	c88fe1ea54	Create a new mca parameter to control creation of session directories. Defaults to true so that the current behavior of always creating them is preserved. If set to false (0), then don't create session directories. Helps in those environments where session directories are a problem. Tell the sm btl that it cannot run if no session directories were created. This commit was SVN r22756.	2010-03-02 15:18:33 +00:00
Ralph Castain	18c7aaff08	Update the grpcomm framework to be more thread-friendly. Modify the orte configure options to specify --enable-multicast such that it directs components to build or not instead of littering the code base with #if's. Remove those #if's where they used to occur. Add a new grpcomm "mcast" module to support multicast operations. Still some work required to properly perform daemon collectives for comm_spawn operations. New module only builds when --enable-multicast is provided, and when specifically selected. This commit was SVN r22709.	2010-02-25 01:11:29 +00:00
Ralph Castain	9a5fdbb622	Continue development of reliable multicast This commit was SVN r22616.	2010-02-14 19:20:56 +00:00
Ralph Castain	16b7bc7a82	Sigh...get the order right to match unpack This commit was SVN r22539.	2010-02-03 15:50:43 +00:00
Ralph Castain	e88627a7ca	Ensure we don't go through rml open/select more than once. Open the rml to get the uri when bootstrapping daemons This commit was SVN r22538.	2010-02-03 15:38:32 +00:00
Ralph Castain	cb1007b5a9	Pass back the number of daemons in the system This commit was SVN r22537.	2010-02-03 14:31:16 +00:00

1 2 3 4 5 ...

403 Коммитов