openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	df642f1508	Add an API to get a remote file's size. Separate dfs cmds from returned data messages so daemons don't get confused. This commit was SVN r27487.	2012-10-25 22:23:08 +00:00
Ralph Castain	79e36413c2	There was some discussion of this at an earlier time, but we never got around to doing it - so make orte behave more like a regular library, counting the number of times init is called, and executing finalize when all those are exhausted. This commit was SVN r27484.	2012-10-25 18:39:37 +00:00
Ralph Castain	094d6f3143	Add a new "distributed file system" capability to support file access operations across nodes that do not have a network file system attached to them. Add a set of URI create/parse utilities This commit was SVN r27483.	2012-10-25 17:15:17 +00:00
Ralph Castain	32c185f730	Set a priority for output of forwarded IO so it can effectively compete against inbound messages This commit was SVN r27480.	2012-10-24 23:34:50 +00:00
Ralph Castain	e06c330635	Add the ability to set a backlog limit on forwarded output waiting at mpirun - helps to avoid crashing systems during debug. Note that we default to "unlimited" to maintain current behavior. This commit was SVN r27479.	2012-10-24 23:21:40 +00:00
Ralph Castain	e6014bf2e1	Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter This commit was SVN r27477. The following SVN revision numbers were found above: r27451 --> open-mpi/ompi@d59034e6ef r27456 --> open-mpi/ompi@ecdbf34937	2012-10-24 18:38:44 +00:00
Ralph Castain	7574d6673b	If someone provides the launch_agent cmd, then don't prefix it cmr:v1.7 This commit was SVN r27473.	2012-10-24 16:14:04 +00:00
Ralph Castain	5c0534a7ad	Ensure that comm_spawn launches procs on the nodes specified by add-host and add-hostfile This commit was SVN r27452.	2012-10-18 00:40:44 +00:00
Nathan Hjelm	d59034e6ef	MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions. cmr:v1.7 This commit was SVN r27451.	2012-10-17 20:17:37 +00:00
Ralph Castain	4028ce7a5d	Silence warnings by making types match This commit was SVN r27446.	2012-10-14 03:45:28 +00:00
Ralph Castain	285a3b168d	Add an ability to specify the max number of simultaneous procs/node for an application when operating in staged mode. Change some debug statements from OPAL_OUTPUT_VERBOSE to opal_output_verbose so they are available in optimized builds. This commit was SVN r27445.	2012-10-14 03:31:32 +00:00
Ralph Castain	04304c186f	Remove the setup_hadoop configure script as it is no longer required - the hadoop support components can build without accessing hadoop itself. This commit was SVN r27385.	2012-09-29 18:30:35 +00:00
Ralph Castain	9daaa001d9	Remove tools that are no longer required This commit was SVN r27383.	2012-09-29 17:33:16 +00:00
Ralph Castain	54db4c35eb	Get the trunk to build again when --without-hwloc is specified. Move a couple of key type definitions and utilities out from under the HAVE_HWLOC test so they are always available as they don't really depend on hwloc's presence. Tell two compnents not to build if hwloc is disabled: ompi/mca/sbgp/basesmsocket orte/mca/rmaps/lama Remove stale configure.params files from the sbgp framework as the OMPI build system no longer looks at those files. This commit was SVN r27377.	2012-09-26 23:24:27 +00:00
Samuel Gutierrez	42280e2af5	Temporarily make routed binomial the default. We are experiencing issues with debruijn when launching fewer processes than are actually available within an allocation. When this is fixed, please revert this change. This commit was SVN r27376.	2012-09-26 16:08:12 +00:00
Jeff Squyres	cb65a44c6c	Fix the component priority assignment. Thanks to Alex Margolin for the patch. This commit was SVN r27363.	2012-09-25 07:13:23 +00:00
George Bosilca	6ec41400b3	Fix the error message in case a daemon does not succeed at killing the local offspring. This commit was SVN r27362.	2012-09-24 15:25:21 +00:00
Ralph Castain	d5279b0dc8	Make an attempt to protect hwloc cset2str from segfaulting in weird scenario This commit was SVN r27361.	2012-09-23 16:51:51 +00:00
Ralph Castain	1ddb334a52	Put the specified JDK path first so we find that one This commit was SVN r27360.	2012-09-22 19:08:52 +00:00
Ralph Castain	d95025f53a	Ensure we clear the usage numbers when binding on multiple nodes so we don't "carry over" info from one node to the next. Use the same tracking mechanism for binding upwards and in-place to avoid doing a bunch of mallocs. Refs trac:3322 This commit was SVN r27356. The following Trac tickets were found above: Ticket 3322 --> https://svn.open-mpi.org/trac/ompi/ticket/3322	2012-09-20 15:16:06 +00:00
Ralph Castain	90d7b5fdca	Update test This commit was SVN r27354.	2012-09-20 02:51:27 +00:00
Ralph Castain	445161cd2e	Correctly count the total number of allocated slots This commit was SVN r27353.	2012-09-20 02:50:14 +00:00
Ralph Castain	f592967685	Add missing retain to maintain correct accounting on nodes This commit was SVN r27352.	2012-09-20 02:30:53 +00:00
Ralph Castain	e309db0be9	Ensure file descriptors are closed upon completion of transfer This commit was SVN r27349.	2012-09-18 18:39:29 +00:00
Ralph Castain	11305109e1	Track positioned files so we avoid re-positioning them across jobs This commit was SVN r27347.	2012-09-18 15:56:21 +00:00
Ralph Castain	a3060cdd15	Fix the bind_downward code - it was incorrectly looking across the entire node instead of only looking below the locale to which the proc had been assigned. In other words, if the proc was mapped to a core, then the only hwthreads that should be considered for binding are those directly below that core. The binding algo was incorrectly looking at ALL hwthreads in that scenario, causing the proc to be bound to an HT outside of the mapped location. This now results in the procs being bound within their assigned location. It also causes us to use only the 0th HT on a core unless --use-hwthread-cpus has been specified (in which case, we use all the HTs in a core). Bind to core binds you to all HTs regardless - the --use-hwthread-cpus only impacts the oversubscribed determination and when binding to HT. cmr:v1.7 This commit was SVN r27342.	2012-09-14 22:01:19 +00:00
Ralph Castain	9057e84ec1	Correct test statement This commit was SVN r27321.	2012-09-12 14:30:03 +00:00
Ralph Castain	c4fd3df2df	Remove unused variables This commit was SVN r27319.	2012-09-12 12:03:24 +00:00
Ralph Castain	c82cfecc1c	Cleanup comm_spawn for the multi-node case where at least one new process isn't spawned on every node. Avoid the complexities of trying to execute a daemon collective across the dynamic spawn as it becomes too hard to ensure that all daemons participate or are accounted for - instead, use a less scalable but workable solution of sending the data directly between the participating procs. Ensure that singletons get their collectives properly defined at startup so the spawned "HNP" is ready for them. As a secondary cleanup, the HNP doesn't need to update its nidmap during an xcast as it already has an up-to-date picture of the situation. So just dump that data and move along. This commit was SVN r27318.	2012-09-12 11:31:36 +00:00
Ralph Castain	6b5f9d7767	Some cleanups for staged execution This commit was SVN r27317.	2012-09-12 09:15:33 +00:00
Ralph Castain	5f7a5c4793	Update test to include all keys This commit was SVN r27311.	2012-09-12 05:02:51 +00:00
Jeff Squyres	fb2e543a57	Refs trac:3275. We ran into a case where the OMPI SVN trunk grew a new acceptable MCA parameter value, but this new value was not accepted on the v1.6 branch (hwloc_base_mem_bind_failure_action -- on the trunk it accepts the value "silent", but on the older v1.6 branch, it doesn't). If you set "hwloc_base_mem_bind_failure_action=silent" in the default MCA params file and then accidentally ran with the v1.6 branch, every OMPI executable (including ompi_info) just failed because hwloc_base_open() would say "hey, 'silent' is not a valid value for hwloc_base_mem_bind_failure_action!". Kaboom. The only problem is that it didn't give you any indication of where this value was being set. Quite maddening, from a user perspective. So we changed the ompi_info handles this case. If any framework open function return OMPI_ERR_BAD_PARAM (either because its base MCA params got a bad value or because one of its component register/open functions return OMPI_ERR_BAD_PARAM), ompi_info will stop, print out a warning that it received and error, and then dump out the parameters that it has received so far in the framework that had a problem. At a minimum, this will show the user the MCA param that had an error (it's usually the last one), and ''where it was set from'' (so that they can go fix it). We updated ompi_info to check for O???_ERR_BAD_PARAM from each from the framework opens. Also updated the doxygen docs in mca.h for this O???_BAD_PARAM behavior. And we noticed that mca.h had MCA_SUCCESS and MCA_ERR_??? codes. Why? I think we used them in exactly one place in the code base (mca_base_components_open.c). So we deleted those and just used the normal OPAL_* codes instead. While we were doing this, we also cleaned up a little memory management during ompi_info/orte-info/opal-info finalization. Valgrind still reports a truckload of memory still in use at ompi_info termination, but they mostly look to be components not freeing memory/resources properly (and outside the scope of this fix). This commit was SVN r27306. The following Trac tickets were found above: Ticket 3275 --> https://svn.open-mpi.org/trac/ompi/ticket/3275	2012-09-11 20:47:24 +00:00
Ralph Castain	a0ffeb205a	Add an orted component for staged operations and rename the staged component to "staged_hnp". This commit was SVN r27305.	2012-09-11 20:35:46 +00:00
Ralph Castain	387f657fc2	Nuts - forgot to include this with the MPI Ticket 313 stuff. Set some of the envars needed for MPI_INFO_ENV This commit was SVN r27304.	2012-09-11 20:35:09 +00:00
Ralph Castain	cd8aff675b	Update test This commit was SVN r27303.	2012-09-11 20:32:43 +00:00
Ralph Castain	e8ecd67d53	Once again, bloody SLURM changes the envars and breaks things. Try and track their changes so we get a correct allocation. This commit was SVN r27302.	2012-09-11 20:31:33 +00:00
Jeff Squyres	a8f8064d8b	Add a missing free(). Refs trac:3292. This commit was SVN r27298. The following Trac tickets were found above: Ticket 3292 --> https://svn.open-mpi.org/trac/ompi/ticket/3292	2012-09-11 17:59:40 +00:00
Josh Hursey	40132f1874	Protect a potentially uninitialized variable (orte_sstore_base_global_snapshot_ref). If orte_sstore_base_global_snapshot_ref is null, then it will default appropriately when it is used. When prelaunching we always specify this parameter, but if we are not prelaunching it is possible to allow this to be null and it will initialize when used. However we setup the prelaunching variable in both situtations and in the latter that would result in a NULL reference. This patch protects that code segment. This commit was SVN r27289.	2012-09-11 15:14:28 +00:00
Ralph Castain	ca40cb5f1c	Fix comm_spawn by mpirun This commit was SVN r27285.	2012-09-10 17:09:25 +00:00
Jeff Squyres	8585920e49	Add a note in the default hostfile that it is not used in managed environments. This commit was SVN r27264.	2012-09-07 14:41:19 +00:00
Ralph Castain	4ca495c7e3	In managed allocations, we need to ensure that all nodes are flagged as having their slots "given" in case the user incorrectly asks us to change them. Also need to update the HNP node's "slots_given" flag as we don't replace the orte_node_t object when inserting the node info for that node. This commit was SVN r27258.	2012-09-07 04:08:17 +00:00
Ralph Castain	2110fb7f95	Add some debug This commit was SVN r27257.	2012-09-07 04:06:37 +00:00
Ralph Castain	78ccb097f0	Fix vm setup in unmanaged environments - needs to construct a node list in the same way we now do for mapping This commit was SVN r27256.	2012-09-07 01:53:19 +00:00
Ralph Castain	876d78f36a	If JAVA_HOME is present on a Linux system, use it to find Java support This commit was SVN r27255.	2012-09-07 00:41:14 +00:00
Ralph Castain	36acbe4ca6	Multiple apps might want the same files, so instead of using the app_idx to determine who gets what, use the actual file names as they are sent anyway This commit was SVN r27254.	2012-09-06 22:02:05 +00:00
Ralph Castain	e9e52fc78f	Gain some efficiency in the staged mapper - if soft locations are in use and get_nodes returns busy, then no need to continue cycling thru the remaining apps as all nodes are occupied This commit was SVN r27253.	2012-09-06 22:01:18 +00:00
Ralph Castain	67f34c3be6	Record the bind_level recvd by the daemon for each job so it can be correctly sent to the procs. Add test in get_relative_locality to avoid descending into an infinite loop if the level is NODE (==0). This commit was SVN r27252.	2012-09-06 20:50:07 +00:00
Ralph Castain	efa50346c8	Error out if we are filtering a hostfile and encounter a node that is not in the resource-managed allocation, giving an error message identifying the file and the node. Don't filter managed allocations thru a default hostfile as this can lead to "hidden" errors. Don't use dash-host info on managed allocations if we using soft locations This commit was SVN r27245.	2012-09-05 19:42:00 +00:00
Ralph Castain	d772e0fc3d	Add an option to treat dash-host specifications as "requested, but not required". So-called "soft" location requests can allow an application to execute even if the ideal allocation isn't available. This commit was SVN r27242.	2012-09-05 18:42:09 +00:00
Ralph Castain	6d29cecce1	Fix the help message warning of multiple prefixes so it correctly prints out the info, and fix a typo. cmr:v1.7 This commit was SVN r27241.	2012-09-05 16:28:36 +00:00
Ralph Castain	64ccf789f2	Ensure the final output is printed cmr:v1.7 This commit was SVN r27240.	2012-09-05 16:25:15 +00:00
Ralph Castain	fde83a44ab	This confusion has been around for awhile, caused by a long-ago decision to track slots allocated to a specific job as opposed to allocated to the overall mpirun instance. We eliminated that quite a while ago, but never consolidated the "slots_alloc" and "slots" fields in orte_node_t. As a result, confusion has grown in the code base as to which field to look at and/or update. So (finally) consolidate these two fields into one "slots" field. Add a field in orte_job_t to indicate when all the procs for a job will be launched together, so that staged operations can know when MPI operations are allowed. This commit was SVN r27239.	2012-09-05 01:30:39 +00:00
Ralph Castain	bae5dab916	If (and only if) a user requests, set the default number of slots on any node to the number of objects of the specified type. This only takes effect in an unmanaged environment - i.e., if an external resource manager assigns us a number of slots, then that is what we use. However, if we are using a hostfile, then the user may or may not have given us a value for the number of slots on each node. For those nodes (and only those nodes) where the user does not specify a slot count, we will set the number of slots according to their direction: either to the number of cores, numas, sockets, or hwthreads. Otherwise, the slot count is set to 1. Note that the default behavior remains unchanged: in the absence of any value for #slots, and in the absence of any directive to set #slots, we will set #slots=1. This commit was SVN r27236.	2012-09-04 20:58:26 +00:00
Ralph Castain	18d2f75b56	Ensure we don't re-link files when staging execution as we may be executing more members of the same app. Allow the user to ask that directory trees be "flattened" so that all files appear in the proc's session directory itself. This commit was SVN r27232.	2012-09-04 17:52:12 +00:00
Ralph Castain	86a16edd5a	Move the debug output to the right place This commit was SVN r27231.	2012-09-04 17:22:17 +00:00
Ralph Castain	11de735e8a	Complete the revamp of hostfile support in non-managed environments. Working at the app level, ensure that we utilize only those nodes specified for that app, but fall back to the default hostfile (if available) for those with no specification, further falling back to the local host if the default hostfile is not present or is empty. This commit was SVN r27230.	2012-09-04 16:34:05 +00:00
Jeff Squyres	b23a6b8eda	Shiqing removed this file in r27217 (but neglected to remove it from the Makefile.am). This commit was SVN r27226. The following SVN revision numbers were found above: r27217 --> open-mpi/ompi@ddbd542732	2012-09-04 13:06:39 +00:00
Ralph Castain	42d58f17bf	Provide a more user-expected way of handling hostfile and dash-host allocations in unmanaged environments. First look for an RM-managed allocation. If nothing is found, or no active module is alive, proceed to look for hostfile and dash-host assignments. These are provided on a per-app basis, so we have to cycle across the apps using the following algorithm: 1. if a hostfile is given, then add the nodes found in that hostfile to our list - i.e., the resulting allocation contains the UNION of all nodes specified in hostfiles from across all apps. 2. any app that has no hostfile but has a dash-host, will have those nodes added to the list 3. any app that fails to have a hostfile or a dash-host will be given the default hostfile, if we have it Each app will subsequently be filtered using their hostfile and/or dash-host data to ensure that the app only has access to the hosts it specified Note that any relative node syntax found in the hostfiles or dash-host data will generate an error in this scenario, so only non-relative syntax can be present This commit was SVN r27223.	2012-09-04 11:50:52 +00:00
Ralph Castain	3894179e2f	Add missing file This commit was SVN r27222.	2012-09-04 01:16:58 +00:00
Ralph Castain	fa6a18f05a	If the orted HNP is spawned by a singleton, then it needs to harvest all the MCA params from its environment to ensure they are passed on to any subsequently spawned daemons. Otherwise, the singleton could be directed to select options that other apps miss. This commit was SVN r27221.	2012-09-04 01:10:26 +00:00
Ralph Castain	b5e26a90ea	Singletons should insist that the spawned orted HNP use the novm state machine so that any subsequent comm_spawn occurs only on required nodes This commit was SVN r27220.	2012-09-04 01:09:01 +00:00
Shiqing Fan	ddbd542732	Remove one .windows file. Add a macro definition for isblank function. This commit was SVN r27217.	2012-09-03 09:51:44 +00:00
Ralph Castain	66c3f5d18d	When getting target nodes for mapping, there is a difference between not finding any nodes that match the required constraints (either in hostfile or dash-host filtering) and finding at least one such node, but all its slots are busy. Make the return code reflect this difference so the caller can take appropriate action. This commit was SVN r27213.	2012-09-01 10:30:40 +00:00
Ralph Castain	95019cc310	Fix a few places where we weren't completely identifying hostfile-based operations against "localhost" entries. Tell the mapper base to be silent when we don't want errors announced because nodes aren't available for mapping (something it is okay if they are fully used). Fix an infinite loop in the file prepositioning code. This commit was SVN r27210.	2012-08-31 21:28:49 +00:00
Jeff Squyres	da00d281e6	Oops -- we want the priority to be low, not high (for now). :-) This commit was SVN r27208.	2012-08-31 21:08:35 +00:00
Jeff Squyres	fcc1c7e33c	= Overview = First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS component. This component is used to effect many different types of regular of process/processor affinity patterns. Although quite flexible in the patterns that it provides, it is ''not'' a fully-arbitrary, rankfile-like solution for process/processor affinity. Inspiried by !BlueGene-like network specifications, LAMA has a core algorithm that is quite good at specifying regular patterns in multiple "dimensions" (where "dimensions" are expressed in terms of different hardware elements: processor hardware threads, cores, sockets, ...etc.). The LAMA core algorithm is described here: http://www.open-mpi.org/papers/cluster-2011-lama/ = LAMA Usage Levels = LAMA allows specifying affinity multiple different ways: 1. None: Speciying no affinity options to mpirun results in exactly the same behavior as today: no affinity is used. 1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to <LEVEL>" to indicate how "wide" each process should be bound (i.e., bind to a processor core, or to a processor socket, etc.) and how to lay out the processes (i.e., round robin by cores, sockets, etc.). 1. Expert: Using four new MCA parameters to effect process mapping and binding to processors. These options are a bit complex, and are not for the faint at heart, but offer a high degree of (regular pattern) flexibility (each of these are described more fully below): * rmaps_lama_map: a sequence of characters describing how to lay out processes * rmaps_lama_bind: a sequence of characters describing the resources to bind to each process * rmaps_lama_mppr: a sequence of characters describing the maximum number of processes to allow per resource (i.e., a specific definition of "oversubscription") * rmaps_lama_ordering: once all processes are in place, how to order the ranks in MPI_COMM_WORLD We anticipate that most users will utilize the "None" and "Simple" levels of affinity, and they continue to work just as they do with the v1.6 series and SVN trunk. The Expert level was designed for two purposes: 1. To provide a precise definition for the "Simple" level (i.e., every --bind-to/--map-by option in the "Simple" level has a corresponding precise specification in the "Expert" level) 1. As modern computing platforms become more complex, we simply cannot predict what application developers will need in terms of processor affinity. LAMA is an attempt to provide a highly flexible mechanism that allows applications to utilize a variety of complex, unique affinity patterns beyond the common "bind to core" and "bind to socket" patterns. = LAMA Simple Level = The "Simple" level is pretty much the same as what Open MPI has offered for years. It supports the same --bind-to and --map-by options that Open MPI has supported for a while, but expands their scope a bit. Specifically, the following options are available for both --bind-to and --map-by: * slot * hwthread * core * l1cache * l2cache * l3cache * socket * numa * board * node = LAMA Expert Level = The "Expert" level requires some explanation. I'll repeat my disclaimer here: the LAMA Expert level is not for the meek. It is flexible, but complex. '''Most users won't need the Expert level.''' LAMA works in three phases: mapping, binding, and ordering. Each is described below. == Expert: Mapping == Processes are paired with sets of resources. For example, each process may be paired with a single processor core. Or each process may be paired with an entire processor socket. LAMA performs this mapping, obeying the Max Processes Per Resource ("MPPR", pronounced "mipper") limits. More on MPPR, below. Mapping can be performed across multiple hardware levels: * h: Hardware thread * c: Processor core * s: Processor socket * L1: L1 cache * L2: L2 cache * L3: L3 cache * N: NUMA node * b: Processor board * n: Server node If the act of mapping is that of pairing MPI processes to the resources that have been allocated to a job, one can easily imagine looping through all the resources and assigning processes to them. But to effect different process process layout patterns across those resources, one may want to loop over those resources ''in a different order.'' That is, if the above-mentioned nine hardware resources (hardware thread, processor core, etc.) can be thought of as an nine-dimensional space, you can imagine nine nested loops to traverse all of them. And you can imagine that changing the order of nesting would change the traversal pattern. LAMA accepts a sequence of tokens representing the above-mentioned nine hardware resources to specify the order of looping when mapping resources to processes. For example, consider a "simple" traversal: csL1L2L3Nbnh. Reading that sequence of letters from left-to-right, it specifies mapping by processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA node, processor board, server node, and finally hardware thread. Wait... what? That string specifies resources from "smallest" to "largest" -- with the exception of hardware threads. Why are they tacked on to the end? In short, this string of letters means "map by round robin by core" -- (indeed, it exactly corresponds to the Simple level "--map-by core"). Specifically, LAMA traverses the string from left-to-right and maps processes to all the resources indicated by that token (e.g., "c" for processor core). When there are no more resources indicated by that token, it goes on to the next token. Hence, in this case, LAMA will map the first process to the first core, then it will map the second process to the second core, and so on. Once all the cores are exhausted, LAMA effectively ignores all the other letters until "h" (because all the other resources are made up of cores; when cores are exhausted, those resources are exhausted, too). If there are still more processes to be mapped, LAMA will then traverse all the hyperthreads -- meaning that the next process will be mapped to the second hyperthread on the first core. And the next process will be mapped to the second hyperthread on the second core. And so on. Keep in mind that the cores involved may span many server nodes; we're not just talking about the cores (etc.) in a single machine. As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent to "--map-by socket" (i.e., LAMA maps the first process to the first socket, the second process to the second socket, and so on). The sequence of letter can be combined in many, many different ways to produce many different regular mapping patterns. === Max Processes Per Resource (MPPR) === The MPPR is an expression that precisely defines the maximum number of processes that can be mapped to any single resource. In effect, it defines the concept of "oversubscription." Specifically, traditional HPC wisdom is that "oversubscription" is when there is more than one MPI process per processor core. This conventional defintion is expressed in a MPPR string of "1:c" (one process per core). But what if your MPI processes are multi-threaded, and they need multiple processes per core? You'd need a different description of "oversubscription" in this case. Perhaps you want to have one MPI process per socket. This would be expressed in a MPPR string of "1:s". The general form of an individual MPPR specification is an integer follow by a colon, followed by any of the tokens from mapping can be used in the MPPR specification. For example "1:c" is pronounced "one process per core." Multiple MPPR specifications can be strung together into a comma-delimited list, too. All of these MPPR values and then taken into account when mapping. Here's some examples: * 1:c -- allow, at most, one process per processor core (i.e., don't schedule by hyperthread) * 1:s -- allow, at most, one process per processor socket (e.g., that process may be multithreaded, or wants exclusive use of the socket's caches) * 1:s,2:n -- only allow one process per processor socket, but, at most, two processes per server node (e.g., if the two MPI processes will consume all the RAM on the server node, even if there are more processor cores available) If mapping all processes to resources would exceed a MPPR limit, this job is ruled to be oversubscribed. If --oversubscribe was specified on the mpirun command line, the job continues. Otherwise, LAMA will abort the job. Additionally, if --oversubscribe is specified, LAMA will endlessly cycle through the mapping token string untill all processes have been mapped. == Expert: Binding == Once processes have been paired with resources during the Mapping stage, they are optionally bound to a (potentially different) set of resources. For example, processes may be mapped round robin by processor socket, but bound to an individual processor core. To be clear: if binding is not used, then mapping is effectively reduced to "counting how many processes end up on each server node." Without binding, there's no enforcement that a process will stay where LAMA thinks it was placed. With binding, however, processes are bound to a set of hardware threads. The number of threads to which the process is bound is sometimes referred to as the "binding width". For example, if a process is bound to all the hardware threads in a processor socket, its "width" is the processor socket. (note that we specifically do not say that the hardware threads are sequential, even if they are all within a single resource such as a processor core or socket. BIOS ordering of hardware threads can be wonky; so we only refer to "sets of hardware threads") Bindings are expressed as an integer and a token from the mapping string. For example "1s" means "bind each process to one processor socket" (there is no ":" in the binding string because the ":" is pronounced as "per" when reading the MPPR string). Note that it only makes sense to bind processes to a single resource specification (unlike the MPPR specification, where multiple limits can be specified). == Expert: Ordering == Finally, processes are assigned a rank in MPI_COMM_WORLD. LAMA currently offers two ordering modes: sequential or natural: * Sequential: if you laid out all the hardware resources in a single line, and then overlaid all the MPI processes on top of them, they are ordered from 0 to (N-1) from left-to-right. * Natural: the ordering of ranks follows the mapping ordering. For example, consider a server node with two processor sockets, each containing four cores. The command line "mpirun -np 8 --bind-to core --map-by socket --order n a.out" would result in MCW ranks that look like this: [0 2 4 6] [1 3 5 7]. = Execution = At this point, the job is fully mapped, optionally bound, and its ranks in MPI_COMM_WORLD are ordered. It now starts its execution. = Final Notes = Note that at this point, lama is not the default mapper. It must be activiated with "--mca rmaps lama". We'll continue to do further testing and comparitive analysis with the current set of ORTE mappers. Also, note that the LAMA algorithm can handle heterogeneity between hardware resources (e.g., an MPI job spanning server nodes with differing numbers of processor sockets). For lack of a longer explanation (this commit message already long enough!), LAMA considers each server node individually during mapping and binding. See the LAMA paper for more details: http://www.open-mpi.org/papers/cluster-2011-lama/ This commit was SVN r27206.	2012-08-31 19:57:53 +00:00
Jeff Squyres	ca5bd85364	Add some help messages to the RAS simulator for PEBKAC issues. This commit was SVN r27203.	2012-08-31 16:33:29 +00:00
Ralph Castain	38ce23db43	Add some protection to allow NULL bytes in byte objects and NULL strings to be handled cleanly in nidmaps and modex entries. Ensure there is a valid nidmap available for the HNP to pass down to any local procs when it is operating alone. This commit was SVN r27188.	2012-08-31 01:07:36 +00:00
Ralph Castain	efc4a40c8a	It is okay for a key not to be found This commit was SVN r27187.	2012-08-30 15:12:23 +00:00
Ralph Castain	6dbb7a8493	Just because a peer didn't post a particular pmi key, that doesn't mean it is an immediate irrecoverable error - could be they just don't have a matching interface. Let the upper layer decide what to do about it. This commit was SVN r27186.	2012-08-30 14:12:09 +00:00
Ralph Castain	05c0464dcb	Add missing protections This commit was SVN r27183.	2012-08-30 12:17:29 +00:00
Ralph Castain	1b659de132	Get staged execution working on multi-node setups. Improve efficiency by only remapping if all procs not yet mapped in the job. This commit was SVN r27181.	2012-08-29 20:35:52 +00:00
Ralph Castain	a3b08f5800	Fix a few things relating to comm_spawn that causes new daemons to be launched. Ensure that all new daemons receive a full pidmap. Properly mark the daemon job as "updated" when daemons are added This commit was SVN r27177.	2012-08-29 03:11:37 +00:00
Ralph Castain	f0077820f2	Silence warning This commit was SVN r27175.	2012-08-28 22:27:41 +00:00
Ralph Castain	a414ffdf4c	Remove debug This commit was SVN r27174.	2012-08-28 22:18:00 +00:00
Ralph Castain	30fd9d7abc	MPI procs should definitely not be trapping SIGCHLD - only ORTE tools need to do so This commit was SVN r27166.	2012-08-28 21:39:06 +00:00
Ralph Castain	98580c117b	Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine. Remove some stale configure.m4's we no longer need. Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages. This commit was SVN r27165.	2012-08-28 21:20:17 +00:00
Ralph Castain	aadfe1b61e	Fix a missing test that breaks novm operation. CMR:v1.7 This commit was SVN r27163.	2012-08-28 21:13:57 +00:00
Ralph Castain	d310dd8c58	Fix a strange race condition by creating a separate buffer for each send - apparently, just a retain isn't enough protection on some systems This commit was SVN r27161.	2012-08-28 17:17:34 +00:00
Ralph Castain	11c68e2299	Correct the count in the pmi key This commit was SVN r27156.	2012-08-28 15:05:02 +00:00
Ralph Castain	6e8c97c77c	Per Sam's eagle-eyed review, free the malloc'd memory if getcwd fails for some strange reason. This commit was SVN r27150.	2012-08-27 19:15:16 +00:00
Ralph Castain	bccc20d13e	Deal with one last corner case of positioning a dot-file This commit was SVN r27144.	2012-08-26 03:49:31 +00:00
Ralph Castain	63d41c643d	Minor cleanup This commit was SVN r27143.	2012-08-25 14:24:45 +00:00
Ralph Castain	0e1dbe8711	Remove non-existent files This commit was SVN r27136.	2012-08-25 01:29:17 +00:00
Ralph Castain	05f0b4c653	Couple of minor cleanups This commit was SVN r27135.	2012-08-24 21:14:40 +00:00
Ralph Castain	d6cbff6d4e	Since the preload flags are at the app_context level, we need to link only those files/exe's that pertain to each app_context to the corresponding procs. Also, gain a little optimization by checking to ensure we only send files once - this probably won't work when daemons are created on-the-fly, but that's for some other day This commit was SVN r27134.	2012-08-24 16:16:30 +00:00
Jeff Squyres	20612c4194	Don't close the IOF stdin if we happen to read less than a full buffer's worth of data -- interactive stdin will have that behavior frequently. This commit was SVN r27131.	2012-08-24 14:29:19 +00:00
Ralph Castain	e0c39c94e8	Complete the cleanup of the preload files system. Remove the dest_dir option as moving things to arbitrary locations - especially absolute paths - can prove disastrous. Remove the preload_libs option as these can be treated as just files. Cleanup some of the pack/unpack code as the dss handles NULL strings just fine. Deal a little better with absolute paths, noting that tar now strips the leading '/' for us (showing my age as it didn't used to do so). Remove the odls_base_state.c file as that code is now covered by the new broadcast form of preload_files. This commit was SVN r27127.	2012-08-24 02:28:29 +00:00
Ralph Castain	c8b511d18a	Remove stale tests This commit was SVN r27126.	2012-08-24 02:22:11 +00:00
Ralph Castain	b4a544ad2a	Per discussion with Josh, use the --preload-xxx cmd line options to broadcast files to all nodes. Add --set-cwd-to-session-dir option to start procs in their session directories. Add OMPI_FILE_LOCATION envar to tell procs where their prepositioned files went. This commit was SVN r27125.	2012-08-23 21:28:05 +00:00
Ralph Castain	855c9ae6cf	Support archives .tar, .bz[2,zip], and .gz[ip] This commit was SVN r27123.	2012-08-23 15:38:39 +00:00
Ralph Castain	286c610712	Protect us against the scenario where filem is included in enable-mca-no-build This commit was SVN r27122.	2012-08-23 13:52:06 +00:00
Shiqing Fan	d141d94bd7	Include the new .windows files into the tarball. This commit was SVN r27121.	2012-08-23 12:50:51 +00:00
Ralph Castain	7237a938bf	Extend the filem interface to support prepositioning and linking required local files for execution. Create a new "raw" module that uses xcast to send the files to all nodes as this is faster than doing an scp in a linear pattern This commit was SVN r27118.	2012-08-22 21:43:20 +00:00
Ralph Castain	ed4b354846	Ensure we pass along user-specified mca params from the cmd line when doing a tree spawn, but don't extend the cmd line with duplicates or things that shouldn't be there This commit was SVN r27117.	2012-08-22 21:41:50 +00:00
Ralph Castain	5d7872fd68	Cleanup the tag list This commit was SVN r27115.	2012-08-22 21:37:58 +00:00
Ralph Castain	3c13176aa7	Remove test code This commit was SVN r27114.	2012-08-22 21:36:54 +00:00
Ralph Castain	7bcf2f8b5c	Stop leaving droppings behind us This commit was SVN r27111.	2012-08-22 17:39:22 +00:00
Shiqing Fan	95b9552546	include several components for Windows build. This commit was SVN r27108.	2012-08-22 14:46:49 +00:00
Jeff Squyres	c8cee23ee7	Priorities really shouldn't be less than 0. This commit was SVN r27098.	2012-08-21 15:47:15 +00:00
Ralph Castain	dacb07000d	Turn udcm and ud oob off by default, but allow them to build and be used if someone wants to test them cmr:v1.7 This commit was SVN r27097.	2012-08-21 15:18:34 +00:00
Nathan Hjelm	0061ac066b	orte/alps: add support for --with-alps=yes on CLE 5.0 and clean out tabs This commit was SVN r27096.	2012-08-20 15:26:58 +00:00
Ralph Castain	64cf75cec5	Add some debug This commit was SVN r27087.	2012-08-17 02:19:26 +00:00
Ralph Castain	a572b6fa9f	Pick the right place This commit was SVN r27085.	2012-08-17 00:28:28 +00:00
Ralph Castain	b2cd2b1289	Allow developers to enable OMPI progress threads for debugging purposes. Warn and error out if ORTE progress threads are enabled, but they forgot to enable the libevent thread support. This commit was SVN r27071.	2012-08-16 17:50:52 +00:00
Ralph Castain	335c0eafcf	Add a filem test program and set ignores This commit was SVN r27069.	2012-08-16 17:46:46 +00:00
Jeff Squyres	96f640a762	Add new "opal_hotel" class. Abstractly speaking, this class does the following: * Provides a fixed number of resource slots (i.e., "hotel rooms"). * Allows one thing to occupy a resource slot at a time (i.e., each hotel room can have an occupant check in to that room). * Resource slots can be vacated at any time (i.e., occupants can voluntarily check out of their hotel room). * Resource slots can be occupied for a specific maximum amount of time. If that time expires, the occupant is forcibly evicted and the upper layer is notified via (libevent) callback (i.e., the maid will kick an occupant of out of their room when their reservation is over). This class can be to be used for things like retransmission schemes for unreliable transports. For example, a message sent on an unreliable transport can be checked in to a hotel room. If an ACK for that message is received, the message can be checked out. But if the ACK is never received, the message will eventually be evicted from its room and the upper layer will be notified that the message failed to check out in time (i.e., that an ACK for that message was not received in time). Code using this class is currently being developed off-trunk, but will be coming to SVN soon. This commit was SVN r27067.	2012-08-16 17:29:55 +00:00
Ralph Castain	e4d82b8912	Turn off the common port by default by now until we get rollup working properly on ALL platforms This commit was SVN r27060.	2012-08-15 22:13:04 +00:00
Ralph Castain	35fef87202	Make the "no virtual machine" selection more intuitive by providing a --novm option to mpirun. This commit was SVN r27048.	2012-08-15 14:55:03 +00:00
Ralph Castain	229e3f9b2a	This will break systems like orcm, but we aren't trying to support those any more - so put the nodes back in their daemon-indexed position. Will continue working to reduce search requirements in other parts of the code This commit was SVN r27038.	2012-08-14 22:26:40 +00:00
Ralph Castain	481ed4e292	Only one equal sign, if you please... This commit was SVN r27037.	2012-08-14 22:08:19 +00:00
Ralph Castain	8c890b1c46	Fix the alps configury so it doesn't attempt to build alps by default, even if --with-alps wasn't given. This commit was SVN r27036.	2012-08-14 22:04:39 +00:00
Ralph Castain	3cb8d55c8b	We can't just lookup the node in the node pool by daemon vpid as the daemons aren't stored that way - this was done because when holes exist in daemon vpids, we can generate huge orte_node_pool arrays even when only a few daemons actually exist. So we have to search for the vpid in the array This commit was SVN r27035.	2012-08-14 18:17:59 +00:00
Nathan Hjelm	d5824f7800	add missing test This commit was SVN r27028.	2012-08-14 03:14:07 +00:00
Nathan Hjelm	3d03d8f08b	fix typo in orte_check_alps.m4 This commit was SVN r27027.	2012-08-13 23:00:06 +00:00
Nathan Hjelm	8e03f77004	update alps configure scripts This commit was SVN r27026.	2012-08-13 22:57:55 +00:00
Ralph Castain	589acf550c	Improve the new MPI_INFO_ENV to better handle Java applications and to correctly report the info for singletons. This commit was SVN r27025.	2012-08-13 22:13:49 +00:00
Ralph Castain	3938ec5361	Remove debug This commit was SVN r27024.	2012-08-13 21:35:21 +00:00
Ralph Castain	49a757e0bd	Silly me - now that all daemons are stripping their prefix on the backend, we no longer need to do it as they report This commit was SVN r27023.	2012-08-13 20:48:13 +00:00
Ralph Castain	b9b41d8662	For cases where the alpha+non-zero prefix must be removed from a node name, be sure to do it everywhere we access node names - otherwise, modex methods such as pmi will fail to correctly identify procs on the same node This commit was SVN r27022.	2012-08-13 20:44:56 +00:00
Ralph Castain	c90b7380c1	Sigh - of course, they changed the name of the silly MPI_Info object in the final standard, but not in the proposal. So change to the new MPI_INFO_ENV name. Also, don't set unknown values to "N/A", but just leave them unset. This commit was SVN r27012.	2012-08-12 05:00:57 +00:00
Ralph Castain	cb48fd52d4	Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions: "num_app_ctx" - the number of app_contexts in the job "first_rank" - the MPI rank of the first process in each app_context "np" - the number of procs in each app_context Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it. This commit was SVN r27005.	2012-08-12 01:28:23 +00:00
Ralph Castain	c4ee297a60	Cleanup the pmi grpcomm module so it passes non-btl modex data correctly. This commit was SVN r26992.	2012-08-10 20:35:50 +00:00
Ralph Castain	e3e9b7345d	First cut at updating the ccp launcher to use the state machine This commit was SVN r26986.	2012-08-10 17:09:33 +00:00
Jeff Squyres	3719b6c68b	After some further discussion between Jeff, Ralph, and Josh, rever r26951. The feeling is that fixing the actual problem of the command line parser not always identifying when invalid command line options were specified (i.e., r26953) was a better solution. This commit was SVN r26979. The following SVN revision numbers were found above: r26951 --> open-mpi/ompi@1f8df92c3c r26953 --> open-mpi/ompi@0b7b3feba9	2012-08-09 20:56:01 +00:00
Shiqing Fan	e304c19920	This is also used on Windows. This commit was SVN r26975.	2012-08-08 16:44:00 +00:00
George Bosilca	ba879c2c51	Remove the unused map. This commit was SVN r26960.	2012-08-07 12:06:13 +00:00
Shiqing Fan	2f442799f8	fix several typecasts This commit was SVN r26957.	2012-08-07 10:41:53 +00:00
Ralph Castain	1f8df92c3c	Remove the confusion over which options are "to" and which are "by" by creating synonyms so that either spelling works. This commit was SVN r26951.	2012-08-05 14:40:38 +00:00
Ralph Castain	53b1a1c976	Cleanly error out when someone asks to map-to <object> if that object doesn't exist on a node. This commit was SVN r26950.	2012-08-04 21:52:36 +00:00
Ralph Castain	61b09a132b	Fix bynode mapping of multiple app-contexts This commit was SVN r26949.	2012-08-03 21:45:40 +00:00
Ralph Castain	96f6f94c24	Ensure we don't get trapped in an infinite loop when ranking bynode if something isn't right This commit was SVN r26948.	2012-08-03 21:45:10 +00:00
Ralph Castain	0d878937fe	If a callback is set in the state machine, and the state doesn't yet exist, create it This commit was SVN r26947.	2012-08-03 21:43:36 +00:00
Ralph Castain	431d5361ed	For those who really preferred our prior mode of operation that mapped procs and only launched daemons on the nodes that had procs on them, introduce the "novm" state machine component. This recreates the old mode of operation by re-ordering the launch sequence so that we allocate, then map, and then launch daemons only on the reqd nodes (instead of across the entire allocation). This commit was SVN r26946.	2012-08-03 16:30:05 +00:00
Ralph Castain	dc22ea5cde	A little cleaner on the message about repeated ctrl-c, and re-enable the event so we can abort if we see multiple ctrl-c's that don't meet the time requirement This commit was SVN r26945.	2012-08-03 01:26:18 +00:00
Ralph Castain	e6c72bfd53	Ensure we can forcibly exit even when we are stuck inside of an event by replacing the libevent signal handler with a POSIX one that (a) attempts to trip a libevent termination event and (b) if anothe ctrl-c hits within 5 seconds, just calls exit. This commit was SVN r26943.	2012-08-02 21:15:35 +00:00
Ralph Castain	d818c9d407	Includes a patch from Jeff and Josh: update the simulator module to allow specification of multiple slot and max_slot counts for each node group (but don't require it). Remove the requirement that each node group provide its own topology. Adjust verbosities to allow showing some light debug output to see what nodes have been added without getting a bunch of other stuff. This commit was SVN r26936.	2012-08-02 04:57:13 +00:00
Jeff Squyres	62c2ff7ee7	It's actually ''not'' an error to exit if all routes and children are gone. So exit with 0, not ORTE_ERROR_DEFAULT_EXIT_CODE (which is 1). This fixes a race condition in the rsh launcher upon termination, where ORTE would sometimes think that a daemon failed to launch. This commit was SVN r26935.	2012-08-01 19:49:19 +00:00
Nathan Hjelm	4557e15c18	oob/ud fix compile error This commit was SVN r26933.	2012-07-31 21:50:34 +00:00
Ralph Castain	6ee35e4977	Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well This commit was SVN r26931.	2012-07-31 21:21:50 +00:00
Jeff Squyres	88cbe9c780	.ompi_ignore this component until it can be fixed. This commit was SVN r26930.	2012-07-31 21:02:06 +00:00
Nathan Hjelm	980692804d	oob/ud: don't start listening for ud requests unless we have one usable port This commit was SVN r26929.	2012-07-31 19:00:18 +00:00
Ralph Castain	23c2a315a9	Add missing line to set flag indicating at least one port found This commit was SVN r26914.	2012-07-30 17:54:38 +00:00
Ralph Castain	6285f7d8c0	Per request of Shiqing, restore the ccp components This commit was SVN r26904.	2012-07-29 23:49:59 +00:00
Ralph Castain	c7f9a0fa34	Check for recursive use of mpirun - issue error message and abort if detected This commit was SVN r26903.	2012-07-28 21:50:56 +00:00
Ralph Castain	94d11e04fd	Add an intermediate state when the VM is ready so that third party tools can take action prior to mapping/launching apps This commit was SVN r26902.	2012-07-28 15:33:09 +00:00
Shiqing Fan	660188307c	fix an export declaration name This commit was SVN r26895.	2012-07-27 13:26:24 +00:00
Shiqing Fan	42dfbc7d2f	Another CMake scripts update for: correctly generate hwloc library automatically define OMPI/OPAL/ORTE_OMPORTS for user applications update the f77 bindings This commit was SVN r26893.	2012-07-27 11:49:09 +00:00
Ralph Castain	8bc6694a62	Ensure the daemons don't incorrectly declare a failed launch This commit was SVN r26875.	2012-07-26 19:05:06 +00:00
Ralph Castain	07846f12ae	Reconnect the rsh/ssh error reporting code for remote spawns to report failure to launch. Ensure the HNP correctly reports non-zero exit status when ssh encounters a problem. Thanks to Terry for spotting it! This commit was SVN r26868.	2012-07-25 21:46:45 +00:00
Jeff Squyres	e5cfad0c1a	This variable is only used in FT builds. This commit was SVN r26854.	2012-07-24 12:48:47 +00:00
Shiqing Fan	8c4a3e1269	correct the symbol dllexports for windows build This commit was SVN r26827.	2012-07-22 08:54:50 +00:00
Shiqing Fan	12d99a9ebb	Update the hwloc build on Windows and related files. This commit was SVN r26818.	2012-07-20 12:14:28 +00:00
Abhishek Kulkarni	1ce378b5c6	Make C/R work with nodes > 1. This fix makes sure that the app coordinators send the "ready-to-checkpoint" signal to the global coordinator only after ORTE has initialized. This commit was SVN r26795.	2012-07-13 23:37:29 +00:00
Abhishek Kulkarni	1878f276cd	Replace the pattern while(flag) { opal_progress() }; in the C/R code with the ORTE_WAIT_FOR_COMPLETION macro. This commit was SVN r26794.	2012-07-13 23:31:56 +00:00
George Bosilca	772ec212eb	Fix another compiler warning. This commit was SVN r26775.	2012-07-10 15:57:42 +00:00
Abhishek Kulkarni	eec5a28aa4	More C/R fixes. * Fix a typo introduced by the removal of the notifier framework * Fix to flush the modex cached data correctly using the orte DB API. This commit was SVN r26773.	2012-07-10 01:19:46 +00:00
Abhishek Kulkarni	5c58a1c9c1	Fix C/R support in the trunk. Among other things, this patch deals with the following issues: * fix ompi-checkpoint argument parsing * ompi-restart -showme prints an extraneous "Restarted child with PID" message. Move around the debug statement to avoid this. * fixes for the state machine changes This commit was SVN r26770.	2012-07-09 23:34:13 +00:00
George Bosilca	ec760454a6	Cleaning ... This commit was SVN r26747.	2012-07-04 21:22:13 +00:00
Ralph Castain	cf4606cdd5	Add debug of nidmap subsystem This commit was SVN r26739.	2012-07-04 00:04:16 +00:00
Ralph Castain	6ae5776904	Cleanup IPV6 build This commit was SVN r26738.	2012-07-04 00:03:50 +00:00
Ralph Castain	1a90471374	Drat - missed the other one This commit was SVN r26718.	2012-07-02 22:18:31 +00:00
Ralph Castain	9a6a969f60	Remove debug This commit was SVN r26717.	2012-07-02 22:18:08 +00:00
Ralph Castain	b83fc41d54	Add a state that allows mpirun or other tools to be notified of a job completion prior to terminating so that alternative actions can be performed. This commit was SVN r26716.	2012-07-02 22:16:32 +00:00
Ralph Castain	e335de3564	Refactor ompi_info, splitting it into parts according to the layer involved. Thus, we call down to the opal layer to get those frameworks and components, and down to the orte layer to get those. Still some abstraction breaks, but they mostly involve renaming of OMPI_foo labels that have been around since before we split the build system by layer. This commit was SVN r26695.	2012-06-28 18:23:34 +00:00
Ralph Castain	8bebf2fa47	Ensure we don't build the MR iof components unless hadoop support is enabled This commit was SVN r26694.	2012-06-28 18:20:15 +00:00
Ralph Castain	9aa821d8b4	Add missing file to tarball This commit was SVN r26688.	2012-06-28 02:57:10 +00:00
Ralph Castain	0dfe29b1a6	Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required. Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework. Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one. This commit was SVN r26678.	2012-06-27 14:53:55 +00:00
Brian Barrett	b22faedd9d	Remove the Portals4 SHMEM reference implementation runtime support, as we're no longer using the runtime provided by the reference implementation. Remove the Catamount support from ORTE, since we're no longer supporting Catamount. Left the Catamount timer component, because I'm not sure whether it's used on the XTs running CNL. This commit was SVN r26677.	2012-06-27 14:17:43 +00:00
Josh Hursey	28681deffa	Backout the ORCA commit. :( There is a linking issue on Mac OSX that needs to be addressed before this is able to come back into the trunk. This commit was SVN r26676.	2012-06-27 01:28:28 +00:00
Josh Hursey	542330e3a7	Commit of ORCA: Open MPI Runtime Collaborative Abstraction This is a runtime interposition project that sits between the OMPI and ORTE layers in Open MPI. The project is described on the wiki: https://svn.open-mpi.org/trac/ompi/wiki/Runtime_Interposition And on this email thread: http://www.open-mpi.org/community/lists/devel/2012/06/11109.php This commit was SVN r26670.	2012-06-26 21:42:16 +00:00
Ralph Castain	a34f09e67a	Ensure common port is off when not being used This commit was SVN r26666.	2012-06-26 16:09:58 +00:00
Ralph Castain	92527da4e3	Remove unused component This commit was SVN r26660.	2012-06-26 00:49:28 +00:00
Ralph Castain	0103f82918	Turn off the common port for slurm for now This commit was SVN r26656.	2012-06-25 21:55:51 +00:00
Ralph Castain	b990c65a53	Remove another antiquated dss function - the 'size' API isn't used anywhere since the GPR went away This commit was SVN r26646.	2012-06-25 13:33:45 +00:00
Shiqing Fan	6f746cdb33	remove a unused file. This commit was SVN r26645.	2012-06-25 10:17:21 +00:00
Ralph Castain	abe7dd8274	Cleanup the dss by removing unused functions This commit was SVN r26644.	2012-06-23 21:20:09 +00:00
Ralph Castain	9680c52f5e	Add mrplus examples to tarball This commit was SVN r26643.	2012-06-23 02:40:46 +00:00
Jeff Squyres	148ae6d6e3	This commit unifies the configury of some verbs-lovin' components. * Add new configure command line options and deprecate some old ones: * --with-verbs replaces --with-openib * --with-verbs-libdir replaces --with-openib-libdir * If you specify --with-openib[-libdir] without --with-verbs[-libdir], you'll get a "these options have been deprecated!" warning, but then they'll act just like --with-verbs[--libdir]. '''Sidenote:''' Note that we are not renaming any components at this time, nor are we renaming the top-level OMPI_CHECK_OPENIB m4 macro (which is pretty strongly tied to the openib BTL and is bastaridzed by the ofud BTL). Note that there will likely be more changes in this area coming soon (next week?) when some long-standing changes move to the SVN trunk: some openib BTL infrastructure will move to ompi/mca/common, and its configury gets split up / refactored. We extend our philosophy of other --with-<foo> configure options of --with-verbs to ''all'' verbs-lovin components: * If you specify --with-verbs, then all verbs-lovin' components must configure successfully (or abort). This currently means: OOB ud, BTL ofud, BTL openib. * If you specify --with-verbs=DIR, then all verbs-lovin' component must configure successfully (or abort), and will use DIR to find verbs headers and libraries. * If you specify --without-verbs, then all verbs-lovin' components will be ignored. This commit also fixes a problem where the --with-openib=DIR form would not use DIR for ''all'' verbs-lovin' components (I think only BTL openib and BTL ofud used that DIR). Now all of them do, as does hwloc (because hwloc has some !OpenFabrics helper functions that require ibv types from verbs.h). There's a little new m4 infrastructure worth mentioning: * If you create a new verbs-lovin' component (i.e., a component that need verbs), your configure.m4 should AC_REQUIRE([OPAL_CHECK_VERBS_DIR]). * You can then use three global shell variables: $opal_want_verbs, $opal_verbs_dir, $opal_verbs_libdir, which will be set as follows: * opal_want_verbs will be "yes" and opal_verbs_dir and opal_verbs_libdir will both be set to directory values, '''OR''' * opal_want_verbs will be "no" and opal_verbs_dir and opal_verbs_libdir will both be set empty This commit was SVN r26640.	2012-06-22 19:53:56 +00:00
Ralph Castain	e6f3586415	Remove the orte notifier framework, per discussion at the devel meeting and follow-up with Jeff (who took the action item) This commit was SVN r26637.	2012-06-22 18:09:23 +00:00
Ralph Castain	60758faa55	Fix data type This commit was SVN r26633.	2012-06-21 23:48:55 +00:00
Ralph Castain	e9591f2563	Fix tree spawn in the rsh/qrsh environment This commit was SVN r26631.	2012-06-21 21:29:28 +00:00
Brian Barrett	9af72072a3	Use MKDIR_P instead of mkdir_p in Makefiles, as MKDIR_P is the only one defined in recent versions of AC/AM. This commit was SVN r26625.	2012-06-21 16:52:37 +00:00
Ralph Castain	019857b616	Ensure that we don't attempt to use common ports if --disable-static was specified. This commit was SVN r26620.	2012-06-20 03:14:11 +00:00
Ralph Castain	0a713cd27e	Add database framework to ORTE and refactor modex code to utilize it. Create the "hash" db component from the prior modex db code. Leave the other components ignored for now - will activate them later. Modex is still a blocking operation at this point. This commit was SVN r26618.	2012-06-19 13:38:42 +00:00
Ralph Castain	9e0bb6ae28	Revert r26600 and r26601 for a couple of reasons: 1. they modified the OMPI-ORTE interface, which is something I promised to avoid doing unless absolutely necessary, and 2. the framework ident is already in the component name key provided to the modex db. What is missing is the project ident, but as Jeff and I discussed last week, we really need to add that field to the component struct anyway to avoid multi-project collisions on framework names. That will be done over the next couple of weeks as a separate effort. This commit was SVN r26613. The following SVN revision numbers were found above: r26600 --> open-mpi/ompi@5ba4deff07 r26601 --> open-mpi/ompi@0e3094c318	2012-06-16 09:11:03 +00:00
Ralph Castain	9b026c6695	For now, run MTT with the use_common_port option enabled. This would be the desirable scenario for users, especially at scale, so let's see if it creates any issues. This commit was SVN r26609.	2012-06-15 15:46:38 +00:00
Ralph Castain	3c2a03b16d	Update the other routed components to use common ports. Per conversation with Josh, remove the "cm" component. This commit was SVN r26608.	2012-06-15 15:36:08 +00:00
Ralph Castain	96c778656a	Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP. Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build. This commit was SVN r26606.	2012-06-15 10:15:07 +00:00
Ralph Castain	0e3094c318	Update the other grpcomm modules to new API This commit was SVN r26601.	2012-06-14 03:28:48 +00:00
Ralph Castain	5ba4deff07	Extend the modex database to support multiple projects and frameworks that might have duplicate component names. No visible API change in the BTL's as it was executed solely in the ompi modex code. This commit was SVN r26600.	2012-06-14 02:55:06 +00:00
Ralph Castain	ecc51d8583	Add missing endif This commit was SVN r26596.	2012-06-12 15:07:09 +00:00
Ralph Castain	078a4667e4	Some more cleanup on direct routed when daemons are involved This commit was SVN r26594.	2012-06-11 23:46:22 +00:00
Ralph Castain	cee5a75d19	Revert the default configuration to no orte progress thread and no libevent thread support until we can get more of the kinks ironed out. This commit was SVN r26593.	2012-06-11 20:52:28 +00:00
Ralph Castain	9506ac1617	Remove debug This commit was SVN r26592.	2012-06-11 20:02:53 +00:00
Ralph Castain	269cb2b8d9	Some cleanup to remove calls to opal_progress when running with orte progress threads, and to ensure that all orte-related events are in the orte event base. This commit was SVN r26591.	2012-06-11 19:59:53 +00:00
Ralph Castain	75e66ad51e	Restore the direct routed component This commit was SVN r26590.	2012-06-11 17:16:02 +00:00
Brian Barrett	7406ef1241	Make all the PMI components depend on the common pmi library and properly install the common pmi library This commit was SVN r26588.	2012-06-11 15:58:09 +00:00
Ralph Castain	2812579246	Just because we find an IB device does not mean we can get a QP on it. Check to see if we can before we select the UD OOB module for use. This commit was SVN r26587.	2012-06-10 01:42:51 +00:00
Ralph Castain	0442a807c0	Default the OOB to the "ud" component IFF the HNP finds itself on a node with a supported Infiniband device. Ensure that the daemons all pick the matching component by dictating the selection via mca param on the orted cmd line. This commit was SVN r26582.	2012-06-08 01:23:08 +00:00
Ralph Castain	05122a2f93	Make debruijn the default routed component. Update the radix component to "short-circuit" the tree when the job size permits This commit was SVN r26580.	2012-06-08 00:35:36 +00:00
Ralph Castain	ffcca0185a	Remove no longer needed component This commit was SVN r26578.	2012-06-08 00:18:59 +00:00
Ralph Castain	980768965f	Remove unused and unsupported component This commit was SVN r26577.	2012-06-07 23:48:06 +00:00
Ralph Castain	350900f70e	Remove unused and unsupported component This commit was SVN r26576.	2012-06-07 23:47:35 +00:00
Nathan Hjelm	625c8078c3	oob/ud: fix typo This commit was SVN r26569.	2012-06-07 19:21:23 +00:00
Ralph Castain	7a94a52420	No reason not to build this This commit was SVN r26568.	2012-06-07 19:11:44 +00:00
Ralph Castain	5876496f4c	Enable orte progress threads and libevent thread support by default This commit was SVN r26565.	2012-06-07 04:25:00 +00:00
Shiqing Fan	2abf783fa0	Remove a unnecessary definition before the real one. This commit was SVN r26562.	2012-06-06 14:15:39 +00:00
Ralph Castain	166d254d4e	Add new routed component This commit was SVN r26557.	2012-06-06 11:53:12 +00:00
Ralph Castain	d6279fc971	Fix the debugger daemon launch support to fit the new state machine. Treat debugger daemons just like any other job, except that we map them only to nodes where an app process currently exists (as opposed to every node in the system). Trigger breakpoint and rank0 release only after the debugger daemons are in position. This commit was SVN r26556.	2012-06-06 02:01:23 +00:00
Jeff Squyres	0b8849e2c4	Make "mpirun --report-bindings" have a user-friendly output (i.e., readable by normal human beings, vs. having a bitmap of physical PU's). Use the new hwloc base prettyprint functions to generate the output. This commit was SVN r26533.	2012-06-01 16:35:31 +00:00
Jeff Squyres	99c5afb397	Remove clang compiler warnings. This commit was SVN r26523.	2012-05-29 23:36:06 +00:00
Ralph Castain	b0938a254e	Dont use mutex where it isn't needed This commit was SVN r26521.	2012-05-29 20:21:11 +00:00
Ralph Castain	32b66c166b	Missed one blasted spot This commit was SVN r26520.	2012-05-29 20:20:10 +00:00
Ralph Castain	9bedb25dda	Cleanup some compiler warnings, some of which are actual logic errors This commit was SVN r26519.	2012-05-29 20:11:51 +00:00
Ralph Castain	d7ac424d8d	Silence optimized build warnings This commit was SVN r26518.	2012-05-29 19:55:47 +00:00
Ralph Castain	bf5ec1ac0c	Silence optimized build warnings This commit was SVN r26517.	2012-05-29 19:55:31 +00:00
Shiqing Fan	08d553d7bf	Add a file to the installation list. This commit was SVN r26507.	2012-05-29 13:58:23 +00:00
Ralph Castain	9883f42caf	Add missing commit This commit was SVN r26501.	2012-05-28 02:20:20 +00:00
Ralph Castain	e705de1ce6	Complete nidmap cleanup - we don't know our node until we have unpacked all the jobs since our job is always the last one, so wait until all jobs are unpacked before assigning locality This commit was SVN r26500.	2012-05-27 18:37:57 +00:00
Ralph Castain	be6ed9c2df	Allow partial use of allocations by specifying the max number of daemons (i.e., max VM size) for the job This commit was SVN r26499.	2012-05-27 16:48:19 +00:00
Ralph Castain	c69a04e16b	Cleanup the pidmap decoding for apps to avoid confusion This commit was SVN r26498.	2012-05-27 16:21:38 +00:00
Ralph Castain	31beff6362	Oops - if we don't want the Java bindings, then we really shouldn't be building them :-/ Also ensure we don't try to build them if no Java support was found, and error out if the user requests the bindings and we didn't find Java support. Add a configure flag to skip the Java tests and just force-set the Java support to "disabled" This commit was SVN r26484.	2012-05-23 19:51:27 +00:00
Ralph Castain	7fb49b1559	Silence warning This commit was SVN r26480.	2012-05-23 13:59:41 +00:00
Ralph Castain	da28a4b0e6	Silence warning This commit was SVN r26479.	2012-05-23 13:59:22 +00:00
Jeff Squyres	7969faf372	Fixes trac:3057: minor update to the man page to state that slot locations in rankfiles use ''physical'' device indexes (vs. logical indexes). This commit was SVN r26478. The following Trac tickets were found above: Ticket 3057 --> https://svn.open-mpi.org/trac/ompi/ticket/3057	2012-05-23 11:43:33 +00:00
Nathan Hjelm	b9959a95cd	ack! one more This commit was SVN r26472.	2012-05-22 20:52:52 +00:00
Nathan Hjelm	f2d4e95429	doh! add missing include This commit was SVN r26471.	2012-05-22 20:49:13 +00:00
Nathan Hjelm	cdc3c87ba6	move pmi init/finalize into a common component This commit was SVN r26470.	2012-05-22 15:15:39 +00:00
Nathan Hjelm	78b8b3cf76	bug fix: actually close ess components This commit was SVN r26469.	2012-05-22 15:09:18 +00:00
Ralph Castain	b217124bd8	Symlink instead of copy This commit was SVN r26464.	2012-05-21 23:07:48 +00:00
Ralph Castain	da3873af6f	Rename the mapreduce tool to "mr+" per the marketing types This commit was SVN r26463.	2012-05-21 21:17:44 +00:00
Nathan Hjelm	6eeca66475	add an option to enable static ports. diabled by default This commit was SVN r26462.	2012-05-21 19:56:15 +00:00
Ralph Castain	83d69b6c95	Enable the ORTE progress thread for apps (not needed in the tools as they already continuously loop in the event lib). This appears to be working, at least for MPI apps that only use shared memory (a simple "hello"). More testing is required to identify where problems will occur - this is only intended to allow further development. In order to use the progress thread, you must configure with: --enable-orte-progress-threads --enable-event-thread-support This commit was SVN r26457.	2012-05-20 15:14:43 +00:00
Ralph Castain	c4f8043064	Per Nathan, with a little cleanup by me: update the PMI support to aggregate modex info, thus reducing the number of keys required so it fits within Cray default constraints This commit was SVN r26456.	2012-05-19 16:12:52 +00:00
Ralph Castain	a526afae92	Ensure we always cleanup local procs, no matter how we exited. This commit was SVN r26454.	2012-05-18 23:37:40 +00:00
Ralph Castain	12ebc0e269	Don't need this to be a bin program as the class is captured in the jar This commit was SVN r26453.	2012-05-18 23:37:18 +00:00
Ralph Castain	b16e43f489	Silence a warning on Mac This commit was SVN r26449.	2012-05-18 15:27:04 +00:00
Ralph Castain	ca1b325738	Tweak the java setup so it works better on Mac. Only build mapreduce and allocators if hadoop support was requested. This commit was SVN r26448.	2012-05-18 01:02:01 +00:00
Jeff Squyres	cab31eafce	Revert r26413: it was causing too much confusion. When an MPI proc exits with status 77, the whole job will be killed, but mpirun will still return an exit status of 77, so MTT will report it as a skip anyway. This commit was SVN r26445. The following SVN revision numbers were found above: r26413 --> open-mpi/ompi@02aa36f2e5	2012-05-16 14:45:58 +00:00
Jeff Squyres	dab7d36a81	Fix location of the default hostfile. Thanks to Götz Waschk for identifying the problem. This commit was SVN r26441.	2012-05-15 16:13:39 +00:00
Jeff Squyres	2d78728d38	Fix the macro name in the comment: it's EXTRA_DIST, not EXTRA_SOURCES. This commit was SVN r26429.	2012-05-10 14:07:36 +00:00
Jeff Squyres	b325c17c72	It's a little weird to put in a blank _SOURCES line for the HDFSFileFinder PROGRAM, but if we don't put in a _SOURCES line at all, Automake will default to "HDFSFileFinter_class_SOURCES = HDFSFileFinder.c", which clearly will cause problems. But we don't want to put the .java file in _SOURCES, either, because we haven't configured Automake to handle Java (because current versions of Automake only have GCJ, not other Java compilers). So set HDFSFileFinder_class_SOURCES to blank and list the .java file in EXTRA_SOURCES (so that they get picked up for "make dist"). This commit was SVN r26424.	2012-05-10 13:54:51 +00:00
Ralph Castain	b9d560263f	Ensure we properly handle systems that do not have a jdk installed This commit was SVN r26421.	2012-05-10 12:06:59 +00:00
Ralph Castain	b143633593	Fix java config This commit was SVN r26420.	2012-05-10 01:51:02 +00:00
Ralph Castain	640f0610aa	Fix the makefile to install the perl scripts properly This commit was SVN r26416.	2012-05-09 14:06:02 +00:00
Ralph Castain	fd796cce0a	Add an allocator tool for finding HDFS file locations and obtaining allocations for those nodes (supports both Hadoop 1 and 2). Split the Java support into two parts: detection of Java support and request for Java bindings. This commit was SVN r26414.	2012-05-09 01:13:49 +00:00
Jeff Squyres	02aa36f2e5	ORTE defaults to killing the entire job when any process exits with a nonzero status (we polled other MPI implementations since one one in the OMPI community had a concrete opinion on what behavior to do here -- all other MPI's seem to adhere to this behavior, too). This commit adds an MCA parameter that allows us to tell ORTE to ''not'' kill jobs when a process exits with a status of 77, meaning the GNU testing standard of "this test was skipped". In all the OMPI tests, all procs will either return 77 or not. So if they all return 77, mpirun won't consider it an error, but will still return an exit status of 77 (so that MTT can know that the test was cleanly skipped). This commit was SVN r26413.	2012-05-08 21:49:05 +00:00
Ralph Castain	84d031d6c1	Add daemon object to job array after creation This commit was SVN r26406.	2012-05-08 13:39:20 +00:00
Ralph Castain	70a106fa71	Fix binding on remote nodes - need to pass the binding bitmap! This commit was SVN r26403.	2012-05-08 03:52:39 +00:00
Jeff Squyres	2ba10c37fe	Per RFC, bring in the following changes: * Remove paffinity, maffinity, and carto frameworks -- they've been wholly replaced by hwloc. * Move ompi_mpi_init() affinity-setting/checking code down to ORTE. * Update sm, smcuda, wv, and openib components to no longer use carto. Instead, use hwloc data. There are still optimizations possible in the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old carto-based code found out how many NUMA nodes were ''available'' -- not how many were used ''in this job''. The new hwloc-using code computes the same value -- it was not updated to calculate how many NUMA nodes are used ''by this job.'' * Note that I cannot compile the smcuda and wv BTLs -- I ''think'' they're right, but they need to be verified by their owners. * The openib component now does a bunch of stuff to figure out where "near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors (I do not have a NUMA machine with an OpenFabrics device that is a non-uniform distance from multiple different NUMA nodes). * Completely rewrite the OMPI_Affinity_str() routine from the "affinity" mpiext extension. This extension now understands hyperthreads; the output format of it has changed a bit to reflect this new information. * Bunches of minor changes around the code base to update names/types from maffinity/paffinity-based names to hwloc-based names. * Add some helper functions into the hwloc base, mainly having to do with the fact that we have the hwloc data reporting ''all'' topology information, but sometimes you really only want the (online \| available) data. This commit was SVN r26391.	2012-05-07 14:52:54 +00:00
Ralph Castain	44b8608f0a	Convert debug to verbose This commit was SVN r26384.	2012-05-05 17:46:10 +00:00
Ralph Castain	96bfeb591c	Ensure flag is passed to remote daemons This commit was SVN r26383.	2012-05-03 22:31:25 +00:00
Ralph Castain	45fee2b491	Resolve the case where only the HNP is in the system (i.e., single-node operation) This commit was SVN r26382.	2012-05-03 18:00:01 +00:00
Ralph Castain	c352ca36c2	Minor cleanup This commit was SVN r26381.	2012-05-02 21:23:37 +00:00
Ralph Castain	b2f77bf08f	Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications. Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well. This commit was SVN r26380.	2012-05-02 21:00:22 +00:00
Ralph Castain	40c2fc5f55	Update the tests, add a couple This commit was SVN r26379.	2012-05-02 19:00:05 +00:00
Ralph Castain	c5da4f24d7	Fix stupid singletons - get the pidmap message correct This commit was SVN r26378.	2012-05-02 17:48:02 +00:00
Ralph Castain	8f7bf3344a	Update test This commit was SVN r26370.	2012-05-01 18:38:44 +00:00
Ralph Castain	4542070cf2	Add event priority inversion test This commit was SVN r26369.	2012-05-01 16:42:22 +00:00
Ralph Castain	a8db2fc95f	Add procs to each node's map on the daemons This commit was SVN r26368.	2012-05-01 16:41:35 +00:00
Ralph Castain	a927318ea1	Add -N option as synonym for "npernode" This commit was SVN r26367.	2012-05-01 16:18:14 +00:00
Ralph Castain	9f724db182	Remove duplicate event assignment This commit was SVN r26360.	2012-04-30 16:06:20 +00:00
Ralph Castain	289f9f41ec	From long-term discussions, have the daemons use the node_t and proc_t structs and arrays instead of the pidmap and nidmap arrays. Sets the stage for future work. This commit was SVN r26359.	2012-04-29 00:10:01 +00:00
Ralph Castain	47a5e30095	Ensure debug output levels if we are debugging This commit was SVN r26358.	2012-04-29 00:03:28 +00:00
Ralph Castain	f3e3704c9e	Per request from Brian, enable mapping of stddiag output (output from opal_output calls) to stderr of the local process. This allows you to obtain that output in a local window (for example, when using xterm for each process) instead of having it automatically forwarded to mpirun. Turn this on automatically whenever someone uses the -xterm option, and to be set manually using the orte_map_stddiag_to_stderr mca param. This commit was SVN r26352.	2012-04-27 14:39:34 +00:00
Jeff Squyres	46f47e08b6	Remove typo/extra brackets and parens. This commit was SVN r26351.	2012-04-27 13:48:43 +00:00
Jeff Squyres	9d0df5a9a6	Update configury in the new oob ud component: actually check to see if it succeeds and run $1 or $2, accordingly. This allows "make dist" to run properly on machines that do not have OpenFabrics stuff installed (e.g., the nightly tarball build machine). There's still more to be done here -- it doesn't check for non-uniform directories where the OpenFabrics headers/libraries might be installed. We might need to re-tool/combine ompi/config/ompi_check_openib.m4 (which checks for way more than oob/ud needs) and move it up to config/ompi_check_ofa.m4, or something...? This commit was SVN r26350.	2012-04-27 11:32:56 +00:00
Jeff Squyres	9829d2279f	System-level includes should be at the top of the file, before most OPAL/ORTE/OMPI includes. This commit was SVN r26349.	2012-04-27 11:29:22 +00:00
Ralph Castain	38af7db183	Ensure the progress message comes out right away. Otherwise, on a large system where proc state messages are arriving frequently, the message doesn't get printed until the launch is done! This commit was SVN r26346.	2012-04-26 23:41:03 +00:00
Nathan Hjelm	e1e0d466e5	Merge ssh://ct-fe1/usr/projects/hpctools/hjelmn/ompi-trunk-git into HEAD This commit was SVN r26344.	2012-04-26 22:06:12 +00:00
Ralph Castain	3461809341	Fix reporting of launch progress so the numbers are correct and appear when they should This commit was SVN r26342.	2012-04-26 00:10:09 +00:00
Ralph Castain	3b5b185c86	Don't double free timer events This commit was SVN r26341.	2012-04-25 17:36:12 +00:00
Jeff Squyres	501a86afe1	No need to include the generated files in the tarball. Thanks to Eugene for pointing this out. This commit was SVN r26339.	2012-04-25 14:19:18 +00:00
Ralph Castain	71805bf7e4	Clearout the startup_timeout event if the job did in fact start. Have ORTE_TERMINATE use the job state macro so debug will show where it was called This commit was SVN r26334.	2012-04-25 01:05:17 +00:00
Jeff Squyres	708b497968	Ensure to unset the iof "active" flag after the libevent read callback fires (it's already reset once we queue up the read event again). Failure to unset the active flag would cause other logic to not queue up the read event again, because it thought the read event was still active). This commit was SVN r26311.	2012-04-23 15:58:12 +00:00
Ralph Castain	7999266f99	Silence warning by removing unused var This commit was SVN r26275.	2012-04-17 22:34:48 +00:00
Ralph Castain	f68487016c	Add test code from Terry. Properly terminate if we don't abort on non-zero exit This commit was SVN r26271.	2012-04-16 16:44:23 +00:00
Ralph Castain	ddfbde587f	Change the default to "abort" the job when any process exits with a non-zero status. Add the required code to ensure the orted tells the HNP about the problem. This commit was SVN r26270.	2012-04-13 21:19:46 +00:00
Ralph Castain	7741ba47be	Fix comm_spawn that spans multiple nodes This commit was SVN r26268.	2012-04-13 01:59:07 +00:00
Ralph Castain	4d16790836	Fix collectives for jobs running across partial allocations This commit was SVN r26267.	2012-04-13 00:38:47 +00:00
Ralph Castain	5d14fa7546	Fix mpi_abort, minimize error output. This commit was SVN r26266.	2012-04-11 14:37:08 +00:00
Ralph Castain	d3dfba3872	Fix the scenario where an MPI error handler causes a proc to exit after finalize, but with non-zero status to indicate an error occurred. This commit was SVN r26265.	2012-04-11 02:23:46 +00:00
Ralph Castain	9cd4c06488	Get things to build and run when --disable-orte is specified This commit was SVN r26263.	2012-04-10 21:50:01 +00:00
Ralph Castain	14d5525fb1	Some minor cleanups. Get singletons working. Cleanup abort handling so it gets properly identified. This commit was SVN r26261.	2012-04-10 19:08:54 +00:00
Ralph Castain	53bbcf4b5b	Plug slot allocation leak This commit was SVN r26260.	2012-04-10 14:56:24 +00:00
Ralph Castain	f5cd996b91	Fix the case where n=1 This commit was SVN r26258.	2012-04-09 22:44:56 +00:00
Ralph Castain	a34be856aa	Now that we have PMI support, this is no longer needed This commit was SVN r26254.	2012-04-07 13:36:24 +00:00
Ralph Castain	71f9e69c62	Remove stale code This commit was SVN r26253.	2012-04-07 13:34:12 +00:00
Ralph Castain	19630ca28d	Remove stale code This commit was SVN r26252.	2012-04-07 13:33:40 +00:00
Ralph Castain	93bbeabc55	Remove stale code This commit was SVN r26251.	2012-04-07 13:33:30 +00:00
Ralph Castain	b6cde9a8d1	Remove stale code This commit was SVN r26250.	2012-04-07 13:33:18 +00:00
Ralph Castain	48de3a2501	Minor fixes so orte_progress_thread can work This commit was SVN r26248.	2012-04-06 15:50:49 +00:00
George Bosilca	319f76d66a	Low hanging fruit. Remove a declared but not defined function. This commit was SVN r26245.	2012-04-06 15:43:28 +00:00
Ralph Castain	ed197acaa2	Eliminate stale code This commit was SVN r26244.	2012-04-06 15:31:13 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Josh Hursey	1941f6b3b1	Cleanup some compiler warnings when doing an optimized/non-debug build. This commit was SVN r26236.	2012-04-04 20:40:16 +00:00
Ralph Castain	ca3ff58c76	Ensure we get a non-zero exit status when we can't find the specified fork agent. Output a better error message, and ensure we don't multiply report the problem. This commit was SVN r26191.	2012-03-24 00:49:38 +00:00
Ralph Castain	6dc44dc4b8	Look at the basename of the appname for the "java" keyword This commit was SVN r26190.	2012-03-24 00:38:18 +00:00
Ralph Castain	46b040c79f	Fix typo This commit was SVN r26189.	2012-03-24 00:31:05 +00:00

... 4 5 6 7 8 ...

4000 Коммитов