openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	9dec101043	fix totalview launch through --debug This commit was SVN r25654.	2011-12-15 15:19:13 +00:00
Ralph Castain	e683b2f9c7	Minor touchup - reset the pointer to the end of the list each time to ensure we get the nodes in correct daemon order This commit was SVN r25651.	2011-12-14 22:16:52 +00:00
Ralph Castain	912abe8a6c	Catch one more use-case This commit was SVN r25649.	2011-12-14 21:03:19 +00:00
Ralph Castain	f531b09a8d	Correctly handle -host and -hostfile options. Ensure the initial vm launch constrains itself to the union of specified hosts if those options are given. Get oversubscribe set correctly for that case. This commit was SVN r25648.	2011-12-14 20:01:15 +00:00
George Bosilca	ac26f58bd7	I guess this wasn't yet ready for prime time. This commit was SVN r25624.	2011-12-12 23:55:11 +00:00
Nathan Hjelm	885d5cbcf8	enable ptmalloc with using uGNI This commit was SVN r25621.	2011-12-12 20:52:51 +00:00
Nathan Hjelm	be11acf727	bug fix. don't add node to allocated_nodes twice This commit was SVN r25619.	2011-12-12 19:14:41 +00:00
Ralph Castain	3f1ae5d89b	No longer need this include This commit was SVN r25606.	2011-12-09 00:40:07 +00:00
Ralph Castain	44094cd5b3	Remove compiler warning This commit was SVN r25601.	2011-12-08 16:35:41 +00:00
Samuel Gutierrez	0a922dcb3e	fixes XE6 build. This commit was SVN r25600.	2011-12-08 16:13:58 +00:00
Samuel Gutierrez	0588e9ba36	add Cray XK6 support to ras alps. the configuration file is a different format and is in a different place. This commit was SVN r25599.	2011-12-08 14:05:02 +00:00
Ralph Castain	7180ad40ad	Fix a copule of minor buglets This commit was SVN r25589.	2011-12-07 21:08:35 +00:00
Ralph Castain	3e7ab1212a	Since this has come up a number of times, have the rsh launcher add MCA params from the environment by default. If it finds that the cmd line is too long, error out with a message directing the user to set a param to ignore the environmental MCA params. This commit was SVN r25581.	2011-12-07 01:24:36 +00:00
Ralph Castain	7510339725	Remove stale orte_vm_launch param. Add a param that allows users to specify envars to forward/set so they can do it in the MCA param file instead of only via mpirun cmd line. This commit was SVN r25580.	2011-12-06 21:31:22 +00:00
Ralph Castain	15facc4ba6	Fix comm_spawn yet again...add another test This commit was SVN r25579.	2011-12-06 20:15:40 +00:00
Ralph Castain	90b7f2a7bf	The rest of the multi app_context fix. Remove the restriction on number of app_contexts that can have zero np specified as multiple mappers now support that use-case. Update the ranking algorithms to respect and track bookmarks. Ensure we properly set the oversubscribed flag on a per-node basis. This commit was SVN r25578.	2011-12-06 17:28:29 +00:00
Ralph Castain	d9c7764e9b	Remove some debug This commit was SVN r25575.	2011-12-05 22:04:50 +00:00
Ralph Castain	df2f594aa8	Some cleanup associated with multiple app_contexts. Ensure nodes only get entered once into the map. Correctly handle bookmarks. Cleanup tracking of slots_inuse and correct detection of oversubscription. Still need to resolve the ranking issue so it starts at the bookmark, but that will come next. This commit was SVN r25574.	2011-12-05 22:01:08 +00:00
Abhishek Kulkarni	0b7c51fae2	Correct an invalid reference to a missing help file. This commit was SVN r25573.	2011-12-05 21:29:07 +00:00
Josh Hursey	b5ac320826	* If not able to checkpoint at this time (say because we are already checkpointing or restarting) then make sure to re-set the listener so that we can checkpoint later. * Work around duplicate node names in the map. It should not happen normally, but if the rmaps component gets this wrong provide a work around. Ralph is working on a rmaps fix for this, so we will likely remove/comment out the fix later. This commit was SVN r25572.	2011-12-05 19:29:26 +00:00
Josh Hursey	cc57840b53	Fix ess/tool so that it does not segv when using the rsh PLM. Just have it use the base function directly to avoid similar problems with finalizing other components. This commit was SVN r25571.	2011-12-05 15:40:46 +00:00
Ralph Castain	6fefe236a4	Warn users if they set opal_paffinity_alone, either to true or false, that this parameter is no longer functional - they must use the --bind-to option and its corresponding mca param. This commit was SVN r25567.	2011-12-03 01:10:52 +00:00
Ralph Castain	6cbd8fa6c9	Keep everyone in sync with new job state This commit was SVN r25563.	2011-12-02 14:12:40 +00:00
Ralph Castain	07655e2945	Handle the case where the allocator "fibs" to us about the node names. In some cases (ahem...you know who you are!), the allocator will tell us a node number (e.g., "16"). However, the daemon will return a node name (e.g., "nid0016") - leaving us not recognizing its location. So provide a new parameter (can't have too many!) that handles this situation by stripping the prefix from the returned node name. Also do a little cleanup to ensure we cleanly exit from errors, without generating too many annoying messages. This commit was SVN r25562.	2011-12-02 14:10:08 +00:00
Jeff Squyres	ecf6ba910c	Silence a few icc warnings and about mixing enums with other types. This commit was SVN r25560.	2011-12-02 13:18:54 +00:00
Ralph Castain	641e17f26c	A better way of handling fqdn allocations. Prior method was wrong as it equated "node1" with "node10", which definitely caused problems. Detect the addition of fqdn nodes in the allocation. If not found, then strip all incoming hostnames from daemons of any domain info when matching those names against the names in the node pool. Leave some protection and "live" diagnostic output in place so we can continue to detect problems across all environments. This commit was SVN r25557.	2011-12-01 14:24:43 +00:00
Ralph Castain	512aea79bc	Print the right nodename value, fix the strange case This commit was SVN r25556.	2011-12-01 02:31:56 +00:00
Ralph Castain	44394c6b34	Add a little more protection This commit was SVN r25555.	2011-12-01 00:30:56 +00:00
Ralph Castain	c4ea7a252a	Add a little protection against badly formed node names so we don't segfault if they are encountered This commit was SVN r25554.	2011-11-30 23:33:59 +00:00
Ralph Castain	fa9e99454a	Don't divide by cpus-per-task - we'll deal with that at binding time. This commit was SVN r25552.	2011-11-30 21:35:25 +00:00
Ralph Castain	c56acf60ca	Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order. Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use. Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact. Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code. This commit was SVN r25551.	2011-11-30 19:58:24 +00:00
George Bosilca	25476c7e54	buffer is not yet initialized, so there is no reason to release it. This commit was SVN r25549.	2011-11-29 23:50:18 +00:00
Jeff Squyres	6fbbfd0f7a	Gah! r25545 acidentally included ''waaaay'' more stuff than it was supposed to. I.e., half-baked/not complete stuff. This commit backs out all of r25545. Sorry folks! This commit was SVN r25546. The following SVN revision numbers were found above: r25545 --> open-mpi/ompi@7f9ae11faf	2011-11-29 23:24:52 +00:00
Jeff Squyres	7f9ae11faf	Per http://www.open-mpi.org/community/lists/users/2011/11/17862.php , to make MPI_IN_PLACE (and other sentinel Fortran constants) work on OS X, we need to use the following compiler (linker) flag: -Wl,-commons,use_dylibs So if we're compiling on OS X, test to see if that flag works with the compiler. If so, add it to the wrapper FFLAGS and FCFLAGS (note that per a future update, we'll only have one Fortran compiler anyway). Fixes trac:1982. This commit was SVN r25545. The following Trac tickets were found above: Ticket 1982 --> https://svn.open-mpi.org/trac/ompi/ticket/1982	2011-11-29 23:05:54 +00:00
George Bosilca	7a238933b6	Silence a compiler warning. This commit was SVN r25543.	2011-11-29 20:53:08 +00:00
Terry Dontje	b1bb339d23	fix r25507 rationalization of rsh support by removing include of plm_base_rsh_support.h from tm module This commit was SVN r25519. The following SVN revision numbers were found above: r25507 --> open-mpi/ompi@b475421c16	2011-11-29 11:49:41 +00:00
Ralph Castain	0d55a3d739	Missed one spot... This commit was SVN r25517.	2011-11-28 22:30:53 +00:00
Ralph Castain	237c79b6d7	Fix daemon collectives - missed the one spot where returning orte_routed_tree_t was required. Sigh. Change the routed components to return that type on the list of children when get_routing_tree is called. This commit was SVN r25516.	2011-11-28 22:24:49 +00:00
Ralph Castain	89e5bd27a2	Fix copyright date This commit was SVN r25512.	2011-11-28 15:54:04 +00:00
Ralph Castain	70ab8422b1	Per the internal comments, the delay between ssh invocations is not there for debugging purposes, but rather to allow for NIS authetication times. We have seen that problem in the past, so don't just do the delay when we are debugging - use the delay for the intended purpose. Also, allow for shorter than second-level delays as it doesn't always have to be so long. This commit was SVN r25510.	2011-11-27 01:49:42 +00:00
Ralph Castain	b173316b74	Dont induce a delay between spawns unless specifically asked to do so This commit was SVN r25509.	2011-11-26 16:50:31 +00:00
Ralph Castain	b475421c16	As promised, rationalize the rsh support. Remove rshbase and the base rsh support, centralizing all rsh support into the rsh component. Remove the "slave" launch support as that experiment is complete. Fix tree spawn and make that the default method for rsh launch, turning it "off" for qrsh as that system does not support tree spawn. This commit was SVN r25507.	2011-11-26 02:33:05 +00:00
Ralph Castain	9b59d8de6f	This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build. Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations. Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way. This commit was SVN r25497.	2011-11-22 21:24:35 +00:00
George Bosilca	1000af1c48	No need to abort there, returning an error trigger the abort at the upper level. This commit was SVN r25494.	2011-11-18 19:07:26 +00:00
Ralph Castain	866edf6a89	Now that George has found his problem, we no longer need the bozo check. Interesting how these platform-specific issues surface... This commit was SVN r25493.	2011-11-18 17:43:14 +00:00
George Bosilca	b613c7eacb	Fix the issue with the round robin mapper. When mixing different precisions, one should manually promote the participants to the expected type. In this particular example as opal_list_get_size returns an unsigned long, the computation on the left side is translated to an unsigned. If the hostfile contains more nodes that what required (via the -np), this leads to a gigantic value for the balance, and breaks the round robin algorithm. This commit was SVN r25492.	2011-11-18 17:03:35 +00:00
Ralph Castain	1e5e9bde77	Add protection against a bozo case where we could end up in an infinite loop while calculating ranks This commit was SVN r25491.	2011-11-18 15:35:55 +00:00
George Bosilca	88d32312d6	The bind_level should be initialized to zero or weird things happens. I'm not yet sure how and why, but packing a uint8_t with opal_dss lead to weird values during unpack (except if the original value is already set to zero). This commit was SVN r25490.	2011-11-18 10:22:58 +00:00
George Bosilca	61f273b987	Do not tolerate uninitialized variables. This commit was SVN r25489.	2011-11-18 10:19:24 +00:00
Ralph Castain	b34acd0476	Grrr....get the correct number too! This commit was SVN r25478.	2011-11-15 11:11:47 +00:00
Ralph Castain	593fc388a9	Make it a little more obvious as to which nodes are from each topology by labeling them with a letter. This commit was SVN r25477.	2011-11-15 11:10:39 +00:00
Ralph Castain	6310361532	At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here: https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation. In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions: 1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior. 2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation. 3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so. As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes. This commit was SVN r25476.	2011-11-15 03:40:11 +00:00
Ralph Castain	c8e105bd8c	Remove stale code This commit was SVN r25475.	2011-11-14 23:39:23 +00:00
Ralph Castain	793f4c688f	Extend capability to support heterogeneous clusters with multiple topologies This commit was SVN r25474.	2011-11-13 23:23:09 +00:00
Ralph Castain	6b5e1b89cf	Turn off tree spawn as it doesn't currently work - will fix shortly. Add topology collection This commit was SVN r25472.	2011-11-11 23:42:36 +00:00
Ralph Castain	d008aeb531	Silence debug This commit was SVN r25471.	2011-11-11 16:42:45 +00:00
George Bosilca	3d318a4c26	Put the interface of our MPIR support in sync with the document accepted by the MPI Forum (http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf). This commit was SVN r25456.	2011-11-08 01:24:16 +00:00
George Bosilca	85a18dab74	MPIR_partial_attach_ok is not a volatile, but a constant. This commit was SVN r25455.	2011-11-08 01:00:38 +00:00
Ralph Castain	a3ce355a60	Revert r25453 and r25450 until we can fix the libevent2013 configure code - still not getting the includedir to eval correctly. This commit was SVN r25454. The following SVN revision numbers were found above: r25450 --> open-mpi/ompi@7f7d5c4f1f r25453 --> open-mpi/ompi@c9fe8c32e2	2011-11-07 16:23:44 +00:00
Samuel Gutierrez	3ea59cce96	minor cleanup to getenv_pmi.c. This commit was SVN r25449.	2011-11-07 03:18:07 +00:00
Samuel Gutierrez	e03bc93fb7	only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code. This commit was SVN r25447.	2011-11-06 17:28:40 +00:00
Ralph Castain	34f0a27cb6	Initialize the locality info - at time of pmap creation, we at least know node locality This commit was SVN r25446.	2011-11-06 17:06:41 +00:00
Ralph Castain	729935dffb	Minor cleanups, mirroring what Jeff did to ompi_info This commit was SVN r25438.	2011-11-05 00:42:49 +00:00
Ralph Castain	fcee46b063	Add an option for printing a diffable process map for testing mappers This commit was SVN r25428.	2011-11-03 14:22:07 +00:00
Samuel Gutierrez	3fe7b3ee54	add PMI support to ess alps module. xt system guys: please yell at me if i missed something in cnos. This commit was SVN r25423.	2011-11-03 04:04:32 +00:00
Samuel Gutierrez	27b9bcfafd	update ess alps configuration file to include CNOS and PMI checks. some of the features committed here aren't being used, but they will be. also update orte_check_pmi.m4 to include missing call to action-if-not-found if --with-pmi is not specified or is disabled. This commit was SVN r25422.	2011-11-03 02:14:47 +00:00
Jeff Squyres	7f6f7bd0eb	Remove this component; twitter long ago switched to the oauth authentication, and no one has ever updated this component to match. It can be revived out of history if anyone cares. This commit was SVN r25421.	2011-11-02 21:04:49 +00:00
Ralph Castain	891027c10d	Cleanup error reports This commit was SVN r25420.	2011-11-02 18:34:19 +00:00
Ralph Castain	b2e2d24726	As in the rsh module, report failed daemons to the errmgr for proper cleanup This commit was SVN r25419.	2011-11-02 18:30:22 +00:00
Ralph Castain	3e4165fd8d	Cleanup includes This commit was SVN r25418.	2011-11-02 18:28:28 +00:00
Ralph Castain	b77552c45d	Cleanup some include files, return a silent error in open/select as the complaining component already output a message This commit was SVN r25416.	2011-11-02 17:42:06 +00:00
Ralph Castain	198e001554	Add another test This commit was SVN r25415.	2011-11-02 15:59:16 +00:00
Ralph Castain	55b996678e	Minor indentation changes This commit was SVN r25414.	2011-11-02 15:56:56 +00:00
Ralph Castain	f00753881e	Handle the case where mpirun -is- of the same topology as the compute nodes. This commit was SVN r25412.	2011-11-01 22:26:03 +00:00
Ralph Castain	d28dd55d33	Minimize the amount of topology info returned by the daemons. Most clusters, especially at scale, use the same node topology on every node, so there is no re ason to return the topology from every daemon. Borrow a page from the --hetero-apps page and let users indicate that the node topology differs by adding a -- hetero-nodes option to mpirun. If the option is set, then every daemon returns topology info. If not set, then only daemon vpid=1 returns it. We always want one daemon to return the topology as the head node is often different from the compute nodes. Having one daemon return the compute node topolo gy allows us to detect any such difference. All compute nodes are then set to the same topology. This commit was SVN r25408.	2011-11-01 18:43:10 +00:00
Ralph Castain	14966e0f8f	Cleanup PMI startup - if a component isn't selected, it should finalize PMI IFF it started it. Otherwise, components that aren't selected can finalize PMI when it is in use by other parts of the system. This commit was SVN r25407.	2011-11-01 16:25:12 +00:00
Ralph Castain	71ed8e3cd3	Bring back the local node's binding capabilities along with its topology. Clean up indentation. This commit was SVN r25399.	2011-10-30 13:20:16 +00:00
Ralph Castain	d492b20975	Bozo check for topology info This commit was SVN r25398.	2011-10-30 11:49:38 +00:00
Ralph Castain	4232115a98	Ensure pruning remains within the current job/app being mapped. This commit was SVN r25397.	2011-10-30 00:02:20 +00:00
Ralph Castain	648c85b41b	Add a simple pattern mapper as an example of how to use the topology info to create desired mappings. Let the user specify a pattern based on resource types, and map that pattern across all available nodes as resources permit. Don't automatically display the topology for each node when --display-devel-map is set as it can overwhelm the reader. Use a separate flag --display-topo to get it. This commit was SVN r25396.	2011-10-29 15:12:45 +00:00
Ralph Castain	12a589130a	Add some debug This commit was SVN r25395.	2011-10-29 15:07:58 +00:00
Ralph Castain	965b04d1a5	Use the new utilities to get a topology that reflects available cpus This commit was SVN r25394.	2011-10-29 15:07:36 +00:00
Ralph Castain	e50bcbf028	Add the ability to specify a topology-containing xml file to describe the simulated nodes to support mapping tests against arbitrary topologies This commit was SVN r25388.	2011-10-29 02:01:11 +00:00
Ralph Castain	7fa5f82d70	Add simulator component to support testing of large scale mapping methods. Automatically sets do-not-resolve and do-not-launch, and creates however many nodes the user wants to simulate in the system. This commit was SVN r25386.	2011-10-28 23:48:53 +00:00
Ralph Castain	e2eb8d5f78	Remove bad param registration - that param was already registered as an int_name in another location. This commit was SVN r25381.	2011-10-28 19:14:43 +00:00
Josh Hursey	6726590b1c	Remove the 'ess_node_rank' accessor from here. This caused running under 'tm' to segv at the orteds. It just looks like this part of the component was not updated during r25331. It was removed from the 'env' and 'slurm' environments in that patch. It looks like 'tm' was updated, but did not get this particular piece. This commit was SVN r25380. The following SVN revision numbers were found above: r25331 --> open-mpi/ompi@b44f8d4b28	2011-10-28 17:41:35 +00:00
Josh Hursey	59ff1dbbfb	Fix indentation problem that caused a segv when running without regex. This was introduced in r25063. This commit was SVN r25379. The following SVN revision numbers were found above: r25063 --> open-mpi/ompi@e58623cd5b	2011-10-28 13:39:32 +00:00
Samuel Gutierrez	922e41a318	fix typo. use PMI_Initialized for init status instead of PMI_Init. This commit was SVN r25377.	2011-10-27 22:27:30 +00:00
Ralph Castain	951d72692c	Reverse the #if direction so we report daemon failure to the errmgr - otherwise, we just hang if a daemon fails to start. Reviewed with Josh. This commit was SVN r25366.	2011-10-25 19:09:52 +00:00
Ralph Castain	c55cba55a7	Totally trivial spelling fix This commit was SVN r25361.	2011-10-24 14:06:33 +00:00
Ralph Castain	955d8e7d46	Allow apps to use pmi when launched by mpirun, if desired, without affecting daemons This commit was SVN r25359.	2011-10-23 15:57:13 +00:00
Nathan Hjelm	e8af0d8589	don't use alps paffinity This commit was SVN r25358.	2011-10-21 22:52:03 +00:00
Abhishek Kulkarni	46952e9008	Fix C/R functionality in trunk. Intra-node checkpointing of a job now works as expected. Signed-off-by: Abhishek Kulkarni <adkulkar@osl.iu.edu> This commit was SVN r25357.	2011-10-21 22:07:35 +00:00
Nathan Hjelm	7b1172b346	need a terminating character in the decoded string This commit was SVN r25355.	2011-10-21 16:46:28 +00:00
Nathan Hjelm	cd257ac707	fixed typo in pmi grpcomm This commit was SVN r25353.	2011-10-21 16:28:36 +00:00
Shiqing Fan	5711414eb7	Fix Windows build This commit was SVN r25351.	2011-10-21 14:46:58 +00:00
Ralph Castain	53ef085567	Fix a minor issue seen by Jeff in specific failure pathway This commit was SVN r25350.	2011-10-21 14:44:48 +00:00
Ralph Castain	3e72fccacf	Cray's PMI implementation is quite different from slurm's - they extended PMI-1 by adding some, but not all, of the PMI-2 APIs. So you can't just switch to using PMI-2 functions as it isn't a complete implementation. Instead, you have to selectively figure out which ones they have in PMI-2, and use any missing ones from PMI-1. What fun. Modify the configure logic and the PMI components to accommodate Cray's approach. Refactor the PMI error reporting code so it resides in only one place. Cray actually decided -not- to define the PMI-2 error codes, so we have to use the PMI-1 codes instead. More fun. This commit was SVN r25348.	2011-10-21 04:54:38 +00:00
Nathan Hjelm	beb8d8ce32	pmi return code wtf This commit was SVN r25336.	2011-10-20 17:51:24 +00:00
Ralph Castain	84713d5a84	Fix singletons again - must have been broken for a very long time, which only shows how little anyone cares about this capability. This commit was SVN r25332.	2011-10-19 20:19:08 +00:00
Ralph Castain	b44f8d4b28	Complete implementation of the ess.proc_get_locality API. Up to this point, the API was only capable of telling if the specified proc was sharing a node with you. However, the returned value was capable of telling you much more detailed info - e.g., if the proc shares a socket, a cache, or numa node. We just didn't have the data to provide that detail. Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times. Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun. This commit was SVN r25331.	2011-10-19 20:18:14 +00:00
Ralph Castain	2958f3de34	Add some clarifying comments and a small efficiency improvement This commit was SVN r25322.	2011-10-18 18:30:43 +00:00
Ralph Castain	b771114086	Fix the fix :-) If the errmgr is going to try and hold the orted until all routes and children are gone, then the exit cmd must do the same. Otherwise, the orted exits immediately without waiting for routes to be dismantled, which is why we don't see the connections close. Also cleanup some diagnostics and add some debug to more clearly see what's going on. This commit was SVN r25321.	2011-10-18 17:56:37 +00:00
Ralph Castain	ae8e556d14	Okay, once again let's fix the vpid calculator. Identified problem with prior commit (some rmaps components already place their procs in the jdata->procs array, and others don't), so account for those variations. This commit was SVN r25315.	2011-10-18 15:50:11 +00:00
George Bosilca	749b63c09d	Provide a generic fix for the termination issue instead of r25248. The termination condition is to be checked at the daemon/HNP level not down in the routing. This commit was SVN r25313. The following SVN revision numbers were found above: r25248 --> open-mpi/ompi@b42ccc89b8	2011-10-18 03:07:37 +00:00
George Bosilca	f28890fbb7	Revert r25302 as it break the --bynode option. This commit was SVN r25311. The following SVN revision numbers were found above: r25302 --> open-mpi/ompi@d7a8553179	2011-10-18 02:48:17 +00:00
Ralph Castain	2fdd9c6dea	Ensure mpirun doesn't pick this component This commit was SVN r25307.	2011-10-17 22:28:28 +00:00
Ralph Castain	8f0ef54130	Complete implementation of pmi support. Ensure we support both mpirun and direct launch within same configuration to avoid requiring separate builds. Add support for generic pmi, not just under slurm. Add publish/subscribe support, although slurm's pmi implementation will just return an error as it hasn't been done yet. This commit was SVN r25303.	2011-10-17 20:51:22 +00:00
Ralph Castain	d7a8553179	Fix the mapping algo for computing vpids - it was borked for bynode operations when using nperxxx directives This commit was SVN r25302.	2011-10-17 19:49:04 +00:00
Ralph Castain	f1a5a26ba0	Minor cleanups This commit was SVN r25289.	2011-10-14 18:46:03 +00:00
Ralph Castain	89a20de474	Remove unused includes. Ensure that the error log is at least always available as we otherwise segfault when reporting errors that occur prior to opening the errmgr framework This commit was SVN r25288.	2011-10-14 18:45:11 +00:00
Ralph Castain	07dbbc6513	Sorry for mid-day correction - but folks are trying to test this, and we didn't realize it was still ignored :-( This commit was SVN r25287.	2011-10-14 16:19:20 +00:00
Ralph Castain	7bb294f917	Fix debug flags - thanks Terry! This commit was SVN r25286.	2011-10-14 16:10:21 +00:00
Ralph Castain	054c485dcf	Cleanup a race condition and an unreliable method that caused us to not properly handle procs that trapped sigterm for cleanup purposes while ORTE was trying to kill them. Thanks to Rick Payne and Ian Wells of Cisco for spending weeks chasing this down. Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-( This commit was SVN r25285.	2011-10-14 15:39:54 +00:00
Ralph Castain	08fa9e1c6a	Correct include path This commit was SVN r25282.	2011-10-13 23:46:52 +00:00
Ralph Castain	b96ef2161d	Complete the PMI support. Generalize PMI operations to support both slurm and non-slurm environments. Correct some configuration issues - we really only want the PMI integration at the individual component level. Ensure that the pmi grpcomm component doesn't get selected when launching via mpirun by setting its priority below the bad component. Only verified in a slurm environment as that's all I have access to... This commit was SVN r25275.	2011-10-12 20:59:25 +00:00
Ralph Castain	634f83fc52	Fix the routed components. All had errors, some completely broken. You cannot test 0 == ORTE_EPOCH_CMP(target->epoch,ORTE_EPOCH_INVALID) when epoch is not configured as this will always return true. This caused get_route to return an error in all non-binomial routed modules, and caused all components to return an error when delete_route was called. So protect the checks with ORTE_ENABLE_EPOCH so we get the correct behavior. This commit was SVN r25274.	2011-10-12 20:18:57 +00:00
Ralph Castain	24a46f2acb	These were missed by prior commit - need to remove lingering references to OPAL_HWLOC_HAVE_XML This commit was SVN r25272.	2011-10-12 16:54:03 +00:00
George Bosilca	872d377021	Tell what the update status is. This commit was SVN r25259.	2011-10-11 19:49:12 +00:00
Brian Barrett	98e98ce2c5	* opal_atomic_trylock is documented to return 0 if the lock was acquired, 1 otherwise. It was doing the opposite, so this patch fixes the return values. All uses (all in ORTE) used the actual return values, not the documented values, so fix them as well. This commit was SVN r25257.	2011-10-11 18:43:45 +00:00
Ralph Castain	2f38ff5e54	Ensure we don't try to build this module unless pmi is specifically requested This commit was SVN r25252.	2011-10-11 06:12:04 +00:00
Ralph Castain	baefdabd98	Add some debug. Now confirmed to work correctly (prior problem was with odin tcp connection, not code). This commit was SVN r25249.	2011-10-11 02:15:17 +00:00
Ralph Castain	b42ccc89b8	Although this didn't solve the earlier termination problem, the code will be required once we get connection terminations properly detected. If a daemon (or HNP) is trying to terminate, then we need to check for termination conditions whenever a route is lost - when all child connections are gone, then we are free to finalize. This commit was SVN r25248.	2011-10-10 21:41:49 +00:00
Ralph Castain	1aa1c2e9b4	Get the slurm pmi support working. Cannot use infiniband, of course, as the oob can't make the connection - may try other existing methods. Modex may not quite be working right yet as odin was having trouble making TCP connections, but at least the configure now works so things build, so save that for now This commit was SVN r25247.	2011-10-10 21:39:10 +00:00
Swen Boehm	08b4322a1a	patched the lex files to not issue the following compiler warning: 'yyunput' defined but not used This commit was SVN r25246.	2011-10-10 18:13:04 +00:00
Ralph Castain	f1a3a35fcd	Cannot rely on detection of connection terminations for deciding when to exit as they don't always go away immediately. There is no info coming back anyway, so it's okay to just exit once the relay has been sent. The relay is sent via a blocking API, so just go ahead and quit. This commit was SVN r25245.	2011-10-10 16:38:46 +00:00
George Bosilca	649af6c925	Enumerated mixed with another type (int) is tolerated but easily fixable. This commit was SVN r25241.	2011-10-09 03:54:52 +00:00
Terry Dontje	c6691b4122	clean up local procs when abort or abort signal happens This commit was SVN r25237.	2011-10-06 19:19:55 +00:00
Nathan Hjelm	79b14fc3b1	removed licensing warning This commit was SVN r25235.	2011-10-05 20:31:27 +00:00
Nathan Hjelm	34afb5a0fa	first cut at general pmi check This commit was SVN r25234.	2011-10-05 17:14:24 +00:00
George Bosilca	80c02647c8	Each level (OPAL/ORTE/OMPI) should only return it's own constants, instead of the current mismatch. This commit was SVN r25230.	2011-10-04 14:50:31 +00:00
George Bosilca	c6d6c9aece	Remove some #if by using the correct macro (aka. ORTE_EPOCH_CMP). This commit was SVN r25229.	2011-10-04 14:42:40 +00:00
Samuel Gutierrez	25cbf79592	modifications to ras alps. this commit allows users to mpirun without having to set id environment variables (BASIL_RESERVATION_ID, OMPI_ALPS_RESID). note, however, that we preserved the old behavior. if an id environment variable is set, it will be obeyed and our new code path is essentially bypassed. if we missed something, please yell at us. with this commit, the use of ras-alps-command.sh is no longer needed... at least that is our hope. This commit was SVN r25181.	2011-09-26 21:31:08 +00:00
Ralph Castain	8347385630	Fix the radix routed component. This commit was SVN r25175.	2011-09-22 09:32:53 +00:00
Jeff Squyres	ecd603256a	* Rename opal_hwloc_components to opal_hwloc_base_components * Fix some comments This commit was SVN r25150.	2011-09-17 11:54:36 +00:00
Ralph Castain	1cd7b02df3	Add a set of default errmgr components that support solely the default "everything dies on error" behavior. Set their priority to be selected by default, but provide params to adjust those priorities to allow other component selection. This commit was SVN r25139.	2011-09-13 22:03:45 +00:00
Ralph Castain	3c4f04f4d9	Ensure opal_hwloc_topology is NULL after being destroyed This commit was SVN r25138.	2011-09-13 19:21:10 +00:00
Nathan Hjelm	079ccdf8b1	fix debugger co-location launching This commit was SVN r25136.	2011-09-13 15:08:03 +00:00
Ralph Castain	ca7638553f	Remove stale code This commit was SVN r25133.	2011-09-12 23:00:41 +00:00
Ralph Castain	556a05566e	Silence warning This commit was SVN r25130.	2011-09-12 16:21:51 +00:00
Shiqing Fan	0aea775837	Set the compiler flags in a better way. This commit was SVN r25125.	2011-09-12 08:24:27 +00:00
Ralph Castain	92c7372e20	Per the RFC from Jeff, move hwloc from opal/mca/common to its own static framework ala libevent. Have ORTE daemons collect the topology info at startup and, if --enable-hwloc-xml is set, send that info back to the HNP for later use. The HNP only retains unique topology "templates" to reduce memory footprint. Have the daemon include the local topology info in the nidmap buffer sent to each app so the apps don't all hammer the local system to discover it for themselves. Remove the sysinfo framework as hwloc replaces that functionality. This commit was SVN r25124.	2011-09-11 19:02:24 +00:00
Ralph Castain	2091e39bee	Record the file descriptor on the read event when building optimized This commit was SVN r25123.	2011-09-11 18:57:14 +00:00
Rainer Keller	9d5afc58c6	- Fix breakage of the epoch changes with PGI: Don't juse include pre-processor macros between two strins ("s1" #if 0 ... "s2")... Rather print out the epoch as 0 always... This commit was SVN r25110.	2011-08-31 08:40:31 +00:00
Wesley Bland	f8740e5478	Correct a typo reported by Pasha. This commit was SVN r25109.	2011-08-30 18:44:52 +00:00
Ralph Castain	03ddf8520b	Resolve not-used warnings This commit was SVN r25101.	2011-08-27 14:27:15 +00:00
Ralph Castain	56ebfa23cc	ORTE configure options belong in orte/config, not opal. This commit was SVN r25100.	2011-08-27 14:23:49 +00:00
Wesley Bland	f542ecd578	Fix a couple of problems with the resil code not compiling. This commit was SVN r25099.	2011-08-27 03:21:00 +00:00
George Bosilca	a4245b8d63	Remove some warnings related to the resilience patch. This commit was SVN r25097.	2011-08-27 00:15:34 +00:00
George Bosilca	67ec5a0556	While doing cleanup remove pending warnings. This commit was SVN r25095.	2011-08-26 23:38:06 +00:00
George Bosilca	fc1184c41f	Remove a warnings introduced by the epoch commit. This commit was SVN r25094.	2011-08-26 23:36:52 +00:00
Wesley Bland	4e7ff0bd5e	By popular demand the epoch code is now disabled by default. To enable the epochs and the resilient orte code, use the configure flag: --enable-resilient-orte This will define both: ORTE_ENABLE_EPOCH ORTE_RESIL_ORTE This commit was SVN r25093.	2011-08-26 22:16:14 +00:00
Ralph Castain	1c08a4006c	Refactor some code to remove a few API handles from errmgr. Reviewed/tested by Wes. This commit was SVN r25064.	2011-08-18 16:24:45 +00:00
Ralph Castain	e58623cd5b	Bring alps back to full operations by correctly computing daemon names. Unfortunately, alps doesn't assign cnos rank in node-based order - i.e., cnos rank=0 isn't necessarily on the first node of the execution. So adjust when using static ports. Add some debug to nidmap Ensure that the HNP's node name is not included in the regex when launching via rshbase as that node is automatically included in the daemon map. This commit was SVN r25063.	2011-08-18 14:59:18 +00:00
Wesley Bland	a2a20c3766	I believe this should fix the race condition that Terry is seeing in the MTT tests. It appears that nothing in the errmgr was using the mutexes to protect the odls child list. This commit was SVN r25062.	2011-08-18 14:52:30 +00:00
Shiqing Fan	6d0ab9bd6c	One library was missing for linking orterun on Windows. This commit was SVN r25057.	2011-08-18 09:33:41 +00:00
Ralph Castain	23f47295a8	Add even more debug This commit was SVN r25053.	2011-08-16 16:41:33 +00:00
Ralph Castain	d624d43f69	Add more debug This commit was SVN r25052.	2011-08-16 15:47:37 +00:00
Ralph Castain	3d96497581	Add debug This commit was SVN r25050.	2011-08-16 12:22:05 +00:00
Shiqing Fan	7292ee2387	One .windows file is missing in the tarball. This commit was SVN r25049.	2011-08-15 10:21:25 +00:00
Shiqing Fan	3af7c9f7bb	Complete the MinGW build support on Windows. This commit was SVN r25048.	2011-08-15 09:47:23 +00:00
Shiqing Fan	627f1dd351	Correct several export declarations. This commit was SVN r25047.	2011-08-15 09:45:51 +00:00
Ralph Castain	ca3d29a1e6	Extend regex support to a bigger audience This commit was SVN r25046.	2011-08-12 21:02:48 +00:00
Ralph Castain	ea4e2c2db4	Unused variables This commit was SVN r25045.	2011-08-12 21:02:09 +00:00
Jeff Squyres	1cbfb53801	r24976 wasn't quite right -- you now actually get a warning if you specify btl_tcp_if_include because btl_tcp_if_exclude is defaulted to the loopback devices. This commit does a few things: * Introduce a new OPAL MCA base function: mca_base_param_check_exclusive_string(). It checks to see that the ''user'' does not set two MCA parameters that are mutually exclusive by checking the source of those MCS param values. * Use the above function in many BTLs (and the OOB TCP) to ensure that <foo>_if_include and <foo>_if_exclude are not both specified ''by the user''. * Re-arrange many of these BTLs to move their MCA registration code into a separate component_register() function (vs. the component_open() function). This code has been nominally reviewed and checked by Ralph, George, Terry, and Shiqing. This commit was SVN r25043. The following SVN revision numbers were found above: r24976 --> open-mpi/ompi@8f4ac54336	2011-08-10 17:24:36 +00:00
Ralph Castain	b360c98afd	Per request from Pasha, revert r25004 - but modified a touch to reflect fact that opal_argv_append copies the provided string, so we don't need to print it and then free it. This commit was SVN r25037. The following SVN revision numbers were found above: r25004 --> open-mpi/ompi@2418831bea	2011-08-09 22:42:27 +00:00
Nathan Hjelm	aa3d302a05	use persistent rml_recv in iof This commit was SVN r25035.	2011-08-09 21:30:12 +00:00
Ralph Castain	f1951e7ccd	If we are abnormally terminating, then don't wait for orteds to report back. Send them a "halt_vm" command, which instructs them to kill their local procs and immediately terminate, doing their best to cleanup on the way out. Also do a little cleanup on debug output in rshbase. This commit was SVN r25033.	2011-08-09 17:42:19 +00:00
Wesley Bland	67feeb6aca	Move the errmgr code back. This shouldn't cause the svn problems that I apparently caused last time. Sorry about that. This one will just be a big changelog. This commit was SVN r25016.	2011-08-08 16:01:08 +00:00
Wesley Bland	09274cd047	Make sure that the epoch is initialized everywhere so we don't get weird output during valgrind. This shouldn't have caused any problems with any actual execution. Just extra warnings in valgrind. This commit was SVN r25015.	2011-08-08 15:11:55 +00:00
Ralph Castain	8014e3429e	Don't double-count procs as they are launched This commit was SVN r25011.	2011-08-08 06:05:23 +00:00
Ralph Castain	7b9f958dcf	Add some missing error strings. Update test to show silent errors This commit was SVN r25010.	2011-08-08 04:21:02 +00:00
Ralph Castain	4083dc617f	Fix computation of number of required files and file descriptors - it only depends on the total number of local procs, not on the number of procs in the entire job! This commit was SVN r25008.	2011-08-08 04:09:40 +00:00
Ralph Castain	590ac70e88	Add a simple test program for error string output This commit was SVN r25007.	2011-08-07 21:32:25 +00:00
Ralph Castain	8b3c562b84	Adjust verbosity levels to make it easier to debug at scale This commit was SVN r25006.	2011-08-07 21:14:21 +00:00
Ralph Castain	2418831bea	Pass the nodelist to the aprun command even when using all nodes This commit was SVN r25004.	2011-08-06 04:19:41 +00:00
Ralph Castain	bd8e43a2de	Correct debug output so it doesn't falsely report the module This commit was SVN r25003.	2011-08-05 20:30:34 +00:00
Ralph Castain	d603c79ab4	Fix the FAILED_TO_START scenario so orted doesn't segfault This commit was SVN r25002.	2011-08-05 20:29:50 +00:00
Ralph Castain	c86bfb4e90	Need to copy the string This commit was SVN r25001.	2011-08-05 19:03:28 +00:00
Ralph Castain	7b307d5bf0	Cleanup handling of all-numerical node names This commit was SVN r25000.	2011-08-05 14:59:14 +00:00
Ralph Castain	157bad5435	If we can't compress the name, that's fine - but still have to move to next posn This commit was SVN r24999.	2011-08-05 14:43:36 +00:00
Ralph Castain	3199663613	Correctly handle the case of mixes of character-based names and all-number names This commit was SVN r24998.	2011-08-05 14:37:36 +00:00
Ralph Castain	066022126e	Sort the nodes to be in numerically increasing order so the regex has a chance of working right. This commit was SVN r24993.	2011-08-05 03:37:13 +00:00
Ralph Castain	5a634caad9	Cleanly handle the case where the node "name" is just a number, and avoid the N-N output when the number is not part of a sequence. This commit was SVN r24992.	2011-08-05 03:36:30 +00:00
Jeff Squyres	294e1f50cd	Remove compiler warning about nested comment This commit was SVN r24984.	2011-08-03 18:30:56 +00:00
Wesley Bland	87a96da99c	Should fix some of the shutdown woes of the errmgr. Correctly checks that the orted's job is completed. Correctly tests to make sure that there is shutdown going on (doesn't rely on orte_orteds_term_ordered). Adds a patch from Ralph to correctdly check the status of processes. This commit was SVN r24962.	2011-08-01 14:00:41 +00:00
Ralph Castain	42b125ef35	Move the debug so it more accurately reports This commit was SVN r24961.	2011-07-29 20:48:46 +00:00
Ralph Castain	70bca4691f	Add a new "sensor" module that supports fault tolerance tests - randomly kills local procs and/or the daemon itself This commit was SVN r24960.	2011-07-29 20:48:22 +00:00
Wesley Bland	5fde3e0e00	Move the resilient orte errmgr code into a seperate errmgr for now while it's still unstable. Reverted errmgr modules back to the original errmgr (with the updates since the resilient code was brought into the trunk). This commit was SVN r24958.	2011-07-28 21:24:34 +00:00
Ralph Castain	6c879f87fb	Add a new param "orte_remote_tmpdir_base" for those situations where the compute nodes require a different session directory head than the head node. This commit was SVN r24956.	2011-07-27 19:37:17 +00:00
Ralph Castain	decab98fb2	Do a little better job of catching up on missed mcast messages, and provide a way out of scenarios where catch-up is impossible. This commit was SVN r24955.	2011-07-27 14:58:30 +00:00
Ralph Castain	c3bc33b3fb	Don't be so restrictive - accept "slots" as well as "slot" in rank file This commit was SVN r24954.	2011-07-27 00:45:30 +00:00
Wesley Bland	b972fd84e1	No longer sends extra FAILED_NOTIFICATION messages in the non-failure case. Should reduce finalize complexity and avoid a race condition that has been detected by a few users. This commit was SVN r24952.	2011-07-26 20:47:44 +00:00
Ralph Castain	715f871605	Ignore the daemon job when reporting parseable output This commit was SVN r24944.	2011-07-25 20:44:08 +00:00
Ralph Castain	db193555c2	Use non-blocking sends for recovering from lost multicast messages This commit was SVN r24943.	2011-07-25 18:49:47 +00:00
Ralph Castain	199804fc35	complete implementation of parseable output This commit was SVN r24929.	2011-07-23 22:23:24 +00:00
Ralph Castain	ffe6f5f40e	Fix map pack/unpack so they match This commit was SVN r24928.	2011-07-23 22:23:05 +00:00
Ralph Castain	00647fa342	Update orte-ps to add parseable output - not fully tested because I couldn't get other parts of the system to work. This commit was SVN r24927.	2011-07-23 20:20:31 +00:00
Ralph Castain	869024f1c6	You have to initialize th daemon param -before- using it to get epoch!! This commit was SVN r24926.	2011-07-23 20:19:43 +00:00
Ralph Castain	361bcef253	Close multicast before rml This commit was SVN r24925.	2011-07-23 20:19:15 +00:00
Shiqing Fan	cc4403a863	Remove two unused windows files. This commit was SVN r24913.	2011-07-21 12:53:32 +00:00
Brian Barrett	3bd66a5932	* Remove unused Portals3.3 reference implementation support This commit was SVN r24906.	2011-07-20 23:30:29 +00:00
Eugene Loh	921852e1e5	Clean up the computations of num_procs_alive. Do some code refactoring to improve readability and to compute num_procs_alive correctly and to remove the use of loop iteration variables for two loops nested one inside another (causing MPI_Comm_spawn_multiple to fail). This commit was SVN r24903.	2011-07-14 20:10:48 +00:00
Ralph Castain	8853e0e80a	Fix regular expression analyzer for slurmd - use a slurm-specific version Fix multi-node routing for daemon startup when static ports are not set This commit was SVN r24898.	2011-07-13 22:49:56 +00:00
Ralph Castain	8d1b31b887	Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly. Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list. This commit was SVN r24894.	2011-07-13 20:11:14 +00:00
Ralph Castain	1405bacd85	Ensure we dont segfault if we report an error This commit was SVN r24890.	2011-07-13 15:00:22 +00:00
Jeff Squyres	3893a5a1de	Fix compile error introduced in r24888. This commit was SVN r24889. The following SVN revision numbers were found above: r24888 --> open-mpi/ompi@e5253647ea	2011-07-13 14:18:00 +00:00
Shiqing Fan	e5253647ea	Fix a type cast. This commit was SVN r24888.	2011-07-13 09:00:17 +00:00
Ralph Castain	1ad110d2e9	After a nice, calm, rational discussion between Brian, Jeff, and myself, we decided to revert r24864 and r24862 to restore the reference counters in opal_init/finalize. The rationale was that we should instead change orte_init/finalize to also use reference counters to support multi-embedded libraries. Jeff and Brian will discuss proposing a similar change to mpi_init/finalize to the MPI Forum so that all three libraries will behave in similar manners. It was agreed that opal_init_util had wound up being used in unintended ways, which raised the problem of getting reference counts to work right. However, fixing it would involve more pain than it was worth - and so long as the other layers are made to behave similarly, I have no preference either way. Complete implementation will follow - for now, this just reverts the prior changes. This commit was SVN r24886. The following SVN revision numbers were found above: r24862 --> open-mpi/ompi@aa92e0c4eb r24864 --> open-mpi/ompi@a5062385c2	2011-07-12 17:07:41 +00:00
Nathan Hjelm	3f4e5d7dd6	add missing thread lock/unlock around condition_broadcast This commit was SVN r24885.	2011-07-12 15:43:56 +00:00
Nathan Hjelm	c3ec2e2614	fix a potential race condition in rml This commit was SVN r24884.	2011-07-12 15:43:12 +00:00
Abhishek Kulkarni	6bf02d1344	Fixes (after the new ORTE resiliency layer was merged) to make the trunk build with C/R flags turned on. This commit was SVN r24872.	2011-07-10 23:36:26 +00:00
Ralph Castain	a5062385c2	Fix singletons This commit was SVN r24864.	2011-07-08 14:38:33 +00:00
Jeff Squyres	3affb8403e	Remove extra output This commit was SVN r24863.	2011-07-08 13:01:30 +00:00
Ralph Castain	aa92e0c4eb	Replace a useless counter with a boolean check to see if we have already passed thru opal_finalize so we don't call finalize, and then don't pass thru it (as was happening on several tools) This commit was SVN r24862.	2011-07-08 06:43:19 +00:00
Ralph Castain	05f4926bfe	Remove some remaining cruft re regular expressions - caused the trunk to fail if regex wasn't being used This commit was SVN r24861.	2011-07-08 06:42:12 +00:00
Ralph Castain	1ee7c39982	Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long. Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again. Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines. This commit was SVN r24860.	2011-07-07 18:54:30 +00:00
Ralph Castain	6496b2f845	Ensure we terminate properly on non-zero exit status This commit was SVN r24859.	2011-07-07 14:33:49 +00:00
Wesley Bland	0628963506	Fix a return code when a process isn't found. This commit was SVN r24845.	2011-06-30 15:22:54 +00:00
Brian Barrett	e52fef28ca	Do something rational for the disable full support case This commit was SVN r24844.	2011-06-30 14:48:19 +00:00
Ralph Castain	8ac35a8496	Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate. This commit was SVN r24843.	2011-06-30 14:11:56 +00:00
Ralph Castain	c449871ade	Add an mca param to set the "fork agent" - i.e., a program to be run when forking off a process (e.g., valgrind). While you could specify this by "mpirun -n N fork_agent ./my_app", not everyone launches procs with ORTE from mpirun. Provide the ability to store recent stat histories using the ring_buffer class This commit was SVN r24842.	2011-06-30 03:12:38 +00:00
Ralph Castain	2e1fa3e08e	Don't error out if the recv.cancel comes back not found as this is just a race condition This commit was SVN r24841.	2011-06-30 01:19:50 +00:00
Ralph Castain	418229c71c	Define a new error constant This commit was SVN r24833.	2011-06-28 19:47:16 +00:00
Wesley Bland	84be81df95	Standardize the initialization of the EPOCH's. Everyone will be starting at MIN anyway (until we implement restart of course) so there's no reason to set the epoch to INVALID and then immediately reset them to MIN. This way there's less room to make mistakes later. This commit was SVN r24829.	2011-06-28 14:20:33 +00:00
Ralph Castain	c203eee223	Since process names now have three fields, be sure to initialize all three of them This commit was SVN r24828.	2011-06-27 20:50:08 +00:00
Ralph Castain	d316701d3c	Remove unnecessary mcast channel This commit was SVN r24816.	2011-06-23 20:44:22 +00:00
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Ralph Castain	391074cde6	Add a tag This commit was SVN r24813.	2011-06-23 15:12:25 +00:00
Ralph Castain	afceaaa8e4	Add support for detecting rapid failures so the errmgr can respond accordingly This commit was SVN r24796.	2011-06-21 16:08:41 +00:00
Samuel Gutierrez	81f38b258a	commit of new shared memory backing facility framework (shmem) and its components. This commit was SVN r24795.	2011-06-21 15:41:57 +00:00
Ralph Castain	9491fbb60c	Remove two stale modules This commit was SVN r24794.	2011-06-21 05:57:39 +00:00
Ralph Castain	b95ede99d5	Ensure we use the pmi grpcomm when using pmi This commit was SVN r24793.	2011-06-20 21:57:47 +00:00
Ralph Castain	92a65f21bf	Restore slurm pmi support from long, long ago. Since we already have the ability to directly srun an MPI job, just conditionally add the PMI support for key values and provide a grpcomm module that uses PMI for barriers and modex. Currently ompi_ignored, and unignored only for me (others to soon follow). This commit was SVN r24792.	2011-06-20 21:04:46 +00:00
Jeff Squyres	9531e205e1	Minor fix to a comment This commit was SVN r24789.	2011-06-20 17:51:01 +00:00
Ralph Castain	d563e58348	Remove another lingering bproc reference This commit was SVN r24785.	2011-06-17 18:01:23 +00:00
Ralph Castain	042ee3ec48	Support the option of outputting error_log messages with something other than the process name This commit was SVN r24784.	2011-06-17 14:50:00 +00:00
Ralph Castain	61dd7f4588	Need to pass identification of input/output channels This commit was SVN r24783.	2011-06-17 14:48:59 +00:00
Ralph Castain	93dcfc15d0	Let the upper layer set the channels to be opened This commit was SVN r24779.	2011-06-16 20:31:52 +00:00
Ralph Castain	7f2d2e3de7	Track the app_context rank - will equal overall rank for single app_context jobs This commit was SVN r24778.	2011-06-16 20:31:30 +00:00
Josh Hursey	0eb3b3b7b0	Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9335.php This commit was SVN r24775.	2011-06-15 13:10:13 +00:00
Ralph Castain	033cbbed31	Don't automatically assign group channels if not given - let the layer above figure it out. This commit was SVN r24771.	2011-06-10 16:28:18 +00:00
Ralph Castain	e039c7b7ea	Avoid crashing when debugging rmaps and a non-string resource constraint is given This commit was SVN r24770.	2011-06-10 16:27:30 +00:00
Josh Hursey	6539a31b23	Cleanup configure checks for C/R functionality. Add a WANT_FT_CR flag different from WANT_FT so tools like *-checkpoint are not built when a different FT technique is requested. Also fix the C/R thread check so that it is only enabled if C/R is enabled, not generally when threads are enabled. This commit was SVN r24769.	2011-06-09 19:45:29 +00:00
Josh Hursey	9080eaedf3	Fully initialize the orte_errmgr_base_component_t structure for the app. This commit was SVN r24765.	2011-06-09 14:22:25 +00:00
Josh Hursey	20339a7900	Minor coding style and intentation fixes. This commit was SVN r24764.	2011-06-09 14:16:06 +00:00
Ralph Castain	fc8d920c56	Dont monitor resource usage unless requested, even if sensors are enabled during configure This commit was SVN r24760.	2011-06-08 10:47:25 +00:00
Ralph Castain	906eb925f1	Update the resource usage sensor to initiate support for time analysis of measurements This commit was SVN r24759.	2011-06-07 23:22:51 +00:00
Ralph Castain	f3cae3d6f3	Cleanup the handling of if_include and if_exclude arguments based on CIDR notation. Fix a bug in the new code that prevented the system from correctly matching addresses. Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide. Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result. Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages. NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing. This commit was SVN r24755.	2011-06-07 02:09:11 +00:00
George Bosilca	6b52d8f519	The paffinity is apparently needed. This commit was SVN r24749.	2011-06-06 01:20:01 +00:00
Ralph Castain	bd8d9a943a	Add diagnostics This commit was SVN r24748.	2011-06-05 19:17:56 +00:00
Ralph Castain	1491d52bd7	Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3"). This commit was SVN r24747.	2011-06-05 19:16:42 +00:00
George Bosilca	454519842e	Report bindings if requested. This commit was SVN r24743.	2011-06-02 17:17:10 +00:00
George Bosilca	1eccadbd87	No need for the paffinity here. This commit was SVN r24742.	2011-06-02 17:16:25 +00:00
Ralph Castain	8f401a0563	Enable the ability to constrain applications to hosts on the basis of resources. This commit was SVN r24736.	2011-05-28 22:18:19 +00:00
Brian Barrett	beb1bc70b2	* Add support for using modex to exchange NID/PID pairs when using Portals4. Rather than try to support a bunch of lightweight environments like I did with the Portals3 code, always use the "modex" and hack the grpcomm for the SHMEM implementation to return the right nid/pid for a remote process by "magic". This commit was SVN r24733.	2011-05-25 22:10:27 +00:00
Ralph Castain	81b6c50daa	Correct stale typo This commit was SVN r24725.	2011-05-23 17:34:22 +00:00
Ralph Castain	661f508e62	Fix typo This commit was SVN r24723.	2011-05-22 00:20:42 +00:00
Ralph Castain	b2331113a5	Add some debug This commit was SVN r24722.	2011-05-21 21:09:47 +00:00
Ralph Castain	8c08ee9c3d	Remove stale tool This commit was SVN r24720.	2011-05-21 00:38:35 +00:00
Ralph Castain	1b5ca323c6	Always followup with sigkill when killing local procs as procs can trap sigterm and get stuck This commit was SVN r24719.	2011-05-20 22:40:10 +00:00
Ralph Castain	c5686ecfca	Dont sample stats for pid=0 children This commit was SVN r24717.	2011-05-20 14:33:23 +00:00
Ralph Castain	b03e4481a3	Plug a couple of additional places in case orte_iof is not opened This commit was SVN r24716.	2011-05-20 13:42:53 +00:00
Ralph Castain	dc0bb0571b	Record the number of heartbeats recvd each period for diag purposes This commit was SVN r24714.	2011-05-20 00:21:33 +00:00
Ralph Castain	69dce0ec10	Minor heartbeat cleanups This commit was SVN r24713.	2011-05-19 21:27:44 +00:00
Ralph Castain	c3df95dd13	Prevent failure due to race condition during abnormal term This commit was SVN r24712.	2011-05-19 21:27:05 +00:00
Ralph Castain	b0f47e6f59	Allow orte_iof to not be opened This commit was SVN r24711.	2011-05-19 21:26:30 +00:00
Ralph Castain	1f3911cc8b	Add a new proc state This commit was SVN r24710.	2011-05-19 21:25:58 +00:00
Ralph Castain	b47ec2ee87	Remove lingering references to opal_profile option This commit was SVN r24709.	2011-05-18 18:27:29 +00:00
Ralph Castain	9678e62613	Fix possible corruption of environ. Thanks to Ariel Burton and Peter Thompson for finding it! This commit was SVN r24708.	2011-05-18 16:25:35 +00:00
Ralph Castain	d34bab541d	Remove the ompi-profiler tool and its attendant ompi-probe program. Also remove the grpcomm basic component since its only function was to support profiled clusters, which nobody was doing. :-( This commit was SVN r24704.	2011-05-17 03:30:25 +00:00
Ralph Castain	a3e43594a4	Extend node stats to include additional memory info. Change "darwin" pstat module to "test" as we don't really know how to get all the stat info for darwin. Add a new OPAL_ERROR_LOG macro similar to the ORTE_ERROR_LOG one. This commit was SVN r24692.	2011-05-08 14:45:16 +00:00
Ralph Castain	c160f5d5a2	Add ability to specify mcast interfaces by name This commit was SVN r24691.	2011-05-08 14:42:48 +00:00
Thomas Herault	fb3fd8fd0e	items belonging to peer_send_queue are mca_oob_tcp_msg_t *, which are obtained through a opal_freelist. They shouldn't be released, but returned to the freelist. This commit was SVN r24679.	2011-05-03 21:03:09 +00:00
Ralph Castain	9df207aa51	Fix the case where a user supplies the -xterm option, which requires that we leave ssh sessions attached. This commit was SVN r24668.	2011-05-02 12:39:55 +00:00
Ralph Castain	138928fcf4	Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces. This commit was SVN r24666.	2011-04-29 18:46:40 +00:00
Shiqing Fan	9e90ade864	Missed one file from the last commit. This commit was SVN r24664.	2011-04-29 14:44:02 +00:00
Shiqing Fan	4490fdbd34	Add the initial support for MinGW and MSYS. Correctly check the dependencies of MSYS env. Set up configure include and lib path for building the package. update a few more CMake scripts. This commit was SVN r24663.	2011-04-29 14:42:07 +00:00
Ralph Castain	c78531ce8a	Don't free the envar that gets putenv'd as that messes up the environ This commit was SVN r24660.	2011-04-29 08:50:29 +00:00
Ralph Castain	0ff0d20e72	Grr...get the prefix right - need to strip the bin out of absolute path to mpirun. This commit was SVN r24658.	2011-04-28 22:20:55 +00:00
Ralph Castain	6af2677fb8	Check for both absolute-path-to-mpirun and -prefix being specified. If the two differ, print out a warning and ignore -prefix. If they are the same, or only one was given, then proceed as directed. This commit was SVN r24657.	2011-04-28 22:12:41 +00:00
Ralph Castain	b586f2952e	Arggg...revert r24645. I knew those fields were there for a reason...sigh. This commit was SVN r24647. The following SVN revision numbers were found above: r24645 --> open-mpi/ompi@e4732110da	2011-04-28 15:07:00 +00:00
Ralph Castain	859aaab93d	In the case of direct-launched processes running under slurm, psm requires that the pre_condition_transports MCA param be set. This is normally computed by mpirun and inserted into each proc's environ, but that doesn't work here. So separate out the printing of that key, and let the individual procs generate it in a way that ensures they all get the same result. This commit was SVN r24646.	2011-04-28 13:54:33 +00:00
Ralph Castain	e4732110da	Remove a couple more stale fields This commit was SVN r24645.	2011-04-28 00:26:38 +00:00
Ralph Castain	39369f8807	Remove stale fields from global objects - have been moved to the layer that actually uses them This commit was SVN r24644.	2011-04-28 00:20:49 +00:00
Ralph Castain	8858d9a40e	Add a marker for other layers to use in defining data types This commit was SVN r24643.	2011-04-28 00:19:35 +00:00
Ralph Castain	9988b97b97	Extend/update how we handle process stats. Add the ability to collect node-level stats separate from the process stats. Update the process stat memory fields to report in MBytes instead of KBytes as I can't find any process that runs in KBytes nowadays. Rename the memusage sensor plugin to "resusage" as it will soon be updated to include full process stat monitoring. Extend the heartbeat sensor to report node and process stats in the heartbeat. Store the process and node stats in their respective orte_xxx_t object. This commit was SVN r24629.	2011-04-21 22:55:45 +00:00
Ralph Castain	5f64b830f9	Ensure we only kill threads once This commit was SVN r24620.	2011-04-18 14:47:09 +00:00
Ralph Castain	8014e7432c	Send recovery defined flag in app_contexts, include recovery flags in debug prints This commit was SVN r24619.	2011-04-18 14:46:42 +00:00
Ralph Castain	89501e6e24	Don't try to politely end threads when abnormally terminating as we can hang if the thread is in a stuck callback. This commit was SVN r24618.	2011-04-18 12:21:47 +00:00
Ralph Castain	3a28556472	Expand our handling of non-zero exit status. If a process exits with non-zero status, pass that info along to the user in case it means something to them, even if the process also exited without calling MPI_Finalize. If the process calls MPI_Abort, that trumps the exit status question. Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI. This commit was SVN r24614.	2011-04-14 15:04:21 +00:00
Ralph Castain	30fb002524	Take the first small step towards rationalizing rsh support. Create a new "rshbase" component that contains a simple rsh module - no tree spawn, uses all the base functions for launch support. Extend the base rsh support functions to include those functions in common across all rsh modules. Only a minor change made to the current rsh module to avoid a naming conflict. Otherwise, left it alone to avoid creating conflicts with other external work. The current rsh module remains the default for rsh/ssh support, and continues to contain the support for SGE and Loadleveler. This commit was SVN r24593.	2011-03-30 01:15:07 +00:00
Nysal Jan	866ae8b43a	Close the file descriptor This commit was SVN r24580.	2011-03-29 08:42:49 +00:00
Nysal Jan	c8c6b0edab	Improve LoadLeveler integration with Open MPI. Add support for LL native rsh agent - llspawn This commit was SVN r24579.	2011-03-29 07:46:59 +00:00
Nathan Hjelm	8634b6394f	fixed plm/tm component This commit was SVN r24577.	2011-03-25 22:20:15 +00:00
Ralph Castain	d7e029cb40	Convert heartbeat to multicast basis This commit was SVN r24570.	2011-03-24 19:05:39 +00:00
Ralph Castain	90698a2c02	Ensure that blocking recvs wait until the data is actually recvd This commit was SVN r24558.	2011-03-22 18:45:54 +00:00
Ralph Castain	888472f671	Do not release recv as the calling function needs that data and will release it later This commit was SVN r24557.	2011-03-22 18:44:56 +00:00
Ralph Castain	30981de200	Minor cleanups courtesy of Nysal - thanks! This commit was SVN r24552.	2011-03-22 13:48:58 +00:00
Ralph Castain	c1396b278c	Resolve the rsh confusion by splitting the initial search for a launch agent from the actual setup of the launch agent values in the plm base globals. Have each aspiring rsh-clone call lookup to see if their desired launch agent is available - if not, then reject that plm component. If so, then setup the actual launch agent values only when the module init function is called. This resolves the current conflict between the rsh and rshd components. Hopefully, it may avoid future problems in this area -provided- any new uses of rsh-like launchers abide by the lookup-and-then-setup rule. This commit was SVN r24550.	2011-03-22 02:23:09 +00:00
Ralph Castain	d17b50e1ff	Add the appropriate hooks to tell Totalview to display the user's main program upon startup. Apparently, this hook got lost somewhere after the 1.2 series :-( Thanks to David Turner and the TV folks for passing this along. This commit was SVN r24549.	2011-03-21 17:40:58 +00:00
Ralph Castain	795ca2cff2	Complete implementation of the multicast-based grpcomm module This commit was SVN r24548.	2011-03-20 01:18:06 +00:00
Ralph Castain	fa40f5d7c3	Fix bad formatting This commit was SVN r24547.	2011-03-20 01:17:29 +00:00
Ralph Castain	281116ddc5	A max_restarts value of -1 is now valid and indicates infinite restarts, so correct the validity check This commit was SVN r24546.	2011-03-20 01:17:00 +00:00
Eugene Loh	2770a12beb	Continue clean up of thread options started in r22841, 22842, and 22849. No need for any CMRs to 1.5... that was already done in CMR 2728. This commit was SVN r24545. The following SVN revision numbers were found above: r22841 --> open-mpi/ompi@b400b84162	2011-03-18 21:36:35 +00:00
Ralph Castain	ee68cd102c	Fix the hier grpcomm module so modex results in correct data. The prior implementation stored the modex data as node-based attributes. This worked fine for BTL's such as openib where the interfaces were associated with the node. However, BTL's such as TCP have interfaces associated with a specific process, not a node. Thus, store the data in the modex database so it is correctly indexed. This commit was SVN r24536.	2011-03-17 02:22:23 +00:00
Ralph Castain	d5dfe05521	Remove stale code associated with OPAL_THREADS_HAVE_DIFFERENT_PIDS. In the past, we have supported the case of really, really old Linux kernels where threads have different pids. However, when we updated the event library, we didn't also update that support code. In addition, when we dropped progress thread support, we didn't remove areas of the code that could no longer be compiled (i.e., were protected by "if progress thread && if have different pids). There was no compelling reason to support such old kernels. Accordingly, convert the test to print a nice error message indicating we no longer support old kernels (but indicate that earlier OMPI versions do) and error out. Remove all code that was protected by "if have different pids" since it can no longer be compiled. This commit was SVN r24531.	2011-03-15 21:05:03 +00:00
Ralph Castain	de092af8ef	Add a little more debug This commit was SVN r24526.	2011-03-14 18:43:49 +00:00
Ralph Castain	ebabe9c83a	Forgot that Terry wanted to control the vm launch with an mca param - set one up for that purpose This commit was SVN r24525.	2011-03-13 00:46:42 +00:00
Ralph Castain	dc6f616599	Enable VM launch. For some time, ORTE has had the ability to launch daemons on all nodes prior to launching an application. It has largely been used outside of the OMPI community, and so was never explicitly turned "on" inside OMPI releases. Nevertheless, the code has been there. Allowing VM launches does not require ANY changes to existing PLM components. All that was required was to have orterun launch the daemons as a separate call to orte_plm.spawn -prior- to launching the applications. The rest of the VM support code resides in the rmaps framework: (a) a check when asked to map a job to see if it is the daemon job, and (b) a separate "setup_virtual_machine" mapper in the rmaps base that creates the required map so the PLM's will do the right thing. In order to support those users who have no RM allocation but like to give the allocation in the form of a -host or -hostfile argument to their application, there is a little more code in orterun and the setup_virtual_machine mapper to capture information passed in that manner. This has been tested with rsh and slurm environments, and, since there is nothing environment-specific in the implementation, should work in others as well - but needs to be proven. This commit was SVN r24524.	2011-03-12 22:50:53 +00:00
Ralph Castain	80265b472e	Avoid direct reference of pointer_array elements This commit was SVN r24523.	2011-03-12 20:18:51 +00:00
Ralph Castain	df82e4cd36	Plug a memory leak This commit was SVN r24521.	2011-03-12 15:37:33 +00:00
Ralph Castain	1297acde13	George raised some valid concerns about the extensibility of the revised rmaps framework. Address those by: 1. removing the enum of mapper values 2. change the req_mapper and last_mapper fields to char* so they can hold the component name instead of a mapper flag 3. revise the selection logic in the mapper components to reflect the change. Components now look for their name in the req_mapper field, or to see if other criteria (e.g., npernode) are set that mandate their doing the mapping Several MCA params resided in the rmaps base for historical reasons - they have been in the base since at least the original 1.2 release (and perhaps earlier). However, George correctly pointed out that they really should reside in their respective components. Accordingly, move them to the components, but register synonyms to the old names to avoid breaking backward compatibility. These revisions retain the current functionality of allowing comm_spawn'd jobs to use different mappers than the original job, and for the errmgr to utilize the resilient mapper to recover processes regardless of how they were originally mapped. Given the large number of possible combinations, I am sure that someone will find a corner-case combination of values and selection criteria that cause either no mapper to be selected, or one other than the intended to be used. No one can test all the ways people will use this system, so I expect debugging to continue for awhile. The ability of comm_spawn'd jobs to exploit this functionality relies on changes to the orte_dpm component - this will be committed separately. This commit was SVN r24520.	2011-03-12 05:30:09 +00:00
Samuel Gutierrez	830c7c66dc	fixes CID #1667 This commit was SVN r24518.	2011-03-12 03:09:01 +00:00
Ralph Castain	e6a76cc923	Fixes CID #1954 This commit was SVN r24516.	2011-03-11 23:00:27 +00:00
Ralph Castain	2ccd514b9a	Add version string to app This commit was SVN r24514.	2011-03-11 20:38:37 +00:00
Samuel Gutierrez	2a2319d23a	when orte_timing is enabled, always record daemon launch start time before starting the real work. This commit was SVN r24513.	2011-03-11 00:09:23 +00:00
Ralph Castain	f9a9fac76b	Minor typo This commit was SVN r24506.	2011-03-10 16:09:31 +00:00
George Bosilca	80fe617cd2	If we don't release the OPAL utils explicitly there will be a memory leak. This commit was SVN r24505.	2011-03-10 00:42:28 +00:00
George Bosilca	7f34a28c8f	Correct a comment. This commit was SVN r24504.	2011-03-10 00:41:41 +00:00
George Bosilca	d2502b14f9	Destruct the OOB TCP internal objects. This commit was SVN r24503.	2011-03-10 00:40:54 +00:00
Ralph Castain	3b4421d8e3	Separately track requested and last-used mapper so we don't lose that info This commit was SVN r24502.	2011-03-09 18:51:36 +00:00
Jeff Squyres	06d5c59115	Fix a few valgrind-reported memory leaks This commit was SVN r24498.	2011-03-08 17:37:28 +00:00
Jeff Squyres	0586612bd5	Fix another minor memory leak This commit was SVN r24495.	2011-03-08 15:46:13 +00:00
Jeff Squyres	79cf382ff3	Fix a few issues with error messages: * If something goes wrong during ompi_mpi_init, don't erroneously report that it is illegal to invoke MPI_INIT* before MPI_INIT * Aggregate help messages when possible when something goes wring during ompi_mpi_init This commit was SVN r24492.	2011-03-07 16:45:45 +00:00
Ralph Castain	63f38e38bb	Fix ompi-server: remove extra command flag in buffer being sent to mpirun, ensure that tools route messages thru a remote HNP This commit was SVN r24491.	2011-03-05 17:12:46 +00:00
Ralph Castain	d764e7a398	We want uid/gid support at the individual application level. Ensure the values get initialized and packed/unpacked for transfer. This commit was SVN r24489.	2011-03-04 18:46:43 +00:00
George Bosilca	9bbe00bdc3	Set the return code from the processes upstream. This commit was SVN r24483.	2011-03-03 00:02:21 +00:00
George Bosilca	c6a5f9706a	Thomas's patch: Assume we won't fail unless notified by a child. This commit was SVN r24482.	2011-03-02 23:50:01 +00:00
Josh Hursey	62bba1bf12	Name the enum so that it represents as an actual symbol in gdb, instead of just a number. This commit was SVN r24472.	2011-03-01 21:00:03 +00:00
Nysal Jan	4030111478	Add missing copyright and fix the year This commit was SVN r24446.	2011-02-23 15:52:06 +00:00
Nysal Jan	42a73bb887	POE is supported on both AIX and Linux. Build POE PLM only if we find the poe binary. Fix hostfile creation and POE command line arguments. This commit was SVN r24444.	2011-02-23 15:38:41 +00:00
Ralph Castain	f014284f91	Update resilient recovery mapping algorithm to be a bit more sophisticated. Track the prior node a proc was on so we avoid ricochet effect. Also avoid putting recovering proc onto node that is already occupied by a peer as this degrades fault tolerance. This commit was SVN r24417.	2011-02-20 18:46:21 +00:00
Ralph Castain	a8cf19a7bc	Ensure heartbeat only started once and only for daemon job This commit was SVN r24416.	2011-02-18 20:33:54 +00:00
Ralph Castain	ef56e6d78b	Helps to move the pointer This commit was SVN r24414.	2011-02-18 14:01:25 +00:00
Ralph Castain	7b35ada7fc	Fix ricochet effect - move failed procs to next on list instead of loadbalancing This commit was SVN r24413.	2011-02-18 13:11:55 +00:00
Ralph Castain	b98a2917ff	Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it. This commit was SVN r24412.	2011-02-18 02:48:12 +00:00
Ralph Castain	a0f6e153c7	Add missing fields to copy of app_context object This commit was SVN r24411.	2011-02-17 23:55:05 +00:00
Ralph Castain	51cf0a16c3	Some minor cleanups to support VM and CM operations This commit was SVN r24408.	2011-02-16 23:03:08 +00:00
Ralph Castain	9b48c07599	CM daemons handle their own output This commit was SVN r24407.	2011-02-16 23:02:23 +00:00
Ralph Castain	65ba6af44d	Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM. Have each mapper flag it did the map so we can see who did it later. Ensure procs are flagged as "ready to launch". This commit was SVN r24406.	2011-02-16 23:01:57 +00:00
Jeff Squyres	3f4d4886f2	Minor update for something that has been bugging me for quite a while: OMPI supports multiple different repository systems (SVN, hg, git). But the VERSION file has listed "want_svn" and "svn_r" as fields, even though the actual repo system and version may not be SVN. So search/replace those fields (and derrivative values that come from those fields) with "want_repo_rev" and "repo_rev", respectively. This commit was SVN r24405.	2011-02-16 22:53:23 +00:00
Ralph Castain	a32a7d9a82	Update heartbeat system This commit was SVN r24404.	2011-02-16 18:50:51 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Ralph Castain	5120e6aec3	Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param. When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests. Also remove the stale and never completed topo mapper. This commit was SVN r24393.	2011-02-15 23:24:31 +00:00
Ralph Castain	c1da94a444	Dead children have no pid This commit was SVN r24387.	2011-02-15 13:30:51 +00:00
Ralph Castain	a3607ff35d	Make it easier to send a kill-local-procs command for an arbitrary number of procs This commit was SVN r24386.	2011-02-15 13:26:11 +00:00
Ralph Castain	bf1cff3711	Plug a couple of additional memory leaks - try to highlight a little better that strings returned from reg_string_name must be freed by caller This commit was SVN r24383.	2011-02-14 20:58:22 +00:00
Ralph Castain	a9dca25ca5	Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do. Cleanup the heartbeat handling so it is associated with the proc, not a node. Cleanup handling of recovery options so that defaults do not override user values iff they are provided. This commit was SVN r24382.	2011-02-14 20:49:12 +00:00
Ralph Castain	6ee69c670e	Add a minor utility This commit was SVN r24379.	2011-02-14 19:45:59 +00:00
Ralph Castain	b5de068533	Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address. Add some new proc/job states Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed. Add pretty-print of signals for "proc aborted due to signal" reports. This commit was SVN r24378. The following SVN revision numbers were found above: r24371 --> open-mpi/ompi@93d28a5792	2011-02-14 19:29:09 +00:00
Abhishek Kulkarni	93d28a5792	Change opal_err2str_fn_t to return the error string as an argument. This means that the converters (opal_err2str, orte_err2str) can now return NULL as a "silent error". The return value of opal_err2str_fn_t is the status of the operation (OPAL_SUCCESS or OPAL_ERROR). This fixes the "Unknown error" message issues on the trunk. This commit was SVN r24371.	2011-02-13 16:09:17 +00:00
Ralph Castain	33b68132cc	Update the rmcast framework This commit was SVN r24370.	2011-02-12 16:52:03 +00:00
Josh Hursey	a9335ea423	Make sure to initialize the 'update_state' function for the default module. This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB. This commit was SVN r24365.	2011-02-08 20:42:32 +00:00
Nysal Jan	3a8d251daa	vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog This commit was SVN r24339.	2011-02-02 10:05:57 +00:00
Josh Hursey	fa3f6485d8	Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around. This commit was SVN r24326.	2011-01-31 19:09:47 +00:00
Josh Hursey	5b58ff0663	Fix a C/R checkpoint->restart->checkpoint->restart case. The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint. This seems to be fixed now. Thanks to Alex Brick for the bug report. This commit was SVN r24325.	2011-01-28 21:25:14 +00:00
Jeff Squyres	ec3d18dc9f	As noted on the mailing list by Gabriele Fatigati (http://www.open-mpi.org/community/lists/users/2011/01/15427.php), the --tv (and friends) switches to mpirun would effectively munge the orterun command line together and then split it apart again before exec'ing the underlying debugger. We would therefore lose multi-token argv[x] value and split them into multiple tokens. For example: mpirun --tv -np 2 a.out "foo bar" would get launched with "foo" and "bar" as separate arguments; not one argument. This was due to the underlying code joining the argv into a single string and then re-splitting it. This commit removed the argv join; it now does the parsing and re-jigering of the argv by only looking at each individual argv item; multi-word tokens like "foo bar" will never be split into separate tokens. This commit was SVN r24322.	2011-01-28 13:01:06 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Josh Hursey	81fd41f811	Return an informative error message if the user requests a migration of a job that is not capable of it. C/R Functionality cleanup This commit was SVN r24307.	2011-01-26 15:36:34 +00:00
Josh Hursey	8f45fcb429	More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. * Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation). * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer. * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process. * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario. This commit was SVN r24306.	2011-01-26 14:56:35 +00:00
Nathan Hjelm	8a3179cdcb	removed c99 test code This commit was SVN r24297.	2011-01-25 23:02:35 +00:00
Josh Hursey	e4d13d338f	Fix a couple of compiler warnings This commit was SVN r24295.	2011-01-25 22:22:32 +00:00
George Bosilca	09f645f9a9	There is no need for the byte variable. This commit was SVN r24293.	2011-01-24 22:41:04 +00:00
Nathan Hjelm	e2126512a9	test c99 struct initialization with mtt. remove on jan 20, 2011 This commit was SVN r24271.	2011-01-19 22:21:21 +00:00
Abhishek Kulkarni	fd7ef7a1f1	Fixes broken trunk compile: call process status notify only when ft-enable-cr is selected. This commit was SVN r24255.	2011-01-14 18:37:07 +00:00
Abhishek Kulkarni	87d2c9b31d	Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) * Improve the FTB notifier to publish (C/R, process/communication failure) events to the FTB with the OMPI jobid as the associated payload. * Add notifier calls for C/R events and process status events in SnapC and ErrMgr components. * Fix a bug where the SnapC states and process states collide before being thrown out over the notifier. This commit was SVN r24251.	2011-01-13 20:13:49 +00:00
Ralph Castain	b09f57b03d	Update the multicast subsystem - ported from Cisco branch This commit was SVN r24246.	2011-01-13 01:54:05 +00:00
Terry Dontje	f3aaa885a3	corrected a couple places in orte where it said cpu_model when it should have been cpu_type. This commit was SVN r24221.	2011-01-11 19:56:26 +00:00
Abhishek Kulkarni	11ffa854ff	Update the FTB notifier * fix indentation issues * update the name of one of the fault events published to the FTB (per the FTB MPI standard) This commit was SVN r24213.	2011-01-10 18:58:31 +00:00
Ralph Castain	80ef1af8ba	Add psm key generator program This commit was SVN r24197.	2010-12-30 20:54:58 +00:00
Nathan Hjelm	c082d05ecb	Reset the timer on MPIR_being_debugged only if MPIR_being_debugged is not set. Fix typo in return code. This commit was SVN r24187.	2010-12-20 21:00:49 +00:00
Jeff Squyres	a525e70f46	Convert "opal_show_help" to be a global variable pointer. It is statically initialized to the real back-end OPAL show_help function. During orte_show_help_init(), the variable is re-assigned with the value of the back-end ORTE show_help function (the one that does error message aggregation). Therefore, anything that calls opal_show_help() after a certain point in orte_init() will have their show_help messages be aggregated. w00t! Even code down in OPAL -- that has no knowledge of ORTE -- will have their messages aggregated. '''Double w00t!''' During orte_show_help_finalize(), we restore the original pointer value so that it something calls opal_show_help() after orte_finalize(), it'll still work properly (but it won't be aggregated). This commit was SVN r24185.	2010-12-16 23:00:25 +00:00
Jeff Squyres	de97962aac	Fixes trac:2651. Fix off-by-one error when /dev/urandom doesn't exist. Thanks to "pth" for the patch. This commit was SVN r24170. The following Trac tickets were found above: Ticket 2651 --> https://svn.open-mpi.org/trac/ompi/ticket/2651	2010-12-14 14:52:51 +00:00
Ralph Castain	b251a59cdf	Cleanup nidmap finalize This commit was SVN r24164.	2010-12-11 16:42:06 +00:00
Ralph Castain	2dc5cbb483	Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago. This commit was SVN r24153.	2010-12-05 15:58:21 +00:00
Rolf vandeVaart	b67d3398da	It is convention to have orte_config.h included at top of file. This commit was SVN r24146.	2010-12-03 16:13:31 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	aaad8ae891	Remove unused var This commit was SVN r24136.	2010-12-02 02:38:13 +00:00
Ralph Castain	f9ffff59f8	Ensure clean termination of threads and tcp multicast This commit was SVN r24134.	2010-12-02 00:23:42 +00:00
Nathan Hjelm	75605faa75	added support for reattaching a debugger using the MPIR_attach_fifo This commit was SVN r24132.	2010-12-01 20:13:58 +00:00
Ralph Castain	ad814f26cd	One more time, into the breach! Restore the use of override_oversubscribe to indicate that the data source for resources on the backend nodes used in mapping is unreliable. In this situation (e.g., data came from hostfile, or we are just using localhost because nothing was provided), we don't trust the oversubscribe condition passed by the mapper. Instead, we check locally to ensure we set sched_yield correctly. This commit was SVN r24130.	2010-12-01 15:15:26 +00:00
Ralph Castain	eba65e97f3	Extend the rmcast APIs to allow enable/disable of comm, required for clean termination by upper layer users. Point the recv thread event base to the right place so it can wakeup when required. Add a new error code for "comm disabled" when attempting to communicate after disabling comm. This commit was SVN r24129.	2010-12-01 13:41:19 +00:00
Ralph Castain	9224302c10	Remove debug This commit was SVN r24128.	2010-12-01 13:12:24 +00:00
Ralph Castain	4f5625d699	Not totally necessary, but good form - init the oversubscribed field in the orte_nid_t object This commit was SVN r24127.	2010-12-01 12:58:37 +00:00
Ralph Castain	30c37ea536	Ensure that the oversubscribed condition of nodes is accurately reported by the mapper, and that the results are communicated and used by the backend orteds when setting sched_yield on local procs. Restores prior behavior that was somehow lost along the way. Includes a patch from Damien Guinier to fix vpid assignments when cpus-per-task is specified. This commit was SVN r24126.	2010-12-01 12:51:39 +00:00
Ralph Castain	85a974b0de	Better check for NULL before using the value This commit was SVN r24122.	2010-12-01 04:48:50 +00:00
Ralph Castain	c56185887b	Change the event base "wakeup" support to enable the passing of events to the central thread for add/del. Add a macro OPAL_UPDATE_EVBASE for this purpose as it will likely be widely used. Update the ORTE thread support to utilize this capability. Update the rmcast framework to track the change. This commit was SVN r24121.	2010-12-01 04:26:43 +00:00
Ralph Castain	963336ee5a	Remove the test for libevent internal threads This commit was SVN r24120.	2010-12-01 04:24:10 +00:00
Ralph Castain	0441e81882	Oops - ensure that multicast msgs get circulated properly with the tcp module This commit was SVN r24118.	2010-11-30 21:13:53 +00:00
Ralph Castain	d20c023348	Checkpoint the threading support for multicast - will be revised shortly, but this version currently works. This commit was SVN r24117.	2010-11-30 17:30:16 +00:00
Ralph Castain	0465605a9c	Cleanup condition check for a param so it doesn't show if not usable. This commit was SVN r24116.	2010-11-30 17:28:53 +00:00
Ralph Castain	09f02b3087	Update the ORTE thread acquire/release/wakeup macros to trigger release from event_loop so that conditions can be checked. Add macro versions of condition_wait and friends for debug use. This commit was SVN r24115.	2010-11-30 17:27:58 +00:00
Ralph Castain	d2547e84a3	MPI procs never use orte progress threads This commit was SVN r24093.	2010-11-29 03:52:46 +00:00
Ralph Castain	71669720a3	Just get the output once on sigpipe error, and include the fd This commit was SVN r24092.	2010-11-25 15:32:48 +00:00
Ralph Castain	30c635fd4d	Don't endlessly output sigpipe errors. Count the number of times we trap it, and abort if we get more than 10 of them. This commit was SVN r24091.	2010-11-25 15:25:24 +00:00
Ralph Castain	b9b2d101dc	Add an mca param to indicate if orte progress threads are to be enabled. Error out if this is given and libevent thread support was not built. This commit was SVN r24089.	2010-11-24 23:28:00 +00:00
Rolf vandeVaart	1d62542c23	Fix another Sun Studio warning. jobid and vpid need to be uint32_t. This commit was SVN r24074.	2010-11-19 18:12:46 +00:00
Rolf vandeVaart	09fdd5cc23	Include fcntl.h, not sys/fcntl.h so we get the definition of the open system call. That is what man page says to do. Fixes warning on Solaris. This commit was SVN r24073.	2010-11-19 17:40:02 +00:00
Shiqing Fan	358b4a5cba	Add an option to enable the debug postfix for executables. This commit was SVN r24070.	2010-11-19 15:54:13 +00:00

... 6 7 8 9 10 ...

3720 Коммитов