openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	9531e205e1	Minor fix to a comment This commit was SVN r24789.	2011-06-20 17:51:01 +00:00
Ralph Castain	d563e58348	Remove another lingering bproc reference This commit was SVN r24785.	2011-06-17 18:01:23 +00:00
Ralph Castain	042ee3ec48	Support the option of outputting error_log messages with something other than the process name This commit was SVN r24784.	2011-06-17 14:50:00 +00:00
Ralph Castain	61dd7f4588	Need to pass identification of input/output channels This commit was SVN r24783.	2011-06-17 14:48:59 +00:00
Ralph Castain	93dcfc15d0	Let the upper layer set the channels to be opened This commit was SVN r24779.	2011-06-16 20:31:52 +00:00
Ralph Castain	7f2d2e3de7	Track the app_context rank - will equal overall rank for single app_context jobs This commit was SVN r24778.	2011-06-16 20:31:30 +00:00
Josh Hursey	0eb3b3b7b0	Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9335.php This commit was SVN r24775.	2011-06-15 13:10:13 +00:00
Ralph Castain	033cbbed31	Don't automatically assign group channels if not given - let the layer above figure it out. This commit was SVN r24771.	2011-06-10 16:28:18 +00:00
Ralph Castain	e039c7b7ea	Avoid crashing when debugging rmaps and a non-string resource constraint is given This commit was SVN r24770.	2011-06-10 16:27:30 +00:00
Josh Hursey	6539a31b23	Cleanup configure checks for C/R functionality. Add a WANT_FT_CR flag different from WANT_FT so tools like *-checkpoint are not built when a different FT technique is requested. Also fix the C/R thread check so that it is only enabled if C/R is enabled, not generally when threads are enabled. This commit was SVN r24769.	2011-06-09 19:45:29 +00:00
Josh Hursey	9080eaedf3	Fully initialize the orte_errmgr_base_component_t structure for the app. This commit was SVN r24765.	2011-06-09 14:22:25 +00:00
Josh Hursey	20339a7900	Minor coding style and intentation fixes. This commit was SVN r24764.	2011-06-09 14:16:06 +00:00
Ralph Castain	fc8d920c56	Dont monitor resource usage unless requested, even if sensors are enabled during configure This commit was SVN r24760.	2011-06-08 10:47:25 +00:00
Ralph Castain	906eb925f1	Update the resource usage sensor to initiate support for time analysis of measurements This commit was SVN r24759.	2011-06-07 23:22:51 +00:00
Ralph Castain	f3cae3d6f3	Cleanup the handling of if_include and if_exclude arguments based on CIDR notation. Fix a bug in the new code that prevented the system from correctly matching addresses. Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide. Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result. Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages. NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing. This commit was SVN r24755.	2011-06-07 02:09:11 +00:00
George Bosilca	6b52d8f519	The paffinity is apparently needed. This commit was SVN r24749.	2011-06-06 01:20:01 +00:00
Ralph Castain	bd8d9a943a	Add diagnostics This commit was SVN r24748.	2011-06-05 19:17:56 +00:00
Ralph Castain	1491d52bd7	Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3"). This commit was SVN r24747.	2011-06-05 19:16:42 +00:00
George Bosilca	454519842e	Report bindings if requested. This commit was SVN r24743.	2011-06-02 17:17:10 +00:00
George Bosilca	1eccadbd87	No need for the paffinity here. This commit was SVN r24742.	2011-06-02 17:16:25 +00:00
Ralph Castain	8f401a0563	Enable the ability to constrain applications to hosts on the basis of resources. This commit was SVN r24736.	2011-05-28 22:18:19 +00:00
Brian Barrett	beb1bc70b2	* Add support for using modex to exchange NID/PID pairs when using Portals4. Rather than try to support a bunch of lightweight environments like I did with the Portals3 code, always use the "modex" and hack the grpcomm for the SHMEM implementation to return the right nid/pid for a remote process by "magic". This commit was SVN r24733.	2011-05-25 22:10:27 +00:00
Ralph Castain	81b6c50daa	Correct stale typo This commit was SVN r24725.	2011-05-23 17:34:22 +00:00
Ralph Castain	661f508e62	Fix typo This commit was SVN r24723.	2011-05-22 00:20:42 +00:00
Ralph Castain	b2331113a5	Add some debug This commit was SVN r24722.	2011-05-21 21:09:47 +00:00
Ralph Castain	8c08ee9c3d	Remove stale tool This commit was SVN r24720.	2011-05-21 00:38:35 +00:00
Ralph Castain	1b5ca323c6	Always followup with sigkill when killing local procs as procs can trap sigterm and get stuck This commit was SVN r24719.	2011-05-20 22:40:10 +00:00
Ralph Castain	c5686ecfca	Dont sample stats for pid=0 children This commit was SVN r24717.	2011-05-20 14:33:23 +00:00
Ralph Castain	b03e4481a3	Plug a couple of additional places in case orte_iof is not opened This commit was SVN r24716.	2011-05-20 13:42:53 +00:00
Ralph Castain	dc0bb0571b	Record the number of heartbeats recvd each period for diag purposes This commit was SVN r24714.	2011-05-20 00:21:33 +00:00
Ralph Castain	69dce0ec10	Minor heartbeat cleanups This commit was SVN r24713.	2011-05-19 21:27:44 +00:00
Ralph Castain	c3df95dd13	Prevent failure due to race condition during abnormal term This commit was SVN r24712.	2011-05-19 21:27:05 +00:00
Ralph Castain	b0f47e6f59	Allow orte_iof to not be opened This commit was SVN r24711.	2011-05-19 21:26:30 +00:00
Ralph Castain	1f3911cc8b	Add a new proc state This commit was SVN r24710.	2011-05-19 21:25:58 +00:00
Ralph Castain	b47ec2ee87	Remove lingering references to opal_profile option This commit was SVN r24709.	2011-05-18 18:27:29 +00:00
Ralph Castain	9678e62613	Fix possible corruption of environ. Thanks to Ariel Burton and Peter Thompson for finding it! This commit was SVN r24708.	2011-05-18 16:25:35 +00:00
Ralph Castain	d34bab541d	Remove the ompi-profiler tool and its attendant ompi-probe program. Also remove the grpcomm basic component since its only function was to support profiled clusters, which nobody was doing. :-( This commit was SVN r24704.	2011-05-17 03:30:25 +00:00
Ralph Castain	a3e43594a4	Extend node stats to include additional memory info. Change "darwin" pstat module to "test" as we don't really know how to get all the stat info for darwin. Add a new OPAL_ERROR_LOG macro similar to the ORTE_ERROR_LOG one. This commit was SVN r24692.	2011-05-08 14:45:16 +00:00
Ralph Castain	c160f5d5a2	Add ability to specify mcast interfaces by name This commit was SVN r24691.	2011-05-08 14:42:48 +00:00
Thomas Herault	fb3fd8fd0e	items belonging to peer_send_queue are mca_oob_tcp_msg_t *, which are obtained through a opal_freelist. They shouldn't be released, but returned to the freelist. This commit was SVN r24679.	2011-05-03 21:03:09 +00:00
Ralph Castain	9df207aa51	Fix the case where a user supplies the -xterm option, which requires that we leave ssh sessions attached. This commit was SVN r24668.	2011-05-02 12:39:55 +00:00
Ralph Castain	138928fcf4	Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces. This commit was SVN r24666.	2011-04-29 18:46:40 +00:00
Shiqing Fan	9e90ade864	Missed one file from the last commit. This commit was SVN r24664.	2011-04-29 14:44:02 +00:00
Shiqing Fan	4490fdbd34	Add the initial support for MinGW and MSYS. Correctly check the dependencies of MSYS env. Set up configure include and lib path for building the package. update a few more CMake scripts. This commit was SVN r24663.	2011-04-29 14:42:07 +00:00
Ralph Castain	c78531ce8a	Don't free the envar that gets putenv'd as that messes up the environ This commit was SVN r24660.	2011-04-29 08:50:29 +00:00
Ralph Castain	0ff0d20e72	Grr...get the prefix right - need to strip the bin out of absolute path to mpirun. This commit was SVN r24658.	2011-04-28 22:20:55 +00:00
Ralph Castain	6af2677fb8	Check for both absolute-path-to-mpirun and -prefix being specified. If the two differ, print out a warning and ignore -prefix. If they are the same, or only one was given, then proceed as directed. This commit was SVN r24657.	2011-04-28 22:12:41 +00:00
Ralph Castain	b586f2952e	Arggg...revert r24645. I knew those fields were there for a reason...sigh. This commit was SVN r24647. The following SVN revision numbers were found above: r24645 --> open-mpi/ompi@e4732110da	2011-04-28 15:07:00 +00:00
Ralph Castain	859aaab93d	In the case of direct-launched processes running under slurm, psm requires that the pre_condition_transports MCA param be set. This is normally computed by mpirun and inserted into each proc's environ, but that doesn't work here. So separate out the printing of that key, and let the individual procs generate it in a way that ensures they all get the same result. This commit was SVN r24646.	2011-04-28 13:54:33 +00:00
Ralph Castain	e4732110da	Remove a couple more stale fields This commit was SVN r24645.	2011-04-28 00:26:38 +00:00
Ralph Castain	39369f8807	Remove stale fields from global objects - have been moved to the layer that actually uses them This commit was SVN r24644.	2011-04-28 00:20:49 +00:00
Ralph Castain	8858d9a40e	Add a marker for other layers to use in defining data types This commit was SVN r24643.	2011-04-28 00:19:35 +00:00
Ralph Castain	9988b97b97	Extend/update how we handle process stats. Add the ability to collect node-level stats separate from the process stats. Update the process stat memory fields to report in MBytes instead of KBytes as I can't find any process that runs in KBytes nowadays. Rename the memusage sensor plugin to "resusage" as it will soon be updated to include full process stat monitoring. Extend the heartbeat sensor to report node and process stats in the heartbeat. Store the process and node stats in their respective orte_xxx_t object. This commit was SVN r24629.	2011-04-21 22:55:45 +00:00
Ralph Castain	5f64b830f9	Ensure we only kill threads once This commit was SVN r24620.	2011-04-18 14:47:09 +00:00
Ralph Castain	8014e7432c	Send recovery defined flag in app_contexts, include recovery flags in debug prints This commit was SVN r24619.	2011-04-18 14:46:42 +00:00
Ralph Castain	89501e6e24	Don't try to politely end threads when abnormally terminating as we can hang if the thread is in a stuck callback. This commit was SVN r24618.	2011-04-18 12:21:47 +00:00
Ralph Castain	3a28556472	Expand our handling of non-zero exit status. If a process exits with non-zero status, pass that info along to the user in case it means something to them, even if the process also exited without calling MPI_Finalize. If the process calls MPI_Abort, that trumps the exit status question. Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI. This commit was SVN r24614.	2011-04-14 15:04:21 +00:00
Ralph Castain	30fb002524	Take the first small step towards rationalizing rsh support. Create a new "rshbase" component that contains a simple rsh module - no tree spawn, uses all the base functions for launch support. Extend the base rsh support functions to include those functions in common across all rsh modules. Only a minor change made to the current rsh module to avoid a naming conflict. Otherwise, left it alone to avoid creating conflicts with other external work. The current rsh module remains the default for rsh/ssh support, and continues to contain the support for SGE and Loadleveler. This commit was SVN r24593.	2011-03-30 01:15:07 +00:00
Nysal Jan	866ae8b43a	Close the file descriptor This commit was SVN r24580.	2011-03-29 08:42:49 +00:00
Nysal Jan	c8c6b0edab	Improve LoadLeveler integration with Open MPI. Add support for LL native rsh agent - llspawn This commit was SVN r24579.	2011-03-29 07:46:59 +00:00
Nathan Hjelm	8634b6394f	fixed plm/tm component This commit was SVN r24577.	2011-03-25 22:20:15 +00:00
Ralph Castain	d7e029cb40	Convert heartbeat to multicast basis This commit was SVN r24570.	2011-03-24 19:05:39 +00:00
Ralph Castain	90698a2c02	Ensure that blocking recvs wait until the data is actually recvd This commit was SVN r24558.	2011-03-22 18:45:54 +00:00
Ralph Castain	888472f671	Do not release recv as the calling function needs that data and will release it later This commit was SVN r24557.	2011-03-22 18:44:56 +00:00
Ralph Castain	30981de200	Minor cleanups courtesy of Nysal - thanks! This commit was SVN r24552.	2011-03-22 13:48:58 +00:00
Ralph Castain	c1396b278c	Resolve the rsh confusion by splitting the initial search for a launch agent from the actual setup of the launch agent values in the plm base globals. Have each aspiring rsh-clone call lookup to see if their desired launch agent is available - if not, then reject that plm component. If so, then setup the actual launch agent values only when the module init function is called. This resolves the current conflict between the rsh and rshd components. Hopefully, it may avoid future problems in this area -provided- any new uses of rsh-like launchers abide by the lookup-and-then-setup rule. This commit was SVN r24550.	2011-03-22 02:23:09 +00:00
Ralph Castain	d17b50e1ff	Add the appropriate hooks to tell Totalview to display the user's main program upon startup. Apparently, this hook got lost somewhere after the 1.2 series :-( Thanks to David Turner and the TV folks for passing this along. This commit was SVN r24549.	2011-03-21 17:40:58 +00:00
Ralph Castain	795ca2cff2	Complete implementation of the multicast-based grpcomm module This commit was SVN r24548.	2011-03-20 01:18:06 +00:00
Ralph Castain	fa40f5d7c3	Fix bad formatting This commit was SVN r24547.	2011-03-20 01:17:29 +00:00
Ralph Castain	281116ddc5	A max_restarts value of -1 is now valid and indicates infinite restarts, so correct the validity check This commit was SVN r24546.	2011-03-20 01:17:00 +00:00
Eugene Loh	2770a12beb	Continue clean up of thread options started in r22841, 22842, and 22849. No need for any CMRs to 1.5... that was already done in CMR 2728. This commit was SVN r24545. The following SVN revision numbers were found above: r22841 --> open-mpi/ompi@b400b84162	2011-03-18 21:36:35 +00:00
Ralph Castain	ee68cd102c	Fix the hier grpcomm module so modex results in correct data. The prior implementation stored the modex data as node-based attributes. This worked fine for BTL's such as openib where the interfaces were associated with the node. However, BTL's such as TCP have interfaces associated with a specific process, not a node. Thus, store the data in the modex database so it is correctly indexed. This commit was SVN r24536.	2011-03-17 02:22:23 +00:00
Ralph Castain	d5dfe05521	Remove stale code associated with OPAL_THREADS_HAVE_DIFFERENT_PIDS. In the past, we have supported the case of really, really old Linux kernels where threads have different pids. However, when we updated the event library, we didn't also update that support code. In addition, when we dropped progress thread support, we didn't remove areas of the code that could no longer be compiled (i.e., were protected by "if progress thread && if have different pids). There was no compelling reason to support such old kernels. Accordingly, convert the test to print a nice error message indicating we no longer support old kernels (but indicate that earlier OMPI versions do) and error out. Remove all code that was protected by "if have different pids" since it can no longer be compiled. This commit was SVN r24531.	2011-03-15 21:05:03 +00:00
Ralph Castain	de092af8ef	Add a little more debug This commit was SVN r24526.	2011-03-14 18:43:49 +00:00
Ralph Castain	ebabe9c83a	Forgot that Terry wanted to control the vm launch with an mca param - set one up for that purpose This commit was SVN r24525.	2011-03-13 00:46:42 +00:00
Ralph Castain	dc6f616599	Enable VM launch. For some time, ORTE has had the ability to launch daemons on all nodes prior to launching an application. It has largely been used outside of the OMPI community, and so was never explicitly turned "on" inside OMPI releases. Nevertheless, the code has been there. Allowing VM launches does not require ANY changes to existing PLM components. All that was required was to have orterun launch the daemons as a separate call to orte_plm.spawn -prior- to launching the applications. The rest of the VM support code resides in the rmaps framework: (a) a check when asked to map a job to see if it is the daemon job, and (b) a separate "setup_virtual_machine" mapper in the rmaps base that creates the required map so the PLM's will do the right thing. In order to support those users who have no RM allocation but like to give the allocation in the form of a -host or -hostfile argument to their application, there is a little more code in orterun and the setup_virtual_machine mapper to capture information passed in that manner. This has been tested with rsh and slurm environments, and, since there is nothing environment-specific in the implementation, should work in others as well - but needs to be proven. This commit was SVN r24524.	2011-03-12 22:50:53 +00:00
Ralph Castain	80265b472e	Avoid direct reference of pointer_array elements This commit was SVN r24523.	2011-03-12 20:18:51 +00:00
Ralph Castain	df82e4cd36	Plug a memory leak This commit was SVN r24521.	2011-03-12 15:37:33 +00:00
Ralph Castain	1297acde13	George raised some valid concerns about the extensibility of the revised rmaps framework. Address those by: 1. removing the enum of mapper values 2. change the req_mapper and last_mapper fields to char* so they can hold the component name instead of a mapper flag 3. revise the selection logic in the mapper components to reflect the change. Components now look for their name in the req_mapper field, or to see if other criteria (e.g., npernode) are set that mandate their doing the mapping Several MCA params resided in the rmaps base for historical reasons - they have been in the base since at least the original 1.2 release (and perhaps earlier). However, George correctly pointed out that they really should reside in their respective components. Accordingly, move them to the components, but register synonyms to the old names to avoid breaking backward compatibility. These revisions retain the current functionality of allowing comm_spawn'd jobs to use different mappers than the original job, and for the errmgr to utilize the resilient mapper to recover processes regardless of how they were originally mapped. Given the large number of possible combinations, I am sure that someone will find a corner-case combination of values and selection criteria that cause either no mapper to be selected, or one other than the intended to be used. No one can test all the ways people will use this system, so I expect debugging to continue for awhile. The ability of comm_spawn'd jobs to exploit this functionality relies on changes to the orte_dpm component - this will be committed separately. This commit was SVN r24520.	2011-03-12 05:30:09 +00:00
Samuel Gutierrez	830c7c66dc	fixes CID #1667 This commit was SVN r24518.	2011-03-12 03:09:01 +00:00
Ralph Castain	e6a76cc923	Fixes CID #1954 This commit was SVN r24516.	2011-03-11 23:00:27 +00:00
Ralph Castain	2ccd514b9a	Add version string to app This commit was SVN r24514.	2011-03-11 20:38:37 +00:00
Samuel Gutierrez	2a2319d23a	when orte_timing is enabled, always record daemon launch start time before starting the real work. This commit was SVN r24513.	2011-03-11 00:09:23 +00:00
Ralph Castain	f9a9fac76b	Minor typo This commit was SVN r24506.	2011-03-10 16:09:31 +00:00
George Bosilca	80fe617cd2	If we don't release the OPAL utils explicitly there will be a memory leak. This commit was SVN r24505.	2011-03-10 00:42:28 +00:00
George Bosilca	7f34a28c8f	Correct a comment. This commit was SVN r24504.	2011-03-10 00:41:41 +00:00
George Bosilca	d2502b14f9	Destruct the OOB TCP internal objects. This commit was SVN r24503.	2011-03-10 00:40:54 +00:00
Ralph Castain	3b4421d8e3	Separately track requested and last-used mapper so we don't lose that info This commit was SVN r24502.	2011-03-09 18:51:36 +00:00
Jeff Squyres	06d5c59115	Fix a few valgrind-reported memory leaks This commit was SVN r24498.	2011-03-08 17:37:28 +00:00
Jeff Squyres	0586612bd5	Fix another minor memory leak This commit was SVN r24495.	2011-03-08 15:46:13 +00:00
Jeff Squyres	79cf382ff3	Fix a few issues with error messages: * If something goes wrong during ompi_mpi_init, don't erroneously report that it is illegal to invoke MPI_INIT* before MPI_INIT * Aggregate help messages when possible when something goes wring during ompi_mpi_init This commit was SVN r24492.	2011-03-07 16:45:45 +00:00
Ralph Castain	63f38e38bb	Fix ompi-server: remove extra command flag in buffer being sent to mpirun, ensure that tools route messages thru a remote HNP This commit was SVN r24491.	2011-03-05 17:12:46 +00:00
Ralph Castain	d764e7a398	We want uid/gid support at the individual application level. Ensure the values get initialized and packed/unpacked for transfer. This commit was SVN r24489.	2011-03-04 18:46:43 +00:00
George Bosilca	9bbe00bdc3	Set the return code from the processes upstream. This commit was SVN r24483.	2011-03-03 00:02:21 +00:00
George Bosilca	c6a5f9706a	Thomas's patch: Assume we won't fail unless notified by a child. This commit was SVN r24482.	2011-03-02 23:50:01 +00:00
Josh Hursey	62bba1bf12	Name the enum so that it represents as an actual symbol in gdb, instead of just a number. This commit was SVN r24472.	2011-03-01 21:00:03 +00:00
Nysal Jan	4030111478	Add missing copyright and fix the year This commit was SVN r24446.	2011-02-23 15:52:06 +00:00
Nysal Jan	42a73bb887	POE is supported on both AIX and Linux. Build POE PLM only if we find the poe binary. Fix hostfile creation and POE command line arguments. This commit was SVN r24444.	2011-02-23 15:38:41 +00:00
Ralph Castain	f014284f91	Update resilient recovery mapping algorithm to be a bit more sophisticated. Track the prior node a proc was on so we avoid ricochet effect. Also avoid putting recovering proc onto node that is already occupied by a peer as this degrades fault tolerance. This commit was SVN r24417.	2011-02-20 18:46:21 +00:00
Ralph Castain	a8cf19a7bc	Ensure heartbeat only started once and only for daemon job This commit was SVN r24416.	2011-02-18 20:33:54 +00:00
Ralph Castain	ef56e6d78b	Helps to move the pointer This commit was SVN r24414.	2011-02-18 14:01:25 +00:00
Ralph Castain	7b35ada7fc	Fix ricochet effect - move failed procs to next on list instead of loadbalancing This commit was SVN r24413.	2011-02-18 13:11:55 +00:00
Ralph Castain	b98a2917ff	Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it. This commit was SVN r24412.	2011-02-18 02:48:12 +00:00
Ralph Castain	a0f6e153c7	Add missing fields to copy of app_context object This commit was SVN r24411.	2011-02-17 23:55:05 +00:00
Ralph Castain	51cf0a16c3	Some minor cleanups to support VM and CM operations This commit was SVN r24408.	2011-02-16 23:03:08 +00:00
Ralph Castain	9b48c07599	CM daemons handle their own output This commit was SVN r24407.	2011-02-16 23:02:23 +00:00
Ralph Castain	65ba6af44d	Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM. Have each mapper flag it did the map so we can see who did it later. Ensure procs are flagged as "ready to launch". This commit was SVN r24406.	2011-02-16 23:01:57 +00:00
Jeff Squyres	3f4d4886f2	Minor update for something that has been bugging me for quite a while: OMPI supports multiple different repository systems (SVN, hg, git). But the VERSION file has listed "want_svn" and "svn_r" as fields, even though the actual repo system and version may not be SVN. So search/replace those fields (and derrivative values that come from those fields) with "want_repo_rev" and "repo_rev", respectively. This commit was SVN r24405.	2011-02-16 22:53:23 +00:00
Ralph Castain	a32a7d9a82	Update heartbeat system This commit was SVN r24404.	2011-02-16 18:50:51 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Ralph Castain	5120e6aec3	Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param. When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests. Also remove the stale and never completed topo mapper. This commit was SVN r24393.	2011-02-15 23:24:31 +00:00
Ralph Castain	c1da94a444	Dead children have no pid This commit was SVN r24387.	2011-02-15 13:30:51 +00:00
Ralph Castain	a3607ff35d	Make it easier to send a kill-local-procs command for an arbitrary number of procs This commit was SVN r24386.	2011-02-15 13:26:11 +00:00
Ralph Castain	bf1cff3711	Plug a couple of additional memory leaks - try to highlight a little better that strings returned from reg_string_name must be freed by caller This commit was SVN r24383.	2011-02-14 20:58:22 +00:00
Ralph Castain	a9dca25ca5	Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do. Cleanup the heartbeat handling so it is associated with the proc, not a node. Cleanup handling of recovery options so that defaults do not override user values iff they are provided. This commit was SVN r24382.	2011-02-14 20:49:12 +00:00
Ralph Castain	6ee69c670e	Add a minor utility This commit was SVN r24379.	2011-02-14 19:45:59 +00:00
Ralph Castain	b5de068533	Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address. Add some new proc/job states Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed. Add pretty-print of signals for "proc aborted due to signal" reports. This commit was SVN r24378. The following SVN revision numbers were found above: r24371 --> open-mpi/ompi@93d28a5792	2011-02-14 19:29:09 +00:00
Abhishek Kulkarni	93d28a5792	Change opal_err2str_fn_t to return the error string as an argument. This means that the converters (opal_err2str, orte_err2str) can now return NULL as a "silent error". The return value of opal_err2str_fn_t is the status of the operation (OPAL_SUCCESS or OPAL_ERROR). This fixes the "Unknown error" message issues on the trunk. This commit was SVN r24371.	2011-02-13 16:09:17 +00:00
Ralph Castain	33b68132cc	Update the rmcast framework This commit was SVN r24370.	2011-02-12 16:52:03 +00:00
Josh Hursey	a9335ea423	Make sure to initialize the 'update_state' function for the default module. This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB. This commit was SVN r24365.	2011-02-08 20:42:32 +00:00
Nysal Jan	3a8d251daa	vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog This commit was SVN r24339.	2011-02-02 10:05:57 +00:00
Josh Hursey	fa3f6485d8	Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around. This commit was SVN r24326.	2011-01-31 19:09:47 +00:00
Josh Hursey	5b58ff0663	Fix a C/R checkpoint->restart->checkpoint->restart case. The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint. This seems to be fixed now. Thanks to Alex Brick for the bug report. This commit was SVN r24325.	2011-01-28 21:25:14 +00:00
Jeff Squyres	ec3d18dc9f	As noted on the mailing list by Gabriele Fatigati (http://www.open-mpi.org/community/lists/users/2011/01/15427.php), the --tv (and friends) switches to mpirun would effectively munge the orterun command line together and then split it apart again before exec'ing the underlying debugger. We would therefore lose multi-token argv[x] value and split them into multiple tokens. For example: mpirun --tv -np 2 a.out "foo bar" would get launched with "foo" and "bar" as separate arguments; not one argument. This was due to the underlying code joining the argv into a single string and then re-splitting it. This commit removed the argv join; it now does the parsing and re-jigering of the argv by only looking at each individual argv item; multi-word tokens like "foo bar" will never be split into separate tokens. This commit was SVN r24322.	2011-01-28 13:01:06 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Josh Hursey	81fd41f811	Return an informative error message if the user requests a migration of a job that is not capable of it. C/R Functionality cleanup This commit was SVN r24307.	2011-01-26 15:36:34 +00:00
Josh Hursey	8f45fcb429	More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. * Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation). * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer. * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process. * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario. This commit was SVN r24306.	2011-01-26 14:56:35 +00:00
Nathan Hjelm	8a3179cdcb	removed c99 test code This commit was SVN r24297.	2011-01-25 23:02:35 +00:00
Josh Hursey	e4d13d338f	Fix a couple of compiler warnings This commit was SVN r24295.	2011-01-25 22:22:32 +00:00
George Bosilca	09f645f9a9	There is no need for the byte variable. This commit was SVN r24293.	2011-01-24 22:41:04 +00:00
Nathan Hjelm	e2126512a9	test c99 struct initialization with mtt. remove on jan 20, 2011 This commit was SVN r24271.	2011-01-19 22:21:21 +00:00
Abhishek Kulkarni	fd7ef7a1f1	Fixes broken trunk compile: call process status notify only when ft-enable-cr is selected. This commit was SVN r24255.	2011-01-14 18:37:07 +00:00
Abhishek Kulkarni	87d2c9b31d	Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) * Improve the FTB notifier to publish (C/R, process/communication failure) events to the FTB with the OMPI jobid as the associated payload. * Add notifier calls for C/R events and process status events in SnapC and ErrMgr components. * Fix a bug where the SnapC states and process states collide before being thrown out over the notifier. This commit was SVN r24251.	2011-01-13 20:13:49 +00:00
Ralph Castain	b09f57b03d	Update the multicast subsystem - ported from Cisco branch This commit was SVN r24246.	2011-01-13 01:54:05 +00:00
Terry Dontje	f3aaa885a3	corrected a couple places in orte where it said cpu_model when it should have been cpu_type. This commit was SVN r24221.	2011-01-11 19:56:26 +00:00
Abhishek Kulkarni	11ffa854ff	Update the FTB notifier * fix indentation issues * update the name of one of the fault events published to the FTB (per the FTB MPI standard) This commit was SVN r24213.	2011-01-10 18:58:31 +00:00
Ralph Castain	80ef1af8ba	Add psm key generator program This commit was SVN r24197.	2010-12-30 20:54:58 +00:00
Nathan Hjelm	c082d05ecb	Reset the timer on MPIR_being_debugged only if MPIR_being_debugged is not set. Fix typo in return code. This commit was SVN r24187.	2010-12-20 21:00:49 +00:00
Jeff Squyres	a525e70f46	Convert "opal_show_help" to be a global variable pointer. It is statically initialized to the real back-end OPAL show_help function. During orte_show_help_init(), the variable is re-assigned with the value of the back-end ORTE show_help function (the one that does error message aggregation). Therefore, anything that calls opal_show_help() after a certain point in orte_init() will have their show_help messages be aggregated. w00t! Even code down in OPAL -- that has no knowledge of ORTE -- will have their messages aggregated. '''Double w00t!''' During orte_show_help_finalize(), we restore the original pointer value so that it something calls opal_show_help() after orte_finalize(), it'll still work properly (but it won't be aggregated). This commit was SVN r24185.	2010-12-16 23:00:25 +00:00
Jeff Squyres	de97962aac	Fixes trac:2651. Fix off-by-one error when /dev/urandom doesn't exist. Thanks to "pth" for the patch. This commit was SVN r24170. The following Trac tickets were found above: Ticket 2651 --> https://svn.open-mpi.org/trac/ompi/ticket/2651	2010-12-14 14:52:51 +00:00
Ralph Castain	b251a59cdf	Cleanup nidmap finalize This commit was SVN r24164.	2010-12-11 16:42:06 +00:00
Ralph Castain	2dc5cbb483	Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago. This commit was SVN r24153.	2010-12-05 15:58:21 +00:00
Rolf vandeVaart	b67d3398da	It is convention to have orte_config.h included at top of file. This commit was SVN r24146.	2010-12-03 16:13:31 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	aaad8ae891	Remove unused var This commit was SVN r24136.	2010-12-02 02:38:13 +00:00
Ralph Castain	f9ffff59f8	Ensure clean termination of threads and tcp multicast This commit was SVN r24134.	2010-12-02 00:23:42 +00:00
Nathan Hjelm	75605faa75	added support for reattaching a debugger using the MPIR_attach_fifo This commit was SVN r24132.	2010-12-01 20:13:58 +00:00
Ralph Castain	ad814f26cd	One more time, into the breach! Restore the use of override_oversubscribe to indicate that the data source for resources on the backend nodes used in mapping is unreliable. In this situation (e.g., data came from hostfile, or we are just using localhost because nothing was provided), we don't trust the oversubscribe condition passed by the mapper. Instead, we check locally to ensure we set sched_yield correctly. This commit was SVN r24130.	2010-12-01 15:15:26 +00:00
Ralph Castain	eba65e97f3	Extend the rmcast APIs to allow enable/disable of comm, required for clean termination by upper layer users. Point the recv thread event base to the right place so it can wakeup when required. Add a new error code for "comm disabled" when attempting to communicate after disabling comm. This commit was SVN r24129.	2010-12-01 13:41:19 +00:00
Ralph Castain	9224302c10	Remove debug This commit was SVN r24128.	2010-12-01 13:12:24 +00:00

1 2 3 4 5 ...

3236 Коммитов