openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	888472f671	Do not release recv as the calling function needs that data and will release it later This commit was SVN r24557.	2011-03-22 18:44:56 +00:00
Ralph Castain	30981de200	Minor cleanups courtesy of Nysal - thanks! This commit was SVN r24552.	2011-03-22 13:48:58 +00:00
Ralph Castain	c1396b278c	Resolve the rsh confusion by splitting the initial search for a launch agent from the actual setup of the launch agent values in the plm base globals. Have each aspiring rsh-clone call lookup to see if their desired launch agent is available - if not, then reject that plm component. If so, then setup the actual launch agent values only when the module init function is called. This resolves the current conflict between the rsh and rshd components. Hopefully, it may avoid future problems in this area -provided- any new uses of rsh-like launchers abide by the lookup-and-then-setup rule. This commit was SVN r24550.	2011-03-22 02:23:09 +00:00
Ralph Castain	d17b50e1ff	Add the appropriate hooks to tell Totalview to display the user's main program upon startup. Apparently, this hook got lost somewhere after the 1.2 series :-( Thanks to David Turner and the TV folks for passing this along. This commit was SVN r24549.	2011-03-21 17:40:58 +00:00
Ralph Castain	795ca2cff2	Complete implementation of the multicast-based grpcomm module This commit was SVN r24548.	2011-03-20 01:18:06 +00:00
Eugene Loh	2770a12beb	Continue clean up of thread options started in r22841, 22842, and 22849. No need for any CMRs to 1.5... that was already done in CMR 2728. This commit was SVN r24545. The following SVN revision numbers were found above: r22841 --> open-mpi/ompi@b400b84162	2011-03-18 21:36:35 +00:00
Ralph Castain	ee68cd102c	Fix the hier grpcomm module so modex results in correct data. The prior implementation stored the modex data as node-based attributes. This worked fine for BTL's such as openib where the interfaces were associated with the node. However, BTL's such as TCP have interfaces associated with a specific process, not a node. Thus, store the data in the modex database so it is correctly indexed. This commit was SVN r24536.	2011-03-17 02:22:23 +00:00
Ralph Castain	d5dfe05521	Remove stale code associated with OPAL_THREADS_HAVE_DIFFERENT_PIDS. In the past, we have supported the case of really, really old Linux kernels where threads have different pids. However, when we updated the event library, we didn't also update that support code. In addition, when we dropped progress thread support, we didn't remove areas of the code that could no longer be compiled (i.e., were protected by "if progress thread && if have different pids). There was no compelling reason to support such old kernels. Accordingly, convert the test to print a nice error message indicating we no longer support old kernels (but indicate that earlier OMPI versions do) and error out. Remove all code that was protected by "if have different pids" since it can no longer be compiled. This commit was SVN r24531.	2011-03-15 21:05:03 +00:00
Ralph Castain	de092af8ef	Add a little more debug This commit was SVN r24526.	2011-03-14 18:43:49 +00:00
Ralph Castain	dc6f616599	Enable VM launch. For some time, ORTE has had the ability to launch daemons on all nodes prior to launching an application. It has largely been used outside of the OMPI community, and so was never explicitly turned "on" inside OMPI releases. Nevertheless, the code has been there. Allowing VM launches does not require ANY changes to existing PLM components. All that was required was to have orterun launch the daemons as a separate call to orte_plm.spawn -prior- to launching the applications. The rest of the VM support code resides in the rmaps framework: (a) a check when asked to map a job to see if it is the daemon job, and (b) a separate "setup_virtual_machine" mapper in the rmaps base that creates the required map so the PLM's will do the right thing. In order to support those users who have no RM allocation but like to give the allocation in the form of a -host or -hostfile argument to their application, there is a little more code in orterun and the setup_virtual_machine mapper to capture information passed in that manner. This has been tested with rsh and slurm environments, and, since there is nothing environment-specific in the implementation, should work in others as well - but needs to be proven. This commit was SVN r24524.	2011-03-12 22:50:53 +00:00
Ralph Castain	80265b472e	Avoid direct reference of pointer_array elements This commit was SVN r24523.	2011-03-12 20:18:51 +00:00
Ralph Castain	df82e4cd36	Plug a memory leak This commit was SVN r24521.	2011-03-12 15:37:33 +00:00
Ralph Castain	1297acde13	George raised some valid concerns about the extensibility of the revised rmaps framework. Address those by: 1. removing the enum of mapper values 2. change the req_mapper and last_mapper fields to char* so they can hold the component name instead of a mapper flag 3. revise the selection logic in the mapper components to reflect the change. Components now look for their name in the req_mapper field, or to see if other criteria (e.g., npernode) are set that mandate their doing the mapping Several MCA params resided in the rmaps base for historical reasons - they have been in the base since at least the original 1.2 release (and perhaps earlier). However, George correctly pointed out that they really should reside in their respective components. Accordingly, move them to the components, but register synonyms to the old names to avoid breaking backward compatibility. These revisions retain the current functionality of allowing comm_spawn'd jobs to use different mappers than the original job, and for the errmgr to utilize the resilient mapper to recover processes regardless of how they were originally mapped. Given the large number of possible combinations, I am sure that someone will find a corner-case combination of values and selection criteria that cause either no mapper to be selected, or one other than the intended to be used. No one can test all the ways people will use this system, so I expect debugging to continue for awhile. The ability of comm_spawn'd jobs to exploit this functionality relies on changes to the orte_dpm component - this will be committed separately. This commit was SVN r24520.	2011-03-12 05:30:09 +00:00
Samuel Gutierrez	830c7c66dc	fixes CID #1667 This commit was SVN r24518.	2011-03-12 03:09:01 +00:00
Ralph Castain	e6a76cc923	Fixes CID #1954 This commit was SVN r24516.	2011-03-11 23:00:27 +00:00
Samuel Gutierrez	2a2319d23a	when orte_timing is enabled, always record daemon launch start time before starting the real work. This commit was SVN r24513.	2011-03-11 00:09:23 +00:00
George Bosilca	7f34a28c8f	Correct a comment. This commit was SVN r24504.	2011-03-10 00:41:41 +00:00
George Bosilca	d2502b14f9	Destruct the OOB TCP internal objects. This commit was SVN r24503.	2011-03-10 00:40:54 +00:00
Ralph Castain	3b4421d8e3	Separately track requested and last-used mapper so we don't lose that info This commit was SVN r24502.	2011-03-09 18:51:36 +00:00
Jeff Squyres	06d5c59115	Fix a few valgrind-reported memory leaks This commit was SVN r24498.	2011-03-08 17:37:28 +00:00
Jeff Squyres	0586612bd5	Fix another minor memory leak This commit was SVN r24495.	2011-03-08 15:46:13 +00:00
Ralph Castain	63f38e38bb	Fix ompi-server: remove extra command flag in buffer being sent to mpirun, ensure that tools route messages thru a remote HNP This commit was SVN r24491.	2011-03-05 17:12:46 +00:00
George Bosilca	9bbe00bdc3	Set the return code from the processes upstream. This commit was SVN r24483.	2011-03-03 00:02:21 +00:00
George Bosilca	c6a5f9706a	Thomas's patch: Assume we won't fail unless notified by a child. This commit was SVN r24482.	2011-03-02 23:50:01 +00:00
Josh Hursey	62bba1bf12	Name the enum so that it represents as an actual symbol in gdb, instead of just a number. This commit was SVN r24472.	2011-03-01 21:00:03 +00:00
Nysal Jan	4030111478	Add missing copyright and fix the year This commit was SVN r24446.	2011-02-23 15:52:06 +00:00
Nysal Jan	42a73bb887	POE is supported on both AIX and Linux. Build POE PLM only if we find the poe binary. Fix hostfile creation and POE command line arguments. This commit was SVN r24444.	2011-02-23 15:38:41 +00:00
Ralph Castain	f014284f91	Update resilient recovery mapping algorithm to be a bit more sophisticated. Track the prior node a proc was on so we avoid ricochet effect. Also avoid putting recovering proc onto node that is already occupied by a peer as this degrades fault tolerance. This commit was SVN r24417.	2011-02-20 18:46:21 +00:00
Ralph Castain	a8cf19a7bc	Ensure heartbeat only started once and only for daemon job This commit was SVN r24416.	2011-02-18 20:33:54 +00:00
Ralph Castain	ef56e6d78b	Helps to move the pointer This commit was SVN r24414.	2011-02-18 14:01:25 +00:00
Ralph Castain	7b35ada7fc	Fix ricochet effect - move failed procs to next on list instead of loadbalancing This commit was SVN r24413.	2011-02-18 13:11:55 +00:00
Ralph Castain	b98a2917ff	Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it. This commit was SVN r24412.	2011-02-18 02:48:12 +00:00
Ralph Castain	51cf0a16c3	Some minor cleanups to support VM and CM operations This commit was SVN r24408.	2011-02-16 23:03:08 +00:00
Ralph Castain	9b48c07599	CM daemons handle their own output This commit was SVN r24407.	2011-02-16 23:02:23 +00:00
Ralph Castain	65ba6af44d	Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM. Have each mapper flag it did the map so we can see who did it later. Ensure procs are flagged as "ready to launch". This commit was SVN r24406.	2011-02-16 23:01:57 +00:00
Jeff Squyres	3f4d4886f2	Minor update for something that has been bugging me for quite a while: OMPI supports multiple different repository systems (SVN, hg, git). But the VERSION file has listed "want_svn" and "svn_r" as fields, even though the actual repo system and version may not be SVN. So search/replace those fields (and derrivative values that come from those fields) with "want_repo_rev" and "repo_rev", respectively. This commit was SVN r24405.	2011-02-16 22:53:23 +00:00
Ralph Castain	a32a7d9a82	Update heartbeat system This commit was SVN r24404.	2011-02-16 18:50:51 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Ralph Castain	5120e6aec3	Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param. When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests. Also remove the stale and never completed topo mapper. This commit was SVN r24393.	2011-02-15 23:24:31 +00:00
Ralph Castain	c1da94a444	Dead children have no pid This commit was SVN r24387.	2011-02-15 13:30:51 +00:00
Ralph Castain	a3607ff35d	Make it easier to send a kill-local-procs command for an arbitrary number of procs This commit was SVN r24386.	2011-02-15 13:26:11 +00:00
Ralph Castain	bf1cff3711	Plug a couple of additional memory leaks - try to highlight a little better that strings returned from reg_string_name must be freed by caller This commit was SVN r24383.	2011-02-14 20:58:22 +00:00
Ralph Castain	a9dca25ca5	Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do. Cleanup the heartbeat handling so it is associated with the proc, not a node. Cleanup handling of recovery options so that defaults do not override user values iff they are provided. This commit was SVN r24382.	2011-02-14 20:49:12 +00:00
Ralph Castain	b5de068533	Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address. Add some new proc/job states Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed. Add pretty-print of signals for "proc aborted due to signal" reports. This commit was SVN r24378. The following SVN revision numbers were found above: r24371 --> open-mpi/ompi@93d28a5792	2011-02-14 19:29:09 +00:00
Abhishek Kulkarni	93d28a5792	Change opal_err2str_fn_t to return the error string as an argument. This means that the converters (opal_err2str, orte_err2str) can now return NULL as a "silent error". The return value of opal_err2str_fn_t is the status of the operation (OPAL_SUCCESS or OPAL_ERROR). This fixes the "Unknown error" message issues on the trunk. This commit was SVN r24371.	2011-02-13 16:09:17 +00:00
Ralph Castain	33b68132cc	Update the rmcast framework This commit was SVN r24370.	2011-02-12 16:52:03 +00:00
Josh Hursey	a9335ea423	Make sure to initialize the 'update_state' function for the default module. This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB. This commit was SVN r24365.	2011-02-08 20:42:32 +00:00
Nysal Jan	3a8d251daa	vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog This commit was SVN r24339.	2011-02-02 10:05:57 +00:00
Josh Hursey	fa3f6485d8	Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around. This commit was SVN r24326.	2011-01-31 19:09:47 +00:00
Josh Hursey	5b58ff0663	Fix a C/R checkpoint->restart->checkpoint->restart case. The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint. This seems to be fixed now. Thanks to Alex Brick for the bug report. This commit was SVN r24325.	2011-01-28 21:25:14 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Josh Hursey	81fd41f811	Return an informative error message if the user requests a migration of a job that is not capable of it. C/R Functionality cleanup This commit was SVN r24307.	2011-01-26 15:36:34 +00:00
Josh Hursey	8f45fcb429	More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. * Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation). * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer. * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process. * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario. This commit was SVN r24306.	2011-01-26 14:56:35 +00:00
Nathan Hjelm	8a3179cdcb	removed c99 test code This commit was SVN r24297.	2011-01-25 23:02:35 +00:00
Nathan Hjelm	e2126512a9	test c99 struct initialization with mtt. remove on jan 20, 2011 This commit was SVN r24271.	2011-01-19 22:21:21 +00:00
Abhishek Kulkarni	fd7ef7a1f1	Fixes broken trunk compile: call process status notify only when ft-enable-cr is selected. This commit was SVN r24255.	2011-01-14 18:37:07 +00:00
Abhishek Kulkarni	87d2c9b31d	Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) * Improve the FTB notifier to publish (C/R, process/communication failure) events to the FTB with the OMPI jobid as the associated payload. * Add notifier calls for C/R events and process status events in SnapC and ErrMgr components. * Fix a bug where the SnapC states and process states collide before being thrown out over the notifier. This commit was SVN r24251.	2011-01-13 20:13:49 +00:00
Ralph Castain	b09f57b03d	Update the multicast subsystem - ported from Cisco branch This commit was SVN r24246.	2011-01-13 01:54:05 +00:00
Terry Dontje	f3aaa885a3	corrected a couple places in orte where it said cpu_model when it should have been cpu_type. This commit was SVN r24221.	2011-01-11 19:56:26 +00:00
Abhishek Kulkarni	11ffa854ff	Update the FTB notifier * fix indentation issues * update the name of one of the fault events published to the FTB (per the FTB MPI standard) This commit was SVN r24213.	2011-01-10 18:58:31 +00:00
Nathan Hjelm	c082d05ecb	Reset the timer on MPIR_being_debugged only if MPIR_being_debugged is not set. Fix typo in return code. This commit was SVN r24187.	2010-12-20 21:00:49 +00:00
Ralph Castain	2dc5cbb483	Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago. This commit was SVN r24153.	2010-12-05 15:58:21 +00:00
Rolf vandeVaart	b67d3398da	It is convention to have orte_config.h included at top of file. This commit was SVN r24146.	2010-12-03 16:13:31 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	aaad8ae891	Remove unused var This commit was SVN r24136.	2010-12-02 02:38:13 +00:00
Ralph Castain	f9ffff59f8	Ensure clean termination of threads and tcp multicast This commit was SVN r24134.	2010-12-02 00:23:42 +00:00
Nathan Hjelm	75605faa75	added support for reattaching a debugger using the MPIR_attach_fifo This commit was SVN r24132.	2010-12-01 20:13:58 +00:00
Ralph Castain	ad814f26cd	One more time, into the breach! Restore the use of override_oversubscribe to indicate that the data source for resources on the backend nodes used in mapping is unreliable. In this situation (e.g., data came from hostfile, or we are just using localhost because nothing was provided), we don't trust the oversubscribe condition passed by the mapper. Instead, we check locally to ensure we set sched_yield correctly. This commit was SVN r24130.	2010-12-01 15:15:26 +00:00
Ralph Castain	eba65e97f3	Extend the rmcast APIs to allow enable/disable of comm, required for clean termination by upper layer users. Point the recv thread event base to the right place so it can wakeup when required. Add a new error code for "comm disabled" when attempting to communicate after disabling comm. This commit was SVN r24129.	2010-12-01 13:41:19 +00:00
Ralph Castain	9224302c10	Remove debug This commit was SVN r24128.	2010-12-01 13:12:24 +00:00
Ralph Castain	30c37ea536	Ensure that the oversubscribed condition of nodes is accurately reported by the mapper, and that the results are communicated and used by the backend orteds when setting sched_yield on local procs. Restores prior behavior that was somehow lost along the way. Includes a patch from Damien Guinier to fix vpid assignments when cpus-per-task is specified. This commit was SVN r24126.	2010-12-01 12:51:39 +00:00
Ralph Castain	85a974b0de	Better check for NULL before using the value This commit was SVN r24122.	2010-12-01 04:48:50 +00:00
Ralph Castain	c56185887b	Change the event base "wakeup" support to enable the passing of events to the central thread for add/del. Add a macro OPAL_UPDATE_EVBASE for this purpose as it will likely be widely used. Update the ORTE thread support to utilize this capability. Update the rmcast framework to track the change. This commit was SVN r24121.	2010-12-01 04:26:43 +00:00
Ralph Castain	0441e81882	Oops - ensure that multicast msgs get circulated properly with the tcp module This commit was SVN r24118.	2010-11-30 21:13:53 +00:00
Ralph Castain	d20c023348	Checkpoint the threading support for multicast - will be revised shortly, but this version currently works. This commit was SVN r24117.	2010-11-30 17:30:16 +00:00
Ralph Castain	71669720a3	Just get the output once on sigpipe error, and include the fd This commit was SVN r24092.	2010-11-25 15:32:48 +00:00
Ralph Castain	30c635fd4d	Don't endlessly output sigpipe errors. Count the number of times we trap it, and abort if we get more than 10 of them. This commit was SVN r24091.	2010-11-25 15:25:24 +00:00
Rolf vandeVaart	09fdd5cc23	Include fcntl.h, not sys/fcntl.h so we get the definition of the open system call. That is what man page says to do. Fixes warning on Solaris. This commit was SVN r24073.	2010-11-19 17:40:02 +00:00
Ethan Mallove	66f2301170	Just plain "Grid Engine" instead of "Sun Grid Engine" This commit was SVN r24068.	2010-11-18 19:30:04 +00:00
Abhishek Kulkarni	78a67654d4	add notifier events for process migration This commit was SVN r24058.	2010-11-16 17:57:44 +00:00
Abhishek Kulkarni	6e6ccae082	Update the checkpoint notification events that we throw out over the FTB with a payload embedded in {} This commit was SVN r24057.	2010-11-16 17:55:57 +00:00
Jeff Squyres	e4744b4ed5	Per http://www.open-mpi.org/community/lists/devel/2010/11/8671.php , change a bunch of OMPI_<foo> names to OPAL_<foo>. This commit was SVN r24046.	2010-11-12 23:22:11 +00:00
Ralph Castain	bb521c6b7e	Properly count local procs to set oversubscribed condition This commit was SVN r24037.	2010-11-10 21:59:35 +00:00
Ralph Castain	021bd77bf1	Don't free the event base if we aren't using progress threads This commit was SVN r24036.	2010-11-10 21:58:58 +00:00
Ralph Castain	57257ab9b4	Use the right event base if threads are disabled. Always update the seq num This commit was SVN r24034.	2010-11-10 21:26:04 +00:00
Ralph Castain	cbb758c4fb	Allow mcast threads to be disabled This commit was SVN r24032.	2010-11-10 20:16:41 +00:00
Ralph Castain	22e40d92a0	Cleanup thread termination This commit was SVN r24031.	2010-11-10 19:36:44 +00:00
Ralph Castain	01347926d1	Be a little more thorough about cleaning up during finalize This commit was SVN r24014.	2010-11-09 14:56:27 +00:00
Shiqing Fan	d3701ccba8	type casts. This commit was SVN r24013.	2010-11-09 09:17:22 +00:00
Ralph Castain	f2f41d1ca9	Be nice to those who don't enable-multicast...poor wretches. This commit was SVN r24011.	2010-11-09 05:08:55 +00:00
Ralph Castain	a47b33678b	Add orte-level thread support to avoid some of the opal_if_threads protection used solely for ompi. Use threads to help process multicast messages. This commit was SVN r24009.	2010-11-08 19:09:23 +00:00
Ralph Castain	bf665692c3	Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system. This commit was SVN r24004.	2010-11-07 23:29:52 +00:00
Abhishek Kulkarni	e0660101d3	Throw notifier events for checkpointing status (success or failure) This commit was SVN r24003.	2010-11-07 22:12:09 +00:00
Abhishek Kulkarni	8cd3759f21	use the saved value of PID, saving some calls to getpid() This commit was SVN r24002.	2010-11-07 22:09:49 +00:00
Abhishek Kulkarni	d1a4cc33dd	Update the FTB notifier wrt events decided by the CIFTS working group This commit was SVN r24001.	2010-11-07 22:06:32 +00:00
Abhishek Kulkarni	ac2768ca7c	LOG_SYSLOG is a syslog facility. take it off the syslog options This commit was SVN r24000.	2010-11-06 22:05:45 +00:00
Ralph Castain	875a6d61a4	Return correct status code This commit was SVN r23969.	2010-10-29 00:43:50 +00:00
Ralph Castain	9ea2b196ce	Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component. Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work. This commit was SVN r23966.	2010-10-28 15:22:46 +00:00
Ralph Castain	c13b0bb668	Update some debugger attachment code per LLNL request This commit was SVN r23965.	2010-10-28 03:06:20 +00:00
Brian Barrett	3ed00ba148	More fixes to make OMPI compile with minimal ORTE support again This commit was SVN r23962.	2010-10-27 20:40:39 +00:00
Shiqing Fan	199df1eadf	Rename a few var names. This commit was SVN r23959.	2010-10-27 11:52:57 +00:00
Nathan Hjelm	e7bfbe1d1a	added missing object initialization/destruction of mca_oob_tcp_component.tcp_listen_thread_event This commit was SVN r23958.	2010-10-26 22:09:37 +00:00
Shiqing Fan	a3d9c91ff7	Exclude stdbool.h for Windows, and use the definition in opal. Immigrate the socket pair support from libevent. Fix other minor things and make it compile. This commit was SVN r23951.	2010-10-26 14:53:50 +00:00
Ralph Castain	894230b121	This stuff is soooo out-of-date that a complete rewrite would be required - thankfully, nobody cares This commit was SVN r23944.	2010-10-26 06:22:31 +00:00
Ralph Castain	86c7365e8e	Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem. Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution. This commit was SVN r23943.	2010-10-26 02:41:42 +00:00
Ralph Castain	fc46dfa78a	Remove stale code This commit was SVN r23942.	2010-10-26 02:37:56 +00:00
George Bosilca	5882290cdd	We need a default value or the compiler will whine. This commit was SVN r23940.	2010-10-25 19:05:45 +00:00
Abhishek Kulkarni	c671ec52d1	Fix broken trunk compile after the libevent changes. This commit was SVN r23929.	2010-10-25 14:11:48 +00:00
Ralph Castain	fceabb2498	Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects. Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems. Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct. I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things: 1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new) 2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it. There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do. This commit was SVN r23925.	2010-10-24 18:35:54 +00:00
Ralph Castain	8a37b3071a	Cleanup mpirx co-spawn of debugger daemons. Resolve block-until-both-ends-are-opened behavior for fifos. Add some debugging output. Make the fifo event be non-persistent and read the value in the fifo to avoid spinning on the event. This commit was SVN r23922.	2010-10-22 20:46:07 +00:00
Ralph Castain	2c1a658232	Fix debugger attach This commit was SVN r23921.	2010-10-22 20:07:24 +00:00
Ralph Castain	1e93437cd4	To help with debugging, add a new mca param that instructs ORTE_ERROR_LOG to output "silent" errors. Helps to track down silent errors that don't have an associated error message (e.g., via show_help). This commit was SVN r23893.	2010-10-16 03:29:47 +00:00
Brian Barrett	9febaa475e	* Add shell of functionality required for supporting Portals4 * Update places where orte-free builds have failed This commit was SVN r23891.	2010-10-14 22:49:09 +00:00
Ralph Castain	37f566bf1e	Cancel recvs when finalizing This commit was SVN r23871.	2010-10-07 22:02:12 +00:00
Jeff Squyres	eaab8d0062	* Ensure to set paffinity_enabled in all cases * Ensure to set the mask value before we use it This commit was SVN r23861.	2010-10-07 15:48:49 +00:00
Ralph Castain	92d69effc9	Read the data from the fifo to clear the event so it doesn't immediately re-trigger This commit was SVN r23856.	2010-10-07 02:41:02 +00:00
Ralph Castain	871b685e89	Ensure all debugger interface symbols are present in orterun This commit was SVN r23823.	2010-09-30 21:36:00 +00:00
Josh Hursey	c8692198a2	Fix an issue migrating/autorecovering processes that mask SIGTERM using the C/R functionality. I did not want to make this change globally since there could be good reason to keep the check before calling SIGKILL that I am not seeing at the moment. This commit was SVN r23821.	2010-09-30 20:55:12 +00:00
Ralph Castain	dd959f5ab6	Silence an idiotic warning This commit was SVN r23819.	2010-09-30 17:54:13 +00:00
Jeff Squyres	73bcc4a36b	Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers This commit was SVN r23801. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-24 22:53:28 +00:00
Ralph Castain	3631e4e936	Revert remaining svn kruft from r23764 This commit was SVN r23786. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-22 01:11:40 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Abhishek Kulkarni	a143622b54	Remove unused code. Notifier events are aggregated on a per-event basis by the HNP notifier. This commit was SVN r23711.	2010-09-02 16:00:22 +00:00
Ralph Castain	f75437f5a3	Add the ability to receive notifier output when job completes. Set the notification level to INFO for normal job completion, and to ALERT for abnormal termination. This commit was SVN r23710.	2010-09-02 14:42:41 +00:00
Ralph Castain	b982f908e8	Fixed some newly-induced warnings This commit was SVN r23694.	2010-08-31 14:51:19 +00:00
Rainer Keller	97511912ec	- Fixup several functions, that cannot return - Add one instance where we do not use a parameter in a function - Fix a buglet in commit r23689, where the attribute-for-function ptrs was applied. This commit was SVN r23690. The following SVN revision numbers were found above: r23689 --> open-mpi/ompi@5eb571c458	2010-08-31 12:21:13 +00:00
Rainer Keller	5eb571c458	- As suggested in CMR #2558 , attribute-macros should be be tested on function pointers and assigned accordingly, instead of using the pre-processor in the header files. A functional change is (re-) specifying __opal_attribute_noreturn__ on orte_errmgr_base_abort(): All modules in the errmgr framework either use this function, or define their own abort function, which sets __opal_attribute_noreturn__. This attributes was taken out with the errmgr overhaul in r22872. This commit was SVN r23689. The following SVN revision numbers were found above: r22872 --> open-mpi/ompi@e4f2d03d28	2010-08-31 10:28:51 +00:00
Ralph Castain	b81358815c	Add some debug This commit was SVN r23686.	2010-08-29 13:45:10 +00:00
Ralph Castain	554aede041	Fix a situation where we were unlocking a thread that isn't locked for the main launch - it is only used for dynamic spawns. This commit was SVN r23682.	2010-08-28 14:03:17 +00:00
Rainer Keller	4abcf5a0d7	- The Sun-compiler 12 update 1 complains about noreturn-attributes assigned to function-declarations. Check this case and mark the currently only case existing in trunk. Thanks to Paul Hargrove for bringing this up. Let's test the svn commit msg CMR:v1.5 This commit was SVN r23676.	2010-08-27 09:18:30 +00:00
Rainer Keller	12ed573e5e	- Include <strings.h> for rindex(3). Thanks to Paul Hargrove. Please CMR:v1.5 This commit was SVN r23671.	2010-08-26 13:42:36 +00:00
Shiqing Fan	9911797867	Rename a few odls help files for Windows installation. This commit was SVN r23668.	2010-08-26 09:31:18 +00:00
Ralph Castain	2e223abe33	Restore the auto-poll method for detecting debugger attachment, but only in the mpirx debugger module and only if the corresponding rate mca param is set. Guess we missed it before, but add the debugger framework to the orte-info and ompi_info tools This commit was SVN r23667.	2010-08-25 22:52:33 +00:00
Ralph Castain	4ecd9a0bbe	Protect against an obscure race condition that AFAICT only occurs when we are in a loop waiting to recv a message from a peer who is then killed by signal. This commit was SVN r23662.	2010-08-25 15:35:01 +00:00
Ralph Castain	f1a00c9a21	Per Jeff's inquiry, play chicken and don't assume herror exists everywhere. This commit was SVN r23656.	2010-08-24 20:46:41 +00:00
Jeff Squyres	207ca2d928	This commit is the first of several steps in a paffinity makeover extravaganza. = Short version = This commit does several things, but the short version is that it re-orients the error message creation of the ODLS default module to generate error strings in the child process for errors that occur after the fork but before the exec (such errors are ''usually'' related to paffinity). A show_help string is rendered in the child and then IPC'ed up to the parent, who displays the string through normal ORTE show_help aggregation mechanisms. We also broke up the ginormous paffinity-setting logic into a few separate functions, both to help us understand the code, and hopefully to ease future maintenance. The logic for the ODLS default binding should not have changed -- this is mainly a code reshuffle and improvement on error reporting. = Rationale = The reasoning for this commit is complex. As mentioned above, it's the first step in some paffinity cleanup. Here's the line of dominoes that must fall (in this order): 1. Add hwloc paffinity component (already done). 1. While testing hwloc, we discovered that the error reporting from the ODLS default module was abysmal. So we fixed it. 1. Further, we reorganized the code in the odsl_default_module.c a bit to help our understanding of it. 1. We also discovered a few bugs in the original ODLS default module logic that existed before this code shuffle; separate tickets will be filed to fix them. 1. Next up will be some improvements to paffinity / odls default to make the act of binding to a core ensure to bind to ''all'' hardware threads contained in that core (similar for sockets: binding to a socket will bind to ''all'' hardware threads in that socket). 1. Next will be improvements to paffinity to expose binding to hardware threads through the paffinity framework API. 1. Finally, we'll expose these binding controls to the user (e.g., through mpirun command line arguments, MCA parameters, etc.). This commit represents the first few bullets; the last 4 bullets are being worked on right now, but there is no definite timeline for completion. = Miscelaneous = A few points worth mentioning: * We have tested this new code a bunch; we're pretty sure it behaves just like the trunk -- but with better / more precise error reporting. More testing is needed on a wider array of platforms, however. * A big comment at the top of odls_default_module.c explains the (new) general scheme for the error reporting. * The error reporting in the parent process is now really dumb; almost all the intelligence about creating error messages is in the child. * The show_help file was renamed to be more consistent with other help files (help-odls-default.txt -> help-orte-odls-default.txt) * Removed the use of sched_yield() because of recent changes in the Linux 2.6.3x kernels. We already had an #else clause for select()'ing for 1us if we didn't have sched_yield() -- that is now the only code path. This is not a performance-critical section of the code, so this shouldn't be controversial. * Replaced the macro-based error reporting with function-based reporting. It's a bit more bulky, but it helped us understand the code and saved us multiple times with compile-time parameter checking, etc. * Cleaned up the use of several show_help messages to ensure that they mapped to real messages in help*.txt files. This commit was SVN r23652.	2010-08-24 19:38:29 +00:00
Ralph Castain	3b3cd67d07	If we are using static ports and cannot resolve a hostname, then see if the proc is on the local host. If so, then attempt to use a loopback interface to complete the connection. Only implemented for IPv4 because the if.c code has been so hashed I couldn't figure out how to do this cleanly for all cases. This commit was SVN r23647.	2010-08-24 14:14:59 +00:00
Ralph Castain	7608513158	Cleanup the code and add some comments to make it easier to understand. Add a bozo error check This commit was SVN r23639.	2010-08-24 04:46:59 +00:00
Ralph Castain	2886da5669	Ensure that the local daemon vpid gets defined so that the locality procedures work when using the ess generic module. This commit was SVN r23638.	2010-08-24 04:38:21 +00:00
Josh Hursey	4ffc2d6f68	fix a couple of missed prefixes This commit was SVN r23629.	2010-08-19 13:26:33 +00:00
Josh Hursey	fabd5cc153	Simplification of the ErrMgr framework by removing the 'stack'/composite functionality. The composite functionality was becoming difficult to maintain, so we removed it for now which simplifies the framework design considerably. Since the 'crmig' and 'autor' components were -very- similar to the 'hnp' component, this commit also merges them together. By moving the 'crmig' and 'autor' to a separate file under the 'hnp' component we are able to isolate the C/R logic to a large extent, thus being only minimally hooked into the previous 'hnp' component. So other than some name changes, the functionality is all still in place. I will update the C/R documentation later this morning. This commit was SVN r23628.	2010-08-19 13:09:20 +00:00
Josh Hursey	77792c937d	When we checkpoint with the --stop option, be sure to write out all the metadata before clearing the storage handle. Here we tried to write out the session directory marker after we sync the directory, which happens early in the case of --stop. Thanks to Ananda Mudar for noticing the bug. This commit was SVN r23627.	2010-08-18 20:44:03 +00:00
Brian Barrett	13c827dda8	Make trunk compile on Red Storm again This commit was SVN r23622.	2010-08-17 21:51:38 +00:00
Ralph Castain	bbf84fd92b	Refine the protection from cross-dvm communications This commit was SVN r23615.	2010-08-16 16:33:39 +00:00
Ralph Castain	930f7adb0f	Check the return status and report any error This commit was SVN r23611.	2010-08-13 15:04:59 +00:00
Ralph Castain	4491a0e5dc	Add a channel for reporting errors, fix a bug in the tcp module This commit was SVN r23610.	2010-08-13 15:04:22 +00:00
Ralph Castain	ace1f60429	Rename an mca param to something more intuitive and set its default to 0 so the module only runs if a non-zero value is provided This commit was SVN r23609.	2010-08-13 15:03:45 +00:00
Rainer Keller	fc4cb0c0c1	- Allow changing ALPS run command - Fix misnomer This commit was SVN r23601.	2010-08-12 14:41:35 +00:00
Ralph Castain	5715a5b421	Let VM-based mappings include the updated nidmap This commit was SVN r23596.	2010-08-11 21:04:28 +00:00
Josh Hursey	e12ca48cd9	A number of C/R enhancements per RFC below: http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary\|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413	2010-08-10 20:51:11 +00:00

1 2 3 4 5 ...

2433 Коммитов