openmpi

Автор	SHA1	Сообщение	Дата
Nysal Jan	4030111478	Add missing copyright and fix the year This commit was SVN r24446.	2011-02-23 15:52:06 +00:00
Nysal Jan	42a73bb887	POE is supported on both AIX and Linux. Build POE PLM only if we find the poe binary. Fix hostfile creation and POE command line arguments. This commit was SVN r24444.	2011-02-23 15:38:41 +00:00
Ralph Castain	f014284f91	Update resilient recovery mapping algorithm to be a bit more sophisticated. Track the prior node a proc was on so we avoid ricochet effect. Also avoid putting recovering proc onto node that is already occupied by a peer as this degrades fault tolerance. This commit was SVN r24417.	2011-02-20 18:46:21 +00:00
Ralph Castain	a8cf19a7bc	Ensure heartbeat only started once and only for daemon job This commit was SVN r24416.	2011-02-18 20:33:54 +00:00
Ralph Castain	ef56e6d78b	Helps to move the pointer This commit was SVN r24414.	2011-02-18 14:01:25 +00:00
Ralph Castain	7b35ada7fc	Fix ricochet effect - move failed procs to next on list instead of loadbalancing This commit was SVN r24413.	2011-02-18 13:11:55 +00:00
Ralph Castain	b98a2917ff	Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it. This commit was SVN r24412.	2011-02-18 02:48:12 +00:00
Ralph Castain	a0f6e153c7	Add missing fields to copy of app_context object This commit was SVN r24411.	2011-02-17 23:55:05 +00:00
Ralph Castain	51cf0a16c3	Some minor cleanups to support VM and CM operations This commit was SVN r24408.	2011-02-16 23:03:08 +00:00
Ralph Castain	9b48c07599	CM daemons handle their own output This commit was SVN r24407.	2011-02-16 23:02:23 +00:00
Ralph Castain	65ba6af44d	Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM. Have each mapper flag it did the map so we can see who did it later. Ensure procs are flagged as "ready to launch". This commit was SVN r24406.	2011-02-16 23:01:57 +00:00
Jeff Squyres	3f4d4886f2	Minor update for something that has been bugging me for quite a while: OMPI supports multiple different repository systems (SVN, hg, git). But the VERSION file has listed "want_svn" and "svn_r" as fields, even though the actual repo system and version may not be SVN. So search/replace those fields (and derrivative values that come from those fields) with "want_repo_rev" and "repo_rev", respectively. This commit was SVN r24405.	2011-02-16 22:53:23 +00:00
Ralph Castain	a32a7d9a82	Update heartbeat system This commit was SVN r24404.	2011-02-16 18:50:51 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Ralph Castain	5120e6aec3	Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param. When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests. Also remove the stale and never completed topo mapper. This commit was SVN r24393.	2011-02-15 23:24:31 +00:00
Ralph Castain	c1da94a444	Dead children have no pid This commit was SVN r24387.	2011-02-15 13:30:51 +00:00
Ralph Castain	a3607ff35d	Make it easier to send a kill-local-procs command for an arbitrary number of procs This commit was SVN r24386.	2011-02-15 13:26:11 +00:00
Ralph Castain	bf1cff3711	Plug a couple of additional memory leaks - try to highlight a little better that strings returned from reg_string_name must be freed by caller This commit was SVN r24383.	2011-02-14 20:58:22 +00:00
Ralph Castain	a9dca25ca5	Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do. Cleanup the heartbeat handling so it is associated with the proc, not a node. Cleanup handling of recovery options so that defaults do not override user values iff they are provided. This commit was SVN r24382.	2011-02-14 20:49:12 +00:00
Ralph Castain	6ee69c670e	Add a minor utility This commit was SVN r24379.	2011-02-14 19:45:59 +00:00
Ralph Castain	b5de068533	Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address. Add some new proc/job states Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed. Add pretty-print of signals for "proc aborted due to signal" reports. This commit was SVN r24378. The following SVN revision numbers were found above: r24371 --> open-mpi/ompi@93d28a5792	2011-02-14 19:29:09 +00:00
Abhishek Kulkarni	93d28a5792	Change opal_err2str_fn_t to return the error string as an argument. This means that the converters (opal_err2str, orte_err2str) can now return NULL as a "silent error". The return value of opal_err2str_fn_t is the status of the operation (OPAL_SUCCESS or OPAL_ERROR). This fixes the "Unknown error" message issues on the trunk. This commit was SVN r24371.	2011-02-13 16:09:17 +00:00
Ralph Castain	33b68132cc	Update the rmcast framework This commit was SVN r24370.	2011-02-12 16:52:03 +00:00
Josh Hursey	a9335ea423	Make sure to initialize the 'update_state' function for the default module. This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB. This commit was SVN r24365.	2011-02-08 20:42:32 +00:00
Nysal Jan	3a8d251daa	vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog This commit was SVN r24339.	2011-02-02 10:05:57 +00:00
Josh Hursey	fa3f6485d8	Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around. This commit was SVN r24326.	2011-01-31 19:09:47 +00:00
Josh Hursey	5b58ff0663	Fix a C/R checkpoint->restart->checkpoint->restart case. The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint. This seems to be fixed now. Thanks to Alex Brick for the bug report. This commit was SVN r24325.	2011-01-28 21:25:14 +00:00
Jeff Squyres	ec3d18dc9f	As noted on the mailing list by Gabriele Fatigati (http://www.open-mpi.org/community/lists/users/2011/01/15427.php), the --tv (and friends) switches to mpirun would effectively munge the orterun command line together and then split it apart again before exec'ing the underlying debugger. We would therefore lose multi-token argv[x] value and split them into multiple tokens. For example: mpirun --tv -np 2 a.out "foo bar" would get launched with "foo" and "bar" as separate arguments; not one argument. This was due to the underlying code joining the argv into a single string and then re-splitting it. This commit removed the argv join; it now does the parsing and re-jigering of the argv by only looking at each individual argv item; multi-word tokens like "foo bar" will never be split into separate tokens. This commit was SVN r24322.	2011-01-28 13:01:06 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Josh Hursey	81fd41f811	Return an informative error message if the user requests a migration of a job that is not capable of it. C/R Functionality cleanup This commit was SVN r24307.	2011-01-26 15:36:34 +00:00
Josh Hursey	8f45fcb429	More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. * Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation). * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer. * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process. * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario. This commit was SVN r24306.	2011-01-26 14:56:35 +00:00
Nathan Hjelm	8a3179cdcb	removed c99 test code This commit was SVN r24297.	2011-01-25 23:02:35 +00:00
Josh Hursey	e4d13d338f	Fix a couple of compiler warnings This commit was SVN r24295.	2011-01-25 22:22:32 +00:00
George Bosilca	09f645f9a9	There is no need for the byte variable. This commit was SVN r24293.	2011-01-24 22:41:04 +00:00
Nathan Hjelm	e2126512a9	test c99 struct initialization with mtt. remove on jan 20, 2011 This commit was SVN r24271.	2011-01-19 22:21:21 +00:00
Abhishek Kulkarni	fd7ef7a1f1	Fixes broken trunk compile: call process status notify only when ft-enable-cr is selected. This commit was SVN r24255.	2011-01-14 18:37:07 +00:00
Abhishek Kulkarni	87d2c9b31d	Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) * Improve the FTB notifier to publish (C/R, process/communication failure) events to the FTB with the OMPI jobid as the associated payload. * Add notifier calls for C/R events and process status events in SnapC and ErrMgr components. * Fix a bug where the SnapC states and process states collide before being thrown out over the notifier. This commit was SVN r24251.	2011-01-13 20:13:49 +00:00
Ralph Castain	b09f57b03d	Update the multicast subsystem - ported from Cisco branch This commit was SVN r24246.	2011-01-13 01:54:05 +00:00
Terry Dontje	f3aaa885a3	corrected a couple places in orte where it said cpu_model when it should have been cpu_type. This commit was SVN r24221.	2011-01-11 19:56:26 +00:00
Abhishek Kulkarni	11ffa854ff	Update the FTB notifier * fix indentation issues * update the name of one of the fault events published to the FTB (per the FTB MPI standard) This commit was SVN r24213.	2011-01-10 18:58:31 +00:00
Ralph Castain	80ef1af8ba	Add psm key generator program This commit was SVN r24197.	2010-12-30 20:54:58 +00:00
Nathan Hjelm	c082d05ecb	Reset the timer on MPIR_being_debugged only if MPIR_being_debugged is not set. Fix typo in return code. This commit was SVN r24187.	2010-12-20 21:00:49 +00:00
Jeff Squyres	a525e70f46	Convert "opal_show_help" to be a global variable pointer. It is statically initialized to the real back-end OPAL show_help function. During orte_show_help_init(), the variable is re-assigned with the value of the back-end ORTE show_help function (the one that does error message aggregation). Therefore, anything that calls opal_show_help() after a certain point in orte_init() will have their show_help messages be aggregated. w00t! Even code down in OPAL -- that has no knowledge of ORTE -- will have their messages aggregated. '''Double w00t!''' During orte_show_help_finalize(), we restore the original pointer value so that it something calls opal_show_help() after orte_finalize(), it'll still work properly (but it won't be aggregated). This commit was SVN r24185.	2010-12-16 23:00:25 +00:00
Jeff Squyres	de97962aac	Fixes trac:2651. Fix off-by-one error when /dev/urandom doesn't exist. Thanks to "pth" for the patch. This commit was SVN r24170. The following Trac tickets were found above: Ticket 2651 --> https://svn.open-mpi.org/trac/ompi/ticket/2651	2010-12-14 14:52:51 +00:00
Ralph Castain	b251a59cdf	Cleanup nidmap finalize This commit was SVN r24164.	2010-12-11 16:42:06 +00:00
Ralph Castain	2dc5cbb483	Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago. This commit was SVN r24153.	2010-12-05 15:58:21 +00:00
Rolf vandeVaart	b67d3398da	It is convention to have orte_config.h included at top of file. This commit was SVN r24146.	2010-12-03 16:13:31 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	aaad8ae891	Remove unused var This commit was SVN r24136.	2010-12-02 02:38:13 +00:00
Ralph Castain	f9ffff59f8	Ensure clean termination of threads and tcp multicast This commit was SVN r24134.	2010-12-02 00:23:42 +00:00

1 2 3 4 5 ...

3040 Коммитов