openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	ef56e6d78b	Helps to move the pointer This commit was SVN r24414.	2011-02-18 14:01:25 +00:00
Ralph Castain	7b35ada7fc	Fix ricochet effect - move failed procs to next on list instead of loadbalancing This commit was SVN r24413.	2011-02-18 13:11:55 +00:00
Ralph Castain	b98a2917ff	Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it. This commit was SVN r24412.	2011-02-18 02:48:12 +00:00
Ralph Castain	a0f6e153c7	Add missing fields to copy of app_context object This commit was SVN r24411.	2011-02-17 23:55:05 +00:00
Ralph Castain	51cf0a16c3	Some minor cleanups to support VM and CM operations This commit was SVN r24408.	2011-02-16 23:03:08 +00:00
Ralph Castain	9b48c07599	CM daemons handle their own output This commit was SVN r24407.	2011-02-16 23:02:23 +00:00
Ralph Castain	65ba6af44d	Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM. Have each mapper flag it did the map so we can see who did it later. Ensure procs are flagged as "ready to launch". This commit was SVN r24406.	2011-02-16 23:01:57 +00:00
Jeff Squyres	3f4d4886f2	Minor update for something that has been bugging me for quite a while: OMPI supports multiple different repository systems (SVN, hg, git). But the VERSION file has listed "want_svn" and "svn_r" as fields, even though the actual repo system and version may not be SVN. So search/replace those fields (and derrivative values that come from those fields) with "want_repo_rev" and "repo_rev", respectively. This commit was SVN r24405.	2011-02-16 22:53:23 +00:00
Ralph Castain	a32a7d9a82	Update heartbeat system This commit was SVN r24404.	2011-02-16 18:50:51 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Ralph Castain	5120e6aec3	Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param. When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests. Also remove the stale and never completed topo mapper. This commit was SVN r24393.	2011-02-15 23:24:31 +00:00
Ralph Castain	c1da94a444	Dead children have no pid This commit was SVN r24387.	2011-02-15 13:30:51 +00:00
Ralph Castain	a3607ff35d	Make it easier to send a kill-local-procs command for an arbitrary number of procs This commit was SVN r24386.	2011-02-15 13:26:11 +00:00
Ralph Castain	bf1cff3711	Plug a couple of additional memory leaks - try to highlight a little better that strings returned from reg_string_name must be freed by caller This commit was SVN r24383.	2011-02-14 20:58:22 +00:00
Ralph Castain	a9dca25ca5	Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do. Cleanup the heartbeat handling so it is associated with the proc, not a node. Cleanup handling of recovery options so that defaults do not override user values iff they are provided. This commit was SVN r24382.	2011-02-14 20:49:12 +00:00
Ralph Castain	6ee69c670e	Add a minor utility This commit was SVN r24379.	2011-02-14 19:45:59 +00:00
Ralph Castain	b5de068533	Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address. Add some new proc/job states Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed. Add pretty-print of signals for "proc aborted due to signal" reports. This commit was SVN r24378. The following SVN revision numbers were found above: r24371 --> open-mpi/ompi@93d28a5792	2011-02-14 19:29:09 +00:00
Abhishek Kulkarni	93d28a5792	Change opal_err2str_fn_t to return the error string as an argument. This means that the converters (opal_err2str, orte_err2str) can now return NULL as a "silent error". The return value of opal_err2str_fn_t is the status of the operation (OPAL_SUCCESS or OPAL_ERROR). This fixes the "Unknown error" message issues on the trunk. This commit was SVN r24371.	2011-02-13 16:09:17 +00:00
Ralph Castain	33b68132cc	Update the rmcast framework This commit was SVN r24370.	2011-02-12 16:52:03 +00:00
Josh Hursey	a9335ea423	Make sure to initialize the 'update_state' function for the default module. This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB. This commit was SVN r24365.	2011-02-08 20:42:32 +00:00
Nysal Jan	3a8d251daa	vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog This commit was SVN r24339.	2011-02-02 10:05:57 +00:00
Josh Hursey	fa3f6485d8	Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around. This commit was SVN r24326.	2011-01-31 19:09:47 +00:00
Josh Hursey	5b58ff0663	Fix a C/R checkpoint->restart->checkpoint->restart case. The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint. This seems to be fixed now. Thanks to Alex Brick for the bug report. This commit was SVN r24325.	2011-01-28 21:25:14 +00:00
Jeff Squyres	ec3d18dc9f	As noted on the mailing list by Gabriele Fatigati (http://www.open-mpi.org/community/lists/users/2011/01/15427.php), the --tv (and friends) switches to mpirun would effectively munge the orterun command line together and then split it apart again before exec'ing the underlying debugger. We would therefore lose multi-token argv[x] value and split them into multiple tokens. For example: mpirun --tv -np 2 a.out "foo bar" would get launched with "foo" and "bar" as separate arguments; not one argument. This was due to the underlying code joining the argv into a single string and then re-splitting it. This commit removed the argv join; it now does the parsing and re-jigering of the argv by only looking at each individual argv item; multi-word tokens like "foo bar" will never be split into separate tokens. This commit was SVN r24322.	2011-01-28 13:01:06 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Josh Hursey	81fd41f811	Return an informative error message if the user requests a migration of a job that is not capable of it. C/R Functionality cleanup This commit was SVN r24307.	2011-01-26 15:36:34 +00:00
Josh Hursey	8f45fcb429	More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. * Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation). * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer. * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process. * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario. This commit was SVN r24306.	2011-01-26 14:56:35 +00:00
Nathan Hjelm	8a3179cdcb	removed c99 test code This commit was SVN r24297.	2011-01-25 23:02:35 +00:00
Josh Hursey	e4d13d338f	Fix a couple of compiler warnings This commit was SVN r24295.	2011-01-25 22:22:32 +00:00
George Bosilca	09f645f9a9	There is no need for the byte variable. This commit was SVN r24293.	2011-01-24 22:41:04 +00:00
Nathan Hjelm	e2126512a9	test c99 struct initialization with mtt. remove on jan 20, 2011 This commit was SVN r24271.	2011-01-19 22:21:21 +00:00
Abhishek Kulkarni	fd7ef7a1f1	Fixes broken trunk compile: call process status notify only when ft-enable-cr is selected. This commit was SVN r24255.	2011-01-14 18:37:07 +00:00
Abhishek Kulkarni	87d2c9b31d	Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) * Improve the FTB notifier to publish (C/R, process/communication failure) events to the FTB with the OMPI jobid as the associated payload. * Add notifier calls for C/R events and process status events in SnapC and ErrMgr components. * Fix a bug where the SnapC states and process states collide before being thrown out over the notifier. This commit was SVN r24251.	2011-01-13 20:13:49 +00:00
Ralph Castain	b09f57b03d	Update the multicast subsystem - ported from Cisco branch This commit was SVN r24246.	2011-01-13 01:54:05 +00:00
Terry Dontje	f3aaa885a3	corrected a couple places in orte where it said cpu_model when it should have been cpu_type. This commit was SVN r24221.	2011-01-11 19:56:26 +00:00
Abhishek Kulkarni	11ffa854ff	Update the FTB notifier * fix indentation issues * update the name of one of the fault events published to the FTB (per the FTB MPI standard) This commit was SVN r24213.	2011-01-10 18:58:31 +00:00
Ralph Castain	80ef1af8ba	Add psm key generator program This commit was SVN r24197.	2010-12-30 20:54:58 +00:00
Nathan Hjelm	c082d05ecb	Reset the timer on MPIR_being_debugged only if MPIR_being_debugged is not set. Fix typo in return code. This commit was SVN r24187.	2010-12-20 21:00:49 +00:00
Jeff Squyres	a525e70f46	Convert "opal_show_help" to be a global variable pointer. It is statically initialized to the real back-end OPAL show_help function. During orte_show_help_init(), the variable is re-assigned with the value of the back-end ORTE show_help function (the one that does error message aggregation). Therefore, anything that calls opal_show_help() after a certain point in orte_init() will have their show_help messages be aggregated. w00t! Even code down in OPAL -- that has no knowledge of ORTE -- will have their messages aggregated. '''Double w00t!''' During orte_show_help_finalize(), we restore the original pointer value so that it something calls opal_show_help() after orte_finalize(), it'll still work properly (but it won't be aggregated). This commit was SVN r24185.	2010-12-16 23:00:25 +00:00
Jeff Squyres	de97962aac	Fixes trac:2651. Fix off-by-one error when /dev/urandom doesn't exist. Thanks to "pth" for the patch. This commit was SVN r24170. The following Trac tickets were found above: Ticket 2651 --> https://svn.open-mpi.org/trac/ompi/ticket/2651	2010-12-14 14:52:51 +00:00
Ralph Castain	b251a59cdf	Cleanup nidmap finalize This commit was SVN r24164.	2010-12-11 16:42:06 +00:00
Ralph Castain	2dc5cbb483	Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago. This commit was SVN r24153.	2010-12-05 15:58:21 +00:00
Rolf vandeVaart	b67d3398da	It is convention to have orte_config.h included at top of file. This commit was SVN r24146.	2010-12-03 16:13:31 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	aaad8ae891	Remove unused var This commit was SVN r24136.	2010-12-02 02:38:13 +00:00
Ralph Castain	f9ffff59f8	Ensure clean termination of threads and tcp multicast This commit was SVN r24134.	2010-12-02 00:23:42 +00:00
Nathan Hjelm	75605faa75	added support for reattaching a debugger using the MPIR_attach_fifo This commit was SVN r24132.	2010-12-01 20:13:58 +00:00
Ralph Castain	ad814f26cd	One more time, into the breach! Restore the use of override_oversubscribe to indicate that the data source for resources on the backend nodes used in mapping is unreliable. In this situation (e.g., data came from hostfile, or we are just using localhost because nothing was provided), we don't trust the oversubscribe condition passed by the mapper. Instead, we check locally to ensure we set sched_yield correctly. This commit was SVN r24130.	2010-12-01 15:15:26 +00:00
Ralph Castain	eba65e97f3	Extend the rmcast APIs to allow enable/disable of comm, required for clean termination by upper layer users. Point the recv thread event base to the right place so it can wakeup when required. Add a new error code for "comm disabled" when attempting to communicate after disabling comm. This commit was SVN r24129.	2010-12-01 13:41:19 +00:00
Ralph Castain	9224302c10	Remove debug This commit was SVN r24128.	2010-12-01 13:12:24 +00:00
Ralph Castain	4f5625d699	Not totally necessary, but good form - init the oversubscribed field in the orte_nid_t object This commit was SVN r24127.	2010-12-01 12:58:37 +00:00
Ralph Castain	30c37ea536	Ensure that the oversubscribed condition of nodes is accurately reported by the mapper, and that the results are communicated and used by the backend orteds when setting sched_yield on local procs. Restores prior behavior that was somehow lost along the way. Includes a patch from Damien Guinier to fix vpid assignments when cpus-per-task is specified. This commit was SVN r24126.	2010-12-01 12:51:39 +00:00
Ralph Castain	85a974b0de	Better check for NULL before using the value This commit was SVN r24122.	2010-12-01 04:48:50 +00:00
Ralph Castain	c56185887b	Change the event base "wakeup" support to enable the passing of events to the central thread for add/del. Add a macro OPAL_UPDATE_EVBASE for this purpose as it will likely be widely used. Update the ORTE thread support to utilize this capability. Update the rmcast framework to track the change. This commit was SVN r24121.	2010-12-01 04:26:43 +00:00
Ralph Castain	963336ee5a	Remove the test for libevent internal threads This commit was SVN r24120.	2010-12-01 04:24:10 +00:00
Ralph Castain	0441e81882	Oops - ensure that multicast msgs get circulated properly with the tcp module This commit was SVN r24118.	2010-11-30 21:13:53 +00:00
Ralph Castain	d20c023348	Checkpoint the threading support for multicast - will be revised shortly, but this version currently works. This commit was SVN r24117.	2010-11-30 17:30:16 +00:00
Ralph Castain	0465605a9c	Cleanup condition check for a param so it doesn't show if not usable. This commit was SVN r24116.	2010-11-30 17:28:53 +00:00
Ralph Castain	09f02b3087	Update the ORTE thread acquire/release/wakeup macros to trigger release from event_loop so that conditions can be checked. Add macro versions of condition_wait and friends for debug use. This commit was SVN r24115.	2010-11-30 17:27:58 +00:00
Ralph Castain	d2547e84a3	MPI procs never use orte progress threads This commit was SVN r24093.	2010-11-29 03:52:46 +00:00
Ralph Castain	71669720a3	Just get the output once on sigpipe error, and include the fd This commit was SVN r24092.	2010-11-25 15:32:48 +00:00
Ralph Castain	30c635fd4d	Don't endlessly output sigpipe errors. Count the number of times we trap it, and abort if we get more than 10 of them. This commit was SVN r24091.	2010-11-25 15:25:24 +00:00
Ralph Castain	b9b2d101dc	Add an mca param to indicate if orte progress threads are to be enabled. Error out if this is given and libevent thread support was not built. This commit was SVN r24089.	2010-11-24 23:28:00 +00:00
Rolf vandeVaart	1d62542c23	Fix another Sun Studio warning. jobid and vpid need to be uint32_t. This commit was SVN r24074.	2010-11-19 18:12:46 +00:00
Rolf vandeVaart	09fdd5cc23	Include fcntl.h, not sys/fcntl.h so we get the definition of the open system call. That is what man page says to do. Fixes warning on Solaris. This commit was SVN r24073.	2010-11-19 17:40:02 +00:00
Shiqing Fan	358b4a5cba	Add an option to enable the debug postfix for executables. This commit was SVN r24070.	2010-11-19 15:54:13 +00:00
Ethan Mallove	66f2301170	Just plain "Grid Engine" instead of "Sun Grid Engine" This commit was SVN r24068.	2010-11-18 19:30:04 +00:00
Abhishek Kulkarni	78a67654d4	add notifier events for process migration This commit was SVN r24058.	2010-11-16 17:57:44 +00:00
Abhishek Kulkarni	6e6ccae082	Update the checkpoint notification events that we throw out over the FTB with a payload embedded in {} This commit was SVN r24057.	2010-11-16 17:55:57 +00:00
Ralph Castain	58e711a412	Update a test and add two new ones for testing event lib thread support This commit was SVN r24051.	2010-11-13 15:39:28 +00:00
Jeff Squyres	e4744b4ed5	Per http://www.open-mpi.org/community/lists/devel/2010/11/8671.php , change a bunch of OMPI_<foo> names to OPAL_<foo>. This commit was SVN r24046.	2010-11-12 23:22:11 +00:00
Ralph Castain	703684e071	Output the mca params for debug purposes This commit was SVN r24042.	2010-11-11 20:06:29 +00:00
Shiqing Fan	c03ea1a5f3	A more clean way to build on Windows. It's not possible to combine two shared libraries on Windows, so we have to do it a bit different. First generate a small event static library by just linking the object files, and link it into other libraries that needs the libevent API. This commit was SVN r24039.	2010-11-11 12:02:54 +00:00
Ralph Castain	bb521c6b7e	Properly count local procs to set oversubscribed condition This commit was SVN r24037.	2010-11-10 21:59:35 +00:00
Ralph Castain	021bd77bf1	Don't free the event base if we aren't using progress threads This commit was SVN r24036.	2010-11-10 21:58:58 +00:00
Ralph Castain	9c72737414	Send the recovery flag This commit was SVN r24035.	2010-11-10 21:26:28 +00:00
Ralph Castain	57257ab9b4	Use the right event base if threads are disabled. Always update the seq num This commit was SVN r24034.	2010-11-10 21:26:04 +00:00
Ralph Castain	cbb758c4fb	Allow mcast threads to be disabled This commit was SVN r24032.	2010-11-10 20:16:41 +00:00
Ralph Castain	22e40d92a0	Cleanup thread termination This commit was SVN r24031.	2010-11-10 19:36:44 +00:00
Ralph Castain	f5e50abab2	Make class visible This commit was SVN r24022.	2010-11-09 19:07:45 +00:00
Nathan Hjelm	986265fc6e	fixed crash in orte-ps caused by calls to OBJ_RELEASE on an opal_event_t object. This commit was SVN r24020.	2010-11-09 18:41:43 +00:00
Shiqing Fan	482a621e31	Change the behavior of exporting/importing symbols on Windows, so that to fit the new build procedure, i.e. import statically linked opal/orte libraries for other libraries/binaries. There are several use cases when creating dll libraries: 1. create DLL A, export symbols of A, import nothing (A normally is OPAL) should define _USRDLL , A_EXPORT 2. create DLL B, export symbols of B, import A.lib (B could be ORTE, OMPI or other ompi tools) should define _USRDLL, B_EXPORT 3. create DLL C, import B.dll (C could be external libs or apps) should define B_IMPORT This commit was SVN r24016.	2010-11-09 16:13:30 +00:00
Ralph Castain	01347926d1	Be a little more thorough about cleaning up during finalize This commit was SVN r24014.	2010-11-09 14:56:27 +00:00
Shiqing Fan	d3701ccba8	type casts. This commit was SVN r24013.	2010-11-09 09:17:22 +00:00
Shiqing Fan	7bac326920	Fix Windows build, add custom command to generate static libraries (opal and orte) for shared build. This commit was SVN r24012.	2010-11-09 08:32:45 +00:00
Ralph Castain	f2f41d1ca9	Be nice to those who don't enable-multicast...poor wretches. This commit was SVN r24011.	2010-11-09 05:08:55 +00:00
Ralph Castain	a47b33678b	Add orte-level thread support to avoid some of the opal_if_threads protection used solely for ompi. Use threads to help process multicast messages. This commit was SVN r24009.	2010-11-08 19:09:23 +00:00
Ralph Castain	bf665692c3	Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system. This commit was SVN r24004.	2010-11-07 23:29:52 +00:00
Abhishek Kulkarni	e0660101d3	Throw notifier events for checkpointing status (success or failure) This commit was SVN r24003.	2010-11-07 22:12:09 +00:00
Abhishek Kulkarni	8cd3759f21	use the saved value of PID, saving some calls to getpid() This commit was SVN r24002.	2010-11-07 22:09:49 +00:00
Abhishek Kulkarni	d1a4cc33dd	Update the FTB notifier wrt events decided by the CIFTS working group This commit was SVN r24001.	2010-11-07 22:06:32 +00:00
Abhishek Kulkarni	ac2768ca7c	LOG_SYSLOG is a syslog facility. take it off the syslog options This commit was SVN r24000.	2010-11-06 22:05:45 +00:00
Abhishek Kulkarni	132c8d1b00	removing some unneeded calls to ORTE_ERROR_LOG This commit was SVN r23999.	2010-11-06 22:00:18 +00:00
Brian Barrett	a94caae625	More lightweight build changes This commit was SVN r23989.	2010-11-03 15:39:10 +00:00
Shiqing Fan	505efbaa27	Update the CMake scripts, solve a few export symbols for Windows. This commit was SVN r23976.	2010-11-02 16:39:27 +00:00
Ralph Castain	875a6d61a4	Return correct status code This commit was SVN r23969.	2010-10-29 00:43:50 +00:00
Ralph Castain	9ea2b196ce	Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component. Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work. This commit was SVN r23966.	2010-10-28 15:22:46 +00:00
Ralph Castain	c13b0bb668	Update some debugger attachment code per LLNL request This commit was SVN r23965.	2010-10-28 03:06:20 +00:00
Brian Barrett	3ed00ba148	More fixes to make OMPI compile with minimal ORTE support again This commit was SVN r23962.	2010-10-27 20:40:39 +00:00
Shiqing Fan	199df1eadf	Rename a few var names. This commit was SVN r23959.	2010-10-27 11:52:57 +00:00
Nathan Hjelm	e7bfbe1d1a	added missing object initialization/destruction of mca_oob_tcp_component.tcp_listen_thread_event This commit was SVN r23958.	2010-10-26 22:09:37 +00:00
Shiqing Fan	a3d9c91ff7	Exclude stdbool.h for Windows, and use the definition in opal. Immigrate the socket pair support from libevent. Fix other minor things and make it compile. This commit was SVN r23951.	2010-10-26 14:53:50 +00:00
Ralph Castain	ebb4962072	fix typo This commit was SVN r23949.	2010-10-26 14:48:20 +00:00
Ralph Castain	8d5045de16	Make what is going on more obvious This commit was SVN r23948.	2010-10-26 14:39:09 +00:00
Ralph Castain	7f103c8a9d	Supply missing argument This commit was SVN r23945.	2010-10-26 07:24:55 +00:00
Ralph Castain	894230b121	This stuff is soooo out-of-date that a complete rewrite would be required - thankfully, nobody cares This commit was SVN r23944.	2010-10-26 06:22:31 +00:00
Ralph Castain	86c7365e8e	Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem. Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution. This commit was SVN r23943.	2010-10-26 02:41:42 +00:00
Ralph Castain	fc46dfa78a	Remove stale code This commit was SVN r23942.	2010-10-26 02:37:56 +00:00
George Bosilca	17379a6097	tmp is on the stack, therefore storing its address for later usage is FORBIDEN. Set it to NULL by now, and hope the cbdata is never used by the timer callbacks. This commit was SVN r23941.	2010-10-25 19:07:13 +00:00
George Bosilca	5882290cdd	We need a default value or the compiler will whine. This commit was SVN r23940.	2010-10-25 19:05:45 +00:00
Ralph Castain	aaec8ec426	Fix orte-ps so it correctly reports out on processes within a job This commit was SVN r23933.	2010-10-25 17:53:53 +00:00
Abhishek Kulkarni	c671ec52d1	Fix broken trunk compile after the libevent changes. This commit was SVN r23929.	2010-10-25 14:11:48 +00:00
Ralph Castain	fceabb2498	Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects. Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems. Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct. I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things: 1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new) 2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it. There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do. This commit was SVN r23925.	2010-10-24 18:35:54 +00:00
Ralph Castain	8a37b3071a	Cleanup mpirx co-spawn of debugger daemons. Resolve block-until-both-ends-are-opened behavior for fifos. Add some debugging output. Make the fifo event be non-persistent and read the value in the fifo to avoid spinning on the event. This commit was SVN r23922.	2010-10-22 20:46:07 +00:00
Ralph Castain	2c1a658232	Fix debugger attach This commit was SVN r23921.	2010-10-22 20:07:24 +00:00
Jeff Squyres	4322a78e60	Update wrapper compiler scripts to search for perl during configure, per request from BSD maintainers. This commit was SVN r23914.	2010-10-19 22:45:54 +00:00
Ralph Castain	1e93437cd4	To help with debugging, add a new mca param that instructs ORTE_ERROR_LOG to output "silent" errors. Helps to track down silent errors that don't have an associated error message (e.g., via show_help). This commit was SVN r23893.	2010-10-16 03:29:47 +00:00
Brian Barrett	9febaa475e	* Add shell of functionality required for supporting Portals4 * Update places where orte-free builds have failed This commit was SVN r23891.	2010-10-14 22:49:09 +00:00
Jeff Squyres	c891ed34e2	More verbatim escaping. This commit was SVN r23873.	2010-10-07 22:26:51 +00:00
Ralph Castain	37f566bf1e	Cancel recvs when finalizing This commit was SVN r23871.	2010-10-07 22:02:12 +00:00
Jeff Squyres	eaab8d0062	* Ensure to set paffinity_enabled in all cases * Ensure to set the mask value before we use it This commit was SVN r23861.	2010-10-07 15:48:49 +00:00
Ralph Castain	92d69effc9	Read the data from the fifo to clear the event so it doesn't immediately re-trigger This commit was SVN r23856.	2010-10-07 02:41:02 +00:00
Ralph Castain	871b685e89	Ensure all debugger interface symbols are present in orterun This commit was SVN r23823.	2010-09-30 21:36:00 +00:00
Josh Hursey	c8692198a2	Fix an issue migrating/autorecovering processes that mask SIGTERM using the C/R functionality. I did not want to make this change globally since there could be good reason to keep the check before calling SIGKILL that I am not seeing at the moment. This commit was SVN r23821.	2010-09-30 20:55:12 +00:00
Ralph Castain	dd959f5ab6	Silence an idiotic warning This commit was SVN r23819.	2010-09-30 17:54:13 +00:00
Ralph Castain	162734731d	Sigh - fix one more place where r23764 had a problem This commit was SVN r23816. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-29 20:47:52 +00:00
Jeff Squyres	73bcc4a36b	Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers This commit was SVN r23801. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-24 22:53:28 +00:00
Ralph Castain	b76278085e	Sigh. Since 0,0 is a valid process, this just doesn't work (duh) This commit was SVN r23795.	2010-09-23 15:03:05 +00:00
Ralph Castain	7aed412798	Add definitions to tell us when jobid and vpid haven't been defined This commit was SVN r23793.	2010-09-23 01:33:42 +00:00
Ralph Castain	3631e4e936	Revert remaining svn kruft from r23764 This commit was SVN r23786. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-22 01:11:40 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Ethan Mallove	8052c9dd46	Prevent SEGV when using odls_base_verbose This commit was SVN r23758.	2010-09-15 20:17:50 +00:00
Ralph Castain	e96b5f486f	Reorganize the opal interface code in opal/util/if.c per prior emails and telecon discussions. Move the interface discovery code into a framework so that configuration logic can separate it out (instead of the prior #if-#else confusion). All interface APIs for accessing the info remain unchanged in opal/util/if.c. This has been tested on Mac, Linux, and NetBSD. Nobody else seemed interested in testing it, so there may be some future problems revealed as people try it on other OSs. This commit was SVN r23743.	2010-09-13 01:58:51 +00:00
Abhishek Kulkarni	a143622b54	Remove unused code. Notifier events are aggregated on a per-event basis by the HNP notifier. This commit was SVN r23711.	2010-09-02 16:00:22 +00:00
Ralph Castain	f75437f5a3	Add the ability to receive notifier output when job completes. Set the notification level to INFO for normal job completion, and to ALERT for abnormal termination. This commit was SVN r23710.	2010-09-02 14:42:41 +00:00
Rolf vandeVaart	14e7bcc383	Create new entries in the wrapper data files so the administrator can specify compiler flags that get inserted into the command before the user's flags. These flags can be specified at configure time. Reviewed by Jeff Squyres. This fixes ticket #2474. This commit was SVN r23709.	2010-09-02 10:47:55 +00:00
Ralph Castain	b982f908e8	Fixed some newly-induced warnings This commit was SVN r23694.	2010-08-31 14:51:19 +00:00
Rainer Keller	97511912ec	- Fixup several functions, that cannot return - Add one instance where we do not use a parameter in a function - Fix a buglet in commit r23689, where the attribute-for-function ptrs was applied. This commit was SVN r23690. The following SVN revision numbers were found above: r23689 --> open-mpi/ompi@5eb571c458	2010-08-31 12:21:13 +00:00
Rainer Keller	5eb571c458	- As suggested in CMR #2558 , attribute-macros should be be tested on function pointers and assigned accordingly, instead of using the pre-processor in the header files. A functional change is (re-) specifying __opal_attribute_noreturn__ on orte_errmgr_base_abort(): All modules in the errmgr framework either use this function, or define their own abort function, which sets __opal_attribute_noreturn__. This attributes was taken out with the errmgr overhaul in r22872. This commit was SVN r23689. The following SVN revision numbers were found above: r22872 --> open-mpi/ompi@e4f2d03d28	2010-08-31 10:28:51 +00:00
Ralph Castain	b81358815c	Add some debug This commit was SVN r23686.	2010-08-29 13:45:10 +00:00
Ralph Castain	554aede041	Fix a situation where we were unlocking a thread that isn't locked for the main launch - it is only used for dynamic spawns. This commit was SVN r23682.	2010-08-28 14:03:17 +00:00
Rainer Keller	4abcf5a0d7	- The Sun-compiler 12 update 1 complains about noreturn-attributes assigned to function-declarations. Check this case and mark the currently only case existing in trunk. Thanks to Paul Hargrove for bringing this up. Let's test the svn commit msg CMR:v1.5 This commit was SVN r23676.	2010-08-27 09:18:30 +00:00
Rainer Keller	12ed573e5e	- Include <strings.h> for rindex(3). Thanks to Paul Hargrove. Please CMR:v1.5 This commit was SVN r23671.	2010-08-26 13:42:36 +00:00
Shiqing Fan	9911797867	Rename a few odls help files for Windows installation. This commit was SVN r23668.	2010-08-26 09:31:18 +00:00
Ralph Castain	2e223abe33	Restore the auto-poll method for detecting debugger attachment, but only in the mpirx debugger module and only if the corresponding rate mca param is set. Guess we missed it before, but add the debugger framework to the orte-info and ompi_info tools This commit was SVN r23667.	2010-08-25 22:52:33 +00:00
Ralph Castain	f72cdc4160	Update the compare_name_fields function to allow the caller to specify that wildcard values are to be treated as wildcards This commit was SVN r23663.	2010-08-25 15:35:41 +00:00
Ralph Castain	4ecd9a0bbe	Protect against an obscure race condition that AFAICT only occurs when we are in a loop waiting to recv a message from a peer who is then killed by signal. This commit was SVN r23662.	2010-08-25 15:35:01 +00:00
Ralph Castain	f1a00c9a21	Per Jeff's inquiry, play chicken and don't assume herror exists everywhere. This commit was SVN r23656.	2010-08-24 20:46:41 +00:00
Jeff Squyres	207ca2d928	This commit is the first of several steps in a paffinity makeover extravaganza. = Short version = This commit does several things, but the short version is that it re-orients the error message creation of the ODLS default module to generate error strings in the child process for errors that occur after the fork but before the exec (such errors are ''usually'' related to paffinity). A show_help string is rendered in the child and then IPC'ed up to the parent, who displays the string through normal ORTE show_help aggregation mechanisms. We also broke up the ginormous paffinity-setting logic into a few separate functions, both to help us understand the code, and hopefully to ease future maintenance. The logic for the ODLS default binding should not have changed -- this is mainly a code reshuffle and improvement on error reporting. = Rationale = The reasoning for this commit is complex. As mentioned above, it's the first step in some paffinity cleanup. Here's the line of dominoes that must fall (in this order): 1. Add hwloc paffinity component (already done). 1. While testing hwloc, we discovered that the error reporting from the ODLS default module was abysmal. So we fixed it. 1. Further, we reorganized the code in the odsl_default_module.c a bit to help our understanding of it. 1. We also discovered a few bugs in the original ODLS default module logic that existed before this code shuffle; separate tickets will be filed to fix them. 1. Next up will be some improvements to paffinity / odls default to make the act of binding to a core ensure to bind to ''all'' hardware threads contained in that core (similar for sockets: binding to a socket will bind to ''all'' hardware threads in that socket). 1. Next will be improvements to paffinity to expose binding to hardware threads through the paffinity framework API. 1. Finally, we'll expose these binding controls to the user (e.g., through mpirun command line arguments, MCA parameters, etc.). This commit represents the first few bullets; the last 4 bullets are being worked on right now, but there is no definite timeline for completion. = Miscelaneous = A few points worth mentioning: * We have tested this new code a bunch; we're pretty sure it behaves just like the trunk -- but with better / more precise error reporting. More testing is needed on a wider array of platforms, however. * A big comment at the top of odls_default_module.c explains the (new) general scheme for the error reporting. * The error reporting in the parent process is now really dumb; almost all the intelligence about creating error messages is in the child. * The show_help file was renamed to be more consistent with other help files (help-odls-default.txt -> help-orte-odls-default.txt) * Removed the use of sched_yield() because of recent changes in the Linux 2.6.3x kernels. We already had an #else clause for select()'ing for 1us if we didn't have sched_yield() -- that is now the only code path. This is not a performance-critical section of the code, so this shouldn't be controversial. * Replaced the macro-based error reporting with function-based reporting. It's a bit more bulky, but it helped us understand the code and saved us multiple times with compile-time parameter checking, etc. * Cleaned up the use of several show_help messages to ensure that they mapped to real messages in help*.txt files. This commit was SVN r23652.	2010-08-24 19:38:29 +00:00
Jeff Squyres	2c03554fe7	Add new function: orte_show_help_norender(). It is exactly the same as orte_show_help(), but it takes a fully-rendered string instead of a varargs list that must be rendered. This function is useful in cases where one entity renders the "show help" string and a different entity sends the string via the normal orte "show help" mechanisms for aggregation, etc. Example usage: errors occur in the ODLS after forking but before exec'ing. In such cases, it makes sense for the the child process to render the "show help" string because it has all the details about the error. But the child process can't call orte_show_help() itself because it is not an ORTE process -- it can't OOB send the message to the HNP, etc. After rendering the help string, the child sends the rendered string to its parent via normal IPC (e.g., via a pipe) and the parent can then invoke orte_show_help_norender() with the ready-to-go string. The message then displays out via the normal mechanisms (i.e., out via the HNP, aggregated/coalesced, etc.). This commit was SVN r23651.	2010-08-24 19:12:57 +00:00
Ralph Castain	3b3cd67d07	If we are using static ports and cannot resolve a hostname, then see if the proc is on the local host. If so, then attempt to use a loopback interface to complete the connection. Only implemented for IPv4 because the if.c code has been so hashed I couldn't figure out how to do this cleanly for all cases. This commit was SVN r23647.	2010-08-24 14:14:59 +00:00
Ralph Castain	7608513158	Cleanup the code and add some comments to make it easier to understand. Add a bozo error check This commit was SVN r23639.	2010-08-24 04:46:59 +00:00
Ralph Castain	2886da5669	Ensure that the local daemon vpid gets defined so that the locality procedures work when using the ess generic module. This commit was SVN r23638.	2010-08-24 04:38:21 +00:00
Shiqing Fan	c110edbf44	Use exclude lists for non-ordinary sub directories check. This commit was SVN r23631.	2010-08-23 09:43:05 +00:00
Rolf vandeVaart	07604d74b8	We were incorrectly handling some of the --with-wrapper-XXX variables resulting in the same flag being installed twice for the OMPI wrappers. This fixes trac:2539. This commit was SVN r23630. The following Trac tickets were found above: Ticket 2539 --> https://svn.open-mpi.org/trac/ompi/ticket/2539	2010-08-20 18:01:43 +00:00
Josh Hursey	4ffc2d6f68	fix a couple of missed prefixes This commit was SVN r23629.	2010-08-19 13:26:33 +00:00
Josh Hursey	fabd5cc153	Simplification of the ErrMgr framework by removing the 'stack'/composite functionality. The composite functionality was becoming difficult to maintain, so we removed it for now which simplifies the framework design considerably. Since the 'crmig' and 'autor' components were -very- similar to the 'hnp' component, this commit also merges them together. By moving the 'crmig' and 'autor' to a separate file under the 'hnp' component we are able to isolate the C/R logic to a large extent, thus being only minimally hooked into the previous 'hnp' component. So other than some name changes, the functionality is all still in place. I will update the C/R documentation later this morning. This commit was SVN r23628.	2010-08-19 13:09:20 +00:00
Josh Hursey	77792c937d	When we checkpoint with the --stop option, be sure to write out all the metadata before clearing the storage handle. Here we tried to write out the session directory marker after we sync the directory, which happens early in the case of --stop. Thanks to Ananda Mudar for noticing the bug. This commit was SVN r23627.	2010-08-18 20:44:03 +00:00
Brian Barrett	13c827dda8	Make trunk compile on Red Storm again This commit was SVN r23622.	2010-08-17 21:51:38 +00:00
Ralph Castain	bbf84fd92b	Refine the protection from cross-dvm communications This commit was SVN r23615.	2010-08-16 16:33:39 +00:00
Ralph Castain	930f7adb0f	Check the return status and report any error This commit was SVN r23611.	2010-08-13 15:04:59 +00:00
Ralph Castain	4491a0e5dc	Add a channel for reporting errors, fix a bug in the tcp module This commit was SVN r23610.	2010-08-13 15:04:22 +00:00
Ralph Castain	ace1f60429	Rename an mca param to something more intuitive and set its default to 0 so the module only runs if a non-zero value is provided This commit was SVN r23609.	2010-08-13 15:03:45 +00:00
Rainer Keller	fc4cb0c0c1	- Allow changing ALPS run command - Fix misnomer This commit was SVN r23601.	2010-08-12 14:41:35 +00:00
Ralph Castain	5715a5b421	Let VM-based mappings include the updated nidmap This commit was SVN r23596.	2010-08-11 21:04:28 +00:00
Josh Hursey	e12ca48cd9	A number of C/R enhancements per RFC below: http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary\|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413	2010-08-10 20:51:11 +00:00
Terry Dontje	b74ef351b7	Added new solaris sysinfo module. Also added code to assign orte_local_chip_type and orte_local_chip_model in MPI processes it the appropriate sysinfo module found the values on the machine. This commit was SVN r23581.	2010-08-09 19:28:56 +00:00
Eugene Loh	5e1f40ba12	Clean up English and clarify a point regarding process binding in the mpirun man page. This commit was SVN r23558.	2010-08-05 23:10:44 +00:00
Eugene Loh	1c12192c93	Fix grammatical typo in orterun man page. This commit was SVN r23555.	2010-08-04 20:37:28 +00:00
Ralph Castain	9fbd7c1949	Fix a bug in tcp multicast This commit was SVN r23547.	2010-08-04 01:37:54 +00:00
Ralph Castain	94f046e7f3	Fix an "oops" where we missed some instances of a name change. Thanks to Greg Koenig for spotting it! This commit was SVN r23545.	2010-08-03 16:36:08 +00:00
Ralph Castain	8ccd14d508	Correct the -h output to reflect that --bind-to-none is the default behavior This commit was SVN r23537.	2010-08-01 22:31:34 +00:00
Josh Hursey	01fbb729e3	Need to get the nidmap so that the apps have something to look at for wireup. Without this we get errors like the following since the nidmap is empty in the apps: [[13094,1],1] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 234 [[13094,1],1] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 281 This commit was SVN r23536.	2010-07-31 16:25:13 +00:00
Josh Hursey	ba7e94dd89	Some relatively minor C/R related cleanup * Fix a configure warning for checking --enable-ft-thread * In hnp and orted ErrMgr components check to see if other components have already recovered this process before trying to recover it again. * Fix 'npernode' for restarting using the resilient rmaps component * export ompi_info_set, so that internal functionality can use it. This commit was SVN r23535.	2010-07-30 18:59:34 +00:00
Ralph Castain	d8ec83f939	Remove an unneeded tag This commit was SVN r23533.	2010-07-29 02:13:06 +00:00
Ralph Castain	94d4887452	Cleanup some race conditions at job start This commit was SVN r23515.	2010-07-27 18:24:11 +00:00
Ralph Castain	97cfbdfc8c	Add debug This commit was SVN r23514.	2010-07-27 16:55:16 +00:00
Ralph Castain	cfbfbb75a2	Don't abort if a race condition causes a nid not to be found This commit was SVN r23513.	2010-07-27 16:20:58 +00:00
Ralph Castain	06fe2c4c20	Add debug This commit was SVN r23512.	2010-07-27 16:20:29 +00:00
Ralph Castain	e16ec40283	Check the correct envar. Thanks to Philippe for pointing out the error! This commit was SVN r23499.	2010-07-27 03:49:03 +00:00
Ralph Castain	0c201486e4	Allow for multiple channels for the same channel "name" - could be both input and output channels This commit was SVN r23498.	2010-07-27 01:38:39 +00:00
Ralph Castain	6158cab2e1	Add missing header This commit was SVN r23497.	2010-07-27 01:37:41 +00:00
Ralph Castain	1ba8bbe1a9	Don't require the existence of a multicast interface if --enable-multicast wasn't specified. This commit was SVN r23494.	2010-07-26 15:09:57 +00:00
Ralph Castain	e858b029aa	Re-establish support for IBM's POE environment by resurrecting the poe launcher from the v1.2 series, updated for the current trunk. Modify the hnp errmgr module to support batch-operated jobs. Update the new ess generic module to support a wider range of mapping options. This commit was SVN r23493.	2010-07-24 23:35:20 +00:00
Ralph Castain	ff2d573f7e	Adjust some default values, and ensure we don't start sending too soon This commit was SVN r23492.	2010-07-23 19:37:16 +00:00
Ralph Castain	140e427a79	Ensure that wildcard recvs go to the end of the matching list so that recvs for specific tags take precedence. Ensure we don't try to send tcp mcast messsages to procs that haven't reported back yet This commit was SVN r23491.	2010-07-23 19:31:34 +00:00
Ralph Castain	95a908cb27	Provide run-time control over heartbeat This commit was SVN r23490.	2010-07-23 17:11:12 +00:00
Ralph Castain	dfdb1eca0c	Remove windows dist file requirement This commit was SVN r23480.	2010-07-23 01:23:15 +00:00
Ralph Castain	67027cfce4	Add a new "generic" ess module that supports direct launched MPI jobs so long as certain minimum envars are provided, regardless of launch environment. This commit was SVN r23478.	2010-07-22 21:51:43 +00:00
Ralph Castain	e7719f0aa4	Update platform files, adjust sensor heartbeat module selection rules This commit was SVN r23477.	2010-07-22 21:50:46 +00:00
Jeff Squyres	5027915ead	Hostname is not used in this function. This commit was SVN r23454.	2010-07-21 11:07:28 +00:00
Jeff Squyres	64cb8f5d7f	Another round of man page cleanups from Debian mantainer Manuel Prinz. Many thanks! This commit was SVN r23445.	2010-07-20 14:07:18 +00:00
Ralph Castain	f3e13b9766	Improve the efficiency by making the check for uniqueness of incoming hnp contact info much faster by including the hnp_uri in the job_family tracker object. Replace the global buffer storage with a quick routine to build the buffer from the jobfams array This commit was SVN r23443.	2010-07-20 08:30:47 +00:00
Ralph Castain	f85e69b64b	Continue enabling connect_accept across large numbers of independent jobs by replacing the hash tables with pointer_arrays to store routes to remote hnps to gain flexibility we'll need in the future. This commit was SVN r23439.	2010-07-20 04:47:31 +00:00
Ralph Castain	248320b91a	Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started. Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them. Add a test provided by Philippe that tests this ability. This commit was SVN r23438.	2010-07-20 04:22:45 +00:00
Ralph Castain	d9f7947a42	Arg - the other half of the prior commit: have the orted properly parse the resulting data to hand it to orte_odls. Also, reindent per emacs (sigh). This commit was SVN r23437.	2010-07-20 04:06:46 +00:00
Ralph Castain	72525a5850	Allow the passing of NULL to orte_plm.kill_local_procs to match what we allow for the equivalent orte_odls call This commit was SVN r23436.	2010-07-20 04:05:41 +00:00
Ralph Castain	ad5eaee4c6	Protect against NULL and provide additional resource check/error report This commit was SVN r23432.	2010-07-19 18:33:32 +00:00
Ralph Castain	0355da6335	Cleanup pointer array addressing and protect against NULL This commit was SVN r23431.	2010-07-19 18:30:04 +00:00
Ralph Castain	3f7f8df40f	Fix singleton comm_spawn by having the host orted create a map object for the singleton job. Even though this isn't filled in, the pidmap function will ignore any job that doesn't have a map object. Thus, this ensures that the singleton's location is included in the pidmap, thereby allowing messages to be routed to/from it. This commit was SVN r23430.	2010-07-18 02:48:17 +00:00
Ralph Castain	12cd07c9a9	Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases. The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang. This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is: * pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place * modified the errmgr to directly call the new routines when termination is detected * removed the grpcomm.onesided_barrier and its associated RML tag * add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree * use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial. This commit was SVN r23429.	2010-07-17 21:03:27 +00:00
Ralph Castain	23b166cbd4	Initialize pointers to NULL This commit was SVN r23422.	2010-07-15 16:33:57 +00:00
Ralph Castain	a99e8cf132	Re-issue the read event so that multiple debugger attachments can occur This commit was SVN r23421.	2010-07-15 15:35:26 +00:00
Ralph Castain	4af771a681	Update the job object pack-unpack fns This commit was SVN r23392.	2010-07-13 21:23:14 +00:00
Jeff Squyres	c98729694a	Fix a compile error in the xgrid plm. This does nothing to make it work; it just now compiles on os x. This commit was SVN r23387.	2010-07-13 11:56:20 +00:00
Ralph Castain	570d19106b	Allow singletons to use ompi-server for rendezvous via pubsub as well as comm_spawn without starting their own local daemons This commit was SVN r23384.	2010-07-13 06:33:07 +00:00
Ralph Castain	eee7541ae7	Don't overwrite a prior setting for create_session_dirs This commit was SVN r23383.	2010-07-13 06:30:09 +00:00
Ralph Castain	8bb0c16c2f	Add new tag This commit was SVN r23382.	2010-07-13 06:29:13 +00:00
Ralph Castain	0b4081b162	The --output-filename option currently mixes the output from all processes with the same rank, but different jobids. Thus, the output from comm_spawn'd processes gets intermingled with their parents and each other. Adding the jobid to the output filename solves the problem. Thanks to Jody for pointing this out! This commit was SVN r23380.	2010-07-12 21:43:49 +00:00
Ralph Castain	4c5ea3d1ef	Allow the ALPS allocator to run if the BASIL_RESERVATION_ID has been set. Update show_help messages and comments to reflect this new scenario. Thanks to Jerome Soumagne for the patch! This commit was SVN r23374.	2010-07-12 15:42:25 +00:00
Ralph Castain	2babebf9c3	Revert some of a prior commit - there is no need for another flag to indicate one-sided termination of the orteds. This commit was SVN r23372.	2010-07-10 03:33:01 +00:00
Ralph Castain	da61b69b15	Ensure we don't incorrectly return a non-zero exit code when normally terminating a slurm job. Slurm, of course, must always be different... This commit was SVN r23371.	2010-07-09 19:14:10 +00:00
Ralph Castain	5d2233c950	Add some new tags This commit was SVN r23369.	2010-07-09 17:51:28 +00:00
Ralph Castain	2193ace464	Ensure the debugger symbols are loaded into orterun prior to orte_init so they are found by attaching debuggers. This commit was SVN r23368.	2010-07-09 17:51:00 +00:00
Rolf vandeVaart	cc9b768fdb	Fix typo and missing header - needed by Solaris This commit was SVN r23367.	2010-07-09 14:33:13 +00:00
Shiqing Fan	c51c262e67	Relevant Windows fixes for r23360. This commit was SVN r23363. The following SVN revision numbers were found above: r23360 --> open-mpi/ompi@31295e8dc2	2010-07-07 16:58:16 +00:00
Ralph Castain	62a8b73f1a	Correctly handle output sent to the group input channel This commit was SVN r23362.	2010-07-07 14:17:48 +00:00
Ralph Castain	31295e8dc2	As discussed on today's telecon, reorganize the debugger attachment code in orte to better support efforts within the tool community aimed at exploring alternative methods. Move the debugger attachment code from the orterun directory to a new debugger framework. Organize the existing standard support code into an "mpir" component. Organize the current extensions for co-spawning debugger daemons into a separate "mpirx" component. Since the MPIR symbols are now included in the ORTE library, remove duplicate declarations in OMPI and replace them with extern references to their ORTE instantiations. This commit was SVN r23360.	2010-07-06 23:35:42 +00:00
Ralph Castain	98e7bc94f5	This belongs more properly in the ORCM repo now that we have MCA capability over there This commit was SVN r23346.	2010-07-05 18:25:03 +00:00
Ralph Castain	dac1b9976e	Make the default group channels bidirectional This commit was SVN r23345.	2010-07-05 14:45:57 +00:00
Shiqing Fan	f4a5f7c7d6	Update ORTE source files. This commit was SVN r23344.	2010-07-05 08:45:41 +00:00
Jeff Squyres	5bebdb97fa	Need these header files for NetBSD. Thanks for the heads-up from Aleksej Saushev. This commit was SVN r23343.	2010-07-02 17:38:57 +00:00
Jeff Squyres	8fef296b8a	Updates about thread support levels. This commit was SVN r23341.	2010-07-02 13:14:09 +00:00
Ralph Castain	ee4564c13b	Add some useful debug This commit was SVN r23339.	2010-07-02 03:35:47 +00:00
Ralph Castain	f3d90dfb8d	Fully restore fault recovery, both at the individual process and daemon level. NOTE: MPI fault recovery remains unavailable pending merge from Josh. This only covers ORTE-level processes. This commit was SVN r23335.	2010-07-01 19:45:43 +00:00
Ralph Castain	7190415977	Fix JEFF's mistake - we cannot use orte_show_help if execv fails because we already closed all the file descriptors! This commit was SVN r23334.	2010-07-01 19:41:26 +00:00
Ralph Castain	510ade9503	Do not use nodes that are flagged as down or do-not-use for this map. Modify error output to reflect possible reasons no nodes would be available This commit was SVN r23333.	2010-07-01 19:39:31 +00:00
Ralph Castain	81a65f2c67	Define a new node state This commit was SVN r23332.	2010-07-01 19:38:23 +00:00
Ralph Castain	f5548b8e0f	remove a potential locking conflict, and let emacs go ahead and reformat the function (sigh) This commit was SVN r23331.	2010-07-01 19:37:53 +00:00
Ralph Castain	d463aec2f6	Don't try to send to dead daemons, keep accounting straight so we don't hang This commit was SVN r23330.	2010-07-01 19:37:02 +00:00
Ralph Castain	dd85689560	Cleanup pointer array addressing This commit was SVN r23329.	2010-07-01 19:33:10 +00:00
Ralph Castain	26fbae447e	Don't try to forward input when we already ordered shutdown. Check return codes on sends This commit was SVN r23328.	2010-07-01 19:32:08 +00:00
Ralph Castain	628936a99f	Provide a convenience option to disable fault recovery (as opposed to setting three separate, long-named mca params) This commit was SVN r23327.	2010-07-01 19:31:11 +00:00
Ralph Castain	3237b9ec87	Print a nice error message when a daemon fails, and exit with a non-zero status This commit was SVN r23314.	2010-06-28 16:38:54 +00:00
Ralph Castain	a1ea6bc130	Ignore debugger daemon termination status - we don't care how they died. This commit was SVN r23306.	2010-06-26 03:08:50 +00:00
Ralph Castain	099c3aad97	Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0. This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all. Thanks to Dong Ahn (LLNL) for catching this problem! Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm. This commit was SVN r23300.	2010-06-24 05:13:53 +00:00
Ralph Castain	e9f4c84d7e	Add another name field to the job object This commit was SVN r23299.	2010-06-24 01:57:27 +00:00
Shiqing Fan	681df0089b	Add a few new files into the tarball. This commit was SVN r23297.	2010-06-22 16:45:56 +00:00
Ralph Castain	8b2a682fba	Return a silent error when -do-not-launch is given This commit was SVN r23291.	2010-06-22 01:06:10 +00:00
Shiqing Fan	2e5e9f0a03	Fix a wrong windows path in hpn_contack, which causes problems when looking up in the session directories. Add two more ess module for Windows. This commit was SVN r23286.	2010-06-21 09:47:33 +00:00
Ralph Castain	ae746a390f	Debugger daemons spawned upon attachment to a running job need to be treated just like a regular job - they are not "piggybacking" onto an existing launch, and so the orte daemons need to report them just like a regular job launch in order to release from spawn. Modify the debugger control flag to include "do not monitor" so orterun will not take debugger daemon termination into account when deciding that all jobs are done. This commit was SVN r23282.	2010-06-19 15:22:36 +00:00
Ethan Mallove	fc37e408c2	Avoid SEGV in case rsh/ssh is not in `PATH` (refs trac:1490) This commit was SVN r23278. The following Trac tickets were found above: Ticket 1490 --> https://svn.open-mpi.org/trac/ompi/ticket/1490	2010-06-17 14:58:09 +00:00
Ralph Castain	1e90b91b84	Unset envars we set during initialization so we leave environ intact after orte_finalize. Thanks to Damien Gunter for pointing it out. This commit was SVN r23277.	2010-06-17 13:42:21 +00:00
Ralph Castain	628ffd1d6e	Make the mcast channel assignments unsigned ints so they can be used as array indices. Assign input/output channels for apps. Cleanup some bugs in open_channel This commit was SVN r23275.	2010-06-16 19:40:59 +00:00
Ralph Castain	6cbe947810	Modify the multicast scheme so that applications have separate input and output channels to avoid cross-talk. Update the multicast test to conform. This commit was SVN r23271.	2010-06-15 03:50:31 +00:00
Ralph Castain	bb602694e6	Add a new example program, update cisco platform file This commit was SVN r23262.	2010-06-09 18:21:06 +00:00
Ralph Castain	da43547983	Don't define the active_jobid until -after- the job has been setup. Cleanup references to pointer_array objects This commit was SVN r23250.	2010-06-09 02:16:05 +00:00
Jeff Squyres	f1a7b5cc33	Make "processor affinity not supported" error message a little better: * Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic OPAL_ERR_NOT_SUPPORTED case. * When odls_default detects that processor affinity is not supported, it prints a specific message about it, and then it suppressed a generic HNP help message that would normally follow it (i.e., it's easier to have the "processor affinity is not supported" show_help message last). * Use some symbolic names in odls_default instead of fixed int's, just for slight readability improvements in the code. * Introduce orte_show_help_suppress(), which gives the ability to suppress any future showings of any arbitrary show_help() message. This is useful if you display message X and want to suppress message Y. This suppression only works in environments where orte_show_help() does coalescing. This commit was SVN r23249.	2010-06-08 20:16:07 +00:00
Ralph Castain	e52a54183f	Let max restarts be associated with an app_context instead of a job so that individual apps can have different values. Default to a single job-level value This commit was SVN r23248.	2010-06-07 14:21:08 +00:00
Ralph Castain	799a77a187	Some updates to the routed-cm module so it properly supports the tcp rmcast module This commit was SVN r23247.	2010-06-07 14:19:32 +00:00

... 3 4 5 6 7 ...

3236 Коммитов