openmpi

Автор	SHA1	Сообщение	Дата
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Ralph Castain	7e6985edbf	Cleanup warnings This commit was SVN r23150.	2010-05-16 20:23:26 +00:00
Ralph Castain	88f5217a12	Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired). Add new mca params to test: orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch orte_debugger_test_attach: Test debugger colaunch after debugger attachment To test co-launch at job start, just set the orte_debugger_test_daemon param. To test co-launch upon attach: set orte_debugger_test_daemon set orte_debugger_test_attach=1 set orte_enable_debug_cospawn_while_running=1 set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon. This commit was SVN r23144.	2010-05-14 18:44:49 +00:00
Ralph Castain	d6a1d7a082	Little more cleanup on paffinity. Provide a specific error code for affinity not supported so we can better report the problem. Move the error reporting to orterun so we only get one error message. Update the darwin paffinity module to return the correct new error codes. This commit was SVN r23107.	2010-05-07 14:04:55 +00:00
Ralph Castain	d4f56cff61	More cleanup on paffinity....groan It is okay to not have a paffinity module IF you aren't using paffinity anyway. So don't error out of MPI_Init because a paffinity module wasn't selected. Cleanup error reporting in the odls default module to (once and for all!) eliminate messages originating in the fork'd process. Create some new error codes to allow us to pass enough info back to the parent process to provide useful error messages. This commit was SVN r23106.	2010-05-06 20:57:17 +00:00
Ralph Castain	2ff1ae13e1	Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set. This commit was SVN r23102.	2010-05-05 00:48:43 +00:00
Ralph Castain	f994a7edf4	Add recovery data to the jobdat object This commit was SVN r23078.	2010-05-03 04:06:13 +00:00
Ralph Castain	319758e3e0	Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params: 1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis 2. orte_max_local_restarts - default max number of local restarts, can be overridden 3. orte_max_global_restarts - default max number of relocates, can be overridden Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code. This commit was SVN r23057.	2010-04-28 04:06:57 +00:00
Ralph Castain	55889934d8	After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon. Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly. This commit was SVN r23045.	2010-04-27 03:39:32 +00:00
Ralph Castain	b9893aacc5	Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here: 1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified. 2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set. This support must be enabled by configuring --enable-sensors. ompi_info and orte-info have been updated to include the new framework. Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized. Implementation continues. This commit was SVN r23043.	2010-04-26 22:15:57 +00:00
Ralph Castain	efbb5c9b7c	Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base: * add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway. * update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states. * remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails. * update the plm_base_receive.c code to call the errmgr with state updates from remote daemons * update the routed modules to reflect that process state is updated in the errmgr * ensure that the orted's open the errmgr and select their appropriate module * add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place * define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired. * update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios This commit was SVN r23023.	2010-04-23 04:44:41 +00:00
Ralph Castain	86228aee38	Provide two new opal paffinity utilities for printing a hex representation of the cpu set and parsing that string back into a cpu set on the other end. Also add a new MCA param for passing the cpu set applied to a process during launch down to that process so it can know what we attempted to do. All to be used in some new MPI extensions provided by Jeff so that users can easily query their binding situation. This commit was SVN r22998.	2010-04-19 22:16:35 +00:00
Ralph Castain	4d06125a33	Establish a method by which a process knows if it has been bound by mpirun. This helps resolve a problem where a process gets "bound" to all available resources, which looks to the opal paffinity system as "not bound". This can cause mpi_init to attempt to "bind" the process itself, causing unintended behavior. This commit was SVN r22985.	2010-04-17 01:58:26 +00:00
Ralph Castain	41428e6b61	Issue a warning if a requested binding operation results in processes being bound to all available processes, which is the equivalent of not being bound at all. See the following email thread for further details: http://www.open-mpi.org/community/lists/devel/2010/04/7745.php This commit was SVN r22984.	2010-04-17 01:02:41 +00:00
Ralph Castain	81329c637e	Indentation corrections This commit was SVN r22971.	2010-04-13 17:47:34 +00:00
Ralph Castain	8da781af84	Continue developing support for distributed virtual machines - minor changes to ensure correct jobid gets used and that dvm's can communicate with tools This commit was SVN r22958.	2010-04-12 22:33:09 +00:00
Shiqing Fan	96b20a29b5	An easy solution to make singleton work on Windows. This commit was SVN r22952.	2010-04-10 16:30:59 +00:00
Ralph Castain	4f8279df3d	Enable substitution of the communication calls in the orted when sending messages back to the HNP by creating a function for this purpose and saving the pointer to it in orte_odls_base. Higher level libraries can then override the default function to use their own method. This commit was SVN r22950.	2010-04-09 18:50:10 +00:00
Terry Dontje	282a537cf7	This commit fixes 2370, by having the solaris paffinity module return error codes for get_physical_processor_id and having odls_default_fork_local_proc check get_physical_processor_id for OPAL_ERROR This commit was SVN r22948.	2010-04-09 15:10:46 +00:00
Terry Dontje	929c58e38d	This commit fixes trac:2073 This commit was SVN r22946. The following Trac tickets were found above: Ticket 2073 --> https://svn.open-mpi.org/trac/ompi/ticket/2073	2010-04-08 18:17:44 +00:00
Ralph Castain	ed0f42fa49	Fix a bug courtesy of Jeff - since check_job_complete removes the child object and releases it, preserve the pointer to the next item on the list prior to working with it This commit was SVN r22924.	2010-04-02 07:08:34 +00:00
Ralph Castain	f6bfaa76ba	Add some debug output to job_complete. If no session dirs were created, then cannot check for abort file - which wouldn't be created anyway This commit was SVN r22903.	2010-03-29 23:21:03 +00:00
Ralph Castain	2603bd8a47	Eliminate a race condition (first reported by Josh) when deliberately killing procs. Need to cancel the waitpid callback for the proc, then properly flag it as dead (both not-alive and waitpid-fired) so that the system cleans up properly. This commit was SVN r22900.	2010-03-28 16:08:05 +00:00
Ralph Castain	0b9552cd4e	Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly. This commit was SVN r22870.	2010-03-23 20:47:41 +00:00
Rainer Keller	814fb9399f	- Further patches for support on NetBSD (and DragonFly) by Aleksej Saushev. Dont use bash or bashism in shell scripts We should use Posix' setpgid(0,0), which is equivalent to setpgrp(). This commit was SVN r22829.	2010-03-15 05:33:42 +00:00
Josh Hursey	e9b5162d79	Fix the configure logic for --with-ft so that it properly takes a comma separated list. Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those. The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed. This commit was SVN r22824.	2010-03-12 23:57:50 +00:00
Ralph Castain	17936e6e5f	Ensure we cleanly terminate if an executable cannot be found This commit was SVN r22805.	2010-03-10 16:45:08 +00:00
Ralph Castain	9e7f621a98	Port Brad's paffinity change to the 1.4 branch over to the trunk so we don't lose it going forward. This commit was SVN r22794.	2010-03-07 18:44:22 +00:00
Ralph Castain	2a0f7e95ee	Don't double account for the killed local proc - only adjust num_local_procs when the proc actually dies. This commit was SVN r22787.	2010-03-05 13:53:18 +00:00
Ralph Castain	f2c65dc70f	Ensure that the errmgr does not take action if the process was terminated by a "kill_procs" command as this can lead to circular logic. Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits. This commit was SVN r22783.	2010-03-05 13:22:12 +00:00
Ralph Castain	ef6c432e22	Fix a nasty bug where we would hang if an application trapped signals such as SIGTERM - a permissible thing to do. In such cases, we removed the process from the waitpid system and then sent it a SIGTERM. If the application trapped that and attempted to cleanly terminate, it would send us a sync message - and the daemon would then add it back to its local child list, causing both the daemon and the process to hang. In this revision, we let the process terminate/exit however it can, and then pick it up via the usual waitpid. This commit was SVN r22781.	2010-03-05 04:14:56 +00:00
Ralph Castain	5514d9c673	Fix the stupid rankfile mapper again, hopefully not breaking everything else to accommodate it. Looks like the round-robin mappers still work, at least... This commit was SVN r22746.	2010-03-01 20:40:47 +00:00
Ralph Castain	2541aa98ab	Change the app_idx type to uint32_t to support users who use large numbers of app_contexts. Set it up as a new typedef so we can change it later without as much effort. This commit was SVN r22727.	2010-02-27 17:37:34 +00:00
Ralph Castain	18c7aaff08	Update the grpcomm framework to be more thread-friendly. Modify the orte configure options to specify --enable-multicast such that it directs components to build or not instead of littering the code base with #if's. Remove those #if's where they used to occur. Add a new grpcomm "mcast" module to support multicast operations. Still some work required to properly perform daemon collectives for comm_spawn operations. New module only builds when --enable-multicast is provided, and when specifically selected. This commit was SVN r22709.	2010-02-25 01:11:29 +00:00
Jeff Squyres	f65eebf53d	More changes for NetBSD. Thanks to Aleksej Saushev for this patch. This commit was SVN r22680.	2010-02-22 15:05:09 +00:00
Ralph Castain	65a8ab4267	Cleanup the kill_procs command. Send a SIGTERM initially to allow C/R operations, and to be polite. Correctly update proc state if there is a problem so we don't hang. The change to just using SIGKILL was originally done due to problems whereby waitpid thought a proc had died, but it hadn't. We'll continue debugging that problem separately, but SIGTERM is required for C/R to work properly. This commit was SVN r22674.	2010-02-21 19:35:32 +00:00
Iain Bason	28f03a2d86	Suspend/resume enhancements: Have orte call setpgrp after forking (but before exec) when orte_forward_job_control is set. Then have it send signals to the child's process group. This allows suspending jobs that fork. If a SIGTSTP arrives before the processes have been launched, then record it and suspend them right after launching. This commit was SVN r22557.	2010-02-04 15:47:20 +00:00
Shiqing Fan	872a4047ba	Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake. In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time. This commit was SVN r22405.	2010-01-14 18:10:20 +00:00
Ralph Castain	cec840f6b9	The ability to add procs to a running job was unfortunately borked when we added the detection of a proc exiting before calling init. Re-enable it here, ensuring that procs that are being restarted and/or added to a job do -not- call barrier during orte_init. This commit was SVN r22404.	2010-01-14 17:59:42 +00:00
Ralph Castain	ef1bfaa823	Add the ability to track how many times a process has been restarted, and to communicate that value to a process when it is restarted in case it needs to take action when it is restarted as opposed to being started for the first time. This commit was SVN r22377.	2010-01-07 01:19:44 +00:00
Ralph Castain	aaf1119f40	Garrr...ensure we accurately know when to update the contact info so we don't do it incorrectly as procs terminate, thus causing the system to think that perfectly good apps are incorrectly terminating. Thanks to George for pointing out the problem This commit was SVN r22332.	2009-12-17 20:40:21 +00:00
Ralph Castain	8ab962411c	Detect the scenario where one or more procs fail to call orte/ompi_init while others in the job do. This scenario can cause the job to hang as MPI_Init contains a barrier operation that will not complete. Although ORTE does not contain such a barrier, it still will be considered as an error scenario so that we can detect the MPI case - otherwise, ORTE has no knowledge of OMPI and wouldn't know how to differentiate the use-cases. Take advantage of the changes to update the routed_base_receive code to avoid message overlap. This commit was SVN r22329.	2009-12-17 19:39:53 +00:00
George Bosilca	501d1cc4ad	Set default values to avoid using these variables uninitialized. This commit was SVN r22279.	2009-12-08 18:42:22 +00:00
Ralph Castain	4ec9c4b532	Do a better job of ensuring session directories are removed when procs abnormally terminate and/or we order "kill local procs" This commit was SVN r22258.	2009-12-03 04:46:17 +00:00
Ralph Castain	a0d5c80ce0	Add a new framework for discovering local resource information such as cpu type/model, #cpus, available physical memory, etc. Two initial components (darwin and linux) are provided. This is needed to support bootstrap operations where daemons are started at node boot, and applications where initial knowledge of cpu identification is needed to guide framework component selection. Add orte configuration option to control the use of the framework in the system. Although the code will build, it will not be active unless configured with --enable-bootstrap. If bootstrap is enabled and the new opal_sysinfo framework can successfully determine the cpu model, pass that info to the application as an MCA param to support some work at Sun. Also, have daemons report back the resources they find to guide process mapping in bootstrap operations (i.e., where the daemon starts at node boot as opposed to being launched at application start). Adjust some platform files to enable these capabilities. This commit was SVN r22244.	2009-11-30 23:11:25 +00:00
Ralph Castain	a401f05ea3	Add some diagnostics to chase down forced termination of procs. Ensure that procs are removed from the local data list upon termination This commit was SVN r22223.	2009-11-19 19:43:10 +00:00
Ralph Castain	1a44b84b25	If a process is in certain states (e.g., polling for messages in the event lib), then it can blissfully ignore SIGTERM when we try to order it to die. Unfortunately, the OS thinks the process actually did die, leading us to leave orphaned procs around. The only sure way to kill the thing is with SIGKILL. After hours spent trying to debug this bizarre situation with a reliable reproducer, I finally tracked it down and fixed it. Go figure...I sure can't. This commit was SVN r22220.	2009-11-19 17:25:15 +00:00
Rainer Keller	7dfe709ac1	- Initialize n before usage. This commit was SVN r22169.	2009-10-29 15:52:53 +00:00
Ralph Castain	13d86e100b	Courtesy of Ralph and Jeff: Continue the reorganization of the configure system. Move files from the main config directory to their appropriate level-specific config directories. Modify the configure system to correctly handle compiler detection, test, and setup so that all things pertaining to opal and orte are done at the lower level, with the ompi configure system only looking at mpi-specific options. Ensure the wrapper compilers for orte and ompi only get built when appropriate. Add support for c++ to the orte wrapper compilers, both script and non-script versions. This commit was SVN r22138.	2009-10-24 01:04:35 +00:00
Ralph Castain	2665825693	Correct an error that causes the system to "bounce" when we order a job killed. We didn't used to discriminate between a process being ordered to die, and a process that was aborted by an external signal. Unfortunately, that means the error mgr gets called and told a process abnormally aborted when we order termination, thus causing the errmgr to send out a "kill procs" command again. Wouldn't be so bad, except...the errmgr orders the termination of ALL procs, which kills any other job that should have been left alone. Add a new proc and job state indicating "killed_by_cmd" so we can tell the difference between a proc/job that was deliberately terminated by us vs one that is killed by external signal. This change was tested to ensure it didn't interfere with ctrl-c operation (it doesn't - we order termination of all jobs when we get a ctrl-c). This commit was SVN r22100.	2009-10-14 22:49:56 +00:00

1 2 3 4 5 ...

330 Коммитов