openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	8ccd14d508	Correct the -h output to reflect that --bind-to-none is the default behavior This commit was SVN r23537.	2010-08-01 22:31:34 +00:00
Ralph Castain	12cd07c9a9	Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases. The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang. This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is: * pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place * modified the errmgr to directly call the new routines when termination is detected * removed the grpcomm.onesided_barrier and its associated RML tag * add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree * use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial. This commit was SVN r23429.	2010-07-17 21:03:27 +00:00
Ralph Castain	23b166cbd4	Initialize pointers to NULL This commit was SVN r23422.	2010-07-15 16:33:57 +00:00
Ralph Castain	2193ace464	Ensure the debugger symbols are loaded into orterun prior to orte_init so they are found by attaching debuggers. This commit was SVN r23368.	2010-07-09 17:51:00 +00:00
Ralph Castain	31295e8dc2	As discussed on today's telecon, reorganize the debugger attachment code in orte to better support efforts within the tool community aimed at exploring alternative methods. Move the debugger attachment code from the orterun directory to a new debugger framework. Organize the existing standard support code into an "mpir" component. Organize the current extensions for co-spawning debugger daemons into a separate "mpirx" component. Since the MPIR symbols are now included in the ORTE library, remove duplicate declarations in OMPI and replace them with extern references to their ORTE instantiations. This commit was SVN r23360.	2010-07-06 23:35:42 +00:00
Ralph Castain	628936a99f	Provide a convenience option to disable fault recovery (as opposed to setting three separate, long-named mca params) This commit was SVN r23327.	2010-07-01 19:31:11 +00:00
Ralph Castain	099c3aad97	Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0. This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all. Thanks to Dong Ahn (LLNL) for catching this problem! Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm. This commit was SVN r23300.	2010-06-24 05:13:53 +00:00
Jeff Squyres	f1a7b5cc33	Make "processor affinity not supported" error message a little better: * Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic OPAL_ERR_NOT_SUPPORTED case. * When odls_default detects that processor affinity is not supported, it prints a specific message about it, and then it suppressed a generic HNP help message that would normally follow it (i.e., it's easier to have the "processor affinity is not supported" show_help message last). * Use some symbolic names in odls_default instead of fixed int's, just for slight readability improvements in the code. * Introduce orte_show_help_suppress(), which gives the ability to suppress any future showings of any arbitrary show_help() message. This is useful if you display message X and want to suppress message Y. This suppression only works in environments where orte_show_help() does coalescing. This commit was SVN r23249.	2010-06-08 20:16:07 +00:00
Abhishek Kulkarni	260eaa24dd	Add a missing header. Bleh. This commit was SVN r23165.	2010-05-17 23:55:53 +00:00
Abhishek Kulkarni	8335ff3893	Fixing a branch merge issue. Looks like we picked the wrong branch when resolving a conflict. This commit was SVN r23164.	2010-05-17 23:49:46 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Ralph Castain	88f5217a12	Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired). Add new mca params to test: orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch orte_debugger_test_attach: Test debugger colaunch after debugger attachment To test co-launch at job start, just set the orte_debugger_test_daemon param. To test co-launch upon attach: set orte_debugger_test_daemon set orte_debugger_test_attach=1 set orte_enable_debug_cospawn_while_running=1 set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon. This commit was SVN r23144.	2010-05-14 18:44:49 +00:00
Ralph Castain	7ce34223f1	Per off-list discussion, implement the new OMPI exit status policy (soon to be on wiki) and further cleanup error reporting to cover new cases. Implement process migration when failed nodes are detected. Some testing still required This commit was SVN r23121.	2010-05-12 18:11:58 +00:00
Ralph Castain	4bd25f587c	Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so. Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort. This commit was SVN r23112.	2010-05-11 00:34:12 +00:00
Ralph Castain	d6a1d7a082	Little more cleanup on paffinity. Provide a specific error code for affinity not supported so we can better report the problem. Move the error reporting to orterun so we only get one error message. Update the darwin paffinity module to return the correct new error codes. This commit was SVN r23107.	2010-05-07 14:04:55 +00:00
Ralph Castain	319758e3e0	Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params: 1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis 2. orte_max_local_restarts - default max number of local restarts, can be overridden 3. orte_max_global_restarts - default max number of relocates, can be overridden Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code. This commit was SVN r23057.	2010-04-28 04:06:57 +00:00
Ralph Castain	b9893aacc5	Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here: 1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified. 2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set. This support must be enabled by configuring --enable-sensors. ompi_info and orte-info have been updated to include the new framework. Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized. Implementation continues. This commit was SVN r23043.	2010-04-26 22:15:57 +00:00
Ralph Castain	efbb5c9b7c	Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base: * add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway. * update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states. * remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails. * update the plm_base_receive.c code to call the errmgr with state updates from remote daemons * update the routed modules to reflect that process state is updated in the errmgr * ensure that the orted's open the errmgr and select their appropriate module * add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place * define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired. * update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios This commit was SVN r23023.	2010-04-23 04:44:41 +00:00
Ralph Castain	4308922f59	Ensure that any application-specific selection of ess module doesn't get overridden by what is given to the orted or orterun Cleanup tool name determination for CM This commit was SVN r22980.	2010-04-15 18:10:50 +00:00
Ralph Castain	c32f046d7c	Tiny cleanup - when the user kills us with a ctrl-c, there really isn't a need to tell him "your procs died and we don't know why". Just shaddup and die. This commit was SVN r22949.	2010-04-09 18:47:35 +00:00
Ralph Castain	de6679dbd3	Truly respect the -quiet option. Make it an mca param so someone doesn't have to put it solely on the cmd line. Tell show_help to shaddup as well. This commit was SVN r22926.	2010-04-02 14:19:38 +00:00
Ralph Castain	25a9a195b0	When the user requests "quiet", mpirun isn't supposed to output "helpful error messages" - so shaddup! This commit was SVN r22925.	2010-04-02 07:13:11 +00:00
Jeff Squyres	c57a8fba5a	Improve the help message when mpirun cannot find an executable. Refs trac:2035 This commit was SVN r22922. The following Trac tickets were found above: Ticket 2035 --> https://svn.open-mpi.org/trac/ompi/ticket/2035	2010-04-01 13:26:29 +00:00
Josh Hursey	e4f2d03d28	ErrMgr Framework redesign to better support fault tolerance development activities. Explained in more detail in the following RFC: http://www.open-mpi.org/community/lists/devel/2010/03/7589.php This commit was SVN r22872.	2010-03-23 21:28:02 +00:00
Josh Hursey	9e967a3a9b	Revert this change, since even in the CR case we want to reset this var to NULL. Thanks to Jeff for the catch. This commit was SVN r22868.	2010-03-23 19:55:21 +00:00
Josh Hursey	e9b5162d79	Fix the configure logic for --with-ft so that it properly takes a comma separated list. Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those. The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed. This commit was SVN r22824.	2010-03-12 23:57:50 +00:00
Ralph Castain	f2c65dc70f	Ensure that the errmgr does not take action if the process was terminated by a "kill_procs" command as this can lead to circular logic. Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits. This commit was SVN r22783.	2010-03-05 13:22:12 +00:00
Ralph Castain	6c0d7940c7	Add a new MCA param (and corresponding mpirun cmd line option) to output the debugger proctable info after launch. The output is just the job map with the process pid included, so you get a node-by-node list of the process ranks on that node and thier pids. Works for initial launch and comm_spawn. xml and non-xml output is available This commit was SVN r22725.	2010-02-27 08:32:25 +00:00
Ralph Castain	00b493e10d	Make the sigpipe message be a verbose output This commit was SVN r22464.	2010-01-22 03:46:11 +00:00
Ralph Castain	2517799102	Report, but ignore, SIGPIPE events. The odls already resets this signal handler when spawning local procs. This commit was SVN r22463.	2010-01-21 05:01:06 +00:00
Jeff Squyres	16b100219d	A patch from UTK to allow orte_init(), opal_init(), and associated friends also receive &argc and &argv (George asked Jeff to Ralph to review before committing). The thought is that passing argv and argc to opal/orte_init be useful to other projects outside of OMPI that are using OPAL and/or ORTE (especially in conjunction with some other bootstrapping code where it is helpful to modify argv). It's such a small thing that it's easy to apply here to make others' lives a little easier. Ask George for more details; I'm just the messenger. :-) Judging by the copyrights on this patch, it's been around for a while. :-) This commit was SVN r22260.	2009-12-04 00:51:15 +00:00
Ralph Castain	67625d5df6	Esure that mca params have a chance to be used in the standard hierarchy - in this case, allowing a choice of mapping policy to be made at the rmaps mca level. This commit was SVN r22089.	2009-10-11 03:44:39 +00:00
Ralph Castain	19a13039aa	Strange - remove duplicated comment This commit was SVN r22035.	2009-09-30 14:38:13 +00:00
Ralph Castain	c749fefbd0	Instead of an odls-base mca param, make report_bindings a global param so that we can (a) detect it was set in the plm, and then (b) ensure it gets passed along to remote orteds so they will comply with the request. This commit was SVN r22021.	2009-09-28 03:17:15 +00:00
Ralph Castain	dff0d01673	Yet another paffinity cleanup...sigh. 1. ensure that orte_rmaps_base_schedule_policy does not override cmd line settings 2. when you try to bind to more cores than we have, generate a not-enough-processors error message 3. allow npersocket -bind-to-core combination - because, yes, somebody actually wants to do it. This commit was SVN r21996.	2009-09-22 18:44:53 +00:00
Terry Dontje	0ccf2d87b6	rename do-not-bind to bind-to-none and clean up an error message This commit was SVN r21980.	2009-09-21 17:00:02 +00:00
Ralph Castain	2028017554	Modify the paffinity system to handle binding directives that are "soft" - i.e., when someone directs that we bind if the system supports it. This allows community members to distribute OMPI with default MCA param files that direct general binding policies, without having the distributed software fail if the system cannot support those policies. The new options work by adding an ":if-avail" qualifier to the "bind-to-socket" and "bind-to-core" MCA params. If the system does not support this capability, the job will launch anyway. Without the qualifier, the job will abort with an error message indicating that the required functionality is not supported on this system. This commit was SVN r21975.	2009-09-18 19:48:42 +00:00
Ralph Castain	8ae4b55d16	Enable a new command line option to --report-events that instructs mpirun to RML-report specific events during job life to the requestor. This commit was SVN r21954.	2009-09-09 05:28:45 +00:00
Ralph Castain	ca09e8f604	Minor modification required to allow opal_paffinity_alone to default to bind-to-core This commit was SVN r21948.	2009-09-05 15:24:26 +00:00
Ralph Castain	0421a49844	Update the xml support to allow -xml-file foo whereby we redirect all xml formatted output (and ONLY xml formatted output) to a specified file This commit was SVN r21930.	2009-09-02 18:03:10 +00:00
Ralph Castain	0394a4884d	Setup cpus-per-proc and cpus-per-rank as synonyms, both in mca params and on mpirun cmd line This commit was SVN r21914.	2009-08-30 14:30:36 +00:00
Ralph Castain	3ef028ca23	Trap mpirun error messages in xml format This commit was SVN r21910.	2009-08-28 02:46:15 +00:00
Shiqing Fan	4119497c5a	Use select() for Windows events by default. For some historical reasons, we didn't use select() for the Windows events. Now it could be merged back to have a better performance on Windows. This commit was SVN r21899.	2009-08-27 08:11:56 +00:00
Ralph Castain	5e710928a5	Revise the new binding system slightly: 1. finalize the logic for properly respecting externally assigned bindings. Thanks to Chris Samuel for his help with this. Still needs some acid testing, but appears to now work. 2. remove the double-logic of requiring opal_paffinity_alone AND bind-to-foo. If the user specifies bind-to-foo, trust her and just do it. This commit was SVN r21885.	2009-08-26 02:01:49 +00:00
Ralph Castain	509cc0553c	When directly launched by an RM, flag that a process is operating without daemons - i.e., standalone. Provide an error string for the new socket_not_available error. Use errmgr.abort to exit when we cannot get a socket, and ensure that the slurmd module returns the proper exit status for slurm 2.0 This commit was SVN r21868.	2009-08-22 02:58:20 +00:00
Ralph Castain	d3313f732a	Flush the mpirun xml tag output to maintain ordering This commit was SVN r21836.	2009-08-18 21:20:55 +00:00
Ralph Castain	3796cc568a	Add a --report-bindings cmd line option to mpirun This commit was SVN r21829.	2009-08-18 17:10:23 +00:00
Ralph Castain	a489f3fc6c	Surround the entire xml output from an mpirun with <mpirun> tags. This commit was SVN r21826.	2009-08-18 03:15:29 +00:00
Ralph Castain	0005e6e834	Correct a couple of bugs in the rank_file mapper that were incorrectly assigning vpids. Add a capability to parse the rankfile to extract node information in place of requiring both hostfile and rankfile for non-RM managed environments. The rankfile is -only- parsed for this IF the hostfile and -host options are not given. Otherwise, those are used to establish allocation info as we did before this commit. This commit was SVN r21815.	2009-08-13 16:08:43 +00:00
Ralph Castain	1dc12046f1	Modify the OMPI paffinity and mapping system to support socket-level mapping and binding. Mostly refactors existing code, with modifications to the odls_default module to support the new capabilities. Adds several new mpirun options: * -bysocket - assign ranks on a node by socket. Effectively load balances the procs assigned to a node across the available sockets. Note that ranks can still be bound to a specific core within the socket, or to the entire socket - the mapping is independent of the binding. * -bind-to-socket - bind each rank to all the cores on the socket to which they are assigned. * -bind-to-core - currently the default behavior (maintained from prior default) * -npersocket N - launch N procs for every socket on a node. Note that this implies we know how many sockets are on a node. Mpirun will determine its local values. These can be overridden by provided values, either via MCA param or in a hostfile Similar features/options are provided at the board level for multi-board nodes. Documentation to follow... This commit was SVN r21791.	2009-08-11 02:51:27 +00:00

1 2 3 4 5 ...

276 Коммитов