openmpi

Автор	SHA1	Сообщение	Дата
George Bosilca	f2a6b9394f	Deal with the include spree. Protect "environ" on Windows. Some others minors modifications in order to make it compile [again] on Windows. This commit was SVN r14188.	2007-04-01 16:16:54 +00:00
Ralph Castain	0d98264097	Fix the nolocal option on the OMPI trunk This commit was SVN r14138.	2007-03-24 16:16:16 +00:00
Jeff Squyres	bcdfbacaa4	Oops -- typo from previous commit. :-( This commit was SVN r14130.	2007-03-23 00:51:50 +00:00
Jeff Squyres	a3dd0f2e08	Connect --nolocal up to the MCA param rmaps_base_schedule_local, as it should be (it's a mistake that it got left out). This commit was SVN r14127.	2007-03-22 19:29:47 +00:00
Josh Hursey	dadca7da88	Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). This merge adds Checkpoint/Restart support to Open MPI. The initial frameworks and components support a LAM/MPI-like implementation. This commit follows the risk assessment presented to the Open MPI core development group on Feb. 22, 2007. This commit closes trac:158 More details to follow. This commit was SVN r14051. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r13912 The following Trac tickets were found above: Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158	2007-03-16 23:11:45 +00:00
Josh Hursey	0404444dbe	* Added 2 new MCA parameters - mca_base_param_file_prefix (Default: NULL) This is the fullname of the "-am" mpirun option. Used to specify a ':' separated list of AMCA parameter set files. - mca_base_param_file_path (Default: $SYSCONFDIR/amca-param-sets/:$CWD) The path to search for AMCA files with relative paths. A warning will be printed if the AMCA file cannot be found. * Added a new function "mca_base_param_recache_files" the re-reads the file configurations. This is used internally to help bootstrap the MCA system. * Added a new orterun/mpirun command line option '-am' that aliases for the mca_base_param_file_prefix MCA parameter * Exposed the opal_path_access function as it is generally useful in other places in the code. * New function "opal_cmd_line_make_opt_mca" which will allow you to append a new command line option with MCA parameter identifiers to set at the same time. Previously this could only be done at command line declaration time. * Added a new directory under the $pkgdatadir named "amca-param-sets" where all the 'shipped with' Open MPI AMCA parameter sets are placed. This is the first place to search for AMCA sets with relative paths. * An example.conf AMCA parameter set file is located in contrib/amca-param-sets/. * Jeff Squyres contributed an OpenIB AMCA set for benchmarking. Note: You will need to autogen with this commit as it adds a configure param. Sorry :( This commit was SVN r13867.	2007-03-01 13:39:20 +00:00
Pak Lui	2d6b3776bf	* fix the SEGV described in trac #892 that the exit_status in the 200 range causes a strsignal to show NULL as a result. Still trying to determine why exit_status is in that range. This commit was SVN r13583.	2007-02-09 16:39:30 +00:00
Pak Lui	ccff0a6e65	* minor fix to correct the pid that always shows up as 0 in the abort error message. e.g: mpirun noticed that job rank 2 with PID 0 on node burl-ct-v440-4 exited on signal 15 (Terminated). This commit was SVN r13537.	2007-02-07 17:46:19 +00:00
George Bosilca	9f73335bdb	Silence the compiler. This commit was SVN r13381.	2007-01-31 04:24:56 +00:00
Jeff Squyres	8d872b195a	Refs trac:726 Tested this functionality quite a bit more and made some fixes: * Print far fewer help messages * Fix one additional deadlock upon error * Change some ORTE_LOG messages to silent (because they're not errors) * Some code got re-indented, sorry... Discussed and reviewed with Ralph. This commit was SVN r13375. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-30 23:03:13 +00:00
Ralph Castain	ab5ea61100	Bring over the rest of the ctrl-c fixes. This commit includes: 1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned. 2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up. 3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded 4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond This needs more testing before migrating to 1.2. This commit was SVN r13304.	2007-01-25 14:17:44 +00:00
Ralph Castain	455e4ada9a	Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes. This commit was SVN r13191.	2007-01-18 17:15:19 +00:00
Ralph Castain	cc905290e4	Fix the pernode and npernode options - the mca parameters weren't being set to correspond to the command line options This commit was SVN r13151.	2007-01-17 14:56:22 +00:00
Ralph Castain	5d698dc55b	Turn "off" an unimplemented command line option - we do not currently support execution without mpirun waiting for job completion. This commit was SVN r13127.	2007-01-16 16:10:31 +00:00
Jeff Squyres	8a289cf1cb	Part 1 of the fix for ticket #726 . This commit adds logic to orteun to effect the following: * The first time the user hits ctrl-c, we go into the process of killing the ORTE job (this is not new). * While waiting for the job to actually terminate, if the user hits ctrl-c a second time, we print a warning saying "Hey, I'm still trying to kill the job. If you really want me to die immediately, hit ctrl-c again within 1 second." * If the user hits ctrl-c a within 1 second, orterun quits with a warning about how the job may not have actually been killed. Note that none of this logic won't really work until the second part of the fix for #726 is also committed (i.e., make pls.terminate_job() non-blocking). So I'm now throwing the ticket over to Ralph for the second part of the fix... Refs trac:726 This commit was SVN r13040. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-08 20:25:26 +00:00
Brian Barrett	bc6cec346f	Print out the description of the signal from mpirun when a proc was aborted by a signal if we have strsignal() This commit was SVN r12888.	2006-12-17 20:01:11 +00:00
Ralph Castain	7b8f445e13	Modify the "--display-map-at-launch" option to just "--display-map". Now that we have a "--do-not-launch" option, the "-at-launch" part of the display-map option was confusing. "--display-map" displays the resulting process map before we launch anyway, so this is clearer. This commit was SVN r12840.	2006-12-13 13:49:15 +00:00
Ralph Castain	82946cb220	Add a new option to orterun: "--do-not-launch" directs the system to do the allocation, map, job setup, etc., but don't actually launch the job. This lets us test all the setup portions of the code. Also, take the first step in updating how we handle mca params in ORTE - bring it closer to how it is done in the other two layers. Much more work to be done here. This commit was SVN r12838.	2006-12-13 04:51:38 +00:00
Ralph Castain	28ce8e5e5e	Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior). In "--npernode" operation, the "-np" command line parameter is ignored. This commit was SVN r12826.	2006-12-12 00:54:05 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Rainer Keller	e61dd8722e	- Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx - No functional changes, just indentation and corrections to error output. This commit was SVN r12734.	2006-12-03 13:59:23 +00:00
Ralph Castain	f771cc4fbd	Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option. Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution. This commit was SVN r12619.	2006-11-17 19:06:10 +00:00
Ralph Castain	ca5b4358fa	Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too. Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone. This commit was SVN r12616.	2006-11-17 02:58:46 +00:00
Ralph Castain	044898f4bf	My eyes may be deceiving me....but I do believe these comparisons are backwards! I think we only really want to "free" these variables if they are NOT NULL - as opposed to "free"ing them if they ARE NULL. This commit was SVN r12612.	2006-11-15 22:59:01 +00:00
Ralph Castain	f7fc19a2ca	Create the ability to re-use existing daemons. Included in the commit: 1. new functionality in the pls base to check for reusable daemons and launch upon them 2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED 3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse". This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on. This commit was SVN r12608.	2006-11-15 21:12:27 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Ralph Castain	4e50cdae52	This commit accomplishes two things: 1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs. 2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system. This commit was SVN r12558.	2006-11-11 04:03:45 +00:00
Ralph Castain	30de73a712	Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include: 1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch" 2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command. 3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job 4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map. Enjoy! My personal favorite is the first one - helps with debugging. This commit was SVN r12379.	2006-10-31 22:16:51 +00:00
George Bosilca	ee559e9947	Do not completely reset the orterun_globals. Keep the condition and the mutex, but reset everything else. Once initialized the condition (and the attached mutex) should be kept alive as long as possible if we want to be able to retrieve all the informations. This commit was SVN r12253.	2006-10-23 03:34:08 +00:00
Ralph Castain	13227e36ab	This commit looks a lot bigger than it is, so relax :-) Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off. To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place. I used those capabilities in two places: 1. Added an attribute list to the rmgr.spawn interface. 2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h). So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms. This commit was SVN r12138.	2006-10-17 16:06:17 +00:00
Ralph Castain	3f55d6897a	Remove the memory debugging options. Fix what appears to be a typo in a help file. This commit was SVN r12107.	2006-10-12 00:44:48 +00:00
Ralph Castain	2da8245be0	Correctly propagate no-daemonize This commit was SVN r12093.	2006-10-11 17:53:17 +00:00
Ralph Castain	27e305347c	Add a couple of options to orterun that support debugging of daemons for memory corruption. Ensure that the environment provided to local application processes isn't "polluted" by the orteds This commit was SVN r12087.	2006-10-11 15:18:57 +00:00
Ralph Castain	e7f6fa22d6	Fix return code so that mpirun returns the right thing when an abort is encountered. This commit was SVN r12065.	2006-10-09 01:04:00 +00:00
Ralph Castain	2e09128337	Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!! Restore the OBJ_RELEASE calls to cleanup map objects. This commit was SVN r12064.	2006-10-07 22:44:00 +00:00
Ralph Castain	98dd57b70e	Add a new option to launch "pernode" - launches one process/node across all available nodes. The other options also work correctly: "-bynode" with no -np will launch on all slots, mapped on a per-node basis. This commit was SVN r12063.	2006-10-07 19:50:12 +00:00
Ralph Castain	889ddefe85	Remove release that caused totalview connection to bomb This commit was SVN r12061.	2006-10-07 18:25:56 +00:00
Ralph Castain	ae79894bad	Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue. I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there). Gridengine compiles but I cannot test (believe it likely will run). Poe and xgrid compile to the extent they can without the proper include files. This commit was SVN r12059.	2006-10-07 15:45:24 +00:00
Jeff Squyres	72cf2fe813	Oops: --noprefix should not take an argument. This commit was SVN r12043.	2006-10-06 13:02:56 +00:00
George Bosilca	d628a18411	Right now there is no support for TotalView on Windows. Therefore, we don't really care how these functions and variables are declared. This commit was SVN r11996.	2006-10-05 05:19:03 +00:00
Ralph Castain	12328395ae	Missed a couple of debug statements This commit was SVN r11935.	2006-10-02 15:46:41 +00:00
Tim Prins	53b116d309	This commit fixes trac:452. It turns out that we were improperly allocating an array if -np was not passed. Also, we were not really using this array for anything. So this gets rid of the array and performs some minor cleanup. This commit was SVN r11934. The following Trac tickets were found above: Ticket 452 --> https://svn.open-mpi.org/trac/ompi/ticket/452	2006-10-02 15:03:43 +00:00
Ralph Castain	559b9b0ae8	Continue beating on comm_spawn. Setup to debug bproc. This commit was SVN r11932.	2006-10-02 14:58:22 +00:00
Ralph Castain	121f834776	Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component. Still need some work on the proxy component, and on job termination for persistent daemon case. This commit was SVN r11928.	2006-10-02 00:46:31 +00:00
Tim Prins	e4f8ad303e	Fix for #397 on 64 bit platforms sizeof(size_t) != sizeof(orte_std_cntr_t), and we were incorrectly assuming this when dealing with num procs. It worked on little endian platforms, but not big endian. So change num_procs to type int, and cast where needed. This commit was SVN r11796.	2006-09-25 19:41:54 +00:00
Ralph Castain	0ad0d84afd	Add two new API functions to the RMGR, and modify the "spawn" API to support the enhanced MPI-2 functionality. No implementation backs these new APIs - just placeholders for now. This commit was SVN r11699.	2006-09-19 01:45:05 +00:00
George Bosilca	f8de894efe	This one wasn't supposed to get into the repository. This commit was SVN r11697.	2006-09-18 21:28:55 +00:00
George Bosilca	7ad23ff97b	Be 100% total view friendly. Let tv find out the real name of our executable and export all functions as they should be. This commit was SVN r11694.	2006-09-18 17:55:14 +00:00
Jeff Squyres	8226dab86c	Fixes trac:377 Add --enable-orterun-prefix-by-default (and a synonym: --enable-mpirun-prefix-by-default) to make orterun always behave as if "--prefix $prefix" was given on the command line (where $prefix is the value given to the --prefix option to configure). This prevents many rsh/ssh users from needing to modify their shell startup files to set the LD_LIBRARY_PATH for Open MPI (they will still need to set PATH or otherwise find the OMPI executables to mpicc/mpirun/etc. their MPI applications). Also added --noprefix option to orterun to disable this behavior. Finally, note that even if --enable-orterun-prefix-by-default is specified, if the user specifies --prefix or /path/to/mpirun, these options will override the default value of the prefix ($prefix). This commit was SVN r11669. The following Trac tickets were found above: Ticket 377 --> https://svn.open-mpi.org/trac/ompi/ticket/377	2006-09-15 02:52:08 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00

1 2 3 4

161 Коммитов