openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	8a289cf1cb	Part 1 of the fix for ticket #726 . This commit adds logic to orteun to effect the following: * The first time the user hits ctrl-c, we go into the process of killing the ORTE job (this is not new). * While waiting for the job to actually terminate, if the user hits ctrl-c a second time, we print a warning saying "Hey, I'm still trying to kill the job. If you really want me to die immediately, hit ctrl-c again within 1 second." * If the user hits ctrl-c a within 1 second, orterun quits with a warning about how the job may not have actually been killed. Note that none of this logic won't really work until the second part of the fix for #726 is also committed (i.e., make pls.terminate_job() non-blocking). So I'm now throwing the ticket over to Ralph for the second part of the fix... Refs trac:726 This commit was SVN r13040. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-08 20:25:26 +00:00
Rolf vandeVaart	fdf44cc4ab	Add the ability to not only report broken files and directories, but remove them also. This current set of changes will affect nothing as no one is making use of this ability. However, orte-clean will be changed soon to utilize this new feature. This commit was SVN r12996.	2007-01-04 21:48:34 +00:00
Brian Barrett	bc6cec346f	Print out the description of the signal from mpirun when a proc was aborted by a signal if we have strsignal() This commit was SVN r12888.	2006-12-17 20:01:11 +00:00
Ralph Castain	7b8f445e13	Modify the "--display-map-at-launch" option to just "--display-map". Now that we have a "--do-not-launch" option, the "-at-launch" part of the display-map option was confusing. "--display-map" displays the resulting process map before we launch anyway, so this is clearer. This commit was SVN r12840.	2006-12-13 13:49:15 +00:00
Ralph Castain	82946cb220	Add a new option to orterun: "--do-not-launch" directs the system to do the allocation, map, job setup, etc., but don't actually launch the job. This lets us test all the setup portions of the code. Also, take the first step in updating how we handle mca params in ORTE - bring it closer to how it is done in the other two layers. Much more work to be done here. This commit was SVN r12838.	2006-12-13 04:51:38 +00:00
Ralph Castain	28ce8e5e5e	Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior). In "--npernode" operation, the "-np" command line parameter is ignored. This commit was SVN r12826.	2006-12-12 00:54:05 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Rainer Keller	e61dd8722e	- Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx - No functional changes, just indentation and corrections to error output. This commit was SVN r12734.	2006-12-03 13:59:23 +00:00
George Bosilca	a0ed53d70b	Make the compilers happy. This commit was SVN r12729.	2006-12-03 00:19:11 +00:00
Ralph Castain	652b91ee26	Remove some compiler warnings This commit was SVN r12678.	2006-11-27 23:47:36 +00:00
Brian Barrett	32833deff0	since orteboot, ortehalt, and ortekill were all added today (including to configure.ac), we need to add them to SUBDIRS to make them end up in the tarball as well... This commit was SVN r12658.	2006-11-23 03:10:57 +00:00
Ralph Castain	7f95b27141	Correctly "hide" the new orte tools - they shouldn't get compiled or seen unless you specifically go into those subdirectories and manually do a "make". This commit was SVN r12650.	2006-11-22 14:35:16 +00:00
Ralph Castain	33affed09c	Bring ortehalt to a preliminary capability. It will corectly order a persistent daemon to exit cleanly. Need to now interface it to orterun, clean up a few things here and there This commit was SVN r12626.	2006-11-18 04:47:51 +00:00
Ralph Castain	3c5a2cd17b	Cleanup a few warnings for unused variables This commit was SVN r12620.	2006-11-17 19:32:49 +00:00
Ralph Castain	f771cc4fbd	Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option. Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution. This commit was SVN r12619.	2006-11-17 19:06:10 +00:00
Ralph Castain	ca5b4358fa	Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too. Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone. This commit was SVN r12616.	2006-11-17 02:58:46 +00:00
Ralph Castain	5ddcb8a652	Ensure the orted kills all local procs when exiting. Add a little clarity to some of the debugging output This commit was SVN r12615.	2006-11-16 21:15:25 +00:00
Ralph Castain	044898f4bf	My eyes may be deceiving me....but I do believe these comparisons are backwards! I think we only really want to "free" these variables if they are NOT NULL - as opposed to "free"ing them if they ARE NULL. This commit was SVN r12612.	2006-11-15 22:59:01 +00:00
Ralph Castain	f7fc19a2ca	Create the ability to re-use existing daemons. Included in the commit: 1. new functionality in the pls base to check for reusable daemons and launch upon them 2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED 3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse". This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on. This commit was SVN r12608.	2006-11-15 21:12:27 +00:00
Ralph Castain	437f2b044d	Modify the orted command communication system in two ways: 1. use non-blocking sends to transmit commands (this was actually done in a prior commit) 2. have an "ack" message sent back from the orted when it completes the command The latter item is the new one here. With my prior commit, it was possible for the HNP to move on to other things before the orted had completed its command. This caused the HNP to occassionally exit before the orted, thus generating "lost connection" errors. With this change, we retain the parallel nature of the command communications, but still hold the HNP at that point until the orteds are done. Best of both worlds. This commit was SVN r12605.	2006-11-15 15:09:28 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Ralph Castain	7b4261001a	Forgot to modify the orted end of the communication subsystem This commit was SVN r12586.	2006-11-13 22:08:47 +00:00
Ralph Castain	f95e20e2e1	Add another test program - an MPI app that just spins. This supports testing of system response to signal-terminated processes. Add some debugger output to the ODLS default component. Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time. This commit was SVN r12585.	2006-11-13 21:51:34 +00:00
Ralph Castain	4636125e2d	Modify the RMGR components to allow job setup with a given jobid, and add another attribute so that we can setup triggers without launching. Add some debugging output to the ODLS default module, and the orted. Remove the nodename data from the ODLS info report - that info is already stored in the registry by the RMAPS framework upon completing the mapping procedure. Add another test program that does an ORTE-only dynamic spawn (gasp!). Looks just like comm_spawn - just no MPI involved. Modify the ODLS to release the processor when we "kill" local procs in a more scalable fashion. It previously had a sleep in it that Jeff's prior commit removed. However, he introduced some Windows code into the non-Windows component (protected by "if"s, but unnecessary). This is a more general solution he proposed - included here so I could get things to compile properly. This commit was SVN r12579.	2006-11-13 18:51:18 +00:00
Ralph Castain	4e50cdae52	This commit accomplishes two things: 1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs. 2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system. This commit was SVN r12558.	2006-11-11 04:03:45 +00:00
Ralph Castain	a3be8261fb	Fix a bug that had us generate an error message and abort startup when there were stale universe directories around. Now, we just ignore them. This commit was SVN r12472.	2006-11-07 21:34:57 +00:00
Ralph Castain	30de73a712	Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include: 1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch" 2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command. 3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job 4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map. Enjoy! My personal favorite is the first one - helps with debugging. This commit was SVN r12379.	2006-10-31 22:16:51 +00:00
Ralph Castain	c5b59829aa	Fix a long-lingering annoyance. Calling mpirun with a non-existent application would cause the system to hang on all environments. Reason was that the orted would exit, which it should never do without explicit orders to that affect. This commit was SVN r12255.	2006-10-23 13:27:31 +00:00
George Bosilca	ee559e9947	Do not completely reset the orterun_globals. Keep the condition and the mutex, but reset everything else. Once initialized the condition (and the attached mutex) should be kept alive as long as possible if we want to be able to retrieve all the informations. This commit was SVN r12253.	2006-10-23 03:34:08 +00:00
Ralph Castain	153e38ffc9	Lesson to be learned: if you send an ack to a recv'd command, better not send it to the same tag it came from - at least, not if there is a persistent recv on that tag! Fix the persistent daemon problem where it was exiting when a job completed. Problem was that the persistent daemon would order the job daemons to exit. They would then send an 'ack' back to the persistent daemon - but the ack consisted of an echo of the "exit" command, which was recv'd by the wrong listener who treated it as a properly sent cmd....and exited. This commit was SVN r12243.	2006-10-21 02:53:19 +00:00
Ralph Castain	ec0bb9ffda	Fix the bookmark system - we now have children being correctly spawned where they should! Also, I am no longer seeing any issue with the child job spawning its own daemons - this appears to be fixed. We still don't reuse the existing daemons, however, but that will come. This commit was SVN r12229.	2006-10-20 18:05:16 +00:00
Ralph Castain	02efd07b60	Fix the MCA param passing issue, at least for rsh at the moment. I will clean this up and move it to the other environments once I shift back to a local computer. This commit was SVN r12224.	2006-10-20 15:27:29 +00:00
Brian Barrett	37fad860b7	Grrr... Forgot that EXTRA_DIST and man_MANS are not set to include all the possible things contained in the conditional like other rules are (for example, a SOURCES rule in a conditional automatically has its files added to the dist rules, even if that conditional isn't tru when make dist occurs). So the man files weren't in the tarball. Put the EXTRA_DIST with the files explicitly listed outside any conditionals so the man pages always end up in the tarball. This commit was SVN r12220.	2006-10-20 14:15:38 +00:00
Ralph Castain	ab196c3121	Okay, this fixes the problem of MCA params spreading too far. Sorry for the multiple corrections. This commit was SVN r12201.	2006-10-19 22:51:02 +00:00
Ralph Castain	382f954fff	Fix a bug in the way we saved and passed environments to child processes on remote nodes. The problem was that MCA directives for component selection were being passed back to the children. However, now that we only allow certain components to operate on HNPs, this caused the children to bomb out of orte_init. This commit was SVN r12196.	2006-10-19 20:35:55 +00:00
Brian Barrett	204f5b8f52	- Clean up wrapper compiler man pages during maintainer-clean, since they might require special tools (not sure if sed with multiple -e arguments is totally portable) - ignore the opalcc.1 man page. Couldn't do this in the previous man page commit (r12192) because I was removing opalcc.1 in that commit. This commit was SVN r12194. The following SVN revision numbers were found above: r12192 --> open-mpi/ompi@581a4b0a4e	2006-10-19 20:14:40 +00:00
Brian Barrett	581a4b0a4e	A few cleanups to the wrapper compiler build system / man pages: - Only install opal{cc,c++} and orte{cc,c++} if configured with --with-devel-headers. Right now, they are always installed, but there are no header files installed for either project, so there's really not much way for a user to actually compile an OPAL / ORTE application. - Drop support for opalCC and orteCC. It's a pain to setup all the symlinks (indeed, they are currently done wrong for opalCC) and there's no history like there is for mpiCC. - Change what is currently opalcc.1 to opal_wrapper.1 and add some macros that get sed'ed so that the man pages appear to be customized for the given command. - Install the wrapper data files even if we compiled with --disable-binaries. This is for the use case of doing multi-lib builds, where one word size will only have the library built, but we need both set of wrapper data files to piece together to activate the multi-lib support in the wrapper compilers. This commit was SVN r12192.	2006-10-19 18:34:17 +00:00
Ralph Castain	13227e36ab	This commit looks a lot bigger than it is, so relax :-) Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off. To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place. I used those capabilities in two places: 1. Added an attribute list to the rmgr.spawn interface. 2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h). So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms. This commit was SVN r12138.	2006-10-17 16:06:17 +00:00
Brian Barrett	9adde4f7b8	Allow multilib capability based on compiler flags. See: https://svn.open-mpi.org/trac/ompi/wiki/compilerwrapper3264 for more information. Refs trac:374 This commit was SVN r12120. The following Trac tickets were found above: Ticket 374 --> https://svn.open-mpi.org/trac/ompi/ticket/374	2006-10-15 21:21:08 +00:00
Ralph Castain	3f55d6897a	Remove the memory debugging options. Fix what appears to be a typo in a help file. This commit was SVN r12107.	2006-10-12 00:44:48 +00:00
Ralph Castain	2da8245be0	Correctly propagate no-daemonize This commit was SVN r12093.	2006-10-11 17:53:17 +00:00
Ralph Castain	27e305347c	Add a couple of options to orterun that support debugging of daemons for memory corruption. Ensure that the environment provided to local application processes isn't "polluted" by the orteds This commit was SVN r12087.	2006-10-11 15:18:57 +00:00
Ralph Castain	e7f6fa22d6	Fix return code so that mpirun returns the right thing when an abort is encountered. This commit was SVN r12065.	2006-10-09 01:04:00 +00:00
Ralph Castain	2e09128337	Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!! Restore the OBJ_RELEASE calls to cleanup map objects. This commit was SVN r12064.	2006-10-07 22:44:00 +00:00
Ralph Castain	98dd57b70e	Add a new option to launch "pernode" - launches one process/node across all available nodes. The other options also work correctly: "-bynode" with no -np will launch on all slots, mapped on a per-node basis. This commit was SVN r12063.	2006-10-07 19:50:12 +00:00
Ralph Castain	889ddefe85	Remove release that caused totalview connection to bomb This commit was SVN r12061.	2006-10-07 18:25:56 +00:00
Ralph Castain	ae79894bad	Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue. I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there). Gridengine compiles but I cannot test (believe it likely will run). Poe and xgrid compile to the extent they can without the proper include files. This commit was SVN r12059.	2006-10-07 15:45:24 +00:00
Jeff Squyres	72cf2fe813	Oops: --noprefix should not take an argument. This commit was SVN r12043.	2006-10-06 13:02:56 +00:00
George Bosilca	d628a18411	Right now there is no support for TotalView on Windows. Therefore, we don't really care how these functions and variables are declared. This commit was SVN r11996.	2006-10-05 05:19:03 +00:00
Ralph Castain	12328395ae	Missed a couple of debug statements This commit was SVN r11935.	2006-10-02 15:46:41 +00:00

1 2 3 4 5

239 Коммитов