openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	f771cc4fbd	Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option. Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution. This commit was SVN r12619.	2006-11-17 19:06:10 +00:00
Ralph Castain	c1813e5c5a	Extend the daemon reuse functionality to most of the other environments. Note that Bproc won't support this operation, so we just ignore the --reuse-daemons directive. I'm afraid I don't understand the POE and XGrid environments well enough to attempt the necessary modifications. Also, please note that XGrid support has been broken on the trunk. I don't understand the code syntax well enough to make the required changes to that PLS component, so it won't compile at the moment. I'm hoping Brian has a few minutes to fix it after SC. This commit was SVN r12614.	2006-11-16 15:11:45 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Ralph Castain	36d4511143	Bring the timing instrumentation to the trunk. If you want to look at our launch and MPI process startup times, you can do so with two MCA params: OMPI_MCA_orte_timing: set it to anything non-zero and you will get the launch time for different steps in the job launch procedure. The degree of detail depends on the launch environment. rsh will provide you with the average, min, and max launch time for the daemons. SLURM block launches the daemon, so you only get the time to launch the daemons and the total time to launch the job. Ditto for bproc. TM looks more like rsh. Only those four environments are currently supported - anyone interested in extending this capability to other environs is welcome to do so. In all cases, you also get the time to setup the job for launch. OMPI_MCA_ompi_timing: set it to anything non-zero and you will get the time for mpi_init to reach the compound registry command, the time to execute that command, the time to go from our stage1 barrier to the stage2 barrier, and the time to go from the stage2 barrier to the end of mpi_init. This will be output for each process, so you'll have to compile any statistics on your own. Note: if someone develops a nice parser to do so, it would be really appreciated if you could/would share! This commit was SVN r12302.	2006-10-25 15:27:47 +00:00
Tim Prins	cb622db7c9	Fixes trac:352 Only close off stdout/stderr from the daemons if we are not debugging the slurm pls and --debug-daemons was not passed. This commit was SVN r12276. The following Trac tickets were found above: Ticket 352 --> https://svn.open-mpi.org/trac/ompi/ticket/352	2006-10-24 13:05:13 +00:00
Tim Prins	93d61d01fb	Fix for a problem on SLURM we have neen having since r12243 where mpirun would hang after the process had finished. It turns out that we were always reporting the name of the daemon wrong, but we simply never noticed as we never used it, until r12243. This makes it so we report the name of the daemon correctly. This commit was SVN r12274. The following SVN revision numbers were found above: r12243 --> open-mpi/ompi@153e38ffc9	2006-10-24 01:41:28 +00:00
Ralph Castain	c07d4e2510	Cleaner rendition now extended to other environments. Remove MCA params for backend procs that can cause trouble. Specifically, any directives on the selection of components for RDS, RAS, RMAPS, PLS, and RMGR can be bad mojo on the backend. This patch will cause a problem for cnos, however, as there we want to specifically tell the backends to be "null". I'm working on that issue. This commit was SVN r12225.	2006-10-20 16:50:13 +00:00
Ralph Castain	2e09128337	Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!! Restore the OBJ_RELEASE calls to cleanup map objects. This commit was SVN r12064.	2006-10-07 22:44:00 +00:00
Ralph Castain	ae79894bad	Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue. I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there). Gridengine compiles but I cannot test (believe it likely will run). Poe and xgrid compile to the extent they can without the proper include files. This commit was SVN r12059.	2006-10-07 15:45:24 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
George Bosilca	f52c10d18e	And ORTE is ready for prime-time. All Windows tricks are in: - use the OPAL functions for PATH and environment variables - make all headers C++ friendly - no unamed structures - no implicit cast. Plus a full implementation for the orte_wait functions. This commit was SVN r11347.	2006-08-23 03:32:36 +00:00
Ralph Castain	ee5a626d25	Add ability to trap and propagate SIGUSR1/2 to remote processes. There are a number of small changes that hit a bunch of files: 1. Changed the RMGR and PLS APIs to add "signal_job" and "signal_proc" entry points. Only the "signal_job" entries are implemented - none of the components have implementations for "signal_proc" at this time. Thus, you can signal all of the procs in a job, but cannot currently signal only one specific proc. 2. Implemented those new API functions in all components except xgrid (Brian will do so very soon). Only the rsh/ssh and fork modules have been tested, however, and only under OS-X. 3. Added signal traps and callback functions for SIGUSR1/2 to orterun/mpirun that catch those signals and call the appropriate commands to propagate them out to all processes in the job. 4. Added a new test directory under the orte branch to (eventually) hold unit and system level tests for just the run-time. Since our test branch of the repository is under restricted access, people working on the RTE were continually developing their own system-level tests - thus making it hard to help diagnose problems. I have moved the more commonly-used functions here, and added one specifically for testing the SIGUSR1/2 functionality. I will be contacting people directly to seek help with testing the changes on more environments. Other than compile issues, you should see absolutely no change in behavior on any of your systems - this additional functionality is transparent to anyone who does not issue a SIGUSR1/2 to mpirun. Ralph This commit was SVN r10258.	2006-06-08 18:27:17 +00:00
Jeff Squyres	1d6902296c	Additions to the tm, slurm, and rsh pls modules to handle the --prefix option as discussed on the devel-core mailing list. The Big Difference is that instead of hard-coding the strings "/lib" and "/bin" in to append to the prefix, we append the basename of the local libdir and bindir. Hence, if your libdir is $prefix/lib64, we'll append /lib64 to construct the remote node's LD_LIBRARY_PATH (etc.). Also appended the orterun.1 man page to include a description of --prefix, how it is constructed, what it handles / what it does not, etc. This commit was SVN r9930.	2006-05-16 14:14:12 +00:00
Gleb Natapov	80dfe7e39b	remove newline from environment This commit was SVN r9892.	2006-05-11 13:15:48 +00:00
Jeff Squyres	2c91ac861a	When not in debug mode, tie stdout/stderr to dev null so we don't see messages from orted (i.e., from the srun command). This commit was SVN r9083.	2006-02-17 15:06:08 +00:00
Jeff Squyres	c2c2daa966	Change the behavior of orterun (mpirun, mpirexec) to search for argv[0] and the cwd on the target node (i.e., the node where the executable will be running in all systems except BProc, where the searches are run on the node where orterun is invoked). - fork pls now does cwd and argv[0] search in orted - bproc pls does cwd and argv[0] search in orterun - cwd behavior slightly different: - if user specifies a -wdir to orterun, we chdir() to there; if we can't for some reason, abort - if user does not specify a -wdir, try to chdir() to the dir where orterun was invoked. If we can't for some reason (e.g., it doesn't exist on the target node), then try to chdir($HOME). If we can't do that, then just live with whatever default directory we were put in. This commit was SVN r9068.	2006-02-16 20:40:23 +00:00
Brian Barrett	566a050c23	Next step in the project split, mainly source code re-arranging - move files out of toplevel include/ and etc/, moving it into the sub-projects - rather than including config headers with <project>/include, have them as <project> - require all headers to be included with a project prefix, with the exception of the config headers ({opal,orte,ompi}_config.h mpi.h, and mpif.h) This commit was SVN r8985.	2006-02-12 01:33:29 +00:00
Jeff Squyres	42ec26e640	Update the copyright notices for IU and UTK. This commit was SVN r7999.	2005-11-05 19:57:48 +00:00
Josh Hursey	92429dc90f	Fix for a problem Edgar and Jeff identified WRT PLS determining if we are oversubscribed on a node. And thus whether to call sched_yield or not. The value of node->node_slots_inuse does not currently represent the number of slots actually in use, at the moment. This is actually a bug in the RAS/RMAPS base components, but the fix for that specific bug is bigger than we want to address at the moment (but will certianly do so in the near future). Since we cannot trust this value, use the total number of mapped processes (which was properly set by the RMAPS component upon mapping -- Just not properly propagated back to the registry's node segment) from the process mapping. In addition to this change I cleaned up a couple of the debug messages. It seems that TM and RSH are the only two directly effected by this. SLURM would be if that section of code wasn't currently inactive, but put the fix in for prosparity. This commit was SVN r7743.	2005-10-13 03:26:48 +00:00
Jeff Squyres	0629cdc2d7	Bring back the changes from /tmp/jjhursey-rmaps. Specific merge command: svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps . (where "." is a trunk checkout) The logs from this branch are much more descriptive than I will put here (including a really long description from last night). Here's the short version: - fixed some broken implementations in ras and rmaps - "orterun --host ..." now works and has clearly defined semantics (this was the impetus for the branch and all these fixes -- LANL had a requirement for --host to work for 1.0) - there is still a little bit of cleanup left to do post-1.0 (we got correct functionality for 1.0 -- we did not fix bad implementations that still "work") - rds/hostfile and ras/hostfile handshaking - singleton node segment assignments in stage1 - remove the default hostfile (no need for it anymore with the localhost ras component) - clean up pls components to avoid duplicate ras mapping queries - [possible] -bynode/-byslot being specific to a single app context This commit was SVN r7664.	2005-10-07 22:24:52 +00:00
Jeff Squyres	d44fc0fa2a	- Clarify the help file text a little - Remove an extraneous \n in opal_output() output This commit was SVN r7581.	2005-10-02 11:58:51 +00:00
Jeff Squyres	0459678f82	Fixes to make the SLURM pls handle --prefix properly This commit was SVN r7569.	2005-09-30 21:44:05 +00:00
Josh Hursey	a5e5924217	Added a custom arguments MCA param for Slurm PLS. This allows the user to specify certain options to srun when an application is launched with this PLS. A useful example is the need to set the time to wait from when the first process completes and when slurm kills remaining processes: pls_slurm_args=--wait=1200 This commit was SVN r7206.	2005-09-06 21:52:28 +00:00
Brian Barrett	fc71fd5744	* fix place where Jeff changed an exit to a return and we really wanted it to be an exit. * Put the srun process (or what is about to become the srun process) in it's own process group so that group-wide signals (such as the SIGINT sent by hitting cntl-c in a shell) are not sent to the srun process. This commit was SVN r7068.	2005-08-27 17:08:48 +00:00
Brian Barrett	3e8740e740	* mostly working SLURM component. Had to add a sds for the daemons so that we could vector launch the daemons and still have the nodenames fixed up in the end This commit was SVN r7041.	2005-08-25 22:29:23 +00:00
Jeff Squyres	524ded4896	A little cleanup and progress: - build a proper srun argv - launch the srun - still have several "JMS" comments that need to be addressed This commit was SVN r7036.	2005-08-25 16:38:42 +00:00
Jeff Squyres	9755a7f7fa	First cut -- not working yet -- checkpointing to move to another machine. This commit was SVN r7018.	2005-08-24 22:19:48 +00:00
Jeff Squyres	1b18979f79	Initial population of orte tree This commit was SVN r6266.	2005-07-02 13:42:54 +00:00

28 Коммитов