openmpi

Автор	SHA1	Сообщение	Дата
Brian Barrett	414a87b14c	On OS X, the lowest 8 bits of the exit status are for signals, so pushing rc (which is -1 or 4 if we hit this case) resulted in an odd error that a signal killed the proc (instead of a startup error, as is reality). Instead, use the W_EXITCODE macro (if available) to build up an exit code that has an error code for exit status, but does not make it look like the process died from a signal This commit was SVN r12890.	2006-12-18 02:30:05 +00:00
Ralph Castain	0a5d41857a	Complete next round of message size reduction: "strip" the descriptive info from the returned values. I have now added a flag to the gpr address mode (ORTE_GPR_STRIPPED) that instructs the gpr to not include segment names or tokens in the returned gpr_value_t objects. I found only two places that were looking at the tokens: 1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message. 2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was always the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes). This commit was SVN r12813.	2006-12-09 23:10:25 +00:00
Ralph Castain	a1153fdc8f	Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment. Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available. This commit was SVN r12788.	2006-12-07 03:11:20 +00:00
Ralph Castain	d4bd60c9fe	Restore the paffinity capability, along with all the required logic to ensure we "do the right thing" when the user gives us inaccurate information about the number of slots on a remote node. This commit was SVN r12780.	2006-12-06 15:59:34 +00:00
Ralph Castain	b1e16fffac	Add the C++ doo-hicky stuff around the odls framework definitions just in case somebody, somewhere, on some remote planet where only goats can feed needs it. This commit was SVN r12777.	2006-12-06 13:58:04 +00:00
Ralph Castain	8ca415a0c5	Remove duplicate orte_odls declaration This commit was SVN r12776.	2006-12-06 13:44:41 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Gleb Natapov	f0132b2499	Provide parameters in a correct order (processor/oversubscribed was swapped). This commit was SVN r12737.	2006-12-04 12:55:45 +00:00
Ralph Castain	897744cdeb	Two major changes to the runtime: 1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1". 2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial". 3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1". 4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer) This commit was SVN r12722.	2006-12-01 22:30:39 +00:00
Ralph Castain	0398c9e0c5	Correctly setup the sched_yield when launching processes via the orteds. This still doesn't adjust the yield schedule "on-the-fly" as more procs are dynamically added to a node - it just sets it when they are first launched. This commit was SVN r12683.	2006-11-28 08:27:20 +00:00
Ralph Castain	33affed09c	Bring ortehalt to a preliminary capability. It will corectly order a persistent daemon to exit cleanly. Need to now interface it to orterun, clean up a few things here and there This commit was SVN r12626.	2006-11-18 04:47:51 +00:00
Ralph Castain	f771cc4fbd	Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option. Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution. This commit was SVN r12619.	2006-11-17 19:06:10 +00:00
Ralph Castain	ca5b4358fa	Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too. Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone. This commit was SVN r12616.	2006-11-17 02:58:46 +00:00
Ralph Castain	5ddcb8a652	Ensure the orted kills all local procs when exiting. Add a little clarity to some of the debugging output This commit was SVN r12615.	2006-11-16 21:15:25 +00:00
Ralph Castain	f7fc19a2ca	Create the ability to re-use existing daemons. Included in the commit: 1. new functionality in the pls base to check for reusable daemons and launch upon them 2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED 3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse". This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on. This commit was SVN r12608.	2006-11-15 21:12:27 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Ralph Castain	f95e20e2e1	Add another test program - an MPI app that just spins. This supports testing of system response to signal-terminated processes. Add some debugger output to the ODLS default component. Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time. This commit was SVN r12585.	2006-11-13 21:51:34 +00:00
Ralph Castain	4636125e2d	Modify the RMGR components to allow job setup with a given jobid, and add another attribute so that we can setup triggers without launching. Add some debugging output to the ODLS default module, and the orted. Remove the nodename data from the ODLS info report - that info is already stored in the registry by the RMAPS framework upon completing the mapping procedure. Add another test program that does an ORTE-only dynamic spawn (gasp!). Looks just like comm_spawn - just no MPI involved. Modify the ODLS to release the processor when we "kill" local procs in a more scalable fashion. It previously had a sleep in it that Jeff's prior commit removed. However, he introduced some Windows code into the non-Windows component (protected by "if"s, but unnecessary). This is a more general solution he proposed - included here so I could get things to compile properly. This commit was SVN r12579.	2006-11-13 18:51:18 +00:00
Jeff Squyres	bfdf801487	Replace a "sleep(1)" with a yield so that the orted can reap processes much faster. This commit was SVN r12575.	2006-11-13 12:45:03 +00:00
Ralph Castain	4e50cdae52	This commit accomplishes two things: 1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs. 2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system. This commit was SVN r12558.	2006-11-11 04:03:45 +00:00
Ralph Castain	d182ae7472	Clean up a few compiler warnings courtesy of Jeff This commit was SVN r12430.	2006-11-03 20:45:22 +00:00
Ralph Castain	c07d4e2510	Cleaner rendition now extended to other environments. Remove MCA params for backend procs that can cause trouble. Specifically, any directives on the selection of components for RDS, RAS, RMAPS, PLS, and RMGR can be bad mojo on the backend. This patch will cause a problem for cnos, however, as there we want to specifically tell the backends to be "null". I'm working on that issue. This commit was SVN r12225.	2006-10-20 16:50:13 +00:00
Ralph Castain	02efd07b60	Fix the MCA param passing issue, at least for rsh at the moment. I will clean this up and move it to the other environments once I shift back to a local computer. This commit was SVN r12224.	2006-10-20 15:27:29 +00:00
George Bosilca	2aa3e51223	Nothing relevant. Only a set of castings to have a clean compile on Windows. The cl.exe compiler is pretty good at complaining about any kind of non explicit cast. This commit was SVN r12207.	2006-10-20 02:25:50 +00:00
George Bosilca	7982a23bde	ORTE_DECLSPEC should be ... This commit was SVN r12206.	2006-10-20 02:23:54 +00:00
Ralph Castain	ab196c3121	Okay, this fixes the problem of MCA params spreading too far. Sorry for the multiple corrections. This commit was SVN r12201.	2006-10-19 22:51:02 +00:00
George Bosilca	c788f0dd51	Apparently, the MCA params are being loaded into the environment params of the individual app_contexts as well. Clear them of "bad" ones. This commit was SVN r12198.	2006-10-19 21:19:08 +00:00
Ralph Castain	d0d3d7fd41	Apparently, the MCA params are being loaded into the environment params of the individual app_contexts as well. Clear them of "bad" ones. This commit was SVN r12197.	2006-10-19 20:49:35 +00:00
Brian Barrett	29c91cf2f3	* Fix issue in odls_bproc where we were using vpid instead of the number of processes launched locally for the stdio file names. This was causing the expected files to not exist and bproc_vexecmove_io to fail. * Clean up a bunch of debugging output in the bproc pls This commit was SVN r12102.	2006-10-11 20:34:12 +00:00
Ralph Castain	f91a95b3fe	Fix the bug that caused mpirun to hang when a remote executable wasn't found using the rsh launcher. Will now test on a remote node This commit was SVN r12095.	2006-10-11 18:43:13 +00:00
Ralph Castain	27e305347c	Add a couple of options to orterun that support debugging of daemons for memory corruption. Ensure that the environment provided to local application processes isn't "polluted" by the orteds This commit was SVN r12087.	2006-10-11 15:18:57 +00:00
Ralph Castain	ae79894bad	Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue. I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there). Gridengine compiles but I cannot test (believe it likely will run). Poe and xgrid compile to the extent they can without the proper include files. This commit was SVN r12059.	2006-10-07 15:45:24 +00:00
Ralph Castain	ee0df85ece	Add a stupid, useless const since someone put it in the odls.h file. This commit was SVN r12054.	2006-10-07 01:42:23 +00:00
George Bosilca	017af37291	Keep only the useful _DECLSPEC and _DECLSPEC some globals. This commit was SVN r12042.	2006-10-06 07:24:34 +00:00
George Bosilca	b7579b09c7	Correctly handle the ORTE_DECLSPEC attribute. This commit was SVN r12041.	2006-10-06 07:08:17 +00:00
George Bosilca	422ce1d3f8	If we are in the ODLS framework then the types should be called odls not pls. This commit was SVN r12040.	2006-10-06 07:05:58 +00:00
George Bosilca	7fed79434e	Windows is now able to create local processes. This commit was SVN r12039.	2006-10-06 07:04:43 +00:00
George Bosilca	fa01b9b9aa	Last step for the name reversion (from windows back to process). This commit was SVN r12014.	2006-10-05 06:36:11 +00:00
George Bosilca	b7a793a6db	Rename all the files in the process directory. This commit was SVN r12013.	2006-10-05 06:34:30 +00:00
George Bosilca	fee0909815	Rename the Windows component. This commit was SVN r12012.	2006-10-05 06:32:57 +00:00
George Bosilca	fd76e56279	One protection against C++ compilers is more than enough. This commit was SVN r11998.	2006-10-05 05:24:43 +00:00
Ralph Castain	559b9b0ae8	Continue beating on comm_spawn. Setup to debug bproc. This commit was SVN r11932.	2006-10-02 14:58:22 +00:00
Brian Barrett	3c814fdd23	fixes trac:391 Fix for double mutex free that would cause an abort condition in the orted whenever threads were enabled. This commit was SVN r11759. The following Trac tickets were found above: Ticket 391 --> https://svn.open-mpi.org/trac/ompi/ticket/391	2006-09-22 19:24:42 +00:00
Ralph Castain	977e3c5ca1	Let's see if Cyrador understands this version a little better... This commit was SVN r11709.	2006-09-19 13:05:40 +00:00
Ralph Castain	d7e61e40fc	Quiet a few warnings from Cyrador This commit was SVN r11686.	2006-09-18 12:40:42 +00:00
George Bosilca	4fe39a4e7d	The old PLS is now called a ODLS. However, the real name is not windows but process. This change will follow shortly... This commit was SVN r11663.	2006-09-14 22:22:34 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00

... 2 3 4 5 6

297 Коммитов