openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	017a4acceb	Missed these during r19141 This commit was SVN r19151. The following SVN revision numbers were found above: r19141 --> open-mpi/ompi@b83ee7d82a	2008-08-04 20:10:55 +00:00
Jeff Squyres	b83ee7d82a	* Fix a problem with VPATH builds if the destination directory didn't already exist * s/top_srcdir/top_builddir/ in a bunch of places; left over from the previous man page generation system This commit was SVN r19141.	2008-08-04 15:17:50 +00:00
Ralph Castain	381d10833a	Remove comments in documentation about the "-nw" option to mpirun as this option doesn't exist, and probably never will This commit was SVN r19137.	2008-08-04 14:38:37 +00:00
Dan Lacher	9175da1e02	Putback for all changes to automate man page updates to strings of versions, dates and build names. Fixes trac:1387 Big thanks to Jeff and Brian for help and oversight. This commit was SVN r19120. The following Trac tickets were found above: Ticket 1387 --> https://svn.open-mpi.org/trac/ompi/ticket/1387	2008-08-01 21:14:37 +00:00
Jeff Squyres	4bdc093746	Fixes trac:1361: mainly add new internal MCA parameter that orterun will set when it launches under debuggers using the --debug option. This commit was SVN r19116. The following Trac tickets were found above: Ticket 1361 --> https://svn.open-mpi.org/trac/ompi/ticket/1361	2008-07-31 22:11:46 +00:00
Ralph Castain	a62b2a0150	Per the July technical meeting: Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways. Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument. For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work. This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do. To solve this problem, we did the following: 1. removed all mca params from the individual plms for the launch agent 2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted". 3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values 4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry. 5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct. This commit was SVN r19097.	2008-07-30 18:26:24 +00:00
Ralph Castain	d45d728e8e	Allow debuggers to attach to a running mpirun by -always- setting up the MPIR_Proctable. Only wait for MPIR_Breakpoint and hold MPI proc s if we are launching under a debugger. This commit was SVN r19079.	2008-07-29 17:39:16 +00:00
Ralph Castain	3107545709	Ensure that ORTE processes such as mpirun and orted never inadvertently bind themselves to cores. Change the mca param name used by the rank_file mapper to get user directives on slot lists to be different from that used by MPI procs to discover their binding. Add a cmd line option to orterun to make it easier for a user to specify the slot list (basically, hide the mca param name). Discussed and reviewed with Lenny and Jeff. This commit was SVN r19062.	2008-07-28 14:18:36 +00:00
Rolf vandeVaart	ed4920ba5f	Fix a couple problems with orte-clean. Also add a new --debug flag to help developers figure out possible future issues. This fixes trac:1335. This commit was SVN r18979. The following Trac tickets were found above: Ticket 1335 --> https://svn.open-mpi.org/trac/ompi/ticket/1335	2008-07-22 17:41:06 +00:00
Rolf vandeVaart	84ede78e40	Clarify use of --path option to mpirun. This commit fixes trac:1231. This commit was SVN r18909. The following Trac tickets were found above: Ticket 1231 --> https://svn.open-mpi.org/trac/ompi/ticket/1231	2008-07-14 21:27:23 +00:00
Ralph Castain	1edcfa970e	Remove debug This commit was SVN r18845.	2008-07-09 00:12:11 +00:00
Ralph Castain	f1114b4144	Upgrade the ability of orterun to deal with cmd line MCA params that are passed to the orteds. Help reduce the size of the cmd line by eliminating duplicates where possible, and alert to duplicate entries that can cause problems. Add comments to both orterun and orted code explaining why we take a snapshot of the local environment and apply it to the local procs when they are spawned. This commit was SVN r18842.	2008-07-08 22:36:39 +00:00
Ralph Castain	51da9f2980	Properly ensure that cmd line MCA params override environmental MCA params for procs local to mpirun. Actually, the problem was that we were simply -adding- any enviro MCA params to whatever had been found on the cmd line. Thus, duplicate MCA param directives were winding up duplicated in the environment. Some shells took the first one in the environ array - others took the last! So we could get completely different behavior based on the whims of the shell. This commit fixes trac:1373 This commit was SVN r18836. The following Trac tickets were found above: Ticket 1373 --> https://svn.open-mpi.org/trac/ompi/ticket/1373	2008-07-08 13:48:47 +00:00
Lenny Verkhovsky	23da11fdcc	add some documentation for mpirun's new --loadbalance option closing #1277 This commit was SVN r18832.	2008-07-08 07:56:54 +00:00
Lenny Verkhovsky	489c22b6b1	Added -rf\|--rankfile <arg0> oprion to mpirun -h + man info This commit was SVN r18812.	2008-07-07 13:46:22 +00:00
Lenny Verkhovsky	2fae770a86	information about carto framework in mpirun man page, ticket #1372 This commit was SVN r18809.	2008-07-07 12:02:10 +00:00
Ralph Castain	ba5498cdc6	Repair the MPI-2 dynamic operations. This includes: 1. repair of the linear and direct routed modules 2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI 3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature 4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes. Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB 5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept. Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules. 6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch 7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch 8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies 9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc. 10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon There is still more cleanup to be done for efficiency, but this at least works. Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn. Fixes ticket #1256 This commit was SVN r18804.	2008-07-03 17:53:37 +00:00
Ralph Castain	b118779c08	It is okay for us to init the ORTE mca params multiple times. Indeed, it is absolutely required by orterun as the first time has to be done prior to parsing the command line, which means that the mca values haven't been parsed yet! Add ability for sys admins to prohibit putting session directories under specified locations. Thus, they can now protect parallel file systems from foolish user mistakes. This commit was SVN r18721.	2008-06-24 17:50:56 +00:00
Jeff Squyres	24c3aa1d77	Really fix "make dist". Really. This commit was SVN r18704.	2008-06-21 18:04:38 +00:00
Jeff Squyres	930667ac73	Ensure that orte-checkpoint and orte-restart man pages are always included in the distribution tarball. This ''appears'' to be an Automake bug -- I have submitted a bug report to the bug-automake list: http://lists.gnu.org/archive/html/bug-automake/2008-06/msg00019.html This commit was SVN r18696.	2008-06-20 18:19:01 +00:00
Jeff Squyres	e4172a3c44	Shift the AM "if" logic down from orte/tools/Makefile.am down to the individual orte/tools/*/Makefile.am files. This causes "make" to travese into every directory, even if it's not going to build anything in that directory (which is a good thing). It also helps cleanup and dist issues. This also affects orte-checkpoint and orte-restart, but I couldn't get --with-ft to compile properly; I'll pass along a heads-up to Josh to ensure that I didn't break anything. This commit was SVN r18680.	2008-06-19 14:46:10 +00:00
Ralph Castain	571f483c39	Ensure that we don't breakpoint the debugger until -after- all procs have reported their contact info so we can successfully send the release message This commit was SVN r18678.	2008-06-19 14:37:46 +00:00
Jeff Squyres	7e45b24001	MPIR_being_debugged is an int, not a bool. This commit was SVN r18676.	2008-06-19 13:31:34 +00:00
Ralph Castain	b56f8ced4f	Ensure params are registered prior to parsing global cmd line options in orterun so that debugger options are properly captured and acted upon. Ensure that routes to remote procs are set on the HNP before completing launch so that the debugger message can be sent. Solves a race condition that can exist in those environments where the HNP does not have local procs. This commit was SVN r18674.	2008-06-19 02:58:14 +00:00
Ralph Castain	282a220e7e	Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation This commit was SVN r18670.	2008-06-18 15:28:46 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Brian Barrett	7712b07ac4	Add perl based wrapper compilers for cross-compile environments. The default is still to use the C based wrapper compilers (which have many more features and are more well tested). The Perl compilers are enabled with the option --enable-script-wrapper-compilers, which also ignores the option --disable-binaries (ie --enable-script-wrapper-compilers --disable-binaries will result in perl-based wrapper compilers being installed, but no other binaries being installed). This commit was SVN r18655.	2008-06-13 22:52:25 +00:00
Ralph Castain	1f41069ac9	Fix CID 752 - if we can't find the daemon job object, we have to ensure we exit without attempting to dereference it This commit was SVN r18647.	2008-06-11 14:49:58 +00:00
Ralph Castain	13ea4e4673	Be consistent - since we don't strdup the other values for param, don't strdup this one. This allows r18645 to fix the memory corruption issue, but also allows us to resolve the memory leaks cited by CID 1039 This commit was SVN r18646. The following SVN revision numbers were found above: r18645 --> open-mpi/ompi@53d83ba1c5	2008-06-11 14:42:47 +00:00
Pak Lui	53d83ba1c5	Take out a couple of free's. This commit fixes trac:1343 This commit was SVN r18645. The following Trac tickets were found above: Ticket 1343 --> https://svn.open-mpi.org/trac/ompi/ticket/1343	2008-06-11 14:02:49 +00:00
Ralph Castain	1a422995ae	Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak. This commit was SVN r18641.	2008-06-10 17:53:28 +00:00
Ralph Castain	c13cadc3c7	Refs trac:1255 This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something. So the ticket can be checked by Jeff when he returns from vacation... :-) This commit was SVN r18625. The following Trac tickets were found above: Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255	2008-06-09 20:34:14 +00:00
Ralph Castain	2cc8b2c51f	Add yet another test, this one for proper error behavior when someone call an MPI function after calling MPI_Finalize. Add a minor debug that outputs the orterun exit status to stderr when orte_debug is set. This commit was SVN r18622.	2008-06-09 19:21:20 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Ralph Castain	83dd3d8c6f	Restore the ability to forcibly terminate by providing multiple ctrl-c's This commit was SVN r18618.	2008-06-09 13:08:54 +00:00
Ralph Castain	0da811ce79	Initial work on xml support - allocation and job map outputs completed. More to come. This commit was SVN r18587.	2008-06-04 20:53:12 +00:00
Josh Hursey	78f14b5255	Fix the none.checkpoint command. orte-checkpoint/orte-restart seem to not seem to totally like orte_output so revert them to opal_output for now. Since we have no need for the additional complexity of orte_output we can drop it for now and revisit this if anyone needs it later. It seems that if you set the verbose level on an output handle then try to call a normal orte_output() on it then the message will not be printed. This is the same for opal_output, and seems incorrect to me because it stops some error messages from being printed out if you do not directly specify opal_output(0, ...). Maybe someone should take a look a this. orte-checkpoint would segv if passed an incorrect PID. Fixed the return code so it errors out properly. Thanks to Eric Roman for bringing this to my attention. This commit was SVN r18583.	2008-06-04 14:44:11 +00:00
Jeff Squyres	a4db97c213	More man pages fixes from our Debian Open MPI package maintainer friends. Woo hoo! This commit was SVN r18559.	2008-06-03 16:44:40 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	b456fb2d42	Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2heartrate, declares orted failure if not seen in last 2heartrate time. Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher. Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported. More info to come on devel list. This commit was SVN r18555.	2008-06-02 21:46:34 +00:00
Ralph Castain	2772221ae0	Add the -xml option to mpirun to indicate that xml output is desired This commit was SVN r18539.	2008-05-29 14:11:31 +00:00
Ralph Castain	72530f8fed	Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done. 2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear. This represents the best-available fix for ticket #221 so I am closing that ticket at this time. This commit was SVN r18536.	2008-05-29 13:38:27 +00:00
Jeff Squyres	af1a9290cb	Ensure to check that opal_init_util() didn't fail. This commit was SVN r18453.	2008-05-19 11:58:48 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	3e55fe6f6d	Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit. Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs. Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node. This commit was SVN r18338.	2008-04-30 19:49:53 +00:00
Ralph Castain	5311b13b60	Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system. This commit was SVN r18252.	2008-04-23 14:52:09 +00:00
Josh Hursey	cc83d41ad9	Merge in tmp/jjh-scratch {{{ svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch . }}} Contains: * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart. * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P. * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry * Some other sundry cleanup items all dealing with C/R functionality in the trunk. This commit was SVN r18241.	2008-04-23 00:17:12 +00:00
Ralph Castain	16c9100633	Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation. This commit was SVN r18216.	2008-04-20 02:25:45 +00:00
Josh Hursey	56a61bfacf	switch the name of orterun to mpirun to make things more clear. This commit was SVN r18208.	2008-04-18 12:59:23 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Ralph Castain	7b91f8baff	Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory. Fix the ompi-server -h cmd line option so it actually tells you something! Add two new testing codes to the orte/test/mpi area: accept and connect. This commit was SVN r18176.	2008-04-16 14:27:42 +00:00
Ralph Castain	5e6dc24e62	Fix ompi-server so it works with unity routed module - still not working with tree routing. Cleanup debug flag so it activates debugging on the data server code itself This commit was SVN r18080.	2008-04-04 19:17:28 +00:00
Ralph Castain	ce96cb4800	Quite warning about uninitialized variable This commit was SVN r18033.	2008-03-31 13:52:27 +00:00
Galen Shipman	f1e3045296	need orted_LDFLAGS as a placeholder you will need to re autogen.sh This commit was SVN r17951.	2008-03-25 13:41:09 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Jeff Squyres	dee561d29e	Per recent off-list discussions about the build system, I have done some cleanups and standardizations in the various /tools// Makefile.am files. This commit: * Somewhat simplify the tool Makefile.am's * Makes the tool Makefile.am's consistent with each other (do similar actions in similar ways) * Update the tool Makefile.am's to remove old kruft that was required by older versions of AM (trunk requires AM >=1.10) This commit was SVN r17921.	2008-03-22 02:04:05 +00:00
Ralph Castain	6bb139e4f2	One more correction to mpirun exit codes - cleanup the application proc's exit codes in the orted so that non-zero exit codes generated by mpirun itself don't get "munged". Modify the multi_abort function so they all return different exit codes - allows us to tell which one was being reported. This commit was SVN r17895.	2008-03-20 13:54:11 +00:00
Ralph Castain	2ed0e60321	Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize. Jeff is preparing an (undoubtedly lengthy) explanation/matrix of how these codes are determined for the OMPI FAQ. This commit was SVN r17879.	2008-03-19 19:00:51 +00:00
Ralph Castain	629b95a2fe	Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway. So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked. Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion. This commit was SVN r17843.	2008-03-17 17:58:59 +00:00
Ralph Castain	57a72c412a	Utilize Tim M's suggestion and use atomics to do the locking. This commit was SVN r17767.	2008-03-06 21:36:32 +00:00
Ralph Castain	097cc83be2	Fix a race condition - ensure we don't call terminate in orterun more than once, even if the timeout fires while we are doing so This commit was SVN r17766.	2008-03-06 19:35:57 +00:00
Ralph Castain	3883bbee06	Fix bug - must not "free" tsd-allocated memory This commit was SVN r17754.	2008-03-06 03:10:14 +00:00
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Rolf vandeVaart	03fdd57d5a	Fix the use of --path and -x PATH so that things work properly. Note that --path specifies extra directories where the executable is searched for, but does not affect the PATH settings. This commit fixes trac:1221. This commit was SVN r17748. The following Trac tickets were found above: Ticket 1221 --> https://svn.open-mpi.org/trac/ompi/ticket/1221	2008-03-05 21:07:43 +00:00
Ralph Castain	06d3145fe4	First cut at direct launch for TM. Able to launch non-ORTE procs and detect their completion for a clean shutdown. This commit was SVN r17732.	2008-03-05 13:51:32 +00:00
Ralph Castain	edb8e32a7a	Add default hostfile parameter plus --default-hostfile command line option. Fix error message when job setup failed This commit was SVN r17724.	2008-03-05 04:54:57 +00:00
Ralph Castain	9413d6cf5d	Define a default exit code for when things fail prior to a job launch - still needs work, but a start. Fix a deadlock loop when things really, really go bad. If we timeout trying to kill the job, then it's time to bail as cleanly as possible, not go back and keep trying. This commit was SVN r17715.	2008-03-05 01:46:30 +00:00
Ralph Castain	a585923de1	Silence some minor compiler warnings This commit was SVN r17662.	2008-02-29 02:39:39 +00:00
George Bosilca	7879e0b9c2	Be nice with parallel debugger, export this required symbol. This commit was SVN r17637.	2008-02-28 05:59:07 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Sharon Melamed	025b68becf	Move the carto framework to the trunk. This commit was SVN r17177.	2008-01-23 09:20:34 +00:00
Jeff Squyres	b6e9c99f7d	Formatting fixes from Peter Breitenlohner. This commit was SVN r17163.	2008-01-18 23:21:31 +00:00
Josh Hursey	0bf61a1b84	Move in some accumulated small features and minor bug fixes for C/R support. {{{ svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs . }}} This commit was SVN r16478.	2007-10-17 13:47:36 +00:00
Ralph Castain	b6196e8a39	When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed. In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean. I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it. This commit was SVN r16451.	2007-10-15 18:00:30 +00:00
Josh Hursey	31e9369e8b	Fix orterun so it does not get influenced by an application's argv set. For example, if I have an application that, internal to the application, takes the argument '-mca foo bar' we do not want orterun to pick up this argument and pass it through the system. So the following {{{ shell$ mpirun -np 2 -mca btl tcp,self ./myapp -mca foo bar }}} orterun should pick up {{{-mca btl tcp,self}}} but not {{{-mca foo bar}}} which it was previous to this commit. I tested command line runs and runs with app files to confirm this patch works. This commit was SVN r16431.	2007-10-11 18:33:40 +00:00
Josh Hursey	7437f37e96	This commit contains the following: * Fix some missing includes in a few places. * Add the cr_request() functionality to the BLCR CRS component. We are now dependent upon the 0.6.* series of BLCR. * Made the CR notification mechanism a registered function. This way we can have an OPAL-only version and it can be replaced at runtime with the ORTE version. * Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only CR functionality when the user wants it. Default: Disabled. * Fix the placement of a checkpoint request check in MPI_Init * Pull the OPAL notification mechanism into the SnapC framework. * We no longer fork/exec the 'opal-checkpoint' command for local checkpointing, the Local coordinator in the orted does this directly. * The Local and Application coordinator talk together bypassing the OPAL notifiation mechanism. * Optimized the Local <-> App Coordinator communication. * Improved the structure used to track vpid_snapshots in the local coord. * Fix a race condition in which an application under heavy communication load may produce an inconsistent global checkpoint. This commit was SVN r16389.	2007-10-08 20:53:02 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
George Bosilca	4e66376e66	Fix memory leak (Coverty 702). This commit was SVN r16122.	2007-09-13 20:11:38 +00:00
Rolf vandeVaart	a289ac114a	1. Remove some #ifdef 0 code. 2. Remove some unnecessary code that was causing a SEGV. There may be some more work to be done, but at least orte-clean is functional again. This commit was SVN r16111.	2007-09-12 19:50:58 +00:00
Shiqing Fan	dcee7e4229	- Should not use ORTE_DECLSPEC with initialization. This commit was SVN r16086.	2007-09-11 10:13:53 +00:00
Josh Hursey	5a029a47bd	forgot to separate the arguments This commit was SVN r15940.	2007-08-21 19:43:41 +00:00
Josh Hursey	db79f2392e	Make sure to enable C/R support for the HNP when restarting. This commit was SVN r15931.	2007-08-19 20:43:33 +00:00
Brian Barrett	fe0d1f30d5	need errno.h This commit was SVN r15862.	2007-08-15 02:15:33 +00:00
Brian Barrett	330003361b	* Free memory from asprintf * need to compare ERANGE to errno This commit was SVN r15860.	2007-08-14 21:12:00 +00:00
Brian Barrett	881dd0654e	* Provide a hook so that a PLS can tell the orted it's starting that it needs to override the default umask. By default, this is not used since most environments do what the user would expect without any help. * Have TM use the newly added umask hook, so that processes inherit the user's umask from mpirun rather than the pbs_mom's umask, which the user has no control over. This commit was SVN r15858.	2007-08-14 18:44:52 +00:00
George Bosilca	d658a477af	Update the help file to match the real name of the required argument. This commit was SVN r15762.	2007-08-04 00:35:55 +00:00
Shiqing Fan	0f468f3668	- Remove the solution and project files, will commit them later. This commit was SVN r15705.	2007-07-31 17:07:02 +00:00
Shiqing Fan	4d7b349cdb	- Add VC8 solution and project files. - If one wants to use this solution, remember to unload the project 'orte-restart' which is currently not working for Windows. This commit was SVN r15680.	2007-07-30 11:05:34 +00:00
Josh Hursey	a24e530f8e	Some C/R fixes (more to come) r15390 - Changed the paradigm in which the runtime worked by enabling the mpirun process to become an orted and spawn processes. This broke the C/R for this special case as it required that the orted start the process, and that the hierarchy remains. The fix was to allow the global coordinator to be a local coordinator as well for this case. r15528 - Changed the selection logic for the RML. This caused the application to segv if the 'ftrm' wrapper component was selected as it tried to modify a NULL pointer. The fix was to move the 'module swap' code into the init() function, and swap when passed a NULL pointer. It sounds bad, but actually cleans up the code a bit more. Still have to fix the 'routed' framework. This commit was SVN r15566. The following SVN revision numbers were found above: r15390 --> open-mpi/ompi@bd65f8ba88 r15528 --> open-mpi/ompi@39a6057fc6	2007-07-23 20:13:37 +00:00
Ralph Castain	d99c764e75	Resolve a problem where the orte daemon comm functions were being accessed by mpirun while still retaining occasional reference to the orted_globals. Remove all dependence on orted_globals from the comm functions. Move those functions back into their own file to make it easier to maintain the separation. Ensure that mpirun ignores any "exit" commands being sent to daemons as it will exit on its own. This commit was SVN r15562.	2007-07-23 18:36:33 +00:00
Brian Barrett	5b9fa7e998	reapply r15517 and r15520, which were removed in r15527 so that I could get the RML/OOB merge in slightly easier This commit was SVN r15530. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95 r15520 --> open-mpi/ompi@9cbc9df1b8 r15527 --> open-mpi/ompi@2d17dd9516	2007-07-20 02:34:29 +00:00
Brian Barrett	39a6057fc6	A number of improvements / changes to the RML/OOB layers: * General TCP cleanup for OPAL / ORTE * Simplifying the OOB by moving much of the logic into the RML * Allowing the OOB RML component to do routing of messages * Adding a component framework for handling routing tables * Moving the xcast functionality from the OOB base to its own framework Includes merge from tmp/bwb-oob-rml-merge revisions: r15506, r15507, r15508, r15510, r15511, r15512, r15513 This commit was SVN r15528. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r15506 r15507 r15508 r15510 r15511 r15512 r15513	2007-07-20 01:34:02 +00:00
Brian Barrett	2d17dd9516	temporarily back our r15517 and 15520 so that I can get the RML / OOB changes to cleanly apply This commit was SVN r15527. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95	2007-07-20 01:10:34 +00:00
Ralph Castain	41977fcc95	Remove the cellid field from the orte_process_name_t structure. This only affects a handful of files in itself, but... Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point. Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings. This commit was SVN r15517.	2007-07-19 20:56:46 +00:00
Ralph Castain	2110064a9a	Ensure that the LD_LIBRARY_PATH and PATH get properly set for procs locally spawned by mpirun. This commit was SVN r15516.	2007-07-19 19:00:06 +00:00
Josh Hursey	eeba2cb871	Add a comment to clarify the relationship between mca_base_cmd_line_process_args() and opal_init_util() so we do not forget their ordering needs, and subtle relationship. This commit was SVN r15412.	2007-07-13 19:08:05 +00:00
Ralph Castain	2bded34a1d	Fix a problem observed by Brian where processes launched local to mpirun lost their environment except for MCA params. The problem stemmed from no longer launching a local orted on the same node as mpirun. The orted would save and reuse the base environment. Mpirun didn't do that, and the odls was using the orted's globally saved environment (which wasn't being set). This fix establishes a globally accessible base launch environment that both the orted and mpirun can utilize. Since we now use that, we don't need to pass it to the odls_launch_proc function, so remove that param from the API (and modify all components to handle the change). This commit was SVN r15405.	2007-07-13 15:47:57 +00:00
Ralph Castain	bd65f8ba88	Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. Short description: major changes include - 1. singletons now fork/exec a local daemon to manage their operations. 2. the orte daemon code now resides in libopen-rte 3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided. I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine. This commit was SVN r15390.	2007-07-12 19:53:18 +00:00
Jeff Squyres	64083570f5	Add support for DDT parallel debugger, which required several things: * Making some symbols and types be global (vs. static) in orterun * Adding a "ddt" entry in the MCA parameter orte_base_user_debugger default value * Add support for @executable@, @executable_argv@, and @single_app@ tokens in the orte_base_user_debugger MCA parameter. * Added various error checks and corresponding help messages after finding a debugger in the PATH Fixes trac:1081 This commit was SVN r15323. The following Trac tickets were found above: Ticket 1081 --> https://svn.open-mpi.org/trac/ompi/ticket/1081	2007-07-10 12:53:48 +00:00
Jeff Squyres	892bc38ad0	Protect against a bad free: full_line points to the full buffer. But line may point to a few characters beyond the beginning of the buffer (if the buffer had some extra white space padding at the beginning). So if we want to free the buffer, free full_line, not line. This commit was SVN r15315.	2007-07-09 19:56:16 +00:00
Jeff Squyres	c796d84d61	* Integrate man pages contributed by Dirk Eddelbuettel * Make orted.1 man page be non-descriptive because it's really an internal command. * Re-work the opal_wrapper man page logic a bit so that we can have a real opal_wrapper.1 installed that says "don't look here -- look at mpicc (etc.)" This commit was SVN r15264.	2007-07-02 15:27:39 +00:00
Josh Hursey	f88aa6c273	This commit cleans up the AMCA parameter implementation a bit. * Remove the 'opal_mca_base_param_use_amca_sets' global variable * Harness the fact that you can (read should) call the cmd_line functions before initializing opal_init_util(). This pushes the MCA/GMCA/AMCA command line options into the environment before OPAL inits and starts to use these values. By putting the cmd_line parse before opal_init_util in orterun and orted we only parse the MCA parameter files once, and correctly (alleviating the need to 'recache' the files on init.) Small bits of cleanup. This commit was SVN r15219.	2007-06-27 01:03:31 +00:00
Rainer Keller	15c03e8acc	- Apply patch 31_manpages_lintian.dpatch Thanks to Dirk Eddelbuettel <edd@debian.org> This commit was SVN r15215.	2007-06-26 21:13:10 +00:00
Josh Hursey	84f102c343	Fix/Cleanup the Checkpoint Error propagation through the Snapc Full component. This commit was SVN r15175.	2007-06-22 16:14:25 +00:00
Josh Hursey	78df098aee	If we can not checkpoint, then make sure we return an error This commit was SVN r15151.	2007-06-20 21:05:19 +00:00
Josh Hursey	dd021e7121	Remove some leftover debugging that must have been accidentally left in r15142. This commit was SVN r15145. The following SVN revision numbers were found above: r15142 --> open-mpi/ompi@a3998a1676	2007-06-20 14:06:13 +00:00
Josh Hursey	db59235af5	Fix an AMCA parameter regression introduced (as a side effect of) in r14449 (and, due to lack of in code documentation, in r14661). The {{{opal_mca_base_param_use_amca_sets}}} flag tells the orted that it should not look at the parameter files just yet since it may have an AMCA parameter file to look at first. So we need to set this to {{{false}}} before initializing the MCA paras, then quickly turn around and re-init them when we have the full information. This commit fixes trac:1058 This commit was SVN r15144. The following SVN revision numbers were found above: r14449 --> open-mpi/ompi@0ba47105ed r14661 --> open-mpi/ompi@df86202202 The following Trac tickets were found above: Ticket 1058 --> https://svn.open-mpi.org/trac/ompi/ticket/1058	2007-06-20 14:00:40 +00:00
George Bosilca	a3998a1676	Allow the symbols required by TotalView to be exported even when the visibility feature is on. This commit was SVN r15142.	2007-06-19 22:35:23 +00:00
Josh Hursey	5719182a4e	Fix a break introduced in r14706 when RANK_KEY changed types. This commit was SVN r15120. The following SVN revision numbers were found above: r14706 --> open-mpi/ompi@d9acc93efa	2007-06-18 14:57:53 +00:00
Ralph Castain	85df3bd92f	Bring in the generalized xcast communication system along with the correspondingly revised orted launch. I will send a message out to developers explaining the basic changes. In brief: 1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names. 2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used. 3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying. Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed. This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems. This commit was SVN r15007.	2007-06-12 13:28:54 +00:00
Brian Barrett	508da4e959	OS X apparently really doesn't like shared libraries with unresolvable symbols in them and environ is defined only in the final application (probably in crt1.o). Apple provides a function for getting at the environment, so use that instead if it's available. This commit was SVN r14857.	2007-06-05 03:03:59 +00:00
Ralph Castain	a2964f429e	Fix a compiler warning - strncmp returns an int, so you have to compare to 0 instead of NULL. This commit was SVN r14790.	2007-05-29 18:02:10 +00:00
Anya Tatashina	de676d717b	Ref Trac #1032 ; added suport for full path launching with TotalView This commit was SVN r14789.	2007-05-29 17:39:11 +00:00
Josh Hursey	1e678c3f55	per conversation with Ralph and Jeff take out the opal_init_only logic. This commit moves the initalization/finalization of opal_event and opal_progress to opal_init/finalize. These were previously init/final in ORTE which is an abstraction violation. After talking about it we concluded that there are no ordering issues that require these to be init/final in ORTE instead of OPAL. I ran the IBM test suite against this commit and it didn't turn up any new failures so I think it is good to go. Let us know if this causes problems. This commit was SVN r14773.	2007-05-24 21:54:58 +00:00
Josh Hursey	e8b85faf28	Fix for the invalid arguments case. we were not finalizing cleanly. This commit was SVN r14770.	2007-05-24 21:27:06 +00:00
Josh Hursey	a010ff6e6a	Some updates from the interface change to orte_init This commit was SVN r14729.	2007-05-23 14:44:23 +00:00
Ralph Castain	e6ff7757ab	Modify the new DSS xfer and copy functions so they only xfer/copy the unpacked portion of a buffer's payload. This allows for more rapid transfer of data during message relay without requiring any knowledge of what is in the buffer. Begin work on restoring binomial message distribution method. This commit was SVN r14728.	2007-05-23 14:06:32 +00:00
Ralph Castain	4fff584a68	Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that did start. The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system. Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed. Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief. With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn. Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put". This commit was SVN r14711.	2007-05-21 18:31:28 +00:00
Ralph Castain	d9acc93efa	Compute and pass the local_rank and local number of procs (in that proc's job) on the node. To be precise, given this hypothetical launching pattern: host1: vpids 0, 2, 4, 6 host2: vpids 1, 3, 5, 7 The local_rank for these procs would be: host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3 host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3 and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid. I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future. This commit was SVN r14706.	2007-05-21 14:30:10 +00:00
Ralph Castain	75d51812a3	Fix the app-failed-to-start capability that was broken by r14554 (holding the caller in rmgr.spawn until the application - as opposed to just the orteds - have started). Allow the rmgr.spawn function to return if the app terminates, correctly handling its return status code to show abnormal termination. Modify orterun to correctly handle the returned status code so it doesn't enter a conditioned wait if the app fails to start since it will never wakeup if it does. This commit was SVN r14693. The following SVN revision numbers were found above: r14554 --> open-mpi/ompi@4510b42638	2007-05-18 13:29:11 +00:00
Galen Shipman	df86202202	get bproc to compile, other issues still remain.. This commit was SVN r14661.	2007-05-15 23:11:33 +00:00
Jeff Squyres	51ff779a5d	Minor gramatical nit found by Karen/Sun. This commit was SVN r14622.	2007-05-08 21:24:44 +00:00
Jeff Squyres	395d05b6bc	Update the man page to describe both -wdir and -wd. -wdir is consider the "primary" option and -wd is the synonym. Regardless, either of them function exactly like the other. This commit was SVN r14618.	2007-05-08 20:27:20 +00:00
Jeff Squyres	8a68b2dba7	Add -wdir option as a synonym for -wd (to make us match the man page). This commit was SVN r14614.	2007-05-08 19:09:32 +00:00
Sven Stork	a04c8eb39a	- Bring over the visibility feature, for a finer symbol export control via the visibility feature that is provided by some compilers. Per default this feature is disabled, to enable it you need to configure with --enable-visibility and obviously you need a compiler with visibility support. Please refer to the wiki for more information. https://svn.open-mpi.org/trac/ompi/wiki/Visibility This commit was SVN r14582.	2007-05-04 09:03:37 +00:00
Ralph Castain	7d6d0a1c00	Update reuse_daemons to find the daemons again - requires that orteds now report their nodenames (probably temporary patch pending upcoming minor revision of orted) This commit was SVN r14533.	2007-04-26 15:09:54 +00:00
Josh Hursey	d68ff8c2a3	minor typo This commit was SVN r14516.	2007-04-25 19:54:53 +00:00
Ralph Castain	18cb5c9762	Complete modifications for failed-to-start of applications. Modifications for failed-to-start of orteds coming next. This completes the minor changes required to the PLS components. Basically, there is a small change required to the parameter list of the orted cmd functions. I caught and did it for xcpu and poe, in addition to the components listed in my email - so I think that only leaves xgrid unconverted. The orted fail-to-start mods will also make changes in the PLS components, but those can be localized so they come in one at a time. This commit was SVN r14499.	2007-04-24 20:53:54 +00:00
Ralph Castain	c774f641fb	Modify orterun to provide more user-friendly reporting on jobs that fail to start This commit was SVN r14496.	2007-04-24 19:19:14 +00:00
Ralph Castain	18b2dca51c	Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly. There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd really have to try). This also involved a slight change to the oob.xcast API, so propagated that as required. Note: this has only been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-) Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately. This commit was SVN r14475.	2007-04-23 18:41:04 +00:00
Ralph Castain	009be1c1b5	Reorganize the orted code for easier maintenance. Add ability to deliver xcast messages to local procs (not used at this point). This commit was SVN r14474.	2007-04-23 18:28:20 +00:00
Brian Barrett	0a8af62c64	Fix broken build on OS X with static compiles. Everything that uses anything in OPAL MUST call either opal_init() or opal_init_util(). This commit was SVN r14468.	2007-04-23 15:45:39 +00:00
Josh Hursey	27a42f48d3	Make sure to call opal_init_util before mca_base_open(). This bug(?) become apparent due to the installdirs commit since these tools were not finding the proper libraries since the paths were wonkey. It all looks good now. :) This commit was SVN r14461.	2007-04-21 22:38:15 +00:00
Jeff Squyres	5bebd24250	Bring over Brian's installdirs fixes from this afternoon (r14445). This commit was SVN r14450. The following SVN revision numbers were found above: r14445 --> open-mpi/ompi@13d366b827	2007-04-21 00:16:31 +00:00
Jeff Squyres	0ba47105ed	Merge the /tmp/jms-installdirs-trunk branch into the trunk. This finally brings in functionality that is already on the 1.2 branch, and was developed and tested in the v1.2ofed branch (and other places). Short version of new features: * Support for ibv_fork_init() * Automatically fill in the openib BTL bandwidth value by querying the HCA port * Installdirs functionality * Fixes to always use -I in the Fortran wrapper compilers (#924) * Gleb's mpool updates * Remove some kruft in btl/openib/configure.m4, therefore fixing the harmless warnings noted in #665 * Bunches of updates to the Linux RPM spec file I.e., effectively the same thing that r14411 brought to the v1.2 branch. Also effectively brought in r14432 and r14433 (some fixes on top of the original r14411 commit to v1.2). Still need to bring in the moral equivalent of r14445 after this commit (fixes to installdirs). This commit was SVN r14449. The following SVN revision numbers were found above: r14411 --> open-mpi/ompi@83b31314ae r14432 --> open-mpi/ompi@a48f160595 r14433 --> open-mpi/ompi@68f346d2bc r14445 --> open-mpi/ompi@13d366b827	2007-04-21 00:15:05 +00:00
Josh Hursey	6ee0c641fd	Cleanup the output from orte-checkpoint so it is a bit more clear and references the sequence number. Before: [...] Finished - Global Snapshot Reference: ompi_global_snapshot_1234.ckpt After: Snashot Ref.: 1 ompi_global_snapshot_1234.ckpt This commit was SVN r14381.	2007-04-15 14:28:56 +00:00
Josh Hursey	8fd6d4ba09	add a newline so output is cleaner/clearer This commit was SVN r14229.	2007-04-05 17:45:03 +00:00
George Bosilca	f2a6b9394f	Deal with the include spree. Protect "environ" on Windows. Some others minors modifications in order to make it compile [again] on Windows. This commit was SVN r14188.	2007-04-01 16:16:54 +00:00
Ralph Castain	0d98264097	Fix the nolocal option on the OMPI trunk This commit was SVN r14138.	2007-03-24 16:16:16 +00:00
Jeff Squyres	bcdfbacaa4	Oops -- typo from previous commit. :-( This commit was SVN r14130.	2007-03-23 00:51:50 +00:00
Jeff Squyres	a3dd0f2e08	Connect --nolocal up to the MCA param rmaps_base_schedule_local, as it should be (it's a mistake that it got left out). This commit was SVN r14127.	2007-03-22 19:29:47 +00:00
Josh Hursey	101a2abd09	- Be more careful with parens - Run the destructor before shutting things down. This commit was SVN r14064.	2007-03-19 17:33:20 +00:00
Josh Hursey	a181c987cc	Remove some old references to ft_enable parameter that no longer exists. This was replaced by the "-am ft-enable-cr" AMCA parameter. This commit was SVN r14055.	2007-03-17 20:02:42 +00:00
Josh Hursey	dadca7da88	Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). This merge adds Checkpoint/Restart support to Open MPI. The initial frameworks and components support a LAM/MPI-like implementation. This commit follows the risk assessment presented to the Open MPI core development group on Feb. 22, 2007. This commit closes trac:158 More details to follow. This commit was SVN r14051. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r13912 The following Trac tickets were found above: Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158	2007-03-16 23:11:45 +00:00
Josh Hursey	0404444dbe	* Added 2 new MCA parameters - mca_base_param_file_prefix (Default: NULL) This is the fullname of the "-am" mpirun option. Used to specify a ':' separated list of AMCA parameter set files. - mca_base_param_file_path (Default: $SYSCONFDIR/amca-param-sets/:$CWD) The path to search for AMCA files with relative paths. A warning will be printed if the AMCA file cannot be found. * Added a new function "mca_base_param_recache_files" the re-reads the file configurations. This is used internally to help bootstrap the MCA system. * Added a new orterun/mpirun command line option '-am' that aliases for the mca_base_param_file_prefix MCA parameter * Exposed the opal_path_access function as it is generally useful in other places in the code. * New function "opal_cmd_line_make_opt_mca" which will allow you to append a new command line option with MCA parameter identifiers to set at the same time. Previously this could only be done at command line declaration time. * Added a new directory under the $pkgdatadir named "amca-param-sets" where all the 'shipped with' Open MPI AMCA parameter sets are placed. This is the first place to search for AMCA sets with relative paths. * An example.conf AMCA parameter set file is located in contrib/amca-param-sets/. * Jeff Squyres contributed an OpenIB AMCA set for benchmarking. Note: You will need to autogen with this commit as it adds a configure param. Sorry :( This commit was SVN r13867.	2007-03-01 13:39:20 +00:00
Rainer Keller	0889ebd59f	- Eliminate warnings, that PGI-6.2.5 issues with -Minform=inform This commit was SVN r13840.	2007-02-28 08:36:34 +00:00
George Bosilca	d29423b1f7	orted_globals_t should be global. This commit was SVN r13684.	2007-02-16 18:16:06 +00:00
Pak Lui	2d6b3776bf	* fix the SEGV described in trac #892 that the exit_status in the 200 range causes a strsignal to show NULL as a result. Still trying to determine why exit_status is in that range. This commit was SVN r13583.	2007-02-09 16:39:30 +00:00
Pak Lui	ccff0a6e65	* minor fix to correct the pid that always shows up as 0 in the abort error message. e.g: mpirun noticed that job rank 2 with PID 0 on node burl-ct-v440-4 exited on signal 15 (Terminated). This commit was SVN r13537.	2007-02-07 17:46:19 +00:00
Rolf vandeVaart	dcce8c739c	Fix compiler warning. I am not sure how this got passed us, but thanks to Jeff Squyres for pointing it out. This commit was SVN r13501.	2007-02-05 22:03:58 +00:00
Rolf vandeVaart	74e3b68ce8	Better document orte-clean's behavior. This commit was SVN r13498.	2007-02-05 20:01:15 +00:00
Jeff Squyres	4e506e69e5	Add missing <sys/param.h> This commit was SVN r13478.	2007-02-03 01:11:35 +00:00
Rolf vandeVaart	bf5113198d	Update to orte-clean so it will remove files on local and remote nodes. It will also kill off rogue orteds and orterun processes. The killing of processes is ifdef'ed out for Windows since I do not know how to do it there. Note that this change will requite an autogen. This commit was SVN r13477.	2007-02-03 00:25:42 +00:00
Ralph Castain	3daf8b341b	Fix the sched_yield problem for generic environments. We now determine and set sched_yield during mpi_init based on the following logical sequence: 1. if the user has specified sched_yield, we simply do what we are told 2. if they didn't specify anything, try to get the number of processors on this node. Note that we already now get the number of local procs in our job that are sharing this node - that now comes in through the proc callback and is stored in the ompi_proc_t structures. 3. if we can get the number of processors, compare that to the number of local procs from my job that are sharing my node. If the number of local procs exceeds the number of processors, then set sched_yield to true. If not, then be a hog and set sched_yield to false 4. if we can't get the number of processors, default to conservative behavior and set sched_yield to true. Note that I have not yet dealt with the need to dynamically adjust this setting as more processes are added via comm_spawn. So far, we are only looking within our own job. Given that we have now moved this logic to mpi_init (and away from the orteds), it isn't yet clear to me how a process will be informed about the number of procs in other jobs that are also sharing this node. Something to continue to ponder. This commit was SVN r13430.	2007-02-01 19:31:44 +00:00
George Bosilca	9f73335bdb	Silence the compiler. This commit was SVN r13381.	2007-01-31 04:24:56 +00:00
Jeff Squyres	8d872b195a	Refs trac:726 Tested this functionality quite a bit more and made some fixes: * Print far fewer help messages * Fix one additional deadlock upon error * Change some ORTE_LOG messages to silent (because they're not errors) * Some code got re-indented, sorry... Discussed and reviewed with Ralph. This commit was SVN r13375. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-30 23:03:13 +00:00
Rainer Keller	3669e8921e	- Fix further compiler warnings regarding initialization and shadowing variables. This commit was SVN r13358.	2007-01-30 06:34:38 +00:00
Ralph Castain	ab5ea61100	Bring over the rest of the ctrl-c fixes. This commit includes: 1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned. 2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up. 3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded 4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond This needs more testing before migrating to 1.2. This commit was SVN r13304.	2007-01-25 14:17:44 +00:00
Ralph Castain	455e4ada9a	Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes. This commit was SVN r13191.	2007-01-18 17:15:19 +00:00
Ralph Castain	cc905290e4	Fix the pernode and npernode options - the mca parameters weren't being set to correspond to the command line options This commit was SVN r13151.	2007-01-17 14:56:22 +00:00
Ralph Castain	5d698dc55b	Turn "off" an unimplemented command line option - we do not currently support execution without mpirun waiting for job completion. This commit was SVN r13127.	2007-01-16 16:10:31 +00:00
Jeff Squyres	e5205657cf	A much better fix for #739 . No configure test -- just do a simple memcpy() instead of assigning the struct's by value. Fixes trac:739. This commit was SVN r13081. The following Trac tickets were found above: Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739	2007-01-11 14:30:32 +00:00
Jeff Squyres	add3909096	Back out 13076 and 13077 in favor of a much simpler approach. Sorry for the configure change -- hopefully it's early enough in the morning that it won't affect people... (new approach won't have a configure change). Refs trac:739. This commit was SVN r13080. The following Trac tickets were found above: Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739	2007-01-11 14:07:15 +00:00
George Bosilca	24a91fad1d	OPAL_BOOL_STRUCT_COPY or OMPI_BOOL_STRUCT_COPY that's the question! Let's minimize the disturbances and say that the configure system is right. From now on it's OPAL_BOOL_STRUCT_COPY. This one is related to r13076 and has to follow when r13076 goes in the 1.2. This commit was SVN r13077. The following SVN revision numbers were found above: r13076 --> open-mpi/ompi@f0932a0701	2007-01-11 05:44:48 +00:00
Jeff Squyres	f0932a0701	A workaround for a bug in the PGI 6.2 compiler series. This bug has been fixed in the 7.0 PGI series, but is unlikely to be fixed in the 6.2 series: * Add a configure test looking for the bad behavior (the PGI compiler chokes on C code where structs containing bool's are copied by value) * Set OMPI_BOOL_STRUCT_COPY to 1 if it's ok, 0 if it's not (i.e., PGI 6.2 series will have this value set to 0) * In two places in the code base -- orte-clean and btl_openib_ini.h, we have a struct that contains a bool that is copied by value. In these two places, check OMPI_BOOL_STRUCT_COPY and if it's 1, use the "int" type instead of "bool". Fixes trac:739 This commit was SVN r13076. The following Trac tickets were found above: Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739	2007-01-11 02:21:26 +00:00
Jeff Squyres	8a289cf1cb	Part 1 of the fix for ticket #726 . This commit adds logic to orteun to effect the following: * The first time the user hits ctrl-c, we go into the process of killing the ORTE job (this is not new). * While waiting for the job to actually terminate, if the user hits ctrl-c a second time, we print a warning saying "Hey, I'm still trying to kill the job. If you really want me to die immediately, hit ctrl-c again within 1 second." * If the user hits ctrl-c a within 1 second, orterun quits with a warning about how the job may not have actually been killed. Note that none of this logic won't really work until the second part of the fix for #726 is also committed (i.e., make pls.terminate_job() non-blocking). So I'm now throwing the ticket over to Ralph for the second part of the fix... Refs trac:726 This commit was SVN r13040. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-08 20:25:26 +00:00
Rolf vandeVaart	fdf44cc4ab	Add the ability to not only report broken files and directories, but remove them also. This current set of changes will affect nothing as no one is making use of this ability. However, orte-clean will be changed soon to utilize this new feature. This commit was SVN r12996.	2007-01-04 21:48:34 +00:00
Brian Barrett	bc6cec346f	Print out the description of the signal from mpirun when a proc was aborted by a signal if we have strsignal() This commit was SVN r12888.	2006-12-17 20:01:11 +00:00
Ralph Castain	7b8f445e13	Modify the "--display-map-at-launch" option to just "--display-map". Now that we have a "--do-not-launch" option, the "-at-launch" part of the display-map option was confusing. "--display-map" displays the resulting process map before we launch anyway, so this is clearer. This commit was SVN r12840.	2006-12-13 13:49:15 +00:00
Ralph Castain	82946cb220	Add a new option to orterun: "--do-not-launch" directs the system to do the allocation, map, job setup, etc., but don't actually launch the job. This lets us test all the setup portions of the code. Also, take the first step in updating how we handle mca params in ORTE - bring it closer to how it is done in the other two layers. Much more work to be done here. This commit was SVN r12838.	2006-12-13 04:51:38 +00:00
Ralph Castain	28ce8e5e5e	Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior). In "--npernode" operation, the "-np" command line parameter is ignored. This commit was SVN r12826.	2006-12-12 00:54:05 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Rainer Keller	e61dd8722e	- Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx - No functional changes, just indentation and corrections to error output. This commit was SVN r12734.	2006-12-03 13:59:23 +00:00
George Bosilca	a0ed53d70b	Make the compilers happy. This commit was SVN r12729.	2006-12-03 00:19:11 +00:00
Ralph Castain	652b91ee26	Remove some compiler warnings This commit was SVN r12678.	2006-11-27 23:47:36 +00:00
Brian Barrett	32833deff0	since orteboot, ortehalt, and ortekill were all added today (including to configure.ac), we need to add them to SUBDIRS to make them end up in the tarball as well... This commit was SVN r12658.	2006-11-23 03:10:57 +00:00
Ralph Castain	7f95b27141	Correctly "hide" the new orte tools - they shouldn't get compiled or seen unless you specifically go into those subdirectories and manually do a "make". This commit was SVN r12650.	2006-11-22 14:35:16 +00:00
Ralph Castain	33affed09c	Bring ortehalt to a preliminary capability. It will corectly order a persistent daemon to exit cleanly. Need to now interface it to orterun, clean up a few things here and there This commit was SVN r12626.	2006-11-18 04:47:51 +00:00
Ralph Castain	3c5a2cd17b	Cleanup a few warnings for unused variables This commit was SVN r12620.	2006-11-17 19:32:49 +00:00
Ralph Castain	f771cc4fbd	Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option. Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution. This commit was SVN r12619.	2006-11-17 19:06:10 +00:00
Ralph Castain	ca5b4358fa	Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too. Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone. This commit was SVN r12616.	2006-11-17 02:58:46 +00:00
Ralph Castain	5ddcb8a652	Ensure the orted kills all local procs when exiting. Add a little clarity to some of the debugging output This commit was SVN r12615.	2006-11-16 21:15:25 +00:00
Ralph Castain	044898f4bf	My eyes may be deceiving me....but I do believe these comparisons are backwards! I think we only really want to "free" these variables if they are NOT NULL - as opposed to "free"ing them if they ARE NULL. This commit was SVN r12612.	2006-11-15 22:59:01 +00:00
Ralph Castain	f7fc19a2ca	Create the ability to re-use existing daemons. Included in the commit: 1. new functionality in the pls base to check for reusable daemons and launch upon them 2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED 3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse". This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on. This commit was SVN r12608.	2006-11-15 21:12:27 +00:00
Ralph Castain	437f2b044d	Modify the orted command communication system in two ways: 1. use non-blocking sends to transmit commands (this was actually done in a prior commit) 2. have an "ack" message sent back from the orted when it completes the command The latter item is the new one here. With my prior commit, it was possible for the HNP to move on to other things before the orted had completed its command. This caused the HNP to occassionally exit before the orted, thus generating "lost connection" errors. With this change, we retain the parallel nature of the command communications, but still hold the HNP at that point until the orteds are done. Best of both worlds. This commit was SVN r12605.	2006-11-15 15:09:28 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Ralph Castain	7b4261001a	Forgot to modify the orted end of the communication subsystem This commit was SVN r12586.	2006-11-13 22:08:47 +00:00
Ralph Castain	f95e20e2e1	Add another test program - an MPI app that just spins. This supports testing of system response to signal-terminated processes. Add some debugger output to the ODLS default component. Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time. This commit was SVN r12585.	2006-11-13 21:51:34 +00:00
Ralph Castain	4636125e2d	Modify the RMGR components to allow job setup with a given jobid, and add another attribute so that we can setup triggers without launching. Add some debugging output to the ODLS default module, and the orted. Remove the nodename data from the ODLS info report - that info is already stored in the registry by the RMAPS framework upon completing the mapping procedure. Add another test program that does an ORTE-only dynamic spawn (gasp!). Looks just like comm_spawn - just no MPI involved. Modify the ODLS to release the processor when we "kill" local procs in a more scalable fashion. It previously had a sleep in it that Jeff's prior commit removed. However, he introduced some Windows code into the non-Windows component (protected by "if"s, but unnecessary). This is a more general solution he proposed - included here so I could get things to compile properly. This commit was SVN r12579.	2006-11-13 18:51:18 +00:00
Ralph Castain	4e50cdae52	This commit accomplishes two things: 1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs. 2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system. This commit was SVN r12558.	2006-11-11 04:03:45 +00:00
Ralph Castain	a3be8261fb	Fix a bug that had us generate an error message and abort startup when there were stale universe directories around. Now, we just ignore them. This commit was SVN r12472.	2006-11-07 21:34:57 +00:00
Ralph Castain	30de73a712	Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include: 1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch" 2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command. 3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job 4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map. Enjoy! My personal favorite is the first one - helps with debugging. This commit was SVN r12379.	2006-10-31 22:16:51 +00:00
Ralph Castain	c5b59829aa	Fix a long-lingering annoyance. Calling mpirun with a non-existent application would cause the system to hang on all environments. Reason was that the orted would exit, which it should never do without explicit orders to that affect. This commit was SVN r12255.	2006-10-23 13:27:31 +00:00
George Bosilca	ee559e9947	Do not completely reset the orterun_globals. Keep the condition and the mutex, but reset everything else. Once initialized the condition (and the attached mutex) should be kept alive as long as possible if we want to be able to retrieve all the informations. This commit was SVN r12253.	2006-10-23 03:34:08 +00:00
Ralph Castain	153e38ffc9	Lesson to be learned: if you send an ack to a recv'd command, better not send it to the same tag it came from - at least, not if there is a persistent recv on that tag! Fix the persistent daemon problem where it was exiting when a job completed. Problem was that the persistent daemon would order the job daemons to exit. They would then send an 'ack' back to the persistent daemon - but the ack consisted of an echo of the "exit" command, which was recv'd by the wrong listener who treated it as a properly sent cmd....and exited. This commit was SVN r12243.	2006-10-21 02:53:19 +00:00
Ralph Castain	ec0bb9ffda	Fix the bookmark system - we now have children being correctly spawned where they should! Also, I am no longer seeing any issue with the child job spawning its own daemons - this appears to be fixed. We still don't reuse the existing daemons, however, but that will come. This commit was SVN r12229.	2006-10-20 18:05:16 +00:00
Ralph Castain	02efd07b60	Fix the MCA param passing issue, at least for rsh at the moment. I will clean this up and move it to the other environments once I shift back to a local computer. This commit was SVN r12224.	2006-10-20 15:27:29 +00:00
Brian Barrett	37fad860b7	Grrr... Forgot that EXTRA_DIST and man_MANS are not set to include all the possible things contained in the conditional like other rules are (for example, a SOURCES rule in a conditional automatically has its files added to the dist rules, even if that conditional isn't tru when make dist occurs). So the man files weren't in the tarball. Put the EXTRA_DIST with the files explicitly listed outside any conditionals so the man pages always end up in the tarball. This commit was SVN r12220.	2006-10-20 14:15:38 +00:00
Ralph Castain	ab196c3121	Okay, this fixes the problem of MCA params spreading too far. Sorry for the multiple corrections. This commit was SVN r12201.	2006-10-19 22:51:02 +00:00
Ralph Castain	382f954fff	Fix a bug in the way we saved and passed environments to child processes on remote nodes. The problem was that MCA directives for component selection were being passed back to the children. However, now that we only allow certain components to operate on HNPs, this caused the children to bomb out of orte_init. This commit was SVN r12196.	2006-10-19 20:35:55 +00:00
Brian Barrett	204f5b8f52	- Clean up wrapper compiler man pages during maintainer-clean, since they might require special tools (not sure if sed with multiple -e arguments is totally portable) - ignore the opalcc.1 man page. Couldn't do this in the previous man page commit (r12192) because I was removing opalcc.1 in that commit. This commit was SVN r12194. The following SVN revision numbers were found above: r12192 --> open-mpi/ompi@581a4b0a4e	2006-10-19 20:14:40 +00:00
Brian Barrett	581a4b0a4e	A few cleanups to the wrapper compiler build system / man pages: - Only install opal{cc,c++} and orte{cc,c++} if configured with --with-devel-headers. Right now, they are always installed, but there are no header files installed for either project, so there's really not much way for a user to actually compile an OPAL / ORTE application. - Drop support for opalCC and orteCC. It's a pain to setup all the symlinks (indeed, they are currently done wrong for opalCC) and there's no history like there is for mpiCC. - Change what is currently opalcc.1 to opal_wrapper.1 and add some macros that get sed'ed so that the man pages appear to be customized for the given command. - Install the wrapper data files even if we compiled with --disable-binaries. This is for the use case of doing multi-lib builds, where one word size will only have the library built, but we need both set of wrapper data files to piece together to activate the multi-lib support in the wrapper compilers. This commit was SVN r12192.	2006-10-19 18:34:17 +00:00
Ralph Castain	13227e36ab	This commit looks a lot bigger than it is, so relax :-) Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off. To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place. I used those capabilities in two places: 1. Added an attribute list to the rmgr.spawn interface. 2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h). So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms. This commit was SVN r12138.	2006-10-17 16:06:17 +00:00
Brian Barrett	9adde4f7b8	Allow multilib capability based on compiler flags. See: https://svn.open-mpi.org/trac/ompi/wiki/compilerwrapper3264 for more information. Refs trac:374 This commit was SVN r12120. The following Trac tickets were found above: Ticket 374 --> https://svn.open-mpi.org/trac/ompi/ticket/374	2006-10-15 21:21:08 +00:00
Ralph Castain	3f55d6897a	Remove the memory debugging options. Fix what appears to be a typo in a help file. This commit was SVN r12107.	2006-10-12 00:44:48 +00:00
Ralph Castain	2da8245be0	Correctly propagate no-daemonize This commit was SVN r12093.	2006-10-11 17:53:17 +00:00
Ralph Castain	27e305347c	Add a couple of options to orterun that support debugging of daemons for memory corruption. Ensure that the environment provided to local application processes isn't "polluted" by the orteds This commit was SVN r12087.	2006-10-11 15:18:57 +00:00
Ralph Castain	e7f6fa22d6	Fix return code so that mpirun returns the right thing when an abort is encountered. This commit was SVN r12065.	2006-10-09 01:04:00 +00:00
Ralph Castain	2e09128337	Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!! Restore the OBJ_RELEASE calls to cleanup map objects. This commit was SVN r12064.	2006-10-07 22:44:00 +00:00
Ralph Castain	98dd57b70e	Add a new option to launch "pernode" - launches one process/node across all available nodes. The other options also work correctly: "-bynode" with no -np will launch on all slots, mapped on a per-node basis. This commit was SVN r12063.	2006-10-07 19:50:12 +00:00
Ralph Castain	889ddefe85	Remove release that caused totalview connection to bomb This commit was SVN r12061.	2006-10-07 18:25:56 +00:00
Ralph Castain	ae79894bad	Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue. I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there). Gridengine compiles but I cannot test (believe it likely will run). Poe and xgrid compile to the extent they can without the proper include files. This commit was SVN r12059.	2006-10-07 15:45:24 +00:00
Jeff Squyres	72cf2fe813	Oops: --noprefix should not take an argument. This commit was SVN r12043.	2006-10-06 13:02:56 +00:00
George Bosilca	d628a18411	Right now there is no support for TotalView on Windows. Therefore, we don't really care how these functions and variables are declared. This commit was SVN r11996.	2006-10-05 05:19:03 +00:00
Ralph Castain	12328395ae	Missed a couple of debug statements This commit was SVN r11935.	2006-10-02 15:46:41 +00:00
Tim Prins	53b116d309	This commit fixes trac:452. It turns out that we were improperly allocating an array if -np was not passed. Also, we were not really using this array for anything. So this gets rid of the array and performs some minor cleanup. This commit was SVN r11934. The following Trac tickets were found above: Ticket 452 --> https://svn.open-mpi.org/trac/ompi/ticket/452	2006-10-02 15:03:43 +00:00
Ralph Castain	7494a7a83f	Clean out some debugging statements that were inadvertently left in the commit This commit was SVN r11933.	2006-10-02 15:03:18 +00:00
Ralph Castain	559b9b0ae8	Continue beating on comm_spawn. Setup to debug bproc. This commit was SVN r11932.	2006-10-02 14:58:22 +00:00
Ralph Castain	121f834776	Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component. Still need some work on the proxy component, and on job termination for persistent daemon case. This commit was SVN r11928.	2006-10-02 00:46:31 +00:00
Dan Lacher	ba0389723e	Ticket: #346 remove requirements on .la files on wrapper scripts Ticket: #374 extend compilers to support 32 bit and 64 bit in one version of the wrapper Submitted by: Dan Lacher Reviewed by: Rolf Vandevaart This commit was SVN r11908.	2006-09-29 23:58:58 +00:00
Jeff Squyres	785a2e1c90	Move the man page installs to install-data-hook. Putting them in install-exec-hook is not only wrong, it can cause ordering issues such as trying to put sym links to man pages in directories that do not yet exist. This commit was SVN r11893.	2006-09-29 14:34:39 +00:00
Tim Prins	e4f8ad303e	Fix for #397 on 64 bit platforms sizeof(size_t) != sizeof(orte_std_cntr_t), and we were incorrectly assuming this when dealing with num procs. It worked on little endian platforms, but not big endian. So change num_procs to type int, and cast where needed. This commit was SVN r11796.	2006-09-25 19:41:54 +00:00
Jeff Squyres	c5cc1f0c1a	Add man page for wrapper compilers. Fixes trac:358. This commit was SVN r11773. The following Trac tickets were found above: Ticket 358 --> https://svn.open-mpi.org/trac/ompi/ticket/358	2006-09-25 14:11:21 +00:00
Ralph Castain	977e3c5ca1	Let's see if Cyrador understands this version a little better... This commit was SVN r11709.	2006-09-19 13:05:40 +00:00
Ralph Castain	0ad0d84afd	Add two new API functions to the RMGR, and modify the "spawn" API to support the enhanced MPI-2 functionality. No implementation backs these new APIs - just placeholders for now. This commit was SVN r11699.	2006-09-19 01:45:05 +00:00
George Bosilca	f8de894efe	This one wasn't supposed to get into the repository. This commit was SVN r11697.	2006-09-18 21:28:55 +00:00
George Bosilca	7ad23ff97b	Be 100% total view friendly. Let tv find out the real name of our executable and export all functions as they should be. This commit was SVN r11694.	2006-09-18 17:55:14 +00:00
Ralph Castain	d7e61e40fc	Quiet a few warnings from Cyrador This commit was SVN r11686.	2006-09-18 12:40:42 +00:00
Jeff Squyres	8226dab86c	Fixes trac:377 Add --enable-orterun-prefix-by-default (and a synonym: --enable-mpirun-prefix-by-default) to make orterun always behave as if "--prefix $prefix" was given on the command line (where $prefix is the value given to the --prefix option to configure). This prevents many rsh/ssh users from needing to modify their shell startup files to set the LD_LIBRARY_PATH for Open MPI (they will still need to set PATH or otherwise find the OMPI executables to mpicc/mpirun/etc. their MPI applications). Also added --noprefix option to orterun to disable this behavior. Finally, note that even if --enable-orterun-prefix-by-default is specified, if the user specifies --prefix or /path/to/mpirun, these options will override the default value of the prefix ($prefix). This commit was SVN r11669. The following Trac tickets were found above: Ticket 377 --> https://svn.open-mpi.org/trac/ompi/ticket/377	2006-09-15 02:52:08 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
Galen Shipman	b02185374f	Push a generated "key" out to all the processes. This is necessary for some interconnect wireup in which all processes must agree on a "key" to initialize the interconnect with. This commit was SVN r11653.	2006-09-14 15:27:17 +00:00
George Bosilca	7e7bae335e	Protect the environ variable on windows. This commit was SVN r11435.	2006-08-27 04:44:17 +00:00
George Bosilca	e04032ca2f	Correct a comment and protect the usage of the environ variable against Windows. This commit was SVN r11397.	2006-08-24 16:18:42 +00:00
George Bosilca	fdfae70dbe	Use environ. This commit was SVN r11353.	2006-08-23 06:19:47 +00:00
George Bosilca	75fa0317da	Keep environ as the prefered storage for the environment variables. This commit was SVN r11351.	2006-08-23 06:14:24 +00:00
George Bosilca	b4732f557a	Now it's time to update ORTE. Cleanup most of the ORTE tools. Force them to use opal_basename and opal_dirname. Don't create the path manually. Use the specialized opal functions instead. This commit was SVN r11345.	2006-08-23 02:35:00 +00:00
George Bosilca	6ef0acf99f	The names of the defines should start with OPAL as they belong to the OPAL layer. We now support 64 bits Windows too. This commit was SVN r11312.	2006-08-21 21:55:41 +00:00
Ralph Castain	8c7f0ed9ae	Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system. Other changes: 1. Remove the old xcpu components as they are not functional. 2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one. This will require an autogen/configure, I'm afraid. This commit was SVN r11228.	2006-08-16 16:35:09 +00:00
Ralph Castain	5dfd54c778	With the branch to 1.2 made.... Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced). Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up). I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t). In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but... Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems. This commit was SVN r11204.	2006-08-15 19:54:10 +00:00
Ralph Castain	ec3eeb819d	Remove unused variable to make Cyrador happy. This commit was SVN r11144.	2006-08-10 12:57:55 +00:00
Ralph Castain	56c15963af	Finalize the session directory and runtime when the orted exits due to a failed launch. This commit was SVN r11141.	2006-08-09 17:00:53 +00:00
Ralph Castain	8bec270f90	Fix a bug noted by Jeff - we were no longer accurately recording in the registry that a process had been terminated when the user initiated the "kill" process (via cntrl-c). Added another system-level test function for ORTE that just spins until terminated by a ctrl-c signal. Modified orterun - added a couple of newlines to the output when abnormally terminating so the prompt always is on a new line. This commit was SVN r10866.	2006-07-18 14:42:27 +00:00
Jeff Squyres	416e9de22d	Fix some minor problems when handling the error cases This commit was SVN r10854.	2006-07-17 19:21:10 +00:00
Ralph Castain	c22b0d516e	Some edits to the man page for Jeff to review This commit was SVN r10803.	2006-07-14 14:47:06 +00:00
Jeff Squyres	e6c9c699fe	Minor changes: - change -no_oversubscribe to -nooversubscribe (to be similar to -nolocal) - Added text to orterun.1 describing slots and -nooversubscribe Still need to add text about "mpirun a.out" functionality, and RHC wants to make some minor edits, so committing for synchronization. This commit was SVN r10800.	2006-07-14 14:15:03 +00:00
Jeff Squyres	f7a71772a7	Remove long-defunct "openmpi" tool from orte. It was apparently an early generation of the orted, and is now long-dead. This commit was SVN r10754.	2006-07-12 03:52:17 +00:00
Josh Hursey	682a6a123e	- os_dirpath.c : reset the is_dir var each time through the loop. - orte-clean.c : check to see if the base session directory is empty and delete it if it is. - orte_universe_exists.c : Fix a down stread problem resulting from George's r10718 commit. Don't use the 'fulldirpath' since that is no longer guarenteed to be the absolute path to the session directory. Construct this value outside of that function from the prefix and frontend vars. This commit was SVN r10741. The following SVN revision numbers were found above: r10718 --> open-mpi/ompi@47eef2e002	2006-07-11 17:31:05 +00:00
Josh Hursey	5a812c8211	Fix orte-ps which George broke in r10718 by extending the orte_session_dir_get_name() so that it does not return an error when no universe is passed to it. Also put back in the 'Slots In Use' column as it is now working properly per Ralphs recent ras commits. Still not sure what 'Slots Alloc' is meant to represent, so left that as #if 0'd out for the moment. This commit was SVN r10739. The following SVN revision numbers were found above: r10718 --> open-mpi/ompi@47eef2e002	2006-07-11 16:54:07 +00:00
Josh Hursey	2e506591c3	more pedantic cleanup. Hopefully this will make happy. This commit was SVN r10730.	2006-07-11 13:48:28 +00:00

... 3 4 5 6 7 ...

605 Коммитов