openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	5311b13b60	Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system. This commit was SVN r18252.	2008-04-23 14:52:09 +00:00
Josh Hursey	cc83d41ad9	Merge in tmp/jjh-scratch {{{ svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch . }}} Contains: * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart. * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P. * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry * Some other sundry cleanup items all dealing with C/R functionality in the trunk. This commit was SVN r18241.	2008-04-23 00:17:12 +00:00
Ralph Castain	16c9100633	Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation. This commit was SVN r18216.	2008-04-20 02:25:45 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Ralph Castain	7b91f8baff	Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory. Fix the ompi-server -h cmd line option so it actually tells you something! Add two new testing codes to the orte/test/mpi area: accept and connect. This commit was SVN r18176.	2008-04-16 14:27:42 +00:00
Ralph Castain	5e6dc24e62	Fix ompi-server so it works with unity routed module - still not working with tree routing. Cleanup debug flag so it activates debugging on the data server code itself This commit was SVN r18080.	2008-04-04 19:17:28 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Jeff Squyres	dee561d29e	Per recent off-list discussions about the build system, I have done some cleanups and standardizations in the various /tools// Makefile.am files. This commit: * Somewhat simplify the tool Makefile.am's * Makes the tool Makefile.am's consistent with each other (do similar actions in similar ways) * Update the tool Makefile.am's to remove old kruft that was required by older versions of AM (trunk requires AM >=1.10) This commit was SVN r17921.	2008-03-22 02:04:05 +00:00
Ralph Castain	6bb139e4f2	One more correction to mpirun exit codes - cleanup the application proc's exit codes in the orted so that non-zero exit codes generated by mpirun itself don't get "munged". Modify the multi_abort function so they all return different exit codes - allows us to tell which one was being reported. This commit was SVN r17895.	2008-03-20 13:54:11 +00:00
Ralph Castain	2ed0e60321	Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize. Jeff is preparing an (undoubtedly lengthy) explanation/matrix of how these codes are determined for the OMPI FAQ. This commit was SVN r17879.	2008-03-19 19:00:51 +00:00
Ralph Castain	629b95a2fe	Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway. So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked. Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion. This commit was SVN r17843.	2008-03-17 17:58:59 +00:00
Ralph Castain	57a72c412a	Utilize Tim M's suggestion and use atomics to do the locking. This commit was SVN r17767.	2008-03-06 21:36:32 +00:00
Ralph Castain	097cc83be2	Fix a race condition - ensure we don't call terminate in orterun more than once, even if the timeout fires while we are doing so This commit was SVN r17766.	2008-03-06 19:35:57 +00:00
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Rolf vandeVaart	03fdd57d5a	Fix the use of --path and -x PATH so that things work properly. Note that --path specifies extra directories where the executable is searched for, but does not affect the PATH settings. This commit fixes trac:1221. This commit was SVN r17748. The following Trac tickets were found above: Ticket 1221 --> https://svn.open-mpi.org/trac/ompi/ticket/1221	2008-03-05 21:07:43 +00:00
Ralph Castain	06d3145fe4	First cut at direct launch for TM. Able to launch non-ORTE procs and detect their completion for a clean shutdown. This commit was SVN r17732.	2008-03-05 13:51:32 +00:00
Ralph Castain	edb8e32a7a	Add default hostfile parameter plus --default-hostfile command line option. Fix error message when job setup failed This commit was SVN r17724.	2008-03-05 04:54:57 +00:00
Ralph Castain	9413d6cf5d	Define a default exit code for when things fail prior to a job launch - still needs work, but a start. Fix a deadlock loop when things really, really go bad. If we timeout trying to kill the job, then it's time to bail as cleanly as possible, not go back and keep trying. This commit was SVN r17715.	2008-03-05 01:46:30 +00:00
George Bosilca	7879e0b9c2	Be nice with parallel debugger, export this required symbol. This commit was SVN r17637.	2008-02-28 05:59:07 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Sharon Melamed	025b68becf	Move the carto framework to the trunk. This commit was SVN r17177.	2008-01-23 09:20:34 +00:00
Jeff Squyres	b6e9c99f7d	Formatting fixes from Peter Breitenlohner. This commit was SVN r17163.	2008-01-18 23:21:31 +00:00
Josh Hursey	0bf61a1b84	Move in some accumulated small features and minor bug fixes for C/R support. {{{ svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs . }}} This commit was SVN r16478.	2007-10-17 13:47:36 +00:00
Ralph Castain	b6196e8a39	When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed. In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean. I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it. This commit was SVN r16451.	2007-10-15 18:00:30 +00:00
Josh Hursey	31e9369e8b	Fix orterun so it does not get influenced by an application's argv set. For example, if I have an application that, internal to the application, takes the argument '-mca foo bar' we do not want orterun to pick up this argument and pass it through the system. So the following {{{ shell$ mpirun -np 2 -mca btl tcp,self ./myapp -mca foo bar }}} orterun should pick up {{{-mca btl tcp,self}}} but not {{{-mca foo bar}}} which it was previous to this commit. I tested command line runs and runs with app files to confirm this patch works. This commit was SVN r16431.	2007-10-11 18:33:40 +00:00
Josh Hursey	7437f37e96	This commit contains the following: * Fix some missing includes in a few places. * Add the cr_request() functionality to the BLCR CRS component. We are now dependent upon the 0.6.* series of BLCR. * Made the CR notification mechanism a registered function. This way we can have an OPAL-only version and it can be replaced at runtime with the ORTE version. * Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only CR functionality when the user wants it. Default: Disabled. * Fix the placement of a checkpoint request check in MPI_Init * Pull the OPAL notification mechanism into the SnapC framework. * We no longer fork/exec the 'opal-checkpoint' command for local checkpointing, the Local coordinator in the orted does this directly. * The Local and Application coordinator talk together bypassing the OPAL notifiation mechanism. * Optimized the Local <-> App Coordinator communication. * Improved the structure used to track vpid_snapshots in the local coord. * Fix a race condition in which an application under heavy communication load may produce an inconsistent global checkpoint. This commit was SVN r16389.	2007-10-08 20:53:02 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
George Bosilca	4e66376e66	Fix memory leak (Coverty 702). This commit was SVN r16122.	2007-09-13 20:11:38 +00:00
Shiqing Fan	dcee7e4229	- Should not use ORTE_DECLSPEC with initialization. This commit was SVN r16086.	2007-09-11 10:13:53 +00:00
George Bosilca	d658a477af	Update the help file to match the real name of the required argument. This commit was SVN r15762.	2007-08-04 00:35:55 +00:00
Shiqing Fan	0f468f3668	- Remove the solution and project files, will commit them later. This commit was SVN r15705.	2007-07-31 17:07:02 +00:00
Shiqing Fan	4d7b349cdb	- Add VC8 solution and project files. - If one wants to use this solution, remember to unload the project 'orte-restart' which is currently not working for Windows. This commit was SVN r15680.	2007-07-30 11:05:34 +00:00
Ralph Castain	d99c764e75	Resolve a problem where the orte daemon comm functions were being accessed by mpirun while still retaining occasional reference to the orted_globals. Remove all dependence on orted_globals from the comm functions. Move those functions back into their own file to make it easier to maintain the separation. Ensure that mpirun ignores any "exit" commands being sent to daemons as it will exit on its own. This commit was SVN r15562.	2007-07-23 18:36:33 +00:00
Ralph Castain	2110064a9a	Ensure that the LD_LIBRARY_PATH and PATH get properly set for procs locally spawned by mpirun. This commit was SVN r15516.	2007-07-19 19:00:06 +00:00
Josh Hursey	eeba2cb871	Add a comment to clarify the relationship between mca_base_cmd_line_process_args() and opal_init_util() so we do not forget their ordering needs, and subtle relationship. This commit was SVN r15412.	2007-07-13 19:08:05 +00:00
Ralph Castain	2bded34a1d	Fix a problem observed by Brian where processes launched local to mpirun lost their environment except for MCA params. The problem stemmed from no longer launching a local orted on the same node as mpirun. The orted would save and reuse the base environment. Mpirun didn't do that, and the odls was using the orted's globally saved environment (which wasn't being set). This fix establishes a globally accessible base launch environment that both the orted and mpirun can utilize. Since we now use that, we don't need to pass it to the odls_launch_proc function, so remove that param from the API (and modify all components to handle the change). This commit was SVN r15405.	2007-07-13 15:47:57 +00:00
Ralph Castain	bd65f8ba88	Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. Short description: major changes include - 1. singletons now fork/exec a local daemon to manage their operations. 2. the orte daemon code now resides in libopen-rte 3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided. I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine. This commit was SVN r15390.	2007-07-12 19:53:18 +00:00
Jeff Squyres	64083570f5	Add support for DDT parallel debugger, which required several things: * Making some symbols and types be global (vs. static) in orterun * Adding a "ddt" entry in the MCA parameter orte_base_user_debugger default value * Add support for @executable@, @executable_argv@, and @single_app@ tokens in the orte_base_user_debugger MCA parameter. * Added various error checks and corresponding help messages after finding a debugger in the PATH Fixes trac:1081 This commit was SVN r15323. The following Trac tickets were found above: Ticket 1081 --> https://svn.open-mpi.org/trac/ompi/ticket/1081	2007-07-10 12:53:48 +00:00
Jeff Squyres	892bc38ad0	Protect against a bad free: full_line points to the full buffer. But line may point to a few characters beyond the beginning of the buffer (if the buffer had some extra white space padding at the beginning). So if we want to free the buffer, free full_line, not line. This commit was SVN r15315.	2007-07-09 19:56:16 +00:00
Josh Hursey	f88aa6c273	This commit cleans up the AMCA parameter implementation a bit. * Remove the 'opal_mca_base_param_use_amca_sets' global variable * Harness the fact that you can (read should) call the cmd_line functions before initializing opal_init_util(). This pushes the MCA/GMCA/AMCA command line options into the environment before OPAL inits and starts to use these values. By putting the cmd_line parse before opal_init_util in orterun and orted we only parse the MCA parameter files once, and correctly (alleviating the need to 'recache' the files on init.) Small bits of cleanup. This commit was SVN r15219.	2007-06-27 01:03:31 +00:00
Rainer Keller	15c03e8acc	- Apply patch 31_manpages_lintian.dpatch Thanks to Dirk Eddelbuettel <edd@debian.org> This commit was SVN r15215.	2007-06-26 21:13:10 +00:00
Josh Hursey	dd021e7121	Remove some leftover debugging that must have been accidentally left in r15142. This commit was SVN r15145. The following SVN revision numbers were found above: r15142 --> open-mpi/ompi@a3998a1676	2007-06-20 14:06:13 +00:00
George Bosilca	a3998a1676	Allow the symbols required by TotalView to be exported even when the visibility feature is on. This commit was SVN r15142.	2007-06-19 22:35:23 +00:00
Ralph Castain	85df3bd92f	Bring in the generalized xcast communication system along with the correspondingly revised orted launch. I will send a message out to developers explaining the basic changes. In brief: 1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names. 2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used. 3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying. Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed. This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems. This commit was SVN r15007.	2007-06-12 13:28:54 +00:00
Brian Barrett	508da4e959	OS X apparently really doesn't like shared libraries with unresolvable symbols in them and environ is defined only in the final application (probably in crt1.o). Apple provides a function for getting at the environment, so use that instead if it's available. This commit was SVN r14857.	2007-06-05 03:03:59 +00:00
Ralph Castain	a2964f429e	Fix a compiler warning - strncmp returns an int, so you have to compare to 0 instead of NULL. This commit was SVN r14790.	2007-05-29 18:02:10 +00:00
Anya Tatashina	de676d717b	Ref Trac #1032 ; added suport for full path launching with TotalView This commit was SVN r14789.	2007-05-29 17:39:11 +00:00
Ralph Castain	4fff584a68	Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that did start. The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system. Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed. Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief. With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn. Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put". This commit was SVN r14711.	2007-05-21 18:31:28 +00:00
Ralph Castain	d9acc93efa	Compute and pass the local_rank and local number of procs (in that proc's job) on the node. To be precise, given this hypothetical launching pattern: host1: vpids 0, 2, 4, 6 host2: vpids 1, 3, 5, 7 The local_rank for these procs would be: host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3 host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3 and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid. I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future. This commit was SVN r14706.	2007-05-21 14:30:10 +00:00
Ralph Castain	75d51812a3	Fix the app-failed-to-start capability that was broken by r14554 (holding the caller in rmgr.spawn until the application - as opposed to just the orteds - have started). Allow the rmgr.spawn function to return if the app terminates, correctly handling its return status code to show abnormal termination. Modify orterun to correctly handle the returned status code so it doesn't enter a conditioned wait if the app fails to start since it will never wakeup if it does. This commit was SVN r14693. The following SVN revision numbers were found above: r14554 --> open-mpi/ompi@4510b42638	2007-05-18 13:29:11 +00:00
Jeff Squyres	51ff779a5d	Minor gramatical nit found by Karen/Sun. This commit was SVN r14622.	2007-05-08 21:24:44 +00:00
Jeff Squyres	395d05b6bc	Update the man page to describe both -wdir and -wd. -wdir is consider the "primary" option and -wd is the synonym. Regardless, either of them function exactly like the other. This commit was SVN r14618.	2007-05-08 20:27:20 +00:00
Jeff Squyres	8a68b2dba7	Add -wdir option as a synonym for -wd (to make us match the man page). This commit was SVN r14614.	2007-05-08 19:09:32 +00:00
Sven Stork	a04c8eb39a	- Bring over the visibility feature, for a finer symbol export control via the visibility feature that is provided by some compilers. Per default this feature is disabled, to enable it you need to configure with --enable-visibility and obviously you need a compiler with visibility support. Please refer to the wiki for more information. https://svn.open-mpi.org/trac/ompi/wiki/Visibility This commit was SVN r14582.	2007-05-04 09:03:37 +00:00
Ralph Castain	c774f641fb	Modify orterun to provide more user-friendly reporting on jobs that fail to start This commit was SVN r14496.	2007-04-24 19:19:14 +00:00
Ralph Castain	18b2dca51c	Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly. There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd really have to try). This also involved a slight change to the oob.xcast API, so propagated that as required. Note: this has only been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-) Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately. This commit was SVN r14475.	2007-04-23 18:41:04 +00:00
Jeff Squyres	0ba47105ed	Merge the /tmp/jms-installdirs-trunk branch into the trunk. This finally brings in functionality that is already on the 1.2 branch, and was developed and tested in the v1.2ofed branch (and other places). Short version of new features: * Support for ibv_fork_init() * Automatically fill in the openib BTL bandwidth value by querying the HCA port * Installdirs functionality * Fixes to always use -I in the Fortran wrapper compilers (#924) * Gleb's mpool updates * Remove some kruft in btl/openib/configure.m4, therefore fixing the harmless warnings noted in #665 * Bunches of updates to the Linux RPM spec file I.e., effectively the same thing that r14411 brought to the v1.2 branch. Also effectively brought in r14432 and r14433 (some fixes on top of the original r14411 commit to v1.2). Still need to bring in the moral equivalent of r14445 after this commit (fixes to installdirs). This commit was SVN r14449. The following SVN revision numbers were found above: r14411 --> open-mpi/ompi@83b31314ae r14432 --> open-mpi/ompi@a48f160595 r14433 --> open-mpi/ompi@68f346d2bc r14445 --> open-mpi/ompi@13d366b827	2007-04-21 00:15:05 +00:00
Josh Hursey	8fd6d4ba09	add a newline so output is cleaner/clearer This commit was SVN r14229.	2007-04-05 17:45:03 +00:00
George Bosilca	f2a6b9394f	Deal with the include spree. Protect "environ" on Windows. Some others minors modifications in order to make it compile [again] on Windows. This commit was SVN r14188.	2007-04-01 16:16:54 +00:00
Ralph Castain	0d98264097	Fix the nolocal option on the OMPI trunk This commit was SVN r14138.	2007-03-24 16:16:16 +00:00
Jeff Squyres	bcdfbacaa4	Oops -- typo from previous commit. :-( This commit was SVN r14130.	2007-03-23 00:51:50 +00:00
Jeff Squyres	a3dd0f2e08	Connect --nolocal up to the MCA param rmaps_base_schedule_local, as it should be (it's a mistake that it got left out). This commit was SVN r14127.	2007-03-22 19:29:47 +00:00
Josh Hursey	dadca7da88	Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). This merge adds Checkpoint/Restart support to Open MPI. The initial frameworks and components support a LAM/MPI-like implementation. This commit follows the risk assessment presented to the Open MPI core development group on Feb. 22, 2007. This commit closes trac:158 More details to follow. This commit was SVN r14051. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r13912 The following Trac tickets were found above: Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158	2007-03-16 23:11:45 +00:00
Josh Hursey	0404444dbe	* Added 2 new MCA parameters - mca_base_param_file_prefix (Default: NULL) This is the fullname of the "-am" mpirun option. Used to specify a ':' separated list of AMCA parameter set files. - mca_base_param_file_path (Default: $SYSCONFDIR/amca-param-sets/:$CWD) The path to search for AMCA files with relative paths. A warning will be printed if the AMCA file cannot be found. * Added a new function "mca_base_param_recache_files" the re-reads the file configurations. This is used internally to help bootstrap the MCA system. * Added a new orterun/mpirun command line option '-am' that aliases for the mca_base_param_file_prefix MCA parameter * Exposed the opal_path_access function as it is generally useful in other places in the code. * New function "opal_cmd_line_make_opt_mca" which will allow you to append a new command line option with MCA parameter identifiers to set at the same time. Previously this could only be done at command line declaration time. * Added a new directory under the $pkgdatadir named "amca-param-sets" where all the 'shipped with' Open MPI AMCA parameter sets are placed. This is the first place to search for AMCA sets with relative paths. * An example.conf AMCA parameter set file is located in contrib/amca-param-sets/. * Jeff Squyres contributed an OpenIB AMCA set for benchmarking. Note: You will need to autogen with this commit as it adds a configure param. Sorry :( This commit was SVN r13867.	2007-03-01 13:39:20 +00:00
Pak Lui	2d6b3776bf	* fix the SEGV described in trac #892 that the exit_status in the 200 range causes a strsignal to show NULL as a result. Still trying to determine why exit_status is in that range. This commit was SVN r13583.	2007-02-09 16:39:30 +00:00
Pak Lui	ccff0a6e65	* minor fix to correct the pid that always shows up as 0 in the abort error message. e.g: mpirun noticed that job rank 2 with PID 0 on node burl-ct-v440-4 exited on signal 15 (Terminated). This commit was SVN r13537.	2007-02-07 17:46:19 +00:00
George Bosilca	9f73335bdb	Silence the compiler. This commit was SVN r13381.	2007-01-31 04:24:56 +00:00
Jeff Squyres	8d872b195a	Refs trac:726 Tested this functionality quite a bit more and made some fixes: * Print far fewer help messages * Fix one additional deadlock upon error * Change some ORTE_LOG messages to silent (because they're not errors) * Some code got re-indented, sorry... Discussed and reviewed with Ralph. This commit was SVN r13375. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-30 23:03:13 +00:00
Ralph Castain	ab5ea61100	Bring over the rest of the ctrl-c fixes. This commit includes: 1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned. 2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up. 3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded 4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond This needs more testing before migrating to 1.2. This commit was SVN r13304.	2007-01-25 14:17:44 +00:00
Ralph Castain	455e4ada9a	Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes. This commit was SVN r13191.	2007-01-18 17:15:19 +00:00
Ralph Castain	cc905290e4	Fix the pernode and npernode options - the mca parameters weren't being set to correspond to the command line options This commit was SVN r13151.	2007-01-17 14:56:22 +00:00
Ralph Castain	5d698dc55b	Turn "off" an unimplemented command line option - we do not currently support execution without mpirun waiting for job completion. This commit was SVN r13127.	2007-01-16 16:10:31 +00:00
Jeff Squyres	8a289cf1cb	Part 1 of the fix for ticket #726 . This commit adds logic to orteun to effect the following: * The first time the user hits ctrl-c, we go into the process of killing the ORTE job (this is not new). * While waiting for the job to actually terminate, if the user hits ctrl-c a second time, we print a warning saying "Hey, I'm still trying to kill the job. If you really want me to die immediately, hit ctrl-c again within 1 second." * If the user hits ctrl-c a within 1 second, orterun quits with a warning about how the job may not have actually been killed. Note that none of this logic won't really work until the second part of the fix for #726 is also committed (i.e., make pls.terminate_job() non-blocking). So I'm now throwing the ticket over to Ralph for the second part of the fix... Refs trac:726 This commit was SVN r13040. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-08 20:25:26 +00:00
Brian Barrett	bc6cec346f	Print out the description of the signal from mpirun when a proc was aborted by a signal if we have strsignal() This commit was SVN r12888.	2006-12-17 20:01:11 +00:00
Ralph Castain	7b8f445e13	Modify the "--display-map-at-launch" option to just "--display-map". Now that we have a "--do-not-launch" option, the "-at-launch" part of the display-map option was confusing. "--display-map" displays the resulting process map before we launch anyway, so this is clearer. This commit was SVN r12840.	2006-12-13 13:49:15 +00:00
Ralph Castain	82946cb220	Add a new option to orterun: "--do-not-launch" directs the system to do the allocation, map, job setup, etc., but don't actually launch the job. This lets us test all the setup portions of the code. Also, take the first step in updating how we handle mca params in ORTE - bring it closer to how it is done in the other two layers. Much more work to be done here. This commit was SVN r12838.	2006-12-13 04:51:38 +00:00
Ralph Castain	28ce8e5e5e	Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior). In "--npernode" operation, the "-np" command line parameter is ignored. This commit was SVN r12826.	2006-12-12 00:54:05 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Rainer Keller	e61dd8722e	- Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx - No functional changes, just indentation and corrections to error output. This commit was SVN r12734.	2006-12-03 13:59:23 +00:00
Ralph Castain	f771cc4fbd	Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option. Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution. This commit was SVN r12619.	2006-11-17 19:06:10 +00:00
Ralph Castain	ca5b4358fa	Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too. Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone. This commit was SVN r12616.	2006-11-17 02:58:46 +00:00
Ralph Castain	044898f4bf	My eyes may be deceiving me....but I do believe these comparisons are backwards! I think we only really want to "free" these variables if they are NOT NULL - as opposed to "free"ing them if they ARE NULL. This commit was SVN r12612.	2006-11-15 22:59:01 +00:00
Ralph Castain	f7fc19a2ca	Create the ability to re-use existing daemons. Included in the commit: 1. new functionality in the pls base to check for reusable daemons and launch upon them 2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED 3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse". This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on. This commit was SVN r12608.	2006-11-15 21:12:27 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Ralph Castain	4e50cdae52	This commit accomplishes two things: 1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs. 2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system. This commit was SVN r12558.	2006-11-11 04:03:45 +00:00
Ralph Castain	30de73a712	Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include: 1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch" 2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command. 3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job 4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map. Enjoy! My personal favorite is the first one - helps with debugging. This commit was SVN r12379.	2006-10-31 22:16:51 +00:00
George Bosilca	ee559e9947	Do not completely reset the orterun_globals. Keep the condition and the mutex, but reset everything else. Once initialized the condition (and the attached mutex) should be kept alive as long as possible if we want to be able to retrieve all the informations. This commit was SVN r12253.	2006-10-23 03:34:08 +00:00
Ralph Castain	13227e36ab	This commit looks a lot bigger than it is, so relax :-) Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off. To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place. I used those capabilities in two places: 1. Added an attribute list to the rmgr.spawn interface. 2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h). So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms. This commit was SVN r12138.	2006-10-17 16:06:17 +00:00
Ralph Castain	3f55d6897a	Remove the memory debugging options. Fix what appears to be a typo in a help file. This commit was SVN r12107.	2006-10-12 00:44:48 +00:00
Ralph Castain	2da8245be0	Correctly propagate no-daemonize This commit was SVN r12093.	2006-10-11 17:53:17 +00:00
Ralph Castain	27e305347c	Add a couple of options to orterun that support debugging of daemons for memory corruption. Ensure that the environment provided to local application processes isn't "polluted" by the orteds This commit was SVN r12087.	2006-10-11 15:18:57 +00:00
Ralph Castain	e7f6fa22d6	Fix return code so that mpirun returns the right thing when an abort is encountered. This commit was SVN r12065.	2006-10-09 01:04:00 +00:00
Ralph Castain	2e09128337	Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!! Restore the OBJ_RELEASE calls to cleanup map objects. This commit was SVN r12064.	2006-10-07 22:44:00 +00:00
Ralph Castain	98dd57b70e	Add a new option to launch "pernode" - launches one process/node across all available nodes. The other options also work correctly: "-bynode" with no -np will launch on all slots, mapped on a per-node basis. This commit was SVN r12063.	2006-10-07 19:50:12 +00:00
Ralph Castain	889ddefe85	Remove release that caused totalview connection to bomb This commit was SVN r12061.	2006-10-07 18:25:56 +00:00
Ralph Castain	ae79894bad	Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue. I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there). Gridengine compiles but I cannot test (believe it likely will run). Poe and xgrid compile to the extent they can without the proper include files. This commit was SVN r12059.	2006-10-07 15:45:24 +00:00
Jeff Squyres	72cf2fe813	Oops: --noprefix should not take an argument. This commit was SVN r12043.	2006-10-06 13:02:56 +00:00
George Bosilca	d628a18411	Right now there is no support for TotalView on Windows. Therefore, we don't really care how these functions and variables are declared. This commit was SVN r11996.	2006-10-05 05:19:03 +00:00
Ralph Castain	12328395ae	Missed a couple of debug statements This commit was SVN r11935.	2006-10-02 15:46:41 +00:00
Tim Prins	53b116d309	This commit fixes trac:452. It turns out that we were improperly allocating an array if -np was not passed. Also, we were not really using this array for anything. So this gets rid of the array and performs some minor cleanup. This commit was SVN r11934. The following Trac tickets were found above: Ticket 452 --> https://svn.open-mpi.org/trac/ompi/ticket/452	2006-10-02 15:03:43 +00:00
Ralph Castain	559b9b0ae8	Continue beating on comm_spawn. Setup to debug bproc. This commit was SVN r11932.	2006-10-02 14:58:22 +00:00
Ralph Castain	121f834776	Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component. Still need some work on the proxy component, and on job termination for persistent daemon case. This commit was SVN r11928.	2006-10-02 00:46:31 +00:00
Tim Prins	e4f8ad303e	Fix for #397 on 64 bit platforms sizeof(size_t) != sizeof(orte_std_cntr_t), and we were incorrectly assuming this when dealing with num procs. It worked on little endian platforms, but not big endian. So change num_procs to type int, and cast where needed. This commit was SVN r11796.	2006-09-25 19:41:54 +00:00
Ralph Castain	0ad0d84afd	Add two new API functions to the RMGR, and modify the "spawn" API to support the enhanced MPI-2 functionality. No implementation backs these new APIs - just placeholders for now. This commit was SVN r11699.	2006-09-19 01:45:05 +00:00
George Bosilca	f8de894efe	This one wasn't supposed to get into the repository. This commit was SVN r11697.	2006-09-18 21:28:55 +00:00
George Bosilca	7ad23ff97b	Be 100% total view friendly. Let tv find out the real name of our executable and export all functions as they should be. This commit was SVN r11694.	2006-09-18 17:55:14 +00:00
Jeff Squyres	8226dab86c	Fixes trac:377 Add --enable-orterun-prefix-by-default (and a synonym: --enable-mpirun-prefix-by-default) to make orterun always behave as if "--prefix $prefix" was given on the command line (where $prefix is the value given to the --prefix option to configure). This prevents many rsh/ssh users from needing to modify their shell startup files to set the LD_LIBRARY_PATH for Open MPI (they will still need to set PATH or otherwise find the OMPI executables to mpicc/mpirun/etc. their MPI applications). Also added --noprefix option to orterun to disable this behavior. Finally, note that even if --enable-orterun-prefix-by-default is specified, if the user specifies --prefix or /path/to/mpirun, these options will override the default value of the prefix ($prefix). This commit was SVN r11669. The following Trac tickets were found above: Ticket 377 --> https://svn.open-mpi.org/trac/ompi/ticket/377	2006-09-15 02:52:08 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
Galen Shipman	b02185374f	Push a generated "key" out to all the processes. This is necessary for some interconnect wireup in which all processes must agree on a "key" to initialize the interconnect with. This commit was SVN r11653.	2006-09-14 15:27:17 +00:00
George Bosilca	e04032ca2f	Correct a comment and protect the usage of the environ variable against Windows. This commit was SVN r11397.	2006-08-24 16:18:42 +00:00
George Bosilca	75fa0317da	Keep environ as the prefered storage for the environment variables. This commit was SVN r11351.	2006-08-23 06:14:24 +00:00
George Bosilca	b4732f557a	Now it's time to update ORTE. Cleanup most of the ORTE tools. Force them to use opal_basename and opal_dirname. Don't create the path manually. Use the specialized opal functions instead. This commit was SVN r11345.	2006-08-23 02:35:00 +00:00
George Bosilca	6ef0acf99f	The names of the defines should start with OPAL as they belong to the OPAL layer. We now support 64 bits Windows too. This commit was SVN r11312.	2006-08-21 21:55:41 +00:00
Ralph Castain	8c7f0ed9ae	Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system. Other changes: 1. Remove the old xcpu components as they are not functional. 2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one. This will require an autogen/configure, I'm afraid. This commit was SVN r11228.	2006-08-16 16:35:09 +00:00
Ralph Castain	5dfd54c778	With the branch to 1.2 made.... Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced). Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up). I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t). In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but... Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems. This commit was SVN r11204.	2006-08-15 19:54:10 +00:00
Ralph Castain	8bec270f90	Fix a bug noted by Jeff - we were no longer accurately recording in the registry that a process had been terminated when the user initiated the "kill" process (via cntrl-c). Added another system-level test function for ORTE that just spins until terminated by a ctrl-c signal. Modified orterun - added a couple of newlines to the output when abnormally terminating so the prompt always is on a new line. This commit was SVN r10866.	2006-07-18 14:42:27 +00:00
Ralph Castain	c22b0d516e	Some edits to the man page for Jeff to review This commit was SVN r10803.	2006-07-14 14:47:06 +00:00
Jeff Squyres	e6c9c699fe	Minor changes: - change -no_oversubscribe to -nooversubscribe (to be similar to -nolocal) - Added text to orterun.1 describing slots and -nooversubscribe Still need to add text about "mpirun a.out" functionality, and RHC wants to make some minor edits, so committing for synchronization. This commit was SVN r10800.	2006-07-14 14:15:03 +00:00
George Bosilca	94f6cb3765	There is no SIG_USR1 and SIG_USR2 on windows. This commit was SVN r10715.	2006-07-11 05:24:08 +00:00
Ralph Castain	febc143d8c	Per LANL's stated need, add functionality that runs a.out across ALL available process slots if no num_proc is specified on the command line. However, please note the following limitation: we ONLY allow ONE application to be specified on the command line when this feature is invoked. If multiple apps are specified, the user MUST also specify the number to be launched for each and every one of them. Update the help text to report errors when not following that rule. Also updated the RMAPS help text to reflect the reorganization of some of the round-robin code into the base. The new functionality has been tested under Mac OS-X and on Odin using an MPI program. Both byslot and bynode mapping have been checked and verified. Operational support for other systems needs to be verified - I respectfully request people's help in doing so. This commit was SVN r10708.	2006-07-10 21:25:33 +00:00
Jeff Squyres	538965aeb0	Final merge of stuff from /tmp/tm-stuff tree (merged through /tmp/tm-merge). Validated by RHC. Summary: - Add --nolocal (and -nolocal) options to orterun - Make some scalability improvements to the tm pls This commit was SVN r10651.	2006-07-04 20:12:35 +00:00
Josh Hursey	2edf1511fd	Closes ticket #173 : Split name linking up for orte/ompi shared tools. This moves the logic to create the symbolic links for: - mpirun - mpiexec - ompi-ps - ompi-clean and their respective man pages to the ompi level from the orte layer. This is a bit pedantic, but orte shouldn't be doing the work of ompi since that is a bit of an abstraction break. Note: need to autogen.sh to get this. Sorry :( This commit was SVN r10602.	2006-06-30 22:01:56 +00:00
Brian Barrett	b6663c64c7	* fix for bug #161 - add man page info for recently added features This commit was SVN r10514.	2006-06-26 22:16:39 +00:00
Brian Barrett	86861bc1c3	* add --quiet option, and surpress a couple of the status messages in orterun if it is actually enabled. For ticket #129. This commit was SVN r10497.	2006-06-26 18:21:45 +00:00
Brian Barrett	4e8abb943b	* fix up signal handling code so that one function handles SIGUSR1 and SIGUSR2. This can be extended later if needed to include other signals we should forward to the user processes (TSTP and CONT, perhaps?) * Since the signal handlers don't actually run in signal context, we can use malloc/fprintf/etc. So clean up some of the signal handler code so that we don't keep message buffers around for the life of the process This commit was SVN r10496.	2006-06-26 15:12:52 +00:00
Brian Barrett	9766c01e50	* Per discussion at quarterly meeting and bug #91 , print out the bug contact point when printing version and help strings This commit was SVN r10484.	2006-06-22 19:48:27 +00:00
Brian Barrett	5c89dc6946	Fix for ticket #91 mpirun/orterun now has an option to print the version number. If -V/--version is given, it will print the version number. If it's the only option, we exit cleanly. Otherwise, we continue on as if --version wasn't given (except we've printed the version number). --This line, and th se below, will be ignored-- M orte/tools/orterun/orterun.c M orte/tools/orterun/help-orterun.txt This commit was SVN r10276.	2006-06-09 17:21:23 +00:00
Ralph Castain	ee5a626d25	Add ability to trap and propagate SIGUSR1/2 to remote processes. There are a number of small changes that hit a bunch of files: 1. Changed the RMGR and PLS APIs to add "signal_job" and "signal_proc" entry points. Only the "signal_job" entries are implemented - none of the components have implementations for "signal_proc" at this time. Thus, you can signal all of the procs in a job, but cannot currently signal only one specific proc. 2. Implemented those new API functions in all components except xgrid (Brian will do so very soon). Only the rsh/ssh and fork modules have been tested, however, and only under OS-X. 3. Added signal traps and callback functions for SIGUSR1/2 to orterun/mpirun that catch those signals and call the appropriate commands to propagate them out to all processes in the job. 4. Added a new test directory under the orte branch to (eventually) hold unit and system level tests for just the run-time. Since our test branch of the repository is under restricted access, people working on the RTE were continually developing their own system-level tests - thus making it hard to help diagnose problems. I have moved the more commonly-used functions here, and added one specifically for testing the SIGUSR1/2 functionality. I will be contacting people directly to seek help with testing the changes on more environments. Other than compile issues, you should see absolutely no change in behavior on any of your systems - this additional functionality is transparent to anyone who does not issue a SIGUSR1/2 to mpirun. Ralph This commit was SVN r10258.	2006-06-08 18:27:17 +00:00
Jeff Squyres	1d6902296c	Additions to the tm, slurm, and rsh pls modules to handle the --prefix option as discussed on the devel-core mailing list. The Big Difference is that instead of hard-coding the strings "/lib" and "/bin" in to append to the prefix, we append the basename of the local libdir and bindir. Hence, if your libdir is $prefix/lib64, we'll append /lib64 to construct the remote node's LD_LIBRARY_PATH (etc.). Also appended the orterun.1 man page to include a description of --prefix, how it is constructed, what it handles / what it does not, etc. This commit was SVN r9930.	2006-05-16 14:14:12 +00:00
Brian Barrett	52369307f8	Add a feature to the build system that Terry from Sun and I talked about in San Jose. Allow the configure option --disable-binaries to build OMPI, but not build or install the support binaries (so basically, just build the libraries). This commit was SVN r9777.	2006-04-29 02:16:41 +00:00
Brian Barrett	62afa63ded	Initialize length to 0 instead of -1 (size_t might be unsigned and therefore -1 is an issue). This should go to the v1.1 branch... This commit was SVN r9665.	2006-04-20 15:42:36 +00:00
Ralph Castain	c79c1714de	Okaaayyy....let's see if this restores the "prefix" command line option. No idea what the problem was with the other option, but it isn't critical right now, so I'll figure it out later. This commit was SVN r9542.	2006-04-06 07:53:38 +00:00
Ralph Castain	0ba8851a47	Fix the univ_exist option This commit was SVN r9535.	2006-04-05 17:18:06 +00:00
Ralph Castain	b9bdb2125e	Fix and upgrade the console to support better debugging. Activate "dump" commands to display registry content. Remove the blasted opal_output default prefix that made the dump output illegible. Properly connect to existing daemons and/or start new ones. This commit was SVN r9528.	2006-04-04 11:05:52 +00:00
Brian Barrett	99e4c89183	* some typo fixes for orterun manpage * Install orterun manpage as mpirun.1 and mpiexec.1 as well as orterun.1 This commit was SVN r9444.	2006-03-29 01:04:43 +00:00
Jeff Squyres	07b0e559f2	Fix copyright This commit was SVN r9443.	2006-03-29 00:53:11 +00:00
Josh Hursey	35eb1a2970	Added a section on "Specifying Hosts" to the man page. This commit was SVN r9432.	2006-03-27 23:46:38 +00:00
Jeff Squyres	bc96040e1c	- Add Cisco copyright - Add comment explaining why we used INT_MAX - Update NEWS This commit was SVN r9415.	2006-03-24 15:39:09 +00:00
Jeff Squyres	a843ce4c23	Clean up a minor memory leak This commit was SVN r9413.	2006-03-24 15:28:42 +00:00
Ralph Castain	08db67cdf8	Fix the app_context problem for app_files too.... Again, this should be checked by Jeff. This commit was SVN r9393.	2006-03-23 17:55:25 +00:00
Ralph Castain	2a18ebd9e1	Fix the app_context problem. NOTE: JEFF SHOULD CHECK THIS! I found that orterun was not tracking the index number of the app_contexts it was creating. Hence, the app_context->idx field was always sitting at zero. This index is used by the mapper to decide which app_context to use for each process - thus, with the value of each index being zero, the mapper only used the first app_context that was created. All others were ignored. Not sure when this might have gotten changed. Could be it was a problem that always existed, but didn't get exposed until something else was changed. Anyway, it seems to work now - could stand further testing. This commit was SVN r9389.	2006-03-23 16:53:11 +00:00
Josh Hursey	22bac7ae95	a test commit. one more try This commit was SVN r9350.	2006-03-21 00:39:29 +00:00
Josh Hursey	d64aab529f	a test commit. no real changes here. Removing added char. This commit was SVN r9349.	2006-03-21 00:37:13 +00:00
Josh Hursey	c8f9108c18	a test commit. no real changes here This commit was SVN r9348.	2006-03-21 00:33:20 +00:00
Josh Hursey	66edc64be0	Minor comment change This commit was SVN r9316.	2006-03-16 19:00:03 +00:00
Josh Hursey	7fcfd87cd5	Minor date change This commit was SVN r9315.	2006-03-16 18:59:13 +00:00
Jeff Squyres	80bc1850bf	Ensure that --prefix takes precedence over /path/to/orterun This commit was SVN r9183.	2006-02-28 14:44:40 +00:00
Jeff Squyres	88b3e6f8bd	- Fix bug in orterun where --prefix didn't show up in the help output (reported by Cisco) - While in orterun, add a feature that multiple users have asked for: if you specify an absolute pathname to orterun, such as "/path/to/bin/orterun ...", it's equivalent to "orterun --path /path/to ..." This commit was SVN r9181.	2006-02-28 11:52:12 +00:00
Josh Hursey	93e00415d5	A bunch of edits for clarity and precision. Still needs some work, but getting closer This commit was SVN r9098.	2006-02-21 04:17:56 +00:00
Josh Hursey	a3712f7a65	A cleanup checkpoint: - Explained <program> and made a consistancy change in the Quick Start section. - Change references to 'app schema' to Open MPI 'app context' - Audit the command line arguments for --foo, -foo stuff. This commit was SVN r9097.	2006-02-21 00:48:31 +00:00
Jeff Squyres	186704a23b	A few updates This commit was SVN r9089.	2006-02-18 04:17:18 +00:00
Josh Hursey	02c999776b	Removed all of the LAM stuff. This needs to be gone over a few more times before it is allowed to see daylight, but has come a long way. Some sections may be off more than a little, but the general idea is there. Need to audit to make sure we don't call the ORTE VHNP's daemons :) This commit was SVN r9078.	2006-02-17 03:47:52 +00:00
Josh Hursey	2938545220	Checkpoint. Finished adding and pruning all the the Options. Cleaned up a bunch of man syntax, so it should be 'more' readable (making the assumption that man source is ever readable :p). I am moving on to the "description" and "see also" sections next. This commit was SVN r9077.	2006-02-16 23:38:03 +00:00
Jeff Squyres	c2c2daa966	Change the behavior of orterun (mpirun, mpirexec) to search for argv[0] and the cwd on the target node (i.e., the node where the executable will be running in all systems except BProc, where the searches are run on the node where orterun is invoked). - fork pls now does cwd and argv[0] search in orted - bproc pls does cwd and argv[0] search in orterun - cwd behavior slightly different: - if user specifies a -wdir to orterun, we chdir() to there; if we can't for some reason, abort - if user does not specify a -wdir, try to chdir() to the dir where orterun was invoked. If we can't for some reason (e.g., it doesn't exist on the target node), then try to chdir($HOME). If we can't do that, then just live with whatever default directory we were put in. This commit was SVN r9068.	2006-02-16 20:40:23 +00:00
Jeff Squyres	d741b7f37f	We're adding some specific and complex functionality to orteun, so it really needs to be documented (in part so that users stop asking us how to do it!). This is a first cut at an orterun.1 man page. It is 95% copied from LAM's mpirun.1 lam page -- I just edited the very top and am handing this off to Josh to finish the first cut. Then we'll add specific docs about the behavior of some of the finer details. This is not listed in the Makefile.am yet because it's so incomplete/incorrect (w.r.t. OMPI), so I don't want it included in the tarball or installed [yet]. This commit was SVN r9058.	2006-02-16 13:29:37 +00:00
David Daniel	e82c470b32	- Change the exit status set by mpirun when an application process is killed by a signal. The exit status is now set to signo + 128, which conforms with the behavior of (almost) all shells. This commit was SVN r9050.	2006-02-15 22:41:29 +00:00
Brian Barrett	566a050c23	Next step in the project split, mainly source code re-arranging - move files out of toplevel include/ and etc/, moving it into the sub-projects - rather than including config headers with <project>/include, have them as <project> - require all headers to be included with a project prefix, with the exception of the config headers ({opal,orte,ompi}_config.h mpi.h, and mpif.h) This commit was SVN r8985.	2006-02-12 01:33:29 +00:00
Ralph Castain	892b396d70	Ensure that standard triggers are defined for all job/process states so that user's can subscribe to those they want to use. Modify the way that is done to avoid over-burdening the standard launch sequence since it doesn't need alerts from all those triggers. This commit was SVN r8938.	2006-02-08 17:40:11 +00:00
Ralph Castain	4b9f015c0b	Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list. This commit was SVN r8912.	2006-02-07 03:32:36 +00:00
Jeff Squyres	ed0fa9720d	Incorporate fix suggested by Chris Gottbratch. This commit was SVN r8750.	2006-01-19 15:21:53 +00:00
George Bosilca	d91650ea85	Do not use explicitly "ln -s" as on some systems it does not work properly ... (windows). Instead use the LN_S variable exported by the Makefile (set to "ln -s" on all Unixes and to "cp -p" on windows). When we remove an executable use the correct extension for its name (add $(EXEEXT) to the name). This commit was SVN r8616.	2005-12-31 12:33:44 +00:00
George Bosilca	f9b07f1912	Protect the includes. This commit was SVN r8532.	2005-12-17 22:05:10 +00:00
Jeff Squyres	e184fd6801	Make sure that what we find is executable This commit was SVN r8513.	2005-12-15 20:31:20 +00:00
Brian Barrett	fee6409708	fix compiler warning and compiler error in totalview code... This commit was SVN r8207.	2005-11-20 18:41:45 +00:00
Jeff Squyres	8d96c21311	Good weekend brainless activity -- implement the orterun command line debugger scheme described in http://www.open-mpi.org/community/lists/users/2005/10/0214.php. This makes our user-level debugger scheme much more vendor-independent (although the "-tv" option will still work for backwards compatibility -- it'll just be a synonum of "--debug"). This commit was SVN r8206.	2005-11-20 16:06:53 +00:00
Jeff Squyres	42ec26e640	Update the copyright notices for IU and UTK. This commit was SVN r7999.	2005-11-05 19:57:48 +00:00
Josh Hursey	e7d5ecf016	Comment out the C/N notation parsing. Interior comment has more details. This commit was SVN r7980.	2005-11-03 18:15:47 +00:00
Tim Woodall	60754acae8	- modified rmaps data structures to point directly to ras node - modified rsh to NOT query for each nodes mapping, as all data is already available in the rmaps structures This commit was SVN r7894.	2005-10-27 17:04:10 +00:00
Jeff Squyres	0629cdc2d7	Bring back the changes from /tmp/jjhursey-rmaps. Specific merge command: svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps . (where "." is a trunk checkout) The logs from this branch are much more descriptive than I will put here (including a really long description from last night). Here's the short version: - fixed some broken implementations in ras and rmaps - "orterun --host ..." now works and has clearly defined semantics (this was the impetus for the branch and all these fixes -- LANL had a requirement for --host to work for 1.0) - there is still a little bit of cleanup left to do post-1.0 (we got correct functionality for 1.0 -- we did not fix bad implementations that still "work") - rds/hostfile and ras/hostfile handshaking - singleton node segment assignments in stage1 - remove the default hostfile (no need for it anymore with the localhost ras component) - clean up pls components to avoid duplicate ras mapping queries - [possible] -bynode/-byslot being specific to a single app context This commit was SVN r7664.	2005-10-07 22:24:52 +00:00
Jeff Squyres	65f1adfedc	Add "-tv" option to orterun: orterun -tv -np 4 foo which will turn around and re-exec: totalview orterun -a -np 4 foo This commit was SVN r7636.	2005-10-05 10:24:34 +00:00
Josh Hursey	50e128ab83	Take out the --map command line arguemnt, since it is not handled properly at the moment. Also remove all references to --map, and (C, N) command line options in the help file. These references will be put back in when these options are implemented. This commit was SVN r7574.	2005-10-01 15:51:20 +00:00
Jeff Squyres	fcef1774d5	Per advice from Ralf W., change the pkgdata declarations in Makefile.am's to be a slightly more correct (and, more importantly, less error-prone) construct. This commit was SVN r7554.	2005-09-30 13:32:39 +00:00
Brian Barrett	e0c3775551	* remove some duplicate dependencies that were making Solaris mad This commit was SVN r7549.	2005-09-30 04:13:26 +00:00
Josh Hursey	a23370c007	Converted some MCA parameters from the old version to the new. Have the ras_base_schedule_policy MCA parameter working once again. before it would only do slot based allocation, even if the MCA parameter was set properly. Currently you can specify to orterun a node allocation by either: -mca ras_base_schedule_policy node -bynode and slot allocation (which is the default) by: -mca ras_base_schedule_policy slot -byslot This commit was SVN r7513.	2005-09-27 02:54:15 +00:00
Tim Woodall	4a813c1d38	support --host option (in addition to -host or -H) This commit was SVN r7483.	2005-09-22 16:08:40 +00:00
Ralph Castain	5686e8119e	Move the error name macro to the errmgr framework. Add a second level of tracing. Remove an obsolete file. This commit was SVN r7445.	2005-09-20 17:09:11 +00:00
Tim Woodall	c25ffb343a	restore host option This commit was SVN r7443.	2005-09-20 13:36:16 +00:00
Tim Woodall	f0cec8ac0c	Both -H and -host options are allowed to specify hostlist (now supported for bproc - will look at rsh) This commit was SVN r7440.	2005-09-20 13:31:13 +00:00
Jeff Squyres	41ba191e9a	Temporarily comment out the -arch and -host options since we do not yet have an rmapper that can handle that information. This commit was SVN r7438.	2005-09-20 08:56:02 +00:00
Ralph Castain	bfef5928a1	Add a second trace option to pass an argument This commit was SVN r7433.	2005-09-19 20:22:22 +00:00
Ralph Castain	86a43b1d29	Add trace to the daemons and orterun so we can tell when their callbacks are being exercised. This commit was SVN r7432.	2005-09-19 17:20:01 +00:00
Brian Barrett	1fcf18c211	non-persistent signal behavior isn't quite right, so use the proper SIGNAL macros and deregister at the appropriate time. This commit was SVN r7293.	2005-09-10 23:22:37 +00:00
Brian Barrett	ed56e743b7	* update configure.ac to use the modern version of AC_INIT and AM_INIT_AUTOMAKE, instead of the deprecated version. * Work around dumbness in modern AC_INIT that requires the version number to be set at autoconf time (instead of at configure time, as it was before). Set the version number, minus the subversion r number, at autoconf time. Override the internal variables to include the r number (if needed) at configure time. Basically, the right thing should always happen. The only place it might not is the version reported as part of configure --help will not have an r number. * Since AM_INIT_AUTOMAKE taks a list of options, no need to specify them in all the Makefile.am files. * Addes support for subdir-objects, meaning that object files are put in the directory containing source files, even if the Makefile.am is in another directory. This should start making it feasible to reduce the number of Makefile.am files we have in the tree, which will greatly reduce the time to run autogen and configure. This commit was SVN r7211.	2005-09-07 05:54:53 +00:00
Jeff Squyres	383d9f58e7	Be [slightly] more descriptive. :-) This commit was SVN r7198.	2005-09-06 16:57:11 +00:00
Rainer Keller	a36347d728	- Support -prefix specification on mpirun/orterun cmd-line per app_context: mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \ -np 2 -prefix /path/to/ompi/on/machineB ./exec2 - Allow with -mca pls_rsh_assume_same_shell 0, the checking for the SHELL-variable on the actual node (currently 1st node). Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and csh/tcsh. This commit was SVN r7195.	2005-09-06 16:10:05 +00:00
Rainer Keller	192625d2a1	- Once again: uninteresting cleanup to get diff smaller. This commit was SVN r7178.	2005-09-04 20:54:19 +00:00
Ralph Castain	12daecb826	More cleanup This commit was SVN r7167.	2005-09-03 01:22:11 +00:00
Jeff Squyres	4c59058053	- Add some logic to configure to make a version of CFLAGS that doesn't include any optimization flags - Use these flags to always compile ompi/debuggers/* and orterun so that parallel debuggers (such as Totalview) can always see the debugging symbols (see comments in ompi/debuggers/Makefile.am and orte/tools/orterun/Makefile.am) - Remove some obsolete LAM-named variables from configure.ac This commit was SVN r7125.	2005-09-01 10:37:20 +00:00
David Daniel	c6054662d5	Forgot to add new header to sources This commit was SVN r7109.	2005-08-31 16:21:58 +00:00
David Daniel	a5eff8fc78	A little more clean-up. TotalView now works with --enable-debug build. Tested with: pls = rsh totalview.6.6.0-2 Linux cadillac82.ccstar.lanl.gov 2.4.24 #1 SMP Thu Jul 1 15:28:04 MDT 2004 i686 i686 i386 GNU/Linux This commit was SVN r7108.	2005-08-31 16:15:59 +00:00
Jeff Squyres	284328afe3	Add missing .h file so that it is included in the tarball This commit was SVN r7107.	2005-08-31 11:01:28 +00:00
George Bosilca	d64a702a5b	There is a missing header. --enable-picky help to track down such kind of errors. This commit was SVN r7102.	2005-08-31 00:47:52 +00:00
David Daniel	995641c1e6	Don't initialize proctable more than once (since the stage gate 1 trigger seems to get fired at least twice). This commit was SVN r7101.	2005-08-31 00:21:55 +00:00
David Daniel	ced11250e4	Basic totalview support for orterun. Close to working, but need to check hostnames are obtained correctly. This commit was SVN r7096.	2005-08-30 17:29:43 +00:00
David Daniel	6cb97e6ade	Reverting totalview support to not use the as yet unimplemented orte_jobgrp_t. Now just need to work out where to call it... This commit was SVN r7092.	2005-08-30 12:59:04 +00:00
Jeff Squyres	774f879a41	Oops -- add second string in there because we added a second %s to the help message. This commit was SVN r7064.	2005-08-27 13:32:25 +00:00
Jeff Squyres	b3bd549331	- Change a few calls from exit() to orte_abort() so that we get session directory cleanup (among other things) - When we get an abnormal exit in orterun (i.e., timeout expires and we haven't gotten termination notices from all processes), print a better message an exit in a better way (which includes session directory cleanup) - Fix tm and poe pls's to not exit() but rather propagate the error up the stack (where relevant) This commit was SVN r7058.	2005-08-26 20:36:11 +00:00
Josh Hursey	4eefb33182	Some param changes: - Change orte_base_infrastructre to orte_infrastructre to conform with ompi_info's needs - Move MCA Param registration in ORTE to a centralized function that is called first in orte_init_stage1 - Set the infrastructre flag as an argument to orte_init - Adjust initalization functions to properly pass down the infrastructre flag. This commit was SVN r7053.	2005-08-26 20:13:35 +00:00
Jeff Squyres	32e71e5c6c	Fix a problem where orterun itself would not receive MCA parameters that were set on the command line. This was techinically exactly the way the code was designed, but it certainly violated the Law of Least Astonishment (even to its designer ;-) ). So now if you execute something like this: mpirun -mca pls_rsh_debug 1 -np 4 hello You'll see debugging output from the rsh pls component, as you would expect (this was not previously the case -- the MCA pls_rsh_debug parame would be set to 1 in the 4 spawned hello processes, but not in the orterun process). More specifically, MCA parameters will be set in the orterun process in the following cases: - The new command line switch "--gmca" (or "-gmca") is used, indicating that the MCA parameter is "global". --gmca also means that that MCA parameter will be applied to all context app's. For example: mpirun -gmca foo bar -np 1 hello : -np 2 goodbye The foo MCA param will be set in both the hello and goodbye processes. - If there is only one context app. For example: mpirun -mca pls_rsh_debug 1 -np 4 hello will set pls_rsh_debug to 1 in both the orterun process and the 4 spawned hello processes. Also added a few more comments inside orterun to document a somewhat confusing use of a state variable in a recursive case. This commit was SVN r6764.	2005-08-08 16:42:28 +00:00
Ralph Castain	1438009dbd	Properly set the MCA parameter to indicate these functions are infrastructure so that the singleton flag does not get set. Somehow, in changing over to the new MCA interfaces, the "set" part of that logic got lost, so the singleton flag was always being set. This should repair some of the anomalous behavior seen recently where the local host was always being used for an application process. This commit was SVN r6757.	2005-08-07 04:17:10 +00:00
Jeff Squyres	d0a0434172	Investigating an MCA param problem -- converted over orterun to new MCA param API in the process. This commit was SVN r6739.	2005-08-04 18:15:47 +00:00
Jeff Squyres	ef9e06451c	Ensure that --mca is listed in the --help message (thanks for pointing this out Gleb!) This commit was SVN r6712.	2005-08-02 18:52:12 +00:00
Josh Hursey	9acbd4e21f	forgot to take out initalizer when I removed the verbose stuff This commit was SVN r6682.	2005-07-29 00:21:10 +00:00
Josh Hursey	018c4aa44e	remove unnecessary slashes This commit was SVN r6673.	2005-07-28 21:33:33 +00:00
Josh Hursey	8b56769307	removed the version command line option. Added some more user help messages This commit was SVN r6672.	2005-07-28 21:17:48 +00:00
Tim Prins	d4151fa9fd	properly fix the usage of the app pointer array by checking for NULLs instead of forcing it to be the same size as the number of entries This commit was SVN r6395.	2005-07-08 18:48:25 +00:00
Tim Prins	5cdf0803d4	make the app pointer array blocksize 1 so the the size of the pointer array is the same as the number of apps. This was causing a segfault when trying to launch multiple apps. This commit was SVN r6368.	2005-07-07 18:01:26 +00:00
Brian Barrett	170ef8af1f	* rename ompi_show_help to opal_show_help * rename ompi_stacktrace to opal_stacktrace * rename ompi_strncpy to opal_strncpy This commit was SVN r6336.	2005-07-04 02:38:44 +00:00
Brian Barrett	46245aaac1	* rename orte_os_create_dirpath to opal_os_create_dirpath * rename orte_os_path to opal_os_path * rename ompi_path_find to opal_path_find * rename ompi_pow2 to opal_pow2 This commit was SVN r6334.	2005-07-04 01:59:52 +00:00
Brian Barrett	e55f99d23a	* rename ompi_if to opal_if * rename ompi_malloc to opal_malloc * rename ompi_numtostr to opal_numtostr * start of rename of ompi_environ to opal_environ This commit was SVN r6332.	2005-07-04 01:36:20 +00:00
Brian Barrett	9f44b80291	* rename ompi_argv to opal_argv * rename ompi_basename to opal_basename * rename ompi bitop functions to opal * rename ompi_cmd_line to opal_cmd_line * rename ompi_sizet2int to opal_sizet2int * rename orte_daemon_init to opal_daemon_init * rename ompi_few to opal_few This commit was SVN r6330.	2005-07-04 00:13:44 +00:00
Brian Barrett	a13166b500	* rename ompi_output to opal_output This commit was SVN r6329.	2005-07-03 23:31:27 +00:00
Brian Barrett	23b687b0f4	* rename ompi_event to opal_event This commit was SVN r6328.	2005-07-03 23:09:55 +00:00
Brian Barrett	39dbeeedfb	* rename locking code from ompi to opal This commit was SVN r6327.	2005-07-03 22:45:48 +00:00
Brian Barrett	761402f95f	* rename ompi_list to opal_list This commit was SVN r6322.	2005-07-03 16:22:16 +00:00
Brian Barrett	f1c925475e	* use the orte_pointer_array properly This commit was SVN r6314.	2005-07-03 04:02:01 +00:00
Brian Barrett	8077da277b	* move ompi_rb_tree from opal to ompi since it's only used in ompi, and should have the ompi_free_list instead of the opal_free_list * Change orte to use opal_free_list instead of ompi_free_list This commit was SVN r6307.	2005-07-02 16:46:27 +00:00
Jeff Squyres	aa056f7bfd	First cut of OMPI Makefile.am's, plus a few more catchup updates in orte This commit was SVN r6286.	2005-07-02 15:06:47 +00:00
Jeff Squyres	1b18979f79	Initial population of orte tree This commit was SVN r6266.	2005-07-02 13:42:54 +00:00

... 3 4 5 6 7 ...

421 Коммитов