openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	d4dfb1b7a7	Plug a few bugs in the decoding of pidmaps: 1. after we get enough jobs in the pidmap, the address of the jobmap pointer array data can move due to realloc. Need to reset the jobs pointer each time through to ensure it is pointing to valid data 2. when we exit the loop, rc will be set to an error due to reading past end of buffer - need to reset so it is ignored 3. need to ensure we only try to read one jobid each time through loop This commit was SVN r20030.	2008-11-24 17:57:55 +00:00
Ralph Castain	9a57db4a81	To support comm_spawn in fully routed environments, daemons need to know the route to all procs in their job family. They already had this information, but were not retaining it. The infrastructure to do so has existed for some time - just never had the time to complete it. This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message. This commit was SVN r20022.	2008-11-18 15:35:50 +00:00
Jeff Squyres	d4dfd49cdd	Fix typo found in Makefile that caused problems with "make distclean"; thanks to Mehdi Bozzo-Rey for reporting the problem. This commit was SVN r19936.	2008-11-05 20:58:27 +00:00
Ralph Castain	6db5737779	Remove a couple of mutex vars that were defined and used - but never initialized. No clear way to initialize them, and that area of the code should never see threads anyway. This commit was SVN r19889.	2008-11-03 17:23:10 +00:00
Ralph Castain	f54fda489e	This is a first step towards supporting fully-routed OOB communications: 1. remove direct routed module (hooray!) 2. add radix tree routed module (binomial remains default) 3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess 4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support 5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability 6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon) Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm. This will require an autogen... This commit was SVN r19866.	2008-10-31 21:10:00 +00:00
Ralph Castain	6e5d844c36	Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code: 1. completely and cleanly separates responsibilities between the HNP, orted, and tool components. 2. removes all wireup messaging during launch and shutdown. 3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol. 4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0. 5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none". 6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout. 7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output" This is not intended for the 1.3 release as it is a major change requiring considerable soak time. This commit was SVN r19767.	2008-10-18 00:00:49 +00:00
Ralph Castain	037231fbcb	MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required. This commit was SVN r19637.	2008-09-25 13:39:08 +00:00
Jeff Squyres	e0a991a8c2	Print out a message telling the user how to enable non-aggregated help / error messages. This commit was SVN r19604.	2008-09-22 17:42:56 +00:00
Jeff Squyres	8eccda391a	Fix comment to match the code. This commit was SVN r19598.	2008-09-20 12:35:48 +00:00
George Bosilca	579d70edad	We should use #ifdef and not #if This commit was SVN r19504.	2008-09-05 12:44:19 +00:00
Shiqing Fan	ce40b8a35e	- Fix typo ;-) This commit was SVN r19438.	2008-08-27 17:06:40 +00:00
Ralph Castain	a5efefe980	Ensure var is init before use This commit was SVN r19416.	2008-08-26 13:38:11 +00:00
Ralph Castain	28346b5bac	Get -host to not use empty nodes called out specifically later in the -host list This commit was SVN r19403.	2008-08-26 03:02:28 +00:00
Ralph Castain	6039e385cd	Per request from Terry, make -host and -hostfile respect order when used as filters. In other words, if you specify -host host1,host3,host2, then we should use the hosts in that order. Previously, we used them in whatever order they were found in the allocation - all the -host did was tell us which nodes to use, not what order to use them in. Relative node syntax remains supported. Also, if you specify empty nodes, but have a specific empty node called out later, we will not include that node in the empties we add. I'll provide examples in the manpage. This commit was SVN r19402.	2008-08-26 02:56:10 +00:00
Ralph Castain	6d82efba21	Add relative indexing capabilities for hostfile and -host - we can now reference hosts using a relative syntax. See the orte_hosts manpage for an explanation This commit was SVN r19364.	2008-08-19 15:16:27 +00:00
Dan Lacher	7ef29d4abe	More fixes for #1387 . Minor fixes for the orte_host.7 man page file that was missed in the inital pass. We are using $(am_dirstamp) instead of creating our own dirstamp since there is src code in util/hostfile directory is created. The automake process creates the $(am_dirstamp), we found the use of this in the generated Makefile in the util/Makefile This commit was SVN r19230.	2008-08-08 19:10:02 +00:00
Ralph Castain	01a7259a7d	This fixes ticket #1426 - mpirun is cleaning up ALL session dirs Mpirun - and the orteds - were doing their best to whack all session dirs on their nodes just in case there was something lingering due to an abnormal termination. Unfortunately, they were -too- good at it. They were whacking all session directories under the user's name, even those from other mpiruns! This adds another layer to the session dir tree so that we can denote which jobs come from our own job family, and restricts the cleanup operation to only session dirs from within our own job family. So we'll still cleanup anything due to our own mpirun, but won't whack any other mpirun from this user. Call it being polite... This commit was SVN r19083.	2008-07-29 18:58:35 +00:00
Ralph Castain	1a77b15523	Modify the handling of hostfiles to allow them to subdivide allocations. Utilize the "slots_alloc" field of the orte_node_t object - which had previously been unused - to track the #slots allocated to a given app_context. Let the hostfile filtering action utilize the #slots field to modify the allocated slots for each app_context. This commit was SVN r19066.	2008-07-28 15:10:40 +00:00
Ralph Castain	3137ed9255	Update the manpages for comm_spawn(_multiple) - add man page to explain host/hostfile behavior This commit was SVN r18961.	2008-07-21 17:58:12 +00:00
Ralph Castain	8e3658b320	Remove the nodename:pid prefix from show_help output so it doesn't disrupt the formatted output This commit was SVN r18843.	2008-07-08 22:57:50 +00:00
Ralph Castain	ba5498cdc6	Repair the MPI-2 dynamic operations. This includes: 1. repair of the linear and direct routed modules 2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI 3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature 4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes. Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB 5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept. Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules. 6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch 7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch 8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies 9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc. 10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon There is still more cleanup to be done for efficiency, but this at least works. Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn. Fixes ticket #1256 This commit was SVN r18804.	2008-07-03 17:53:37 +00:00
Ralph Castain	6f85e34d66	Detect homo/hetero scenarios in the nidmap, setup to take appropriate actions in the basic grpcomm module. NOT for inclusion in v1.3 This commit was SVN r18786.	2008-07-01 02:44:57 +00:00
Ralph Castain	b118779c08	It is okay for us to init the ORTE mca params multiple times. Indeed, it is absolutely required by orterun as the first time has to be done prior to parsing the command line, which means that the mca values haven't been parsed yet! Add ability for sys admins to prohibit putting session directories under specified locations. Thus, they can now protect parallel file systems from foolish user mistakes. This commit was SVN r18721.	2008-06-24 17:50:56 +00:00
Ralph Castain	5ebe10ebf1	Fix a bad typo - need to look at the node array as the arch array hasn't been built yet This commit was SVN r18689.	2008-06-19 21:34:39 +00:00
Ralph Castain	3b5e80fa61	Shift responsibility for preconnecting the oob to the orte routed framework, which is the only place that knows what needs to be done. Only the direct module will actually do anything - it uses the same algo as the original preconnect function. This commit was SVN r18677.	2008-06-19 13:48:26 +00:00
Ralph Castain	282a220e7e	Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation This commit was SVN r18670.	2008-06-18 15:28:46 +00:00
George Bosilca	0f9b9c0aff	Remove a warning and add arequired header (otherwise we cannot compile when --disable-debug is specified). This commit was SVN r18665.	2008-06-18 08:10:02 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Ralph Castain	d61fe87d04	Use the opal_show_help system if orte_show_help has not been initialized This fixes ticket #1342 This commit was SVN r18644.	2008-06-11 12:50:40 +00:00
Ralph Castain	8d9ff44134	Add visibility required for some environments and configs This commit was SVN r18629.	2008-06-09 21:28:19 +00:00
Ralph Castain	03ab4f5c64	Make the ifdef name mirror the change in filename This commit was SVN r18626.	2008-06-09 20:36:55 +00:00
Ralph Castain	c13cadc3c7	Refs trac:1255 This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something. So the ticket can be checked by Jeff when he returns from vacation... :-) This commit was SVN r18625. The following Trac tickets were found above: Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255	2008-06-09 20:34:14 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Ralph Castain	ca91ec525b	Add a suffix to the opal_output stream descriptor object - we can now output both a prefix and a suffix for a given stream. Default the suffix to NULL. Remove lingering references to a filtering system as this will no longer be implemented. This commit was SVN r18586.	2008-06-04 20:52:20 +00:00
Ralph Castain	8ce4b64b5a	Ensure we don't go past the end of the array This commit was SVN r18569.	2008-06-03 21:31:02 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	b2a566b610	Break an infinite loop in orte_output caused by debugging of ORTE comm subsystems This commit was SVN r18534.	2008-05-29 12:21:17 +00:00
Ralph Castain	f76240e7cc	Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch. This commit was SVN r18528.	2008-05-28 18:38:47 +00:00
Ralph Castain	347752e40b	One more corner case (caught by Jeff) occurs when someone requests usage help - e.g., with mpirun --help. In this case, we do a show_help prior to orte_init, but it is okay to do so. To allow this, we let show_help just operate correctly without any warning about pre-orte_output_init. This commit was SVN r18525.	2008-05-28 13:58:03 +00:00
Ralph Castain	828ae26d90	ORTE-level MCA params are defined in several places. Ompi_info cannot call orte_init due to an issue with the memory allocator, thus making it impossible for ompi_info to display all of the ORTE-level MCA params. By consolidating them all into one function, ompi_info can call that function and register the desired variables. This also requires, however, that ompi_info call orte_output_init to avoid generating tons of error messages, so make that adjustment too. Fixes ticket #1314 In addition, orte_output has a race condition issue whereby calls to orte_output/verbose can occur prior to either the RML being defined/setup, or the HNP being defined. This latter occurs during the initialization of the orte_process_info structure. In both cases, there is no way orte_output can send the output to the HNP. Hence, the message must be simply output locally. Fixes ticket #1315 This commit was SVN r18524.	2008-05-28 13:29:58 +00:00
Ralph Castain	5a2992dea2	After some discussion with Jeff, we have determined that the only time orte_output functions can be accessed prior to calling orte_output_init is if someone calls them prior to calling orte_init. This is clearly an error, so we now report that fact to the caller so it can be fixed. It is still possible that someone can call an orte_output function during orte_finalize - this is not an error. Prior commits ensured that this is correctly handled. This commit only deals with improper calls prior to calling orte_init. This commit was SVN r18513.	2008-05-27 20:13:04 +00:00
Ralph Castain	be193ead83	Ensure that any use of orte_output prior to calling orte_output_init gets a properly initiated stream tracking object to avoid later segfaults This commit was SVN r18512.	2008-05-27 19:07:31 +00:00
Ralph Castain	e190a990ba	Do not re-init the orte_output system if we have already finalized it as part of orte_finalize. Instead, default to routing the output to the std opal_output system so that the message still gets out. Of course, such messages cannot be filtered, but they are only for debug purposes by ORTE developers, so this should be a minimial issue. This commit was SVN r18506.	2008-05-27 15:15:55 +00:00
Jeff Squyres	c546d7bda9	Ensure that the last few duplicates can be shown -- don't shut down everything unitl those duplicates are shown This commit was SVN r18460.	2008-05-20 01:34:55 +00:00
Jeff Squyres	e88ac13e53	Fixes trac:1290: ensure that we setup the orte_init subsystem before using it. This commit was SVN r18448. The following Trac tickets were found above: Ticket 1290 --> https://svn.open-mpi.org/trac/ompi/ticket/1290	2008-05-16 14:32:42 +00:00
Jeff Squyres	28d5f762ca	Fixes trac:1289: ensure that if we haven't initialized the orte_output system, we don't try to use it (e.g., if orte_output or orte_show_help is called before orte_init). This commit was SVN r18442. The following Trac tickets were found above: Ticket 1289 --> https://svn.open-mpi.org/trac/ompi/ticket/1289	2008-05-16 02:03:42 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	313240f2b6	Pass the pointer to a string pointer to the packing function This commit was SVN r18429.	2008-05-13 01:31:49 +00:00
Ralph Castain	40904dd152	Add a binomial routed module - for now, still completely wires up the daemons, but that will be changed later. Modify grpcomm xcast so it now uses the selected routed module - eliminates cross-wiring of xcast and routing paths. Suboptimal at the moment, but better implementation is on its way. Cleanup ignore properties on the new routed components. This commit was SVN r18377.	2008-05-05 22:32:25 +00:00
Josh Hursey	dcd21d7d07	Some checkpoint/restart fixes in response to r18338 (changes in modex). Things should be working now. This commit was SVN r18348. The following SVN revision numbers were found above: r18338 --> open-mpi/ompi@3e55fe6f6d	2008-05-01 17:48:13 +00:00
Ralph Castain	3e55fe6f6d	Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit. Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs. Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node. This commit was SVN r18338.	2008-04-30 19:49:53 +00:00
Josh Hursey	cc83d41ad9	Merge in tmp/jjh-scratch {{{ svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch . }}} Contains: * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart. * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P. * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry * Some other sundry cleanup items all dealing with C/R functionality in the trunk. This commit was SVN r18241.	2008-04-23 00:17:12 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Ralph Castain	e050f37578	Cleanup a few warnings about initializing variables. Remove an obsolete data value. This commit was SVN r18129.	2008-04-10 19:15:16 +00:00
Ralph Castain	86b4ae5970	Remove a generated file from the repository - shouldn't have been there This commit was SVN r18116.	2008-04-09 22:13:51 +00:00
Ralph Castain	537395b924	Make two important MCA params "visible" to ompi_info This commit was SVN r18074.	2008-04-02 14:54:57 +00:00
Ralph Castain	51533c9340	Add a new mapper component that sequentially maps ranks-to-hosts according to the ordering in the hostfile. Not functional yet - still under development. Just placeholding for now to clear a backlog This commit was SVN r18062.	2008-04-01 20:03:49 +00:00
George Bosilca	ee784b601e	For consistency reasons always use opal_home_directory and opal_tmp_directory. This commit was SVN r18043.	2008-03-31 18:13:41 +00:00
George Bosilca	493677426d	Use the OPAL function to retrieve the HOME and TMP environment values. This commit was SVN r18037.	2008-03-31 17:10:08 +00:00
George Bosilca	60111ce66d	Few less warnings. This commit was SVN r18025.	2008-03-30 19:06:49 +00:00
Jeff Squyres	8d79bfe860	Fix for CID 937. All we really care about is being able to chrdir; the extra checks were unnecessary. This commit was SVN r18015.	2008-03-29 13:15:22 +00:00
Ralph Castain	6166278e18	Improve the scalability of the modex operation and fix a bug reported by Tim P The bug was a race condition in the barrier operation that caused the barrier in MPI_Finalize to fail on very short programs. Scalaiblity was improved by using the daemons to aggregate modex and barrier messages before sending them to the rank=0 proc. Improvement is proportional to ppn, of course, but there really wasn't a scaling problem at low ppn anyway. This modification also paves the way for better allgather operations since now all the data for each node is sitting at the daemon level, and the daemons are now aware that a collective operation on the OOB is underway (so they -can- participate in a collective of their own to support it). Also added better diagnostics to map out the timing associated with MPI_Init - turned on by -mca orte_timing 1. This commit was SVN r17988.	2008-03-27 15:17:53 +00:00
Ralph Castain	60d931217f	Modify the routed framework to allow greater control/flexibility over response to lost routes and initial wireup of jobs as required by several soon-to-come new modules. Specifically, add two new APIs: 1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job. Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections. 2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job. The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module. This commit was SVN r17969.	2008-03-26 01:00:24 +00:00
George Bosilca	fe2636cb4a	Coverty fix: use snprintf instead of sprintf. This commit was SVN r17963.	2008-03-25 22:41:25 +00:00
Ralph Castain	ec76fe4fe4	Fix singletons - only go through orte_proc_info_init once! Session dir is calling it be sure it is filled in, which was causing us to reset the fields. This commit was SVN r17944.	2008-03-25 02:10:42 +00:00
Ralph Castain	ebea4d04e4	Remove defunct error constant - we no longer have a GPR that can hold corrupt data! This commit was SVN r17942.	2008-03-24 21:05:14 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Ralph Castain	629b95a2fe	Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway. So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked. Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion. This commit was SVN r17843.	2008-03-17 17:58:59 +00:00
Ralph Castain	ff99aa054f	In order to prevent orphaned processes when using non-unity routing methods, the procs need to realize that their local daemon is a critical connection - if that connection unexpectedly closes, they need to terminate. This commit adds definition for a "lifeline" connection. For an HNP, there is no lifeline, so the lifeline proc is NULL. For a daemon, the lifeline is the HNP - the daemon should abort if it loses that connection. For a proc using unity routed, the lifeline is the HNP since it connects directly to the HNP. For a proc using tree routed, the lifeline is the local daemon. Adjusted OOB to call abort if the lifeline (as opposed to HNP) connection is lost. This commit was SVN r17761.	2008-03-06 15:30:44 +00:00
Ralph Castain	5795427f6a	Protect against situations where someone didn't fill-in all the app_context fields This commit was SVN r17752.	2008-03-05 23:55:03 +00:00
Tim Prins	5de3e1965e	Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte. Everything should work, however I am unable to compile and test the sctp BTL. This commit was SVN r17751.	2008-03-05 22:44:35 +00:00
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Rolf vandeVaart	03fdd57d5a	Fix the use of --path and -x PATH so that things work properly. Note that --path specifies extra directories where the executable is searched for, but does not affect the PATH settings. This commit fixes trac:1221. This commit was SVN r17748. The following Trac tickets were found above: Ticket 1221 --> https://svn.open-mpi.org/trac/ompi/ticket/1221	2008-03-05 21:07:43 +00:00
Ralph Castain	edb8e32a7a	Add default hostfile parameter plus --default-hostfile command line option. Fix error message when job setup failed This commit was SVN r17724.	2008-03-05 04:54:57 +00:00
Ralph Castain	bf5ba58ce0	Get the count correct when the user lists the same node multiple times for -host. This commit was SVN r17711.	2008-03-05 01:24:34 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Ralph Castain	3dbd4d9be7	Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by: 1. taking advantage of the fact that we no longer create the launch message via a GPR trigger. In earlier times, we had the GPR create the launch message based on a subscription. In that mode of operation, we could not guarantee the order in which the data was stored in the message - hence, we had no choice but to parse the message in a loop that checked each value against a list of possible "keys" until the corresponding value was found. Now, however, we construct the message "by hand", so we know precisely what data is in each location in the message. Thus, we no longer need to send the character string "keys" for each data value any more. This represents a rather large savings in the message size - to give you an example, we typically would use a 30-char "key" for a 2-byte data value. As you can see, the overhead can become very large. 2. sending node-specific data only once. Again, because we used to construct the message via subscriptions that were done on a per-proc basis, the data for each node (e.g., the daemon's name, whether or not the node was oversubscribed) would be included in the data for each proc. Thus, the node-specific data was repeated for every proc. Now that we construct the message "by hand", there is no reason to do this any more. Instead, we can insert the data for a specific node only once, and then provide the per-proc data for that node. We therefore not only save all that extra data in the message, but we also only need to parse the per-node data once. The savings become significant at scale. Here is a comparison between the revised trunk and the trunk prior to this commit (all data was taken on odin, using openib, 64 nodes, unity message routing, tested with application consisting of mpi_init/mpi_barrier/mpi_finalize, all execution times given in seconds, all launch message sizes in bytes): Per-node scaling, taken at 1ppn: #nodes original trunk revised trunk time size time size 1 0.10 819 0.09 564 2 0.14 1070 0.14 677 3 0.15 1321 0.14 790 4 0.15 1572 0.15 903 8 0.17 2576 0.20 1355 16 0.25 4584 0.21 2259 32 0.28 8600 0.27 4067 64 0.50 16632 0.39 7683 Per-proc scaling, taken at 64 nodes ppn original trunk revised trunk time size time size 1 0.50 16669 0.40 7720 2 0.55 32733 0.54 11048 3 0.87 48797 0.81 14376 4 1.0 64861 0.85 17704 Condensing those numbers, it appears we gained: per-node message size: 251 bytes/node -> 113 bytes/node per-proc message size: 251 bytes/proc -> 52 bytes/proc per-job message size: 568 bytes/job -> 399 bytes/job (job-specific data such as jobid, override oversubscribe flag, total #procs in job, total slots allocated) The fact that the two pre-commit trunk numbers are the same confirms the fact that each proc was containing the node data as well. It isn't quite the 10x message reduction I had hoped to get, but it is significant and gives much better scaling. Note that the timing info was, as usual, pretty chaotic - the numbers cited here were typical across several runs taken after the initial one to avoid NFS file positioning influences. Also note that this commit removes the orte_process_info.vpid_start field and the handful of places that passed that useless value. By definition, all jobs start at vpid=0, so all we were doing is passing "0" around. In fact, many places simply hardwired it to "0" anyway rather than deal with it. This commit was SVN r16428.	2007-10-11 15:57:26 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
Ralph Castain	b6c60dfc07	Bring over the extra debugging output that helped a user find his NSF mount problems. This just adds ERROR_LOG messages when the session directory creation process fails so we can see where it is happening - really helps users (and us as well) figure out what specifically went wrong. This commit was SVN r15491.	2007-07-18 19:50:54 +00:00
Ralph Castain	bd65f8ba88	Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. Short description: major changes include - 1. singletons now fork/exec a local daemon to manage their operations. 2. the orte daemon code now resides in libopen-rte 3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided. I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine. This commit was SVN r15390.	2007-07-12 19:53:18 +00:00
Tim Prins	c46ed1d5d4	Make it so the universe size is passed through the ODLS instead of through a gpr trigger during MPI init. This matches what is currently being done with the app number. The default odls has been updated and works fine. The process odls has been updated, but I could not verify its operation. The bproc ODLS has not been updated yet. Ralph will look at it soon. This commit was SVN r15257.	2007-07-02 01:33:35 +00:00
George Bosilca	55cf6fc866	Be a little bit more verbose: tell which file we have trouble with... This commit was SVN r15115.	2007-06-17 04:59:15 +00:00
Ralph Castain	85df3bd92f	Bring in the generalized xcast communication system along with the correspondingly revised orted launch. I will send a message out to developers explaining the basic changes. In brief: 1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names. 2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used. 3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying. Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed. This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems. This commit was SVN r15007.	2007-06-12 13:28:54 +00:00
Ralph Castain	d9acc93efa	Compute and pass the local_rank and local number of procs (in that proc's job) on the node. To be precise, given this hypothetical launching pattern: host1: vpids 0, 2, 4, 6 host2: vpids 1, 3, 5, 7 The local_rank for these procs would be: host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3 host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3 and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid. I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future. This commit was SVN r14706.	2007-05-21 14:30:10 +00:00
Ralph Castain	75d51812a3	Fix the app-failed-to-start capability that was broken by r14554 (holding the caller in rmgr.spawn until the application - as opposed to just the orteds - have started). Allow the rmgr.spawn function to return if the app terminates, correctly handling its return status code to show abnormal termination. Modify orterun to correctly handle the returned status code so it doesn't enter a conditioned wait if the app fails to start since it will never wakeup if it does. This commit was SVN r14693. The following SVN revision numbers were found above: r14554 --> open-mpi/ompi@4510b42638	2007-05-18 13:29:11 +00:00
Ralph Castain	7a57b694bb	Allow caller to get session directory name without anything else This commit was SVN r14472.	2007-04-23 18:25:36 +00:00
Ralph Castain	9cd85ef55a	Add a few more error constants that will help provide more definitive output to the user This commit was SVN r14471.	2007-04-23 18:25:03 +00:00
George Bosilca	f2a6b9394f	Deal with the include spree. Protect "environ" on Windows. Some others minors modifications in order to make it compile [again] on Windows. This commit was SVN r14188.	2007-04-01 16:16:54 +00:00
Tim Prins	9cb455272b	Fix a pile of memory leaks in ORTE. Fix a major memory leak in the SLURM RAS, and cleanup a bit of code there. This commit was SVN r14164.	2007-03-29 00:50:56 +00:00
Tim Prins	fe3ea0085f	Fix minor memory leaks This commit was SVN r13946.	2007-03-07 01:09:38 +00:00
Ralph Castain	8314e8dbb9	Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode: 1. no -np provided - put one proc/node across all allocated nodes 2. -np N provided, N > #nodes - we print a pretty error message and exit 3. -np N provided, N <= #nodes - put one proc/node across N nodes I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error. This commit was SVN r12821.	2006-12-11 18:07:07 +00:00
Ralph Castain	a1153fdc8f	Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment. Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available. This commit was SVN r12788.	2006-12-07 03:11:20 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
George Bosilca	6f28bcdc21	Remove the last set of compiler warnings from the precondition file. This commit was SVN r12753.	2006-12-04 21:45:57 +00:00
Brian Barrett	d64fa194f1	Instead of continually screwing around with different format strings to make this warning-proof, loop over the uint64_ts as an array of integers and use %x. The final string is just as random and formatted exactly the same, so we're all good in that department. Refs trac:655 This commit was SVN r12742. The following Trac tickets were found above: Ticket 655 --> https://svn.open-mpi.org/trac/ompi/ticket/655	2006-12-04 18:07:24 +00:00
Rainer Keller	d078bb3e8a	- Revert changes and include pointers to discussion. This commit was SVN r12736.	2006-12-03 17:05:15 +00:00
Rainer Keller	e61dd8722e	- Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx - No functional changes, just indentation and corrections to error output. This commit was SVN r12734.	2006-12-03 13:59:23 +00:00
Ralph Castain	194bdd413b	Cleanup the problem of connecting to default universes. This commit was SVN r12438.	2006-11-06 15:28:38 +00:00
George Bosilca	50649dd6a9	What we write it's a long long so we should be using the long long format. This commit was SVN r12157.	2006-10-18 00:02:20 +00:00

1 2 3 4

195 Коммитов