openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	955d117f5e	Add a new grpcomm module that mimics the old 1.2 behavior - it -always- does a modex because it always includes the architecture. Hence, we called it "blind-and-dumb" since it doesn't look to see if this is required - moniker of "bad". :-) Update the ESS API so we can update the stored arch's should the modex include that info. Update ompi/proc to check/set the arch for remote procs, and add that function call to mpi_init right after the modex is done. Setup to allow other grpcomm modules to decide whether or not to add the arch to the modex, and to detect if other entries have been made. If not, then the modex can just fall through. Begin setting up some logic in the "basic" module to handle different arch situations. For now, default to the "bad" module so we will work in all situations, even though we may be sending around more info than we really require. This fixes ticket #1340 This commit was SVN r18673.	2008-06-18 22:17:53 +00:00
Ralph Castain	282a220e7e	Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation This commit was SVN r18670.	2008-06-18 15:28:46 +00:00
George Bosilca	8e7c35e76c	These symbols are only available via the module/component structure, so they don't have to be globally visible. This commit was SVN r18666.	2008-06-18 08:20:02 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Ralph Castain	a87aa442e3	Remove last remaining reference to iof_flush - it was #if'd out anyway. The existing flush code appears to have several critical problems. Given the impending rework of the IOF subsystem, there is no point in trying to fix it here. This commit was SVN r18649.	2008-06-11 16:25:46 +00:00
Ralph Castain	f9d809748c	Glad someone found that last error - caused me to review the code and find a couple of other cleanups! Nothing major, but just ensure that things flow smoothly since we had a "shadowed" variable. This commit was SVN r18643.	2008-06-10 19:15:59 +00:00
Camille Coti	67cd1849f7	*map was still NULL in the else statement, inducing a segmentation fault when a field of the structure was accessed to. This commit was SVN r18642.	2008-06-10 19:00:57 +00:00
Ralph Castain	1a422995ae	Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak. This commit was SVN r18641.	2008-06-10 17:53:28 +00:00
Brian Barrett	4127bd0dcc	fix two other mistakes in the cnos ess This commit was SVN r18632.	2008-06-09 22:28:26 +00:00
George Bosilca	f72ab90b16	Allow xgrid to compile again. This commit was SVN r18631.	2008-06-09 21:51:41 +00:00
Brian Barrett	11cd3a7cba	Fix problem where local rank always had different architecture than remote ranks on Red Storm This commit was SVN r18630.	2008-06-09 21:46:03 +00:00
Ralph Castain	c13cadc3c7	Refs trac:1255 This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something. So the ticket can be checked by Jeff when he returns from vacation... :-) This commit was SVN r18625. The following Trac tickets were found above: Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255	2008-06-09 20:34:14 +00:00
Ralph Castain	bf5c34d10a	The rsh launcher is one place where multi-word MCA params would have to be passed via the orted cmd line. In such a case, we have to explicitly include quote marks about the param value. Add that capability here. This commit fixes trac:1200 This commit was SVN r18621. The following Trac tickets were found above: Ticket 1200 --> https://svn.open-mpi.org/trac/ompi/ticket/1200	2008-06-09 19:07:19 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Pak Lui	caac0e0182	Add in a couple missing ones from r18611 for all tm users out there... This commit was SVN r18615. The following SVN revision numbers were found above: r18611 --> open-mpi/ompi@7bee71aa59	2008-06-06 22:53:43 +00:00
Ralph Castain	b65eb54ea2	Cut out a new iof pull - that capability isn't ready yet for the trunk, but will be coming shortly Thanks to Pak for letting me know... This commit was SVN r18614.	2008-06-06 21:24:15 +00:00
Pak Lui	7f7777a538	Check for NULL in prefix_dir. This commit fixes trac:1337. This commit was SVN r18612. The following Trac tickets were found above: Ticket 1337 --> https://svn.open-mpi.org/trac/ompi/ticket/1337	2008-06-06 19:55:01 +00:00
Ralph Castain	7bee71aa59	Fix a potential, albeit perhaps esoteric, race condition that can occur for fast HNP's, slow orteds, and fast apps. Under those conditions, it is possible for the orted to be caught in its original send of contact info back to the HNP, and thus for the progress stack never to recover back to a high level. In those circumstances, the orted can "hang" when trying to exit. Add a new function to opal_progress that tells us our recursion depth to support that solution. Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death... Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang. After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them. Thanks you Jeff! :-/ This commit was SVN r18611.	2008-06-06 19:36:27 +00:00
Josh Hursey	1de50b523c	Fix some Coverity 'Event set_but_not_used' highlights. Thanks to Jeff for bringing them to my attention. This commit was SVN r18606.	2008-06-06 14:38:41 +00:00
Jeff Squyres	d3795d7a34	Fix CID 987: remove unused variable. This commit was SVN r18598.	2008-06-05 20:17:02 +00:00
Ralph Castain	332e6c89ab	Modify the slurm launcher so that the kill-on-bad-exit behavior is not "on" by default. Instead, only turn it "on" if the plm_slurm_detect_failure mca param is set to something non-zero This commit was SVN r18588.	2008-06-04 23:59:53 +00:00
Ralph Castain	0da811ce79	Initial work on xml support - allocation and job map outputs completed. More to come. This commit was SVN r18587.	2008-06-04 20:53:12 +00:00
George Bosilca	25ae9c12e6	Silence few warnings. This commit was SVN r18568.	2008-06-03 19:58:40 +00:00
George Bosilca	fa89d299bf	Silence the Obj-C compiler. This commit was SVN r18567.	2008-06-03 19:24:17 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	95578b0528	Fix single-node operations so that the HNP correctly exits when the job completes This commit was SVN r18556.	2008-06-03 14:23:04 +00:00
Ralph Castain	b456fb2d42	Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2heartrate, declares orted failure if not seen in last 2heartrate time. Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher. Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported. More info to come on devel list. This commit was SVN r18555.	2008-06-02 21:46:34 +00:00
Shiqing Fan	af656b2b3d	Fix some typing mistakes, make the sources compile again for Windows Visual Studio. This commit was SVN r18542.	2008-05-29 15:27:43 +00:00
Ralph Castain	2b28bef15a	Provide a "nicer" indication that we don't know the pid of the failed orted This commit was SVN r18538.	2008-05-29 14:10:58 +00:00
Ralph Castain	72530f8fed	Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done. 2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear. This represents the best-available fix for ticket #221 so I am closing that ticket at this time. This commit was SVN r18536.	2008-05-29 13:38:27 +00:00
Ralph Castain	52fb773c6c	Tell slurm to kill the job if an orted abnormally exits This commit was SVN r18535.	2008-05-29 12:26:58 +00:00
Ralph Castain	e5e542ddcf	Clarify an error message This commit was SVN r18533.	2008-05-29 12:20:24 +00:00
Josh Hursey	4ac7016200	Make sure to check "opal_list_get_last" instead of "opal_list_get_end". The former will return a valid item in the list, the latter will return an invalid item that marks the end of the list. It was happending that when oversubscribing by way of an appfile we would cause a segv because we tried to interpret the invalid item returned by "opal_list_get_end" instead of a valid item. We would then try to write to unallocated memory. This commit fixes trac:1279 This commit was SVN r18529. The following Trac tickets were found above: Ticket 1279 --> https://svn.open-mpi.org/trac/ompi/ticket/1279	2008-05-28 19:37:20 +00:00
Ralph Castain	f76240e7cc	Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch. This commit was SVN r18528.	2008-05-28 18:38:47 +00:00
George Bosilca	1eb1742225	Remove this left over dependency. This commit was SVN r18508.	2008-05-27 16:57:40 +00:00
Ralph Castain	93d932aa0c	Ensure that the display-map and display-allocation outputs get processed through the new OPAL filter framework by passing them through orte_output instead of using the opal_dss.dump function. This commit was SVN r18507.	2008-05-27 15:46:21 +00:00
Ralph Castain	0b2b655de5	Initialize a variable so it can correctly be dealt with at shutdown - fixes trac:1312 This commit was SVN r18505. The following Trac tickets were found above: Ticket 1312 --> https://svn.open-mpi.org/trac/ompi/ticket/1312	2008-05-27 14:53:24 +00:00
Pak Lui	695c158192	silence some intel and pgcc compiler warnings. This commit was SVN r18501.	2008-05-26 20:35:13 +00:00
Pak Lui	7b3d7dcac4	This commit closes trac:1300. This commit was SVN r18473. The following Trac tickets were found above: Ticket 1300 --> https://svn.open-mpi.org/trac/ompi/ticket/1300	2008-05-21 22:35:04 +00:00
Josh Hursey	7e8cd20a0a	a fix for C/R support This commit was SVN r18438.	2008-05-14 16:57:37 +00:00
Jeff Squyres	671f0c379d	Remove a whole pile of orte/util/show_help.h's that I missed. :-( This commit was SVN r18437.	2008-05-14 11:32:33 +00:00
Pak Lui	4c8d79d907	Silence the compiler warnings/errors. There is no orte/util/show_help.h This commit was SVN r18436.	2008-05-13 22:07:38 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Shiqing Fan	7ff440f628	Add quotation marks for windows path. This commit was SVN r18420.	2008-05-09 14:12:09 +00:00
Josh Hursey	da2f1c58e2	Some checkpoint/restart cleanup. * Remove the opal_only option. This was suffering from bit rot, and no one uses it. It can be added back fairly easily if wanted. * Cleanup metadata interactions at the local level. * Touch up some of the INC funcitonality (fix typos and a minor ordering issue) This commit was SVN r18416.	2008-05-08 18:47:47 +00:00
Ralph Castain	64ef4102c4	Add the topo mapper module - requires some work in carto for completion. Little cleanup in round-robin mapper. This commit was SVN r18412.	2008-05-08 05:09:13 +00:00
Ralph Castain	ac5263613c	Fix stupid singletons yet again This commit was SVN r18408.	2008-05-07 20:26:31 +00:00
George Bosilca	dbea3e070e	Correct some copy/paste errors. This commit was SVN r18396.	2008-05-07 04:04:42 +00:00
Ralph Castain	ff70636024	Allgather_list needs its own tag to avoid conflicting with the allgather modex operation. All spawned procs must decode the port of the spawning process so they can communicate in direct routed mode. This fixes comm_spawn for all routing modes. This commit was SVN r18395.	2008-05-07 03:03:56 +00:00
Josh Hursey	bc67f40936	whoops typo This commit was SVN r18390.	2008-05-06 22:00:24 +00:00
Josh Hursey	50c909a23d	Fix a bit of selection logic. Filem should not fail select if the user decided not to build with any filem components. This matches the logic before the mca_base_select() change. This commit was SVN r18389.	2008-05-06 21:57:45 +00:00
Pak Lui	108921c020	typo This commit was SVN r18387.	2008-05-06 21:37:35 +00:00
Pak Lui	0302c098be	minor typo This commit was SVN r18386.	2008-05-06 21:26:17 +00:00
Ralph Castain	d97a4f880d	Shift the daemon collective operation to the ODLS framework. Ensure we track the collectives per job to avoid race conditions. Take advantage of the new capabilities of the routed framework to define aggregating trees for the daemon collective, and to track which daemons are participating to handle the case of sparse participation. Make it all work with comm_spawn in the case of all procs on previously occupied nodes, some new procs on new nodes, and mixtures of the two. Note: comm_spawn now works with both binomial and linear routed modules. There remains a problem of spawned procs not properly getting updated contact info for the parent proc when run in the direct routed mode...but that's for another day. This commit was SVN r18385.	2008-05-06 20:16:17 +00:00
Josh Hursey	c47406810e	Fix AMCA orted command line. If no AMCA parameters are passed then do not send across the path information. Only place it on the command line if the AMCA parameter is set. This commit was SVN r18382.	2008-05-06 18:27:31 +00:00
Josh Hursey	9971bc9d95	Merge in the mca_base_select changes per RFC: http://www.open-mpi.org/community/lists/devel/2008/04/3779.php {{{ svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play . }}} Any components not in the trunk, but in one of the effected frameworks must be updated. Contact the list, look at the RFC, or look at the diff for how to do this. Sorry for the early commit of this, but I wanted to get it in today (per RFC) and didn't know if I would have a chance later today. This commit was SVN r18381.	2008-05-06 18:08:45 +00:00
Ralph Castain	40904dd152	Add a binomial routed module - for now, still completely wires up the daemons, but that will be changed later. Modify grpcomm xcast so it now uses the selected routed module - eliminates cross-wiring of xcast and routing paths. Suboptimal at the moment, but better implementation is on its way. Cleanup ignore properties on the new routed components. This commit was SVN r18377.	2008-05-05 22:32:25 +00:00
Aurelien Bouteiller	5ba62469a0	Add a route_is_defined implementation for the linear oob routing. This commit was SVN r18375.	2008-05-05 19:12:41 +00:00
Aurelien Bouteiller	2ae30fe126	Implementation of the route_is_defined stub for direct oob routing. This commit was SVN r18373.	2008-05-05 18:23:26 +00:00
Ralph Castain	b8bb990acf	Rename the routed modules to more accurately reflect what they do and the role they will play in soon-to-come updates. Add two new API's to the routed framework - stub them out so that collaborators can work on them in various components without conflicts. Remove a "finalize" from the select function that could cause problems as the component had not had its initialize called yet. This commit was SVN r18369.	2008-05-05 02:59:09 +00:00
Ralph Castain	519c15f8af	Fix direct and linear xcast modes This commit was SVN r18359.	2008-05-02 14:30:07 +00:00
Ralph Castain	8e846bf7f2	Separate the gathering of collective data by jobid This commit was SVN r18357.	2008-05-02 12:00:08 +00:00
Ralph Castain	432d441b3e	Cleanup a bug found by Josh that caused multiple app_contexts to keep mapping onto the first node in an allocation Continue work on loadbalancing Cleanup code organization in rmaps_base This commit was SVN r18353.	2008-05-01 21:07:49 +00:00
Ralph Castain	b2c73f6e11	Fix tree-spawn to work within the new modex system This commit was SVN r18349.	2008-05-01 19:19:34 +00:00
Josh Hursey	dcd21d7d07	Some checkpoint/restart fixes in response to r18338 (changes in modex). Things should be working now. This commit was SVN r18348. The following SVN revision numbers were found above: r18338 --> open-mpi/ompi@3e55fe6f6d	2008-05-01 17:48:13 +00:00
Ralph Castain	ad894b050b	Set the bookmark so the first process of a comm_spawn'd job will be mapped to the same node as the spawning proc, assuming it has space. If not, then the mapper will automatically move to the next node. This commit was SVN r18346.	2008-05-01 15:24:03 +00:00
Ralph Castain	1766442591	Fix a double-free when tree-spawning Fix the round-robin mapper so it doesn't move to the next node just because it completed mapping an app_context This commit was SVN r18344.	2008-05-01 14:49:56 +00:00
Ralph Castain	3e55fe6f6d	Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit. Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs. Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node. This commit was SVN r18338.	2008-04-30 19:49:53 +00:00
Ralph Castain	4c2c6c9bd8	Ensure the pack/unpacks match for tree-spawn This commit was SVN r18282.	2008-04-24 18:53:08 +00:00
Ralph Castain	09b6758f8c	Pass the prefix dir to the remote orted when doing tree-based spawns This commit was SVN r18280.	2008-04-24 18:38:24 +00:00
Josh Hursey	2c736873bb	Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors. The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge. The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit. Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it. * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level. * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components. * Update ft_event functions in PML and BML to handle the new restart state. * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging. This commit was SVN r18276.	2008-04-24 17:54:22 +00:00
Ralph Castain	eece9f88f0	Fix a bug in the way we computed local_rank. This needs to be the local_rank -among my job peers- on a node. We were mistakenly computing the local_rank across -all- jobs with procs on that node. While the two definitions are equivalent for an initial launch, comm_spawn'd procs would get the wrong local_rank. In particular, there would not be a local_rank=0 proc in the comm_spawn'd job on any node that was shared with the initial job. This commit was SVN r18263.	2008-04-23 17:42:59 +00:00
Ralph Castain	f56f06a7ff	Do not trust the RM's names - apparently, RR has trained it to lie! Default to using the name we got from gethostname as it is the only one we can trust. This commit was SVN r18259.	2008-04-23 17:00:35 +00:00
Ralph Castain	8001e4e99c	See if this will fix a race condition showing up in comm_spawn MTT testing This commit was SVN r18257.	2008-04-23 15:43:44 +00:00
Ralph Castain	5311b13b60	Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system. This commit was SVN r18252.	2008-04-23 14:52:09 +00:00
Lenny Verkhovsky	456ce6c4da	Few cleanups in Rank_File component + fixed opal_paffinity_slot_list without rankfile This commit was SVN r18249.	2008-04-23 13:34:05 +00:00
Shiqing Fan	eb5f5d77cc	If it's not the HNP, release the cluster object first and return. This commit was SVN r18247.	2008-04-23 13:21:32 +00:00
Josh Hursey	750ce0152c	After a bit of testing this morning it seems that the tree component is able to work correctly with the checkpoint/restart functionality. So enable this component when C/R is enabled. This commit was SVN r18246.	2008-04-23 13:01:23 +00:00
Josh Hursey	cc83d41ad9	Merge in tmp/jjh-scratch {{{ svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch . }}} Contains: * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart. * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P. * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry * Some other sundry cleanup items all dealing with C/R functionality in the trunk. This commit was SVN r18241.	2008-04-23 00:17:12 +00:00
Ralph Castain	c3ddf66445	Move the dislay-allocation code to where it is always seen This commit was SVN r18227.	2008-04-21 20:28:59 +00:00
Ralph Castain	16c9100633	Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation. This commit was SVN r18216.	2008-04-20 02:25:45 +00:00
Ralph Castain	07f0a71faa	Cleanup the show_help entries on the seq mapper This commit was SVN r18191.	2008-04-17 14:43:15 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Ralph Castain	66e532669a	Remove some dead code This commit was SVN r18182.	2008-04-16 20:33:53 +00:00
Ralph Castain	3413191e52	Fix singleton and singleton comm_spawn This commit was SVN r18177.	2008-04-16 14:38:10 +00:00
Ralph Castain	7b91f8baff	Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory. Fix the ompi-server -h cmd line option so it actually tells you something! Add two new testing codes to the orte/test/mpi area: accept and connect. This commit was SVN r18176.	2008-04-16 14:27:42 +00:00
Adrian Knoth	84e4013530	Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set. This commit was SVN r18164.	2008-04-16 09:31:15 +00:00
Adrian Knoth	0ddfff4ffe	Added new oob-tcp parameter oob_tcp_disable_family. Like btl_tcp_disable_family, this parameter more or less disables a whole address family. Though the sockets are still created, the corresponding information isn't added to the connection strings. Likewise, we don't try to connect to addresses matching the disabled address family. This is particularly important for multidomain clusters, where IPv4 is oftenly filtered (firewalled), sometimes by simply dropping the packets instead of rejecting them (thus causing a connection timeout instead of a quick "no route to host"). This commit was SVN r18163.	2008-04-16 09:22:00 +00:00
Ralph Castain	a4ea756a76	Ensure the node loop cntr gets incremented if the daemon already exists This commit was SVN r18150.	2008-04-15 14:20:03 +00:00
Ralph Castain	35c260a14f	Fix the plm modules to accommodate the new remote_spawn entry - set that entry to NULL for all but rsh as only that module supports it at this time This commit was SVN r18145.	2008-04-14 19:36:13 +00:00
Ralph Castain	84156c422f	Egad! Typo snuck in there...nasty vi! This commit was SVN r18144.	2008-04-14 18:29:11 +00:00
Ralph Castain	7c7304466c	Add a binomial tree-based launch to ssh, turned "on" only when the plm_rsh_tree_spawned mca param is set to a non-zero value. This probably isn't a very optimized capability, but it does execute a tree-based launch that may scale better than linear at high node counts. Add the daemon map capability to the ODLS to create and save a map of daemon vpid vs nodename from the launch message. Cleanup a few places in the base plm launch support where we didn't adequately protect rml recv's from potentially executing sends. This commit was SVN r18143.	2008-04-14 18:26:08 +00:00
Ralph Castain	e050f37578	Cleanup a few warnings about initializing variables. Remove an obsolete data value. This commit was SVN r18129.	2008-04-10 19:15:16 +00:00
Ralph Castain	851279fc9f	Consolidate the daemon wireup message into the launch message. The daemons don't need their contact info prior to the launch message anyway. This not only eliminates a job-wide communication from the startup procedure, but it also resolves a race condition reported when operating across highly distributed (i.e., cross-country) networks. In such scenarios, it proved possible for a daemon to receive its launch message -before- it had received the contact info message, even though the latter had been sent first! This eliminates that problem... This commit was SVN r18126.	2008-04-10 15:35:11 +00:00
Ralph Castain	57e3e86cda	Use the proper exit code for mpirun to indicate an error when something goes wrong during launch (in scenarios where the procs don't report the problem directly themselves) This commit was SVN r18121.	2008-04-10 09:15:08 +00:00
Ralph Castain	e7d0dae89d	Ensure we update the daemon collective trees if num_procs changes, but only if it changes This commit was SVN r18120.	2008-04-10 03:44:18 +00:00
Ralph Castain	22343e6e0b	Given total lack of interest/support from the folks behind these environments, and the fact that we can now scale so well with our own daemons, it seems unlikely that we will be able to pursue direct and/or standalone launch in these environments. If that situation ever changes, it is easy enough to revive the effort since little had really been done to-date. Meantime, no reason to continue dragging these around. This commit was SVN r18119.	2008-04-10 02:54:13 +00:00
Ralph Castain	dc2f88b9f0	Now that we have the daemon collectives, the unity routed module no longer needs the "hack" we inserted a week ago to tell the daemons how to talk directly to all the application procs. The modex and barrier messages flow cleanly across the daemons and are "dropped" into the procs where required. Add some insurance to make certain that the daemons' number of procs only gets updated when it absolutely is intended. This commit was SVN r18118.	2008-04-10 02:45:42 +00:00
Ralph Castain	0b3122ee2f	Update the cnos module - should (hopefully) compile and work... This commit was SVN r18117.	2008-04-09 22:33:00 +00:00
Ralph Castain	3a0d09300b	Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations. Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study. This commit was SVN r18115.	2008-04-09 22:10:53 +00:00
Ralph Castain	11c6773c83	Commit a patch from Brian that fixes potential segfaults in systems where IPv6 include files are found, but the kernel doesn't actually support IPv6. This commit was SVN r18106.	2008-04-09 12:53:24 +00:00
Lenny Verkhovsky	2be4e32c79	1. Fixing Possible strdup of NULL 2. Fixing num_alloc when combined mapping policies ( rankfile & byslot or bynode ) This commit was SVN r18073.	2008-04-02 14:12:38 +00:00
Ralph Castain	f115b4aed2	Checkpoint the revised gather algorithm This commit was SVN r18072.	2008-04-02 13:35:06 +00:00
Adrian Knoth	a56b9b1df1	Fix broken build with --disable-ipv6. This commit was SVN r18071.	2008-04-02 10:53:48 +00:00
Ralph Castain	50433bf833	Turn off the new fqdn behavior pending resolution of hostfile issue This commit was SVN r18064.	2008-04-01 20:52:22 +00:00
Ralph Castain	51533c9340	Add a new mapper component that sequentially maps ranks-to-hosts according to the ordering in the hostfile. Not functional yet - still under development. Just placeholding for now to clear a backlog This commit was SVN r18062.	2008-04-01 20:03:49 +00:00
Ralph Castain	ee5b96269e	The RML is comfortable with zero-byte payloads, so don't pack something we don't need This commit was SVN r18061.	2008-04-01 19:24:46 +00:00
Ralph Castain	3a4c10efd6	Delete obsolete file, cleanup obsolete cruft in another file This commit was SVN r18060.	2008-04-01 18:36:23 +00:00
Ralph Castain	39c2680e9a	Silence warning This commit was SVN r18057.	2008-04-01 13:42:16 +00:00
Ralph Castain	524ed5d515	Don't have singletons wireup the iof. Instead, we let the fork'd orted handle io forwarding. This prevents an issue with the event library and pty's on singletons This commit was SVN r18056.	2008-04-01 12:40:00 +00:00
Ralph Castain	3e8846d685	Some code cleanups from Brian to clarify port selection and opening logic This commit was SVN r18055.	2008-04-01 12:39:02 +00:00
Ralph Castain	fe88956080	Fix singleton modex - ensure singletons know that a daemon is now in the system This commit was SVN r18047.	2008-03-31 20:36:27 +00:00
Ralph Castain	f3936ff9bc	Record the daemon's state so that we don't attempt to send "die" messages to a daemon that is known to have failed to start. This commit was SVN r18044.	2008-03-31 18:15:24 +00:00
George Bosilca	ee784b601e	For consistency reasons always use opal_home_directory and opal_tmp_directory. This commit was SVN r18043.	2008-03-31 18:13:41 +00:00
Ralph Castain	d8eb0eeec3	Correct the debug output This commit was SVN r18042.	2008-03-31 18:09:37 +00:00
Ralph Castain	2b399a3563	Suppress a warning message - relegate it to only show up when verbosity is set as it is okay for this condition to be true This commit was SVN r18041.	2008-03-31 17:48:07 +00:00
Ralph Castain	f327ebce31	Get the jobid correct - doh! This commit was SVN r18040.	2008-03-31 17:42:50 +00:00
Ralph Castain	e396b9ee9a	Fix unity routed component by adding xcast of proc data to the daemons. This enables daemons to complete the revised modex procedure by forwarding their collected modex info to the rank=0 proc. This commit was SVN r18039.	2008-03-31 17:35:29 +00:00
George Bosilca	493677426d	Use the OPAL function to retrieve the HOME and TMP environment values. This commit was SVN r18037.	2008-03-31 17:10:08 +00:00
Ralph Castain	379b8a3e2f	Fix singleton operations that have no data in the modex. Note: this also allows -any- modex operation to have zero data in it, not just singletons. This commit was SVN r18034.	2008-03-31 13:53:23 +00:00
Ralph Castain	1889bbd119	Quiet some warnings about uninitialized variables This commit was SVN r18032.	2008-03-31 13:52:10 +00:00
Ralph Castain	8506be755d	Clean-up the mess. Repair static builds. Remove unused and empty C-decl braces. Add missing prototype for function. This commit was SVN r18031.	2008-03-31 13:02:33 +00:00
Ralph Castain	81a83dabc6	Setup sandbox for testing new orte collectives This commit was SVN r18026.	2008-03-31 04:21:37 +00:00
George Bosilca	594884b613	The return is an int not a pointer. This commit was SVN r18024.	2008-03-30 19:06:25 +00:00
George Bosilca	a6d5c15249	There is no need to force opal_progress down there. It will get called few steps upper. This commit was SVN r18022.	2008-03-30 19:05:09 +00:00
Lenny Verkhovsky	7e45d7e134	Few updates due to RMAPS rank_file component changes 1. applied prefix rule to functions and variables of RMAPS rank_file component 2. cleaned ompi_mpi_init.c from paffinity code 3. paffinity code moved to new opal/mca/paffinity/base/paffinity_base_service.c file 4. added opal_paffinity_slot_list mca parameter This commit was SVN r18019.	2008-03-30 11:52:11 +00:00
Lenny Verkhovsky	cb83a1287d	Realy deleted old files now This commit was SVN r18018.	2008-03-30 11:50:19 +00:00
Lenny Verkhovsky	f734ba51a4	Added files with names according to prefix rule This commit was SVN r18017.	2008-03-30 11:42:09 +00:00
Lenny Verkhovsky	b43f4a2dc9	Deleted and added files after prefix rule changes This commit was SVN r18016.	2008-03-30 11:41:01 +00:00
Ralph Castain	9f1001a6f8	Ensure that the procs know how many daemons will be participating in collective operations. This commit was SVN r17992.	2008-03-27 17:31:54 +00:00
Ralph Castain	6166278e18	Improve the scalability of the modex operation and fix a bug reported by Tim P The bug was a race condition in the barrier operation that caused the barrier in MPI_Finalize to fail on very short programs. Scalaiblity was improved by using the daemons to aggregate modex and barrier messages before sending them to the rank=0 proc. Improvement is proportional to ppn, of course, but there really wasn't a scaling problem at low ppn anyway. This modification also paves the way for better allgather operations since now all the data for each node is sitting at the daemon level, and the daemons are now aware that a collective operation on the OOB is underway (so they -can- participate in a collective of their own to support it). Also added better diagnostics to map out the timing associated with MPI_Init - turned on by -mca orte_timing 1. This commit was SVN r17988.	2008-03-27 15:17:53 +00:00
Ralph Castain	8e6da2ee76	Maintain the mapping bookmark across multiple comm_spawns This commit was SVN r17984.	2008-03-27 00:19:13 +00:00
Ralph Castain	abfb3577c1	Ensure that the bookmark of the parent job is applied to the child in a comm_spawn so we start mapping from the right place This commit was SVN r17982.	2008-03-26 21:18:16 +00:00
Ralph Castain	7ad6db207c	Cover some timing-related output This commit was SVN r17977.	2008-03-26 12:54:50 +00:00
Rainer Keller	ce8154eb3e	- Coverity issues CID 945: Event uninit_use: Using uninitialized value "rc" Instead of initializing rc in the beginning, rather use return value of opal_hash_table_set_value_uint32. This commit was SVN r17976.	2008-03-26 11:39:25 +00:00
Brad Benton	0b84dfd2a6	POE is not currently working or supported, so removing from the trunk. This commit was SVN r17970.	2008-03-26 02:06:40 +00:00
Ralph Castain	60d931217f	Modify the routed framework to allow greater control/flexibility over response to lost routes and initial wireup of jobs as required by several soon-to-come new modules. Specifically, add two new APIs: 1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job. Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections. 2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job. The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module. This commit was SVN r17969.	2008-03-26 01:00:24 +00:00
George Bosilca	2ed6ed37bd	Don't forget to cleanup once we're done. This commit was SVN r17965.	2008-03-25 22:42:24 +00:00
George Bosilca	ac6121bd1c	Remove unused variable. This commit was SVN r17964.	2008-03-25 22:41:50 +00:00
Jeff Squyres	183fcdf51b	Remove duplicate free(), fixing CID 973. This commit was SVN r17959.	2008-03-25 20:30:56 +00:00
Ralph Castain	90107f3c14	Fix an issue with comm_spawn over who sent/recv first in the modex. The modex assumes that the first name on the list is the "root" that will serve as the allgather collector/distributor. The dpm was putting that entity last, which forced us to pre-inform the parent procs of the child proc's contact info since the parent was trying to send to the child. Clarify the setting of send_first in the mpi bindings (trivial, i know, but helpful) Remove the extra xcast of child contact info to the parent job. This commit was SVN r17952.	2008-03-25 14:57:34 +00:00
Ralph Castain	cca449e379	Move an OMPI RML tag to the OMPI layer This commit was SVN r17950.	2008-03-25 13:30:48 +00:00
Ralph Castain	4efddc7b0a	Fix the allgather and allgather_list functions to avoid deadlocks at large node/proc counts. Violated the RML rules here - we received the allgather buffer and then did an xcast, which causes a send to go out, and is then subsequently received by the sender. This fix breaks that pattern by forcing the recv to complete outside of the function itself - thus, the allgather and allgather_list always complete their recvs before returning or sending. Reogranize the grpcomm code a little to provide support for soon-to-come new grpcomm components. The revised organization puts what will be common code elements in the base to avoid duplication, while allowing components that don't need those functions to ignore them. This commit was SVN r17941.	2008-03-24 20:50:31 +00:00
Ralph Castain	58d51f2689	Revert that! Need to complete the rest of the change so the orted knows the correct nodeid... Sorry This commit was SVN r17939.	2008-03-24 18:17:26 +00:00
Ralph Castain	dae4518878	Use the correct nodeid! This commit was SVN r17938.	2008-03-24 18:15:08 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Ralph Castain	f8642e9390	Add debug to tell us when we opened a socket and to whom This commit was SVN r17911.	2008-03-21 15:47:47 +00:00
Ralph Castain	19ffdfef42	Add some debugging output to tell us what interfaces were considered and used by OOB This commit was SVN r17909.	2008-03-21 15:35:40 +00:00
Ralph Castain	c2fd5dd416	Clarify method used to translate application proc termination codes to exit status codes This commit was SVN r17899.	2008-03-20 18:50:05 +00:00
Brian Barrett	2bf4784893	Set a meaningful orte_system_info.nodeid on Catamount This commit was SVN r17898.	2008-03-20 16:55:57 +00:00
Ralph Castain	f8a10dfb93	Complete the fix of the orted vs mpirun race condition for finalizing. The darned mpirun is just too fast! Rather than try to slow it down, we set the orte_finalizing flag -prior- to telling mpirun the orted is leaving. This ensures we don't mistakenly declare the lifeline lost when mpirun leaves in a hurry. This commit was SVN r17897.	2008-03-20 16:55:24 +00:00
Ralph Castain	6bb139e4f2	One more correction to mpirun exit codes - cleanup the application proc's exit codes in the orted so that non-zero exit codes generated by mpirun itself don't get "munged". Modify the multi_abort function so they all return different exit codes - allows us to tell which one was being reported. This commit was SVN r17895.	2008-03-20 13:54:11 +00:00
Ralph Castain	27a73ad9ee	Fix a race condition between the orteds and HNP that can cause the orteds to output the "lost lifeline" message. This has been a long-time problem. I tried to reduce the problem by having the orteds tell the HNP they were finalizing, and having the HNP wait until all orteds had reported or we timed out. What was observed was that all the orteds were correctly reporting that they are leaving, but the HNP is able to exit before the orteds, thus closing the orteds lifeline socket and generating the error output. This is caused by the fact that the orteds have to whack all remaining session directories, which includes that blasted monster shared memory file! Cleaning up the SM file can take quite a while. The HNP doesn't have that problem as there is no SM file there! So it gets out first. What we had done in the past to resolve that problem was put a little test in the OOB that checks to see if we are finalizing. If we are, then we ignore the lifeline connection being lost. That check was still in the code - however, we had lost the line in orte_finalize that set the flag!! This commit was SVN r17893.	2008-03-20 13:30:51 +00:00
Ralph Castain	8ee26a55ca	Just turn these off for now - will revisit later This commit was SVN r17891.	2008-03-20 13:25:35 +00:00
Ralph Castain	67a2cc8a8e	Fix a bug noted by Tim P where we would report the incorrect app_context as "not found". If you gave us the command line: mpirun -n 1 hostname : -n 1 bogus we would erroneously report that hostname had not been found instead of bogus. This commit was SVN r17886.	2008-03-19 21:13:13 +00:00
Ralph Castain	ec64bf3da8	Clarify the error output so we can understand if it was a daemon or process that lost its lifeline This commit was SVN r17880.	2008-03-19 19:06:52 +00:00
Ralph Castain	2ed0e60321	Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize. Jeff is preparing an (undoubtedly lengthy) explanation/matrix of how these codes are determined for the OMPI FAQ. This commit was SVN r17879.	2008-03-19 19:00:51 +00:00
Galen Shipman	80ac7c87cd	don't forget command file.. This commit was SVN r17878.	2008-03-19 16:24:29 +00:00
Galen Shipman	77c8532cc9	do things in a less hacky way.. This commit was SVN r17877.	2008-03-19 16:23:56 +00:00
Jeff Squyres	ac2e329353	Oops! That should not have been removed... This commit was SVN r17865.	2008-03-18 14:42:30 +00:00
Jeff Squyres	bd92720d41	More fixes to make it compile and play nice on OS X. Still more fixes are required; sending mail to devel shortly... This commit was SVN r17864.	2008-03-18 14:38:52 +00:00
Ralph Castain	8f31a62600	Fix compilation errors so this will compile, remove unused variables This commit was SVN r17862.	2008-03-18 13:01:26 +00:00
Lenny Verkhovsky	647bce6d3e	Support for new RMAPS rank mapping component This commit was SVN r17860.	2008-03-18 09:39:07 +00:00
Lenny Verkhovsky	14c32f87d5	Added new RMAPS component for rank mapping This commit was SVN r17859.	2008-03-18 09:33:49 +00:00
Ralph Castain	8cd6142e6d	Add some debugging to the grpcomm module. Setting grpcomm_base_verbose = 1 will now give you a trace through the functions as they are called. Setting it to 2 or more will give you details on what each function is doing as it works through its procedure. This commit was SVN r17848.	2008-03-17 19:34:36 +00:00
Ralph Castain	629b95a2fe	Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway. So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked. Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion. This commit was SVN r17843.	2008-03-17 17:58:59 +00:00
Josh Hursey	aaff245271	A couple verbose additions. Poll the event engine while waiting for the named pipe. This commit was SVN r17787.	2008-03-07 21:10:14 +00:00
Galen Shipman	0fb6cf0916	make output use verbose macro.. This commit was SVN r17778.	2008-03-07 03:06:17 +00:00
Shiqing Fan	eb1dfaf4d5	Select the windows CCP component at runtime by testing if we are on Windows cluster. This commit was SVN r17776.	2008-03-07 01:31:53 +00:00
Ralph Castain	b110a247be	Fix comm_spawn (maybe). Comm_spawn was sticking during spawn_multiple because of a problem in the dpm - the modex there is asking processes to talk to each other in an allgather_list operation, but the procs don't have the required contact info to do so. The solution here was to ensure that all parent procs have full contact info for procs in the child job. Admittedly, this isn't the long-term answer. We would like to have the contact info given to only the parent procs that were involved in the comm_spawn. There is a way to do that, but this will suffice to keep things working until that can be implemented and tested. This commit was SVN r17772.	2008-03-06 21:56:00 +00:00
Ralph Castain	64d43cc44b	Fix the unity routed component and direct xcast mode. Ensure that direct xcast handles all its use-cases correctly. Unity routed component needs to use the base recv function to properly operate. This commit was SVN r17764.	2008-03-06 18:13:05 +00:00
Ralph Castain	ff99aa054f	In order to prevent orphaned processes when using non-unity routing methods, the procs need to realize that their local daemon is a critical connection - if that connection unexpectedly closes, they need to terminate. This commit adds definition for a "lifeline" connection. For an HNP, there is no lifeline, so the lifeline proc is NULL. For a daemon, the lifeline is the HNP - the daemon should abort if it loses that connection. For a proc using unity routed, the lifeline is the HNP since it connects directly to the HNP. For a proc using tree routed, the lifeline is the local daemon. Adjusted OOB to call abort if the lifeline (as opposed to HNP) connection is lost. This commit was SVN r17761.	2008-03-06 15:30:44 +00:00
Josh Hursey	0b4d9a12ce	a bit more verbosity for the fun of it This commit was SVN r17758.	2008-03-06 14:04:25 +00:00
Tim Prins	f61c2333c0	Remove unneeded field, and the two uses of it. This commit was SVN r17757.	2008-03-06 12:46:36 +00:00
Tim Prins	d56f19c77d	Fix logic error, and remove uneeded checks for invalid results. This commit was SVN r17756.	2008-03-06 04:38:13 +00:00
Ralph Castain	6d94e7b232	Fix the debug output so it correctly reports launch state This commit was SVN r17755.	2008-03-06 03:11:01 +00:00
Tim Prins	5de3e1965e	Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte. Everything should work, however I am unable to compile and test the sctp BTL. This commit was SVN r17751.	2008-03-05 22:44:35 +00:00
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Rolf vandeVaart	03fdd57d5a	Fix the use of --path and -x PATH so that things work properly. Note that --path specifies extra directories where the executable is searched for, but does not affect the PATH settings. This commit fixes trac:1221. This commit was SVN r17748. The following Trac tickets were found above: Ticket 1221 --> https://svn.open-mpi.org/trac/ompi/ticket/1221	2008-03-05 21:07:43 +00:00
Ralph Castain	4dbc352828	Per request, change name of new enviro var to OMPI_COMM_WORLD_LOCAL_SIZE This commit was SVN r17736.	2008-03-05 14:45:26 +00:00
Ralph Castain	06d3145fe4	First cut at direct launch for TM. Able to launch non-ORTE procs and detect their completion for a clean shutdown. This commit was SVN r17732.	2008-03-05 13:51:32 +00:00
George Bosilca	c71f225a28	These functions should only be compiled when OPAL_ENABLE_FT == 1. This commit was SVN r17727.	2008-03-05 05:57:13 +00:00
Josh Hursey	3b4073e32c	This commit fixes the checkpoint/restart functionality on the trunk. Included in this commit are: * Extension to the ESS framework to support C/R * Fixed support for {{{snapc_base_establish_global_snapshot_dir}}} * Fixed FileM support * Misc. minor code modifications There are some outstanding visability issues that I want to fix next. This commit was SVN r17725.	2008-03-05 04:57:23 +00:00
Ralph Castain	edb8e32a7a	Add default hostfile parameter plus --default-hostfile command line option. Fix error message when job setup failed This commit was SVN r17724.	2008-03-05 04:54:57 +00:00
Ralph Castain	022fc1f382	Add another MPI-related enviro variable OMPI_COMM_WORLD_NUM_LOCAL_PROCS This commit was SVN r17723.	2008-03-05 04:53:32 +00:00
Ralph Castain	e745c16ff1	Modify the enviro variable names to be OMPI_... Add two new ones: OMPI_COMM_WORLD_LOCAL_RANK and OMPI_UNIVERSE_SIZE This commit was SVN r17694.	2008-03-04 20:16:05 +00:00
Shiqing Fan	ebf9c0441d	Set the windows components invisible. This commit was SVN r17687.	2008-03-04 17:37:17 +00:00
Shiqing Fan	ae41b5418b	Update the RAS and PLM components for Windows. These won't suffer another platforms but only windows. This commit was SVN r17686.	2008-03-04 17:13:01 +00:00
Ralph Castain	ffa232687a	Fix xcast so it works in multi-node situations where the user specifies a particular mode to use (e.g., direct). This commit was SVN r17682.	2008-03-03 20:07:02 +00:00
Ralph Castain	841d0e5208	Cleanup an attribute warning - not sure which one to set or where it should go, so I'll leave that to someone more familiar with "attributes". Ensure some debugging is only enabled when have_debug is set. This commit was SVN r17681.	2008-03-03 16:06:47 +00:00
Rich Graham	d37db14901	get the shared memory collectives working again with the new version of orte. This commit was SVN r17672.	2008-02-29 22:28:57 +00:00
Ralph Castain	6450962d59	Add some debugging to the message event object. Cleanup some no-longer-used values This commit was SVN r17671.	2008-02-29 20:10:31 +00:00
Ralph Castain	a585923de1	Silence some minor compiler warnings This commit was SVN r17662.	2008-02-29 02:39:39 +00:00
Tim Prins	84b2099fe8	Remove the now-unused orte_value_array. As this is the last 'class' split between orte and ompi, remove the big comment about the split in ompi_bitmap. Also, update some properties (source files should not be executeable...), and remove a couple unneeded inclusions of orte_proc_table.h This commit was SVN r17655.	2008-02-28 21:39:42 +00:00
Ralph Castain	5e6928d710	Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message. Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv. Created two new macros to assist: ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd. Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output. This has been tested on Mac, TM, and SLURM. This commit was SVN r17647.	2008-02-28 19:58:32 +00:00
Ralph Castain	5dc64cea6a	Correct logic - only issue recv and cancel it if we are an HNP This commit was SVN r17641.	2008-02-28 15:27:16 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Gleb Natapov	da3e69101d	Add missing include. This commit was SVN r17493.	2008-02-18 14:55:02 +00:00
Galen Shipman	18d1d3b408	Add ORTE ALPS support (Cray XT CNL) This commit was SVN r17482.	2008-02-17 19:29:06 +00:00
George Bosilca	fcab6cc0bb	Fix typo. This commit was SVN r17255.	2008-01-26 21:36:04 +00:00
Rainer Keller	9d4852cdc1	- Get rid of Wshadow warnings. This commit was SVN r17231.	2008-01-25 14:07:38 +00:00
Pak Lui	413bcca4c0	Support the qrsh or qsub "-notify" option by catching the SIGUSR1/2 signals and not letting user processes to exit on those signals. This commit was SVN r17174.	2008-01-22 17:32:29 +00:00
Josh Hursey	158dda5458	Fix some overlapping code. This commit was SVN r17067.	2008-01-08 15:40:21 +00:00
George Bosilca	eb71a634c6	Don't forget to initialize the msg_origin field. This commit was SVN r17055.	2008-01-04 23:24:49 +00:00
George Bosilca	48f5a26e8c	Cast to keep VC happy (quiet). This commit was SVN r17054.	2008-01-04 23:13:32 +00:00
Adrian Knoth	42d5fe62f9	Fixed misplaced #endif This commit was SVN r17028.	2008-01-01 11:02:38 +00:00
Jeff Squyres	213b5d5c6e	Per long threads on the mailing list and much confusion discussion about linkers, have all OPAL, ORTE, and OMPI components '''not'' link against the OPAL, ORTE, or OMPI libraries. See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a better-formatted version of the same info). This commit was SVN r16968.	2007-12-15 13:32:02 +00:00
Josh Hursey	f7812baf5b	forgot a bit of error checking in the last commit This commit was SVN r16953.	2007-12-13 14:41:18 +00:00
Josh Hursey	a287c9cb65	This commit distinguishes the file transfer stage from the finish stage. This commit also cleans up the checkpoint and terminate case making it more precise than before. Previously the application could make a small amount of progress between checkpoint completion and application termination. Now the application will make no progress at all in this time span. Additional minor change: - Start using OPAL_INT_TO_BOOL instead of if/else logic This commit was SVN r16952.	2007-12-13 14:37:17 +00:00
Rolf vandeVaart	3ea89b69ae	Remove a few tabs. Allow the output stream to be passed to the close command for verbose output. This matches all the other frameworks. This commit was SVN r16938.	2007-12-11 20:44:56 +00:00
Josh Hursey	27c9016b93	sleep -> usleep so we can be a bit more eager when waiting for events to finish. Still working on solutions that do not involve sleeping, but this will do for now. This commit was SVN r16824.	2007-12-03 19:27:32 +00:00
Jeff Squyres	c20350b943	Patch submitted by Brian Barrett, inspired by this thread: http://www.open-mpi.org/community/lists/users/2007/11/4547.php. - Better handling of ECONNABORTED from connect on Linux. - Reduce extraneous output from OOB when TCP connections must be retried. This commit was SVN r16808.	2007-11-30 21:42:15 +00:00
Ron Brightwell	edb9d8e354	Added Catamount to the conditional compilation since Catamount doesn't support fork() or pipe() either. This removes a linker warning message when building for Cray XT with Catamount. This commit was SVN r16772.	2007-11-21 21:37:58 +00:00
George Bosilca	d67c0eefb4	Remove a compilation warning about using uninitialized variables. This commit was SVN r16589.	2007-10-26 20:15:28 +00:00
George Bosilca	b1b5cb6453	Looks like SO_REUSEPORT it's not defined on some platforms. Switch to the conventional SO_REUSEADDR instead. This commit was SVN r16588.	2007-10-26 19:56:21 +00:00
George Bosilca	337f78a4a8	Restrict the port range for the OOB and the BTL. Each protocols (v4 and v6) has his own range which is defined by a min value and a range. By default there is no limitation on the port range, which is exactly the same behavior as before. This commit was SVN r16584.	2007-10-26 16:36:51 +00:00
Jeff Squyres	9e4387d021	* Use new BEGIN_C_DECLS / END_C_DECLS convention * Add newline at end of file to avoid compiler warning This commit was SVN r16579.	2007-10-26 13:40:38 +00:00
Shiqing Fan	3c38c9c020	- Add extern "C" to resolve linkage specification problems. This commit was SVN r16577.	2007-10-26 09:54:42 +00:00
Ralph Castain	a791ce2299	The processor affinity must be set on a per-process basis, not per-app-context. This commit was SVN r16559.	2007-10-23 20:46:16 +00:00
George Bosilca	7a63f9b730	I somehow mess up my last commit. Sorry. This commit was SVN r16543.	2007-10-22 15:08:17 +00:00
George Bosilca	b93f72bdfd	Remove 2 warnings about uninitialized i and quit_flags. This commit was SVN r16542.	2007-10-22 15:01:15 +00:00
Jeff Squyres	5637c7a5a0	In addition to r16513, this commit fixes trac:1170. If we cannot resolve the route to the peer that we're trying to send to, don't queue up the message in the TCP OOB -- instead, return it to the upper layer (e.g., the RML) and let it decide what to do. In the case of the routed RML, the tree component will queue it up for later transmission. Hence, we don't want the message queued up both here in the TCP OOB and the tree routed. Also see some more discussion / explanation in #1171. This commit was SVN r16540. The following SVN revision numbers were found above: r16513 --> open-mpi/ompi@7ae9589d70 The following Trac tickets were found above: Ticket 1170 --> https://svn.open-mpi.org/trac/ompi/ticket/1170	2007-10-22 13:46:57 +00:00
Jeff Squyres	7ae9589d70	The header is at the address of the buffer pointed to by the iov, not the address of the iov. This commit was SVN r16513.	2007-10-19 12:40:14 +00:00
Jeff Squyres	abf1b728b9	Minor code maintenance fix -- put the THREAD_UNLOCK outside the if statement so that you only have to have it once. This commit was SVN r16512.	2007-10-19 12:36:26 +00:00
Ralph Castain	73eeb7f0d2	Fix a bug in the way we handled buffer releases and the conditioned wait that held us in the xcast until completed. This commit was SVN r16504.	2007-10-19 01:17:01 +00:00
Josh Hursey	0bf61a1b84	Move in some accumulated small features and minor bug fixes for C/R support. {{{ svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs . }}} This commit was SVN r16478.	2007-10-17 13:47:36 +00:00
Ralph Castain	ec5fe78876	When in the unity message routing mode, we have to update the RML contact info in the parent procs so that they know how to talk to the children. Ideally, this would be done in the MPI layer since that layer knows which procs are actively involved in the comm_spawn. However, it isn't being done there, which causes comm_spawn to fail, so do it explicitly in the RTE. Note that this means ALL procs in the parent job are updated, even though they may not be participating in the comm_spawn. This doesn't really hurt anything - just unnecessary. Comm_spawn still has a problem when a child process shares a node with a parent, so this doesn't fix everything. It only fixes the bug of ensuring all procs know how to talk to each other. This commit was SVN r16460.	2007-10-16 16:09:41 +00:00
Ralph Castain	713b6e13a5	Improve diagnostic output messages when errors are hit This commit was SVN r16457.	2007-10-16 14:51:52 +00:00
Josh Hursey	ea0652d20f	If we are going to pretend to do filem, then we should always pretend. No one should be using this feature except for me. :) This commit was SVN r16454.	2007-10-15 20:04:35 +00:00
Ralph Castain	b6196e8a39	When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed. In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean. I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it. This commit was SVN r16451.	2007-10-15 18:00:30 +00:00
Jeff Squyres	423f23eb6a	Fixes trac:1160. There is still some other problem in the OOB, but we wanted to commit this to get wider testing. This commit was SVN r16445. The following Trac tickets were found above: Ticket 1160 --> https://svn.open-mpi.org/trac/ompi/ticket/1160	2007-10-15 15:41:36 +00:00
Josh Hursey	f16a42947a	Change some default MCA parameters: - Global snapshot directory = $HOME - FileM 'rsh' = 'ssh' - FileM 'rcp' = 'scp' This commit was SVN r16444.	2007-10-15 15:21:17 +00:00
Josh Hursey	520c27ac94	If the HNP is acting as the orted for local launch then the gpr_replica variable is not defined. Make sure to set it to something reasonable so that file preloading still works (instead of seg faulting :) Thanks to Hiep Bui Hoang for reporting this bug. This commit was SVN r16433.	2007-10-11 19:47:04 +00:00
Josh Hursey	e483c36cea	Remove a big of debug in filem/rsh that should have never been committed. A guesture towards overlapping file removal with metadata update. This commit was SVN r16432.	2007-10-11 19:37:33 +00:00
Ralph Castain	3dbd4d9be7	Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by: 1. taking advantage of the fact that we no longer create the launch message via a GPR trigger. In earlier times, we had the GPR create the launch message based on a subscription. In that mode of operation, we could not guarantee the order in which the data was stored in the message - hence, we had no choice but to parse the message in a loop that checked each value against a list of possible "keys" until the corresponding value was found. Now, however, we construct the message "by hand", so we know precisely what data is in each location in the message. Thus, we no longer need to send the character string "keys" for each data value any more. This represents a rather large savings in the message size - to give you an example, we typically would use a 30-char "key" for a 2-byte data value. As you can see, the overhead can become very large. 2. sending node-specific data only once. Again, because we used to construct the message via subscriptions that were done on a per-proc basis, the data for each node (e.g., the daemon's name, whether or not the node was oversubscribed) would be included in the data for each proc. Thus, the node-specific data was repeated for every proc. Now that we construct the message "by hand", there is no reason to do this any more. Instead, we can insert the data for a specific node only once, and then provide the per-proc data for that node. We therefore not only save all that extra data in the message, but we also only need to parse the per-node data once. The savings become significant at scale. Here is a comparison between the revised trunk and the trunk prior to this commit (all data was taken on odin, using openib, 64 nodes, unity message routing, tested with application consisting of mpi_init/mpi_barrier/mpi_finalize, all execution times given in seconds, all launch message sizes in bytes): Per-node scaling, taken at 1ppn: #nodes original trunk revised trunk time size time size 1 0.10 819 0.09 564 2 0.14 1070 0.14 677 3 0.15 1321 0.14 790 4 0.15 1572 0.15 903 8 0.17 2576 0.20 1355 16 0.25 4584 0.21 2259 32 0.28 8600 0.27 4067 64 0.50 16632 0.39 7683 Per-proc scaling, taken at 64 nodes ppn original trunk revised trunk time size time size 1 0.50 16669 0.40 7720 2 0.55 32733 0.54 11048 3 0.87 48797 0.81 14376 4 1.0 64861 0.85 17704 Condensing those numbers, it appears we gained: per-node message size: 251 bytes/node -> 113 bytes/node per-proc message size: 251 bytes/proc -> 52 bytes/proc per-job message size: 568 bytes/job -> 399 bytes/job (job-specific data such as jobid, override oversubscribe flag, total #procs in job, total slots allocated) The fact that the two pre-commit trunk numbers are the same confirms the fact that each proc was containing the node data as well. It isn't quite the 10x message reduction I had hoped to get, but it is significant and gives much better scaling. Note that the timing info was, as usual, pretty chaotic - the numbers cited here were typical across several runs taken after the initial one to avoid NFS file positioning influences. Also note that this commit removes the orte_process_info.vpid_start field and the handful of places that passed that useless value. By definition, all jobs start at vpid=0, so all we were doing is passing "0" around. In fact, many places simply hardwired it to "0" anyway rather than deal with it. This commit was SVN r16428.	2007-10-11 15:57:26 +00:00
Rolf vandeVaart	25c95c9ee9	Fix build on solaris. Need to include sys/wait.h. This commit was SVN r16426.	2007-10-11 15:04:30 +00:00
Jeff Squyres	e2df42eea3	Move the <sys/wait.h> below "orte_config.h" This commit was SVN r16424.	2007-10-11 11:31:09 +00:00
George Bosilca	7cc9f588a8	Decorate the base functions with ORTE_DECLSPEC. This commit was SVN r16423.	2007-10-11 00:02:49 +00:00
Ralph Castain	53af94fd87	Modify the configure system so that gridengine support is only built in specific conditions: 1. --with-sge, always builds 2. --without-sge, never builds 3. if neither is specified, build if and only if either SGE_ROOT is set or "qrsh" is found in the path This commit was SVN r16422.	2007-10-10 21:39:16 +00:00
Josh Hursey	6e5341c659	Forgot to move a header in the code movement. This commit was SVN r16420.	2007-10-10 15:39:40 +00:00
Ralph Castain	82a8e2d10d	Reorganize the odls framework to place common functionality in the base, thus making maintenance easier. We still need this to be a framework as some environments (e.g., bproc) require significantly different functionality. However, there is quite a bit of commonality across the components, so this ensures that fixes in one get propagated across the others. This patch also fixes a minor bug discovered along the way: we had "lost" the passing of the oversubscribed condition flag from the mapper to the orteds. Thus, we were not setting sched_yield correctly when in oversubscribed conditions (except when a hostfile was specified - different logic there because we treat the number of slots allocated on the node as "uncertain") I did not modify the process component in this patch - I will send a proposed patch to the maintainers of that component so they can review it first. This commit was SVN r16418.	2007-10-10 15:02:10 +00:00
Josh Hursey	7f833a9cb2	silence a warning that is triggered on restart This commit was SVN r16417.	2007-10-10 14:25:49 +00:00
Ethan Mallove	d0b61db65c	Add in a missing #include for Solaris builds. This commit was SVN r16416.	2007-10-10 12:49:15 +00:00
Josh Hursey	aa8391f888	Local and global coordinators should be the only ones involved in the movement of checkpoint files. This reduces the overhead on the applicaiton. This commit was SVN r16412.	2007-10-09 19:52:47 +00:00
Galen Shipman	fda1306807	revert my stupidity.. This commit was SVN r16410.	2007-10-09 19:01:20 +00:00
Josh Hursey	8fe2ef5647	a missing include This commit was SVN r16402.	2007-10-09 14:32:36 +00:00
Josh Hursey	7437f37e96	This commit contains the following: * Fix some missing includes in a few places. * Add the cr_request() functionality to the BLCR CRS component. We are now dependent upon the 0.6.* series of BLCR. * Made the CR notification mechanism a registered function. This way we can have an OPAL-only version and it can be replaced at runtime with the ORTE version. * Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only CR functionality when the user wants it. Default: Disabled. * Fix the placement of a checkpoint request check in MPI_Init * Pull the OPAL notification mechanism into the SnapC framework. * We no longer fork/exec the 'opal-checkpoint' command for local checkpointing, the Local coordinator in the orted does this directly. * The Local and Application coordinator talk together bypassing the OPAL notifiation mechanism. * Optimized the Local <-> App Coordinator communication. * Improved the structure used to track vpid_snapshots in the local coord. * Fix a race condition in which an application under heavy communication load may produce an inconsistent global checkpoint. This commit was SVN r16389.	2007-10-08 20:53:02 +00:00
Galen Shipman	1c1b9d5480	make cray happy This commit was SVN r16377.	2007-10-08 14:31:59 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00

... 3 4 5 6 7 ...

1521 Коммитов