1
1
Граф коммитов

1521 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
955d117f5e Add a new grpcomm module that mimics the old 1.2 behavior - it -always- does a modex because it always includes the architecture. Hence, we called it "blind-and-dumb" since it doesn't look to see if this is required - moniker of "bad". :-)
Update the ESS API so we can update the stored arch's should the modex include that info. Update ompi/proc to check/set the arch for remote procs, and add that function call to mpi_init right after the modex is done.

Setup to allow other grpcomm modules to decide whether or not to add the arch to the modex, and to detect if other entries have been made. If not, then the modex can just fall through. Begin setting up some logic in the "basic" module to handle different arch situations.

For now, default to the "bad" module so we will work in all situations, even though we may be sending around more info than we really require.

This fixes ticket #1340

This commit was SVN r18673.
2008-06-18 22:17:53 +00:00
Ralph Castain
282a220e7e Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation
This commit was SVN r18670.
2008-06-18 15:28:46 +00:00
George Bosilca
8e7c35e76c These symbols are only available via the module/component structure, so they
don't have to be globally visible.

This commit was SVN r18666.
2008-06-18 08:20:02 +00:00
Ralph Castain
0532d799d6 Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm.
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.

This commit was SVN r18664.
2008-06-18 03:15:56 +00:00
Ralph Castain
a87aa442e3 Remove last remaining reference to iof_flush - it was #if'd out anyway. The existing flush code appears to have several critical problems. Given the impending rework of the IOF subsystem, there is no point in trying to fix it here.
This commit was SVN r18649.
2008-06-11 16:25:46 +00:00
Ralph Castain
f9d809748c Glad someone found that last error - caused me to review the code and find a couple of other cleanups! Nothing major, but just ensure that things flow smoothly since we had a "shadowed" variable.
This commit was SVN r18643.
2008-06-10 19:15:59 +00:00
Camille Coti
67cd1849f7 *map was still NULL in the else statement, inducing a segmentation fault when a field of the structure was accessed to.
This commit was SVN r18642.
2008-06-10 19:00:57 +00:00
Ralph Castain
1a422995ae Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak.
This commit was SVN r18641.
2008-06-10 17:53:28 +00:00
Brian Barrett
4127bd0dcc fix two other mistakes in the cnos ess
This commit was SVN r18632.
2008-06-09 22:28:26 +00:00
George Bosilca
f72ab90b16 Allow xgrid to compile again.
This commit was SVN r18631.
2008-06-09 21:51:41 +00:00
Brian Barrett
11cd3a7cba Fix problem where local rank always had different architecture than remote
ranks on Red Storm

This commit was SVN r18630.
2008-06-09 21:46:03 +00:00
Ralph Castain
c13cadc3c7 Refs trac:1255
This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something.

So the ticket can be checked by Jeff when he returns from vacation... :-)

This commit was SVN r18625.

The following Trac tickets were found above:
  Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255
2008-06-09 20:34:14 +00:00
Ralph Castain
bf5c34d10a The rsh launcher is one place where multi-word MCA params would have to be passed via the orted cmd line. In such a case, we have to explicitly include quote marks about the param value. Add that capability here.
This commit fixes trac:1200

This commit was SVN r18621.

The following Trac tickets were found above:
  Ticket 1200 --> https://svn.open-mpi.org/trac/ompi/ticket/1200
2008-06-09 19:07:19 +00:00
Ralph Castain
9613b3176c Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.

I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.

This commit was SVN r18619.
2008-06-09 14:53:58 +00:00
Pak Lui
caac0e0182 Add in a couple missing ones from r18611 for all tm users out there...
This commit was SVN r18615.

The following SVN revision numbers were found above:
  r18611 --> open-mpi/ompi@7bee71aa59
2008-06-06 22:53:43 +00:00
Ralph Castain
b65eb54ea2 Cut out a new iof pull - that capability isn't ready yet for the trunk, but will be coming shortly
Thanks to Pak for letting me know...

This commit was SVN r18614.
2008-06-06 21:24:15 +00:00
Pak Lui
7f7777a538 Check for NULL in prefix_dir.
This commit fixes trac:1337.

This commit was SVN r18612.

The following Trac tickets were found above:
  Ticket 1337 --> https://svn.open-mpi.org/trac/ompi/ticket/1337
2008-06-06 19:55:01 +00:00
Ralph Castain
7bee71aa59 Fix a potential, albeit perhaps esoteric, race condition that can occur for fast HNP's, slow orteds, and fast apps. Under those conditions, it is possible for the orted to be caught in its original send of contact info back to the HNP, and thus for the progress stack never to recover back to a high level. In those circumstances, the orted can "hang" when trying to exit.
Add a new function to opal_progress that tells us our recursion depth to support that solution.

Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death...

Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang.

After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them.

Thanks you Jeff! :-/

This commit was SVN r18611.
2008-06-06 19:36:27 +00:00
Josh Hursey
1de50b523c Fix some Coverity 'Event set_but_not_used' highlights.
Thanks to Jeff for bringing them to my attention.

This commit was SVN r18606.
2008-06-06 14:38:41 +00:00
Jeff Squyres
d3795d7a34 Fix CID 987: remove unused variable.
This commit was SVN r18598.
2008-06-05 20:17:02 +00:00
Ralph Castain
332e6c89ab Modify the slurm launcher so that the kill-on-bad-exit behavior is not "on" by default. Instead, only turn it "on" if the plm_slurm_detect_failure mca param is set to something non-zero
This commit was SVN r18588.
2008-06-04 23:59:53 +00:00
Ralph Castain
0da811ce79 Initial work on xml support - allocation and job map outputs completed. More to come.
This commit was SVN r18587.
2008-06-04 20:53:12 +00:00
George Bosilca
25ae9c12e6 Silence few warnings.
This commit was SVN r18568.
2008-06-03 19:58:40 +00:00
George Bosilca
fa89d299bf Silence the Obj-C compiler.
This commit was SVN r18567.
2008-06-03 19:24:17 +00:00
Ralph Castain
c992e99035 Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface
This commit was SVN r18557.
2008-06-03 14:24:01 +00:00
Ralph Castain
95578b0528 Fix single-node operations so that the HNP correctly exits when the job completes
This commit was SVN r18556.
2008-06-03 14:23:04 +00:00
Ralph Castain
b456fb2d42 Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2*heartrate, declares orted failure if not seen in last 2*heartrate time.
Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher.

Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported.

More info to come on devel list.

This commit was SVN r18555.
2008-06-02 21:46:34 +00:00
Shiqing Fan
af656b2b3d Fix some typing mistakes, make the sources compile again for Windows Visual Studio.
This commit was SVN r18542.
2008-05-29 15:27:43 +00:00
Ralph Castain
2b28bef15a Provide a "nicer" indication that we don't know the pid of the failed orted
This commit was SVN r18538.
2008-05-29 14:10:58 +00:00
Ralph Castain
72530f8fed Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed:
1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done.

2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear.

This represents the best-available fix for ticket #221 so I am closing that ticket at this time.

This commit was SVN r18536.
2008-05-29 13:38:27 +00:00
Ralph Castain
52fb773c6c Tell slurm to kill the job if an orted abnormally exits
This commit was SVN r18535.
2008-05-29 12:26:58 +00:00
Ralph Castain
e5e542ddcf Clarify an error message
This commit was SVN r18533.
2008-05-29 12:20:24 +00:00
Josh Hursey
4ac7016200 Make sure to check "opal_list_get_last" instead of "opal_list_get_end".
The former will return a valid item in the list, the latter will return an invalid item that marks the end of the list.

It was happending that when oversubscribing by way of an appfile we would cause a segv because we tried to interpret the invalid item returned by "opal_list_get_end" instead of a valid item. We would then try to write to unallocated memory.

This commit fixes trac:1279

This commit was SVN r18529.

The following Trac tickets were found above:
  Ticket 1279 --> https://svn.open-mpi.org/trac/ompi/ticket/1279
2008-05-28 19:37:20 +00:00
Ralph Castain
f76240e7cc Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch.
This commit was SVN r18528.
2008-05-28 18:38:47 +00:00
George Bosilca
1eb1742225 Remove this left over dependency.
This commit was SVN r18508.
2008-05-27 16:57:40 +00:00
Ralph Castain
93d932aa0c Ensure that the display-map and display-allocation outputs get processed through the new OPAL filter framework by passing them through orte_output instead of using the opal_dss.dump function.
This commit was SVN r18507.
2008-05-27 15:46:21 +00:00
Ralph Castain
0b2b655de5 Initialize a variable so it can correctly be dealt with at shutdown - fixes trac:1312
This commit was SVN r18505.

The following Trac tickets were found above:
  Ticket 1312 --> https://svn.open-mpi.org/trac/ompi/ticket/1312
2008-05-27 14:53:24 +00:00
Pak Lui
695c158192 silence some intel and pgcc compiler warnings.
This commit was SVN r18501.
2008-05-26 20:35:13 +00:00
Pak Lui
7b3d7dcac4 This commit closes trac:1300.
This commit was SVN r18473.

The following Trac tickets were found above:
  Ticket 1300 --> https://svn.open-mpi.org/trac/ompi/ticket/1300
2008-05-21 22:35:04 +00:00
Josh Hursey
7e8cd20a0a a fix for C/R support
This commit was SVN r18438.
2008-05-14 16:57:37 +00:00
Jeff Squyres
671f0c379d Remove a whole pile of orte/util/show_help.h's that I missed. :-(
This commit was SVN r18437.
2008-05-14 11:32:33 +00:00
Pak Lui
4c8d79d907 Silence the compiler warnings/errors. There is no orte/util/show_help.h
This commit was SVN r18436.
2008-05-13 22:07:38 +00:00
Jeff Squyres
e7ecd56bd2 This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.

= ORTE Job-Level Output Messages =

Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):

 * orte_output(): (and corresponding friends ORTE_OUTPUT,
   orte_output_verbose, etc.)  This function sends the output directly
   to the HNP for processing as part of a job-specific output
   channel.  It supports all the same outputs as opal_output()
   (syslog, file, stdout, stderr), but for stdout/stderr, the output
   is sent to the HNP for processing and output.  More on this below.
 * orte_show_help(): This function is a drop-in-replacement for
   opal_show_help(), with two differences in functionality:
   1. the rendered text help message output is sent to the HNP for
      display (rather than outputting directly into the process' stderr
      stream)
   1. the HNP detects duplicate help messages and does not display them
      (so that you don't see the same error message N times, once from
      each of your N MPI processes); instead, it counts "new" instances
      of the help message and displays a message every ~5 seconds when
      there are new ones ("I got X new copies of the help message...")

opal_show_help and opal_output still exist, but they only output in
the current process.  The intent for the new orte_* functions is that
they can apply job-level intelligence to the output.  As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.

=== New code ===

For ORTE and OMPI programmers, here's what you need to do differently
in new code:

 * Do not include opal/util/show_help.h or opal/util/output.h.
   Instead, include orte/util/output.h (this one header file has
   declarations for both the orte_output() series of functions and
   orte_show_help()).
 * Effectively s/opal_output/orte_output/gi throughout your code.
   Note that orte_output_open() takes a slightly different argument
   list (as a way to pass data to the filtering stream -- see below),
   so you if explicitly call opal_output_open(), you'll need to
   slightly adapt to the new signature of orte_output_open().
 * Literally s/opal_show_help/orte_show_help/.  The function signature
   is identical.

=== Notes ===

 * orte_output'ing to stream 0 will do similar to what
   opal_output'ing did, so leaving a hard-coded "0" as the first
   argument is safe.
 * For systems that do not use ORTE's RML or the HNP, the effect of
   orte_output_* and orte_show_help will be identical to their opal
   counterparts (the additional information passed to
   orte_output_open() will be lost!).  Indeed, the orte_* functions
   simply become trivial wrappers to their opal_* counterparts.  Note
   that we have not tested this; the code is simple but it is quite
   possible that we mucked something up.

= Filter Framework =

Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr.  The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations.  The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc.  This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).

Filtering is not active by default.  Filter components must be
specifically requested, such as:

{{{
$ mpirun --mca filter xml ...
}}}

There can only be one filter component active.

= New MCA Parameters =

The new functionality described above introduces two new MCA
parameters:

 * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
   help messages will be aggregated, as described above.  If set to 0,
   all help messages will be displayed, even if they are duplicates
   (i.e., the original behavior).
 * '''orte_base_show_output_recursions''': An MCA parameter to help
   debug one of the known issues, described below.  It is likely that
   this MCA parameter will disappear before v1.3 final.

= Known Issues =

 * The XML filter component is not complete.  The current output from
   this component is preliminary and not real XML.  A bit more work
   needs to be done to configure.m4 search for an appropriate XML
   library/link it in/use it at run time.
 * There are possible recursion loops in the orte_output() and
   orte_show_help() functions -- e.g., if RML send calls orte_output()
   or orte_show_help().  We have some ideas how to fix these, but
   figured that it was ok to commit before feature freeze with known
   issues.  The code currently contains sub-optimal workarounds so
   that this will not be a problem, but it would be good to actually
   solve the problem rather than have hackish workarounds before v1.3 final.

This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
Shiqing Fan
7ff440f628 Add quotation marks for windows path.
This commit was SVN r18420.
2008-05-09 14:12:09 +00:00
Josh Hursey
da2f1c58e2 Some checkpoint/restart cleanup.
* Remove the opal_only option. This was suffering from bit rot, and no one uses it. It can be added back fairly easily if wanted.
 * Cleanup metadata interactions at the local level.
 * Touch up some of the INC funcitonality (fix typos and a minor ordering issue)

This commit was SVN r18416.
2008-05-08 18:47:47 +00:00
Ralph Castain
64ef4102c4 Add the topo mapper module - requires some work in carto for completion.
Little cleanup in round-robin mapper.

This commit was SVN r18412.
2008-05-08 05:09:13 +00:00
Ralph Castain
ac5263613c Fix stupid singletons yet again
This commit was SVN r18408.
2008-05-07 20:26:31 +00:00
George Bosilca
dbea3e070e Correct some copy/paste errors.
This commit was SVN r18396.
2008-05-07 04:04:42 +00:00
Ralph Castain
ff70636024 Allgather_list needs its own tag to avoid conflicting with the allgather modex operation.
All spawned procs must decode the port of the spawning process so they can communicate in direct routed mode.

This fixes comm_spawn for all routing modes.

This commit was SVN r18395.
2008-05-07 03:03:56 +00:00
Josh Hursey
bc67f40936 whoops typo
This commit was SVN r18390.
2008-05-06 22:00:24 +00:00
Josh Hursey
50c909a23d Fix a bit of selection logic. Filem should not fail select if the user decided not to build with any filem components. This matches the logic before the mca_base_select() change.
This commit was SVN r18389.
2008-05-06 21:57:45 +00:00
Pak Lui
108921c020 typo
This commit was SVN r18387.
2008-05-06 21:37:35 +00:00
Pak Lui
0302c098be minor typo
This commit was SVN r18386.
2008-05-06 21:26:17 +00:00
Ralph Castain
d97a4f880d Shift the daemon collective operation to the ODLS framework. Ensure we track the collectives per job to avoid race conditions. Take advantage of the new capabilities of the routed framework to define aggregating trees for the daemon collective, and to track which daemons are participating to handle the case of sparse participation.
Make it all work with comm_spawn in the case of all procs on previously occupied nodes, some new procs on new nodes, and mixtures of the two.

Note: comm_spawn now works with both binomial and linear routed modules. There remains a problem of spawned procs not properly getting updated contact info for the parent proc when run in the direct routed mode...but that's for another day.

This commit was SVN r18385.
2008-05-06 20:16:17 +00:00
Josh Hursey
c47406810e Fix AMCA orted command line.
If no AMCA parameters are passed then do not send across the path information. Only place it on the command line if the AMCA parameter is set.

This commit was SVN r18382.
2008-05-06 18:27:31 +00:00
Josh Hursey
9971bc9d95 Merge in the mca_base_select changes per RFC:
http://www.open-mpi.org/community/lists/devel/2008/04/3779.php

{{{
svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play .
}}}

Any components not in the trunk, but in one of the effected frameworks *must* be
updated. Contact the list, look at the RFC, or look at the diff for how to do this.

Sorry for the early commit of this, but I wanted to get it in today (per RFC) and
didn't know if I would have a chance later today.

This commit was SVN r18381.
2008-05-06 18:08:45 +00:00
Ralph Castain
40904dd152 Add a binomial routed module - for now, still completely wires up the daemons, but that will be changed later.
Modify grpcomm xcast so it now uses the selected routed module - eliminates cross-wiring of xcast and routing paths. Suboptimal at the moment, but better implementation is on its way.

Cleanup ignore properties on the new routed components.

This commit was SVN r18377.
2008-05-05 22:32:25 +00:00
Aurelien Bouteiller
5ba62469a0 Add a route_is_defined implementation for the linear oob routing.
This commit was SVN r18375.
2008-05-05 19:12:41 +00:00
Aurelien Bouteiller
2ae30fe126 Implementation of the route_is_defined stub for direct oob routing.
This commit was SVN r18373.
2008-05-05 18:23:26 +00:00
Ralph Castain
b8bb990acf Rename the routed modules to more accurately reflect what they do and the role they will play in soon-to-come updates.
Add two new API's to the routed framework - stub them out so that collaborators can work on them in various components without conflicts.

Remove a "finalize" from the select function that could cause problems as the component had not had its initialize called yet.

This commit was SVN r18369.
2008-05-05 02:59:09 +00:00
Ralph Castain
519c15f8af Fix direct and linear xcast modes
This commit was SVN r18359.
2008-05-02 14:30:07 +00:00
Ralph Castain
8e846bf7f2 Separate the gathering of collective data by jobid
This commit was SVN r18357.
2008-05-02 12:00:08 +00:00
Ralph Castain
432d441b3e Cleanup a bug found by Josh that caused multiple app_contexts to keep mapping onto the first node in an allocation
Continue work on loadbalancing

Cleanup code organization in rmaps_base

This commit was SVN r18353.
2008-05-01 21:07:49 +00:00
Ralph Castain
b2c73f6e11 Fix tree-spawn to work within the new modex system
This commit was SVN r18349.
2008-05-01 19:19:34 +00:00
Josh Hursey
dcd21d7d07 Some checkpoint/restart fixes in response to r18338 (changes in modex).
Things should be working now.

This commit was SVN r18348.

The following SVN revision numbers were found above:
  r18338 --> open-mpi/ompi@3e55fe6f6d
2008-05-01 17:48:13 +00:00
Ralph Castain
ad894b050b Set the bookmark so the first process of a comm_spawn'd job will be mapped to the same node as the spawning proc, assuming it has space. If not, then the mapper will automatically move to the next node.
This commit was SVN r18346.
2008-05-01 15:24:03 +00:00
Ralph Castain
1766442591 Fix a double-free when tree-spawning
Fix the round-robin mapper so it doesn't move to the next node just because it completed mapping an app_context

This commit was SVN r18344.
2008-05-01 14:49:56 +00:00
Ralph Castain
3e55fe6f6d Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit.
Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs.

Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node.

This commit was SVN r18338.
2008-04-30 19:49:53 +00:00
Ralph Castain
4c2c6c9bd8 Ensure the pack/unpacks match for tree-spawn
This commit was SVN r18282.
2008-04-24 18:53:08 +00:00
Ralph Castain
09b6758f8c Pass the prefix dir to the remote orted when doing tree-based spawns
This commit was SVN r18280.
2008-04-24 18:38:24 +00:00
Josh Hursey
2c736873bb Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.

The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.

Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.

 * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
 * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
 * Update ft_event functions in PML and BML to handle the new restart state.
 * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.

This commit was SVN r18276.
2008-04-24 17:54:22 +00:00
Ralph Castain
eece9f88f0 Fix a bug in the way we computed local_rank. This needs to be the local_rank -among my job peers- on a node.
We were mistakenly computing the local_rank across -all- jobs with procs on that node. While the two definitions are equivalent for an initial launch, comm_spawn'd procs would get the wrong local_rank. In particular, there would not be a local_rank=0 proc in the comm_spawn'd job on any node that was shared with the initial job.

This commit was SVN r18263.
2008-04-23 17:42:59 +00:00
Ralph Castain
f56f06a7ff Do not trust the RM's names - apparently, RR has trained it to lie! Default to using the name we got from gethostname as it is the only one we can trust.
This commit was SVN r18259.
2008-04-23 17:00:35 +00:00
Ralph Castain
8001e4e99c See if this will fix a race condition showing up in comm_spawn MTT testing
This commit was SVN r18257.
2008-04-23 15:43:44 +00:00
Ralph Castain
5311b13b60 Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list
Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system.

This commit was SVN r18252.
2008-04-23 14:52:09 +00:00
Lenny Verkhovsky
456ce6c4da Few cleanups in Rank_File component + fixed opal_paffinity_slot_list without rankfile
This commit was SVN r18249.
2008-04-23 13:34:05 +00:00
Shiqing Fan
eb5f5d77cc If it's not the HNP, release the cluster object first and return.
This commit was SVN r18247.
2008-04-23 13:21:32 +00:00
Josh Hursey
750ce0152c After a bit of testing this morning it seems that the tree component is able to work correctly with the checkpoint/restart functionality. So enable this component when C/R is enabled.
This commit was SVN r18246.
2008-04-23 13:01:23 +00:00
Josh Hursey
cc83d41ad9 Merge in tmp/jjh-scratch
{{{
 svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch .
}}}

Contains:
 * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart.
 * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff
 * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P.
 * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry
 * Some other sundry cleanup items all dealing with C/R functionality in the trunk.

This commit was SVN r18241.
2008-04-23 00:17:12 +00:00
Ralph Castain
c3ddf66445 Move the dislay-allocation code to where it is always seen
This commit was SVN r18227.
2008-04-21 20:28:59 +00:00
Ralph Castain
16c9100633 Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation.
This commit was SVN r18216.
2008-04-20 02:25:45 +00:00
Ralph Castain
07f0a71faa Cleanup the show_help entries on the seq mapper
This commit was SVN r18191.
2008-04-17 14:43:15 +00:00
Ralph Castain
e7487ad533 Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile.
Restore the "do-not-launch" functionality so users can test a mapping without launching it.

Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests.

Add a function to hostfile to generate an ordered list of host names from a hostfile

This commit was SVN r18190.
2008-04-17 13:50:59 +00:00
Ralph Castain
66e532669a Remove some dead code
This commit was SVN r18182.
2008-04-16 20:33:53 +00:00
Ralph Castain
3413191e52 Fix singleton and singleton comm_spawn
This commit was SVN r18177.
2008-04-16 14:38:10 +00:00
Ralph Castain
7b91f8baff Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory.
Fix the ompi-server -h cmd line option so it actually tells you something!

Add two new testing codes to the orte/test/mpi area: accept and connect.

This commit was SVN r18176.
2008-04-16 14:27:42 +00:00
Adrian Knoth
84e4013530 Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set.
This commit was SVN r18164.
2008-04-16 09:31:15 +00:00
Adrian Knoth
0ddfff4ffe Added new oob-tcp parameter oob_tcp_disable_family.
Like btl_tcp_disable_family, this parameter more or less disables
a whole address family. Though the sockets are still created, the
corresponding information isn't added to the connection strings.

Likewise, we don't try to connect to addresses matching the disabled
address family.

This is particularly important for multidomain clusters, where IPv4 is
oftenly filtered (firewalled), sometimes by simply dropping the packets
instead of rejecting them (thus causing a connection timeout instead of
a quick "no route to host").

This commit was SVN r18163.
2008-04-16 09:22:00 +00:00
Ralph Castain
a4ea756a76 Ensure the node loop cntr gets incremented if the daemon already exists
This commit was SVN r18150.
2008-04-15 14:20:03 +00:00
Ralph Castain
35c260a14f Fix the plm modules to accommodate the new remote_spawn entry - set that entry to NULL for all but rsh as only that module supports it at this time
This commit was SVN r18145.
2008-04-14 19:36:13 +00:00
Ralph Castain
84156c422f Egad! Typo snuck in there...nasty vi!
This commit was SVN r18144.
2008-04-14 18:29:11 +00:00
Ralph Castain
7c7304466c Add a binomial tree-based launch to ssh, turned "on" only when the plm_rsh_tree_spawned mca param is set to a non-zero value. This probably isn't a very optimized capability, but it does execute a tree-based launch that may scale better than linear at high node counts.
Add the daemon map capability to the ODLS to create and save a map of daemon vpid vs nodename from the launch message.

Cleanup a few places in the base plm launch support where we didn't adequately protect rml recv's from potentially executing sends.

This commit was SVN r18143.
2008-04-14 18:26:08 +00:00
Ralph Castain
e050f37578 Cleanup a few warnings about initializing variables.
Remove an obsolete data value.

This commit was SVN r18129.
2008-04-10 19:15:16 +00:00
Ralph Castain
851279fc9f Consolidate the daemon wireup message into the launch message. The daemons don't need their contact info prior to the launch message anyway. This not only eliminates a job-wide communication from the startup procedure, but it also resolves a race condition reported when operating across highly distributed (i.e., cross-country) networks. In such scenarios, it proved possible for a daemon to receive its launch message -before- it had received the contact info message, even though the latter had been sent first!
This eliminates that problem...

This commit was SVN r18126.
2008-04-10 15:35:11 +00:00
Ralph Castain
57e3e86cda Use the proper exit code for mpirun to indicate an error when something goes wrong during launch (in scenarios where the procs don't report the problem directly themselves)
This commit was SVN r18121.
2008-04-10 09:15:08 +00:00
Ralph Castain
e7d0dae89d Ensure we update the daemon collective trees if num_procs changes, but only if it changes
This commit was SVN r18120.
2008-04-10 03:44:18 +00:00
Ralph Castain
22343e6e0b Given total lack of interest/support from the folks behind these environments, and the fact that we can now scale so well with our own daemons, it seems unlikely that we will be able to pursue direct and/or standalone launch in these environments. If that situation ever changes, it is easy enough to revive the effort since little had really been done to-date.
Meantime, no reason to continue dragging these around.

This commit was SVN r18119.
2008-04-10 02:54:13 +00:00
Ralph Castain
dc2f88b9f0 Now that we have the daemon collectives, the unity routed module no longer needs the "hack" we inserted a week ago to tell the daemons how to talk directly to all the application procs. The modex and barrier messages flow cleanly across the daemons and are "dropped" into the procs where required.
Add some insurance to make certain that the daemons' number of procs only gets updated when it absolutely is intended.

This commit was SVN r18118.
2008-04-10 02:45:42 +00:00
Ralph Castain
0b3122ee2f Update the cnos module - should (hopefully) compile and work...
This commit was SVN r18117.
2008-04-09 22:33:00 +00:00
Ralph Castain
3a0d09300b Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations.
Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study.

This commit was SVN r18115.
2008-04-09 22:10:53 +00:00
Ralph Castain
11c6773c83 Commit a patch from Brian that fixes potential segfaults in systems where IPv6 include files are found, but the kernel doesn't actually support IPv6.
This commit was SVN r18106.
2008-04-09 12:53:24 +00:00
Lenny Verkhovsky
2be4e32c79 1. Fixing Possible strdup of NULL
2.  Fixing num_alloc when combined mapping policies ( rankfile & byslot or bynode )

This commit was SVN r18073.
2008-04-02 14:12:38 +00:00
Ralph Castain
f115b4aed2 Checkpoint the revised gather algorithm
This commit was SVN r18072.
2008-04-02 13:35:06 +00:00
Adrian Knoth
a56b9b1df1 Fix broken build with --disable-ipv6.
This commit was SVN r18071.
2008-04-02 10:53:48 +00:00
Ralph Castain
50433bf833 Turn off the new fqdn behavior pending resolution of hostfile issue
This commit was SVN r18064.
2008-04-01 20:52:22 +00:00
Ralph Castain
51533c9340 Add a new mapper component that sequentially maps ranks-to-hosts according to the ordering in the hostfile.
Not functional yet - still under development. Just placeholding for now to clear a backlog

This commit was SVN r18062.
2008-04-01 20:03:49 +00:00
Ralph Castain
ee5b96269e The RML is comfortable with zero-byte payloads, so don't pack something we don't need
This commit was SVN r18061.
2008-04-01 19:24:46 +00:00
Ralph Castain
3a4c10efd6 Delete obsolete file, cleanup obsolete cruft in another file
This commit was SVN r18060.
2008-04-01 18:36:23 +00:00
Ralph Castain
39c2680e9a Silence warning
This commit was SVN r18057.
2008-04-01 13:42:16 +00:00
Ralph Castain
524ed5d515 Don't have singletons wireup the iof. Instead, we let the fork'd orted handle io forwarding. This prevents an issue with the event library and pty's on singletons
This commit was SVN r18056.
2008-04-01 12:40:00 +00:00
Ralph Castain
3e8846d685 Some code cleanups from Brian to clarify port selection and opening logic
This commit was SVN r18055.
2008-04-01 12:39:02 +00:00
Ralph Castain
fe88956080 Fix singleton modex - ensure singletons know that a daemon is now in the system
This commit was SVN r18047.
2008-03-31 20:36:27 +00:00
Ralph Castain
f3936ff9bc Record the daemon's state so that we don't attempt to send "die" messages to a daemon that is known to have failed to start.
This commit was SVN r18044.
2008-03-31 18:15:24 +00:00
George Bosilca
ee784b601e For consistency reasons always use opal_home_directory and
opal_tmp_directory.

This commit was SVN r18043.
2008-03-31 18:13:41 +00:00
Ralph Castain
d8eb0eeec3 Correct the debug output
This commit was SVN r18042.
2008-03-31 18:09:37 +00:00
Ralph Castain
2b399a3563 Suppress a warning message - relegate it to only show up when verbosity is set as it is okay for this condition to be true
This commit was SVN r18041.
2008-03-31 17:48:07 +00:00
Ralph Castain
f327ebce31 Get the jobid correct - doh!
This commit was SVN r18040.
2008-03-31 17:42:50 +00:00
Ralph Castain
e396b9ee9a Fix unity routed component by adding xcast of proc data to the daemons. This enables daemons to complete the revised modex procedure by forwarding their collected modex info to the rank=0 proc.
This commit was SVN r18039.
2008-03-31 17:35:29 +00:00
George Bosilca
493677426d Use the OPAL function to retrieve the HOME and TMP environment values.
This commit was SVN r18037.
2008-03-31 17:10:08 +00:00
Ralph Castain
379b8a3e2f Fix singleton operations that have no data in the modex.
Note: this also allows -any- modex operation to have zero data in it, not just singletons.

This commit was SVN r18034.
2008-03-31 13:53:23 +00:00
Ralph Castain
1889bbd119 Quiet some warnings about uninitialized variables
This commit was SVN r18032.
2008-03-31 13:52:10 +00:00
Ralph Castain
8506be755d Clean-up the mess. Repair static builds. Remove unused and empty C-decl braces. Add missing prototype for function.
This commit was SVN r18031.
2008-03-31 13:02:33 +00:00
Ralph Castain
81a83dabc6 Setup sandbox for testing new orte collectives
This commit was SVN r18026.
2008-03-31 04:21:37 +00:00
George Bosilca
594884b613 The return is an int not a pointer.
This commit was SVN r18024.
2008-03-30 19:06:25 +00:00
George Bosilca
a6d5c15249 There is no need to force opal_progress down there. It will get called few
steps upper.

This commit was SVN r18022.
2008-03-30 19:05:09 +00:00
Lenny Verkhovsky
7e45d7e134 Few updates due to RMAPS rank_file component changes
1. applied prefix rule to functions and variables of RMAPS rank_file component
2. cleaned ompi_mpi_init.c from paffinity code
3. paffinity code moved to new opal/mca/paffinity/base/paffinity_base_service.c file
4. added opal_paffinity_slot_list mca parameter

This commit was SVN r18019.
2008-03-30 11:52:11 +00:00
Lenny Verkhovsky
cb83a1287d Realy deleted old files now
This commit was SVN r18018.
2008-03-30 11:50:19 +00:00
Lenny Verkhovsky
f734ba51a4 Added files with names according to prefix rule
This commit was SVN r18017.
2008-03-30 11:42:09 +00:00
Lenny Verkhovsky
b43f4a2dc9 Deleted and added files after prefix rule changes
This commit was SVN r18016.
2008-03-30 11:41:01 +00:00
Ralph Castain
9f1001a6f8 Ensure that the procs know how many daemons will be participating in collective operations.
This commit was SVN r17992.
2008-03-27 17:31:54 +00:00
Ralph Castain
6166278e18 Improve the scalability of the modex operation and fix a bug reported by Tim P
The bug was a race condition in the barrier operation that caused the barrier in MPI_Finalize to fail on very short programs.

Scalaiblity was improved by using the daemons to aggregate modex and barrier messages before sending them to the rank=0 proc. Improvement is proportional to ppn, of course, but there really wasn't a scaling problem at low ppn anyway. This modification also paves the way for better allgather operations since now all the data for each node is sitting at the daemon level, and the daemons are now aware that a collective operation on the OOB is underway (so they -can- participate in a collective of their own to support it).

Also added better diagnostics to map out the timing associated with MPI_Init - turned on by -mca orte_timing 1.

This commit was SVN r17988.
2008-03-27 15:17:53 +00:00
Ralph Castain
8e6da2ee76 Maintain the mapping bookmark across multiple comm_spawns
This commit was SVN r17984.
2008-03-27 00:19:13 +00:00
Ralph Castain
abfb3577c1 Ensure that the bookmark of the parent job is applied to the child in a comm_spawn so we start mapping from the right place
This commit was SVN r17982.
2008-03-26 21:18:16 +00:00
Ralph Castain
7ad6db207c Cover some timing-related output
This commit was SVN r17977.
2008-03-26 12:54:50 +00:00
Rainer Keller
ce8154eb3e - Coverity issues CID 945:
Event uninit_use: Using uninitialized value "rc"
   Instead of initializing rc in the beginning, rather use return value
   of opal_hash_table_set_value_uint32.

This commit was SVN r17976.
2008-03-26 11:39:25 +00:00
Brad Benton
0b84dfd2a6 POE is not currently working or supported, so removing from the trunk.
This commit was SVN r17970.
2008-03-26 02:06:40 +00:00
Ralph Castain
60d931217f Modify the routed framework to allow greater control/flexibility over response to lost routes and initial wireup of jobs as required by several soon-to-come new modules.
Specifically, add two new APIs:

1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job.

Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections.

2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job.

The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module.

This commit was SVN r17969.
2008-03-26 01:00:24 +00:00
George Bosilca
2ed6ed37bd Don't forget to cleanup once we're done.
This commit was SVN r17965.
2008-03-25 22:42:24 +00:00
George Bosilca
ac6121bd1c Remove unused variable.
This commit was SVN r17964.
2008-03-25 22:41:50 +00:00
Jeff Squyres
183fcdf51b Remove duplicate free(), fixing CID 973.
This commit was SVN r17959.
2008-03-25 20:30:56 +00:00
Ralph Castain
90107f3c14 Fix an issue with comm_spawn over who sent/recv first in the modex. The modex assumes that the first name on the list is the "root" that will serve as the allgather collector/distributor. The dpm was putting that entity last, which forced us to pre-inform the parent procs of the child proc's contact info since the parent was trying to send to the child.
Clarify the setting of send_first in the mpi bindings (trivial, i know, but helpful)

Remove the extra xcast of child contact info to the parent job.

This commit was SVN r17952.
2008-03-25 14:57:34 +00:00
Ralph Castain
cca449e379 Move an OMPI RML tag to the OMPI layer
This commit was SVN r17950.
2008-03-25 13:30:48 +00:00
Ralph Castain
4efddc7b0a Fix the allgather and allgather_list functions to avoid deadlocks at large node/proc counts. Violated the RML rules here - we received the allgather buffer and then did an xcast, which causes a send to go out, and is then subsequently received by the sender. This fix breaks that pattern by forcing the recv to complete outside of the function itself - thus, the allgather and allgather_list always complete their recvs before returning or sending.
Reogranize the grpcomm code a little to provide support for soon-to-come new grpcomm components. The revised organization puts what will be common code elements in the base to avoid duplication, while allowing components that don't need those functions to ignore them.

This commit was SVN r17941.
2008-03-24 20:50:31 +00:00
Ralph Castain
58d51f2689 Revert that! Need to complete the rest of the change so the orted knows the correct nodeid...
Sorry

This commit was SVN r17939.
2008-03-24 18:17:26 +00:00
Ralph Castain
dae4518878 Use the correct nodeid!
This commit was SVN r17938.
2008-03-24 18:15:08 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Ralph Castain
f8642e9390 Add debug to tell us when we opened a socket and to whom
This commit was SVN r17911.
2008-03-21 15:47:47 +00:00
Ralph Castain
19ffdfef42 Add some debugging output to tell us what interfaces were considered and used by OOB
This commit was SVN r17909.
2008-03-21 15:35:40 +00:00
Ralph Castain
c2fd5dd416 Clarify method used to translate application proc termination codes to exit status codes
This commit was SVN r17899.
2008-03-20 18:50:05 +00:00
Brian Barrett
2bf4784893 Set a meaningful orte_system_info.nodeid on Catamount
This commit was SVN r17898.
2008-03-20 16:55:57 +00:00
Ralph Castain
f8a10dfb93 Complete the fix of the orted vs mpirun race condition for finalizing. The darned mpirun is just too fast! Rather than try to slow it down, we set the orte_finalizing flag -prior- to telling mpirun the orted is leaving. This ensures we don't mistakenly declare the lifeline lost when mpirun leaves in a hurry.
This commit was SVN r17897.
2008-03-20 16:55:24 +00:00
Ralph Castain
6bb139e4f2 One more correction to mpirun exit codes - cleanup the application proc's exit codes in the orted so that non-zero exit codes generated by mpirun itself don't get "munged".
Modify the multi_abort function so they all return different exit codes - allows us to tell which one was being reported.

This commit was SVN r17895.
2008-03-20 13:54:11 +00:00
Ralph Castain
27a73ad9ee Fix a race condition between the orteds and HNP that can cause the orteds to output the "lost lifeline" message.
This has been a long-time problem. I tried to reduce the problem by having the orteds tell the HNP they were finalizing, and having the HNP wait until all orteds had reported or we timed out.

What was observed was that all the orteds were correctly reporting that they are leaving, but the HNP is able to exit before the orteds, thus closing the orteds lifeline socket and generating the error output. This is caused by the fact that the orteds have to whack all remaining session directories, which includes that blasted monster shared memory file! Cleaning up the SM file can take quite a while.

The HNP doesn't have that problem as there is no SM file there! So it gets out first.

What we had done in the past to resolve that problem was put a little test in the OOB that checks to see if we are finalizing. If we are, then we ignore the lifeline connection being lost. That check was still in the code - however, we had lost the line in orte_finalize that set the flag!!

This commit was SVN r17893.
2008-03-20 13:30:51 +00:00
Ralph Castain
8ee26a55ca Just turn these off for now - will revisit later
This commit was SVN r17891.
2008-03-20 13:25:35 +00:00
Ralph Castain
67a2cc8a8e Fix a bug noted by Tim P where we would report the incorrect app_context as "not found". If you gave us the command line:
mpirun -n 1 hostname : -n 1 bogus

we would erroneously report that hostname had not been found instead of bogus.

This commit was SVN r17886.
2008-03-19 21:13:13 +00:00
Ralph Castain
ec64bf3da8 Clarify the error output so we can understand if it was a daemon or process that lost its lifeline
This commit was SVN r17880.
2008-03-19 19:06:52 +00:00
Ralph Castain
2ed0e60321 Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize.
Jeff is preparing an (undoubtedly lengthy) explanation/matrix of how these codes are determined for the OMPI FAQ.

This commit was SVN r17879.
2008-03-19 19:00:51 +00:00
Galen Shipman
80ac7c87cd don't forget command file..
This commit was SVN r17878.
2008-03-19 16:24:29 +00:00
Galen Shipman
77c8532cc9 do things in a less hacky way..
This commit was SVN r17877.
2008-03-19 16:23:56 +00:00
Jeff Squyres
ac2e329353 Oops! That should not have been removed...
This commit was SVN r17865.
2008-03-18 14:42:30 +00:00
Jeff Squyres
bd92720d41 More fixes to make it compile and play nice on OS X. Still more fixes
are required; sending mail to devel shortly...

This commit was SVN r17864.
2008-03-18 14:38:52 +00:00
Ralph Castain
8f31a62600 Fix compilation errors so this will compile, remove unused variables
This commit was SVN r17862.
2008-03-18 13:01:26 +00:00
Lenny Verkhovsky
647bce6d3e Support for new RMAPS rank mapping component
This commit was SVN r17860.
2008-03-18 09:39:07 +00:00
Lenny Verkhovsky
14c32f87d5 Added new RMAPS component for rank mapping
This commit was SVN r17859.
2008-03-18 09:33:49 +00:00
Ralph Castain
8cd6142e6d Add some debugging to the grpcomm module. Setting grpcomm_base_verbose = 1 will now give you a trace through the functions as they are called. Setting it to 2 or more will give you details on what each function is doing as it works through its procedure.
This commit was SVN r17848.
2008-03-17 19:34:36 +00:00
Ralph Castain
629b95a2fe Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation.
Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway.

So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked.

Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion.

This commit was SVN r17843.
2008-03-17 17:58:59 +00:00
Josh Hursey
aaff245271 A couple verbose additions. Poll the event engine while waiting for the
named pipe.

This commit was SVN r17787.
2008-03-07 21:10:14 +00:00
Galen Shipman
0fb6cf0916 make output use verbose macro..
This commit was SVN r17778.
2008-03-07 03:06:17 +00:00
Shiqing Fan
eb1dfaf4d5 Select the windows CCP component at runtime by testing if we are on Windows cluster.
This commit was SVN r17776.
2008-03-07 01:31:53 +00:00
Ralph Castain
b110a247be Fix comm_spawn (maybe).
Comm_spawn was sticking during spawn_multiple because of a problem in the dpm - the modex there is asking processes to talk to each other in an allgather_list operation, but the procs don't have the required contact info to do so. The solution here was to ensure that all parent procs have full contact info for procs in the child job.

Admittedly, this isn't the long-term answer. We would like to have the contact info given to only the parent procs that were involved in the comm_spawn. There is a way to do that, but this will suffice to keep things working until that can be implemented and tested.

This commit was SVN r17772.
2008-03-06 21:56:00 +00:00
Ralph Castain
64d43cc44b Fix the unity routed component and direct xcast mode.
Ensure that direct xcast handles all its use-cases correctly.

Unity routed component needs to use the base recv function to properly operate.

This commit was SVN r17764.
2008-03-06 18:13:05 +00:00
Ralph Castain
ff99aa054f In order to prevent orphaned processes when using non-unity routing methods, the procs need to realize that their local daemon is a critical connection - if that connection unexpectedly closes, they need to terminate.
This commit adds definition for a "lifeline" connection. For an HNP, there is no lifeline, so the lifeline proc is NULL. For a daemon, the lifeline is the HNP - the daemon should abort if it loses that connection.

For a proc using unity routed, the lifeline is the HNP since it connects directly to the HNP.

For a proc using tree routed, the lifeline is the local daemon.

Adjusted OOB to call abort if the lifeline (as opposed to HNP) connection is lost.

This commit was SVN r17761.
2008-03-06 15:30:44 +00:00
Josh Hursey
0b4d9a12ce a bit more verbosity for the fun of it
This commit was SVN r17758.
2008-03-06 14:04:25 +00:00
Tim Prins
f61c2333c0 Remove unneeded field, and the two uses of it.
This commit was SVN r17757.
2008-03-06 12:46:36 +00:00
Tim Prins
d56f19c77d Fix logic error, and remove uneeded checks for invalid results.
This commit was SVN r17756.
2008-03-06 04:38:13 +00:00
Ralph Castain
6d94e7b232 Fix the debug output so it correctly reports launch state
This commit was SVN r17755.
2008-03-06 03:11:01 +00:00
Tim Prins
5de3e1965e Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte.
Everything should work, however I am unable to compile and test the sctp BTL.

This commit was SVN r17751.
2008-03-05 22:44:35 +00:00
Tim Prins
f9916811ae Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124
The change also:
 - cleans up and simplifies the command line processing code
 - adds an error output if more than one hostfile passed for a single app context
 - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options

This commit was SVN r17750.

The following Trac tickets were found above:
  Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124
2008-03-05 22:12:27 +00:00
Rolf vandeVaart
03fdd57d5a Fix the use of --path and -x PATH so that things work properly.
Note that --path specifies extra directories where the executable
is searched for, but does not affect the PATH settings.

This commit fixes trac:1221.

This commit was SVN r17748.

The following Trac tickets were found above:
  Ticket 1221 --> https://svn.open-mpi.org/trac/ompi/ticket/1221
2008-03-05 21:07:43 +00:00
Ralph Castain
4dbc352828 Per request, change name of new enviro var to OMPI_COMM_WORLD_LOCAL_SIZE
This commit was SVN r17736.
2008-03-05 14:45:26 +00:00
Ralph Castain
06d3145fe4 First cut at direct launch for TM. Able to launch non-ORTE procs and detect their completion for a clean shutdown.
This commit was SVN r17732.
2008-03-05 13:51:32 +00:00
George Bosilca
c71f225a28 These functions should only be compiled when OPAL_ENABLE_FT == 1.
This commit was SVN r17727.
2008-03-05 05:57:13 +00:00
Josh Hursey
3b4073e32c This commit fixes the checkpoint/restart functionality on the trunk. Included in this commit are:
* Extension to the ESS framework to support C/R
 * Fixed support for {{{snapc_base_establish_global_snapshot_dir}}}
 * Fixed FileM support
 * Misc. minor code modifications

There are some outstanding visability issues that I want to fix next.

This commit was SVN r17725.
2008-03-05 04:57:23 +00:00
Ralph Castain
edb8e32a7a Add default hostfile parameter plus --default-hostfile command line option.
Fix error message when job setup failed

This commit was SVN r17724.
2008-03-05 04:54:57 +00:00
Ralph Castain
022fc1f382 Add another MPI-related enviro variable OMPI_COMM_WORLD_NUM_LOCAL_PROCS
This commit was SVN r17723.
2008-03-05 04:53:32 +00:00
Ralph Castain
e745c16ff1 Modify the enviro variable names to be OMPI_...
Add two new ones: OMPI_COMM_WORLD_LOCAL_RANK and OMPI_UNIVERSE_SIZE

This commit was SVN r17694.
2008-03-04 20:16:05 +00:00
Shiqing Fan
ebf9c0441d Set the windows components invisible.
This commit was SVN r17687.
2008-03-04 17:37:17 +00:00
Shiqing Fan
ae41b5418b Update the RAS and PLM components for Windows.
These won't suffer another platforms but only windows. 

This commit was SVN r17686.
2008-03-04 17:13:01 +00:00
Ralph Castain
ffa232687a Fix xcast so it works in multi-node situations where the user specifies a particular mode to use (e.g., direct).
This commit was SVN r17682.
2008-03-03 20:07:02 +00:00
Ralph Castain
841d0e5208 Cleanup an attribute warning - not sure which one to set or where it should go, so I'll leave that to someone more familiar with "attributes".
Ensure some debugging is only enabled when have_debug is set.

This commit was SVN r17681.
2008-03-03 16:06:47 +00:00
Rich Graham
d37db14901 get the shared memory collectives working again with the new
version of orte.

This commit was SVN r17672.
2008-02-29 22:28:57 +00:00
Ralph Castain
6450962d59 Add some debugging to the message event object.
Cleanup some no-longer-used values

This commit was SVN r17671.
2008-02-29 20:10:31 +00:00
Ralph Castain
a585923de1 Silence some minor compiler warnings
This commit was SVN r17662.
2008-02-29 02:39:39 +00:00
Tim Prins
84b2099fe8 Remove the now-unused orte_value_array. As this is the last 'class' split between orte and ompi, remove the big comment about the split in ompi_bitmap.
Also, update some properties (source files should not be executeable...), and remove a couple unneeded inclusions of orte_proc_table.h

This commit was SVN r17655.
2008-02-28 21:39:42 +00:00
Ralph Castain
5e6928d710 Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message.
Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv.

Created two new macros to assist:

ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object

ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd.

Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output.

This has been tested on Mac, TM, and SLURM.

This commit was SVN r17647.
2008-02-28 19:58:32 +00:00
Ralph Castain
5dc64cea6a Correct logic - only issue recv and cancel it if we are an HNP
This commit was SVN r17641.
2008-02-28 15:27:16 +00:00
George Bosilca
9d421bea2a Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the
implementation of orte_pointer_array.

This commit was SVN r17636.
2008-02-28 05:32:23 +00:00
Ralph Castain
d70e2e8c2b Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately.
Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer

This commit was SVN r17632.
2008-02-28 01:57:57 +00:00
Gleb Natapov
da3e69101d Add missing include.
This commit was SVN r17493.
2008-02-18 14:55:02 +00:00
Galen Shipman
18d1d3b408 Add ORTE ALPS support (Cray XT CNL)
This commit was SVN r17482.
2008-02-17 19:29:06 +00:00
George Bosilca
fcab6cc0bb Fix typo.
This commit was SVN r17255.
2008-01-26 21:36:04 +00:00
Rainer Keller
9d4852cdc1 - Get rid of Wshadow warnings.
This commit was SVN r17231.
2008-01-25 14:07:38 +00:00
Pak Lui
413bcca4c0 Support the qrsh or qsub "-notify" option by catching the SIGUSR1/2
signals and not letting user processes to exit on those signals.

This commit was SVN r17174.
2008-01-22 17:32:29 +00:00
Josh Hursey
158dda5458 Fix some overlapping code.
This commit was SVN r17067.
2008-01-08 15:40:21 +00:00
George Bosilca
eb71a634c6 Don't forget to initialize the msg_origin field.
This commit was SVN r17055.
2008-01-04 23:24:49 +00:00
George Bosilca
48f5a26e8c Cast to keep VC happy (quiet).
This commit was SVN r17054.
2008-01-04 23:13:32 +00:00
Adrian Knoth
42d5fe62f9 Fixed misplaced #endif
This commit was SVN r17028.
2008-01-01 11:02:38 +00:00
Jeff Squyres
213b5d5c6e Per long threads on the mailing list and much confusion discussion
about linkers, have all OPAL, ORTE, and OMPI components '''not'' link
against the OPAL, ORTE, or OMPI libraries.

See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for
details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a
better-formatted version of the same info).

This commit was SVN r16968.
2007-12-15 13:32:02 +00:00
Josh Hursey
f7812baf5b forgot a bit of error checking in the last commit
This commit was SVN r16953.
2007-12-13 14:41:18 +00:00
Josh Hursey
a287c9cb65 This commit distinguishes the file transfer stage from the finish stage.
This commit also cleans up the checkpoint and terminate case making it more
precise than before. Previously the application could make a small amount of
progress between checkpoint completion and application termination. Now the
application will make no progress at all in this time span.

Additional minor change:
 - Start using OPAL_INT_TO_BOOL instead of if/else logic

This commit was SVN r16952.
2007-12-13 14:37:17 +00:00
Rolf vandeVaart
3ea89b69ae Remove a few tabs. Allow the output stream to be
passed to the close command for verbose output.  This
matches all the other frameworks.

This commit was SVN r16938.
2007-12-11 20:44:56 +00:00
Josh Hursey
27c9016b93 sleep -> usleep so we can be a bit more eager when waiting for events to finish.
Still working on solutions that do not involve sleeping, but this will do for
now.

This commit was SVN r16824.
2007-12-03 19:27:32 +00:00
Jeff Squyres
c20350b943 Patch submitted by Brian Barrett, inspired by this thread:
http://www.open-mpi.org/community/lists/users/2007/11/4547.php.

- Better handling of ECONNABORTED from connect on Linux.
- Reduce extraneous output from OOB when TCP connections must
  be retried.

This commit was SVN r16808.
2007-11-30 21:42:15 +00:00
Ron Brightwell
edb9d8e354 Added Catamount to the conditional compilation since Catamount
doesn't support fork() or pipe() either.  This removes a
linker warning message when building for Cray XT with Catamount.

This commit was SVN r16772.
2007-11-21 21:37:58 +00:00
George Bosilca
d67c0eefb4 Remove a compilation warning about using uninitialized variables.
This commit was SVN r16589.
2007-10-26 20:15:28 +00:00
George Bosilca
b1b5cb6453 Looks like SO_REUSEPORT it's not defined on some platforms. Switch
to the conventional SO_REUSEADDR instead.

This commit was SVN r16588.
2007-10-26 19:56:21 +00:00
George Bosilca
337f78a4a8 Restrict the port range for the OOB and the BTL. Each protocols (v4 and v6)
has his own range which is defined by a min value and a range. By default
there is no limitation on the port range, which is exactly the same
behavior as before.

This commit was SVN r16584.
2007-10-26 16:36:51 +00:00
Jeff Squyres
9e4387d021 * Use new BEGIN_C_DECLS / END_C_DECLS convention
* Add newline at end of file to avoid compiler warning

This commit was SVN r16579.
2007-10-26 13:40:38 +00:00
Shiqing Fan
3c38c9c020 - Add extern "C" to resolve linkage specification problems.
This commit was SVN r16577.
2007-10-26 09:54:42 +00:00
Ralph Castain
a791ce2299 The processor affinity must be set on a per-process basis, not per-app-context.
This commit was SVN r16559.
2007-10-23 20:46:16 +00:00
George Bosilca
7a63f9b730 I somehow mess up my last commit. Sorry.
This commit was SVN r16543.
2007-10-22 15:08:17 +00:00
George Bosilca
b93f72bdfd Remove 2 warnings about uninitialized i and quit_flags.
This commit was SVN r16542.
2007-10-22 15:01:15 +00:00
Jeff Squyres
5637c7a5a0 In addition to r16513, this commit fixes trac:1170.
If we cannot resolve the route to the peer that we're trying to send
to, don't queue up the message in the TCP OOB -- instead, return it to
the upper layer (e.g., the RML) and let it decide what to do.

In the case of the routed RML, the tree component will queue it up for
later transmission.  Hence, we don't want the message queued up both
here in the TCP OOB and the tree routed.  Also see some more
discussion / explanation in #1171.

This commit was SVN r16540.

The following SVN revision numbers were found above:
  r16513 --> open-mpi/ompi@7ae9589d70

The following Trac tickets were found above:
  Ticket 1170 --> https://svn.open-mpi.org/trac/ompi/ticket/1170
2007-10-22 13:46:57 +00:00
Jeff Squyres
7ae9589d70 The header is at the address of the buffer pointed to by the iov, not
the address of the iov.

This commit was SVN r16513.
2007-10-19 12:40:14 +00:00
Jeff Squyres
abf1b728b9 Minor code maintenance fix -- put the THREAD_UNLOCK outside the if
statement so that you only have to have it once.

This commit was SVN r16512.
2007-10-19 12:36:26 +00:00
Ralph Castain
73eeb7f0d2 Fix a bug in the way we handled buffer releases and the conditioned wait that held us in the xcast until completed.
This commit was SVN r16504.
2007-10-19 01:17:01 +00:00
Josh Hursey
0bf61a1b84 Move in some accumulated small features and minor bug fixes for C/R support.
{{{
svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs .
}}}

This commit was SVN r16478.
2007-10-17 13:47:36 +00:00
Ralph Castain
ec5fe78876 When in the unity message routing mode, we have to update the RML contact info in the parent procs so that they know how to talk to the children. Ideally, this would be done in the MPI layer since that layer knows which procs are actively involved in the comm_spawn. However, it isn't being done there, which causes comm_spawn to fail, so do it explicitly in the RTE.
Note that this means ALL procs in the parent job are updated, even though they may not be participating in the comm_spawn. This doesn't really hurt anything - just unnecessary.

Comm_spawn still has a problem when a child process shares a node with a parent, so this doesn't fix everything. It only fixes the bug of ensuring all procs know how to talk to each other.

This commit was SVN r16460.
2007-10-16 16:09:41 +00:00
Ralph Castain
713b6e13a5 Improve diagnostic output messages when errors are hit
This commit was SVN r16457.
2007-10-16 14:51:52 +00:00
Josh Hursey
ea0652d20f If we are going to pretend to do filem, then we should always pretend.
No one should be using this feature except for me. :)

This commit was SVN r16454.
2007-10-15 20:04:35 +00:00
Ralph Castain
b6196e8a39 When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive.
This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed.

In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean.

I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it.

This commit was SVN r16451.
2007-10-15 18:00:30 +00:00
Jeff Squyres
423f23eb6a Fixes trac:1160. There is still some other problem in the OOB, but we
wanted to commit this to get wider testing.

This commit was SVN r16445.

The following Trac tickets were found above:
  Ticket 1160 --> https://svn.open-mpi.org/trac/ompi/ticket/1160
2007-10-15 15:41:36 +00:00
Josh Hursey
f16a42947a Change some default MCA parameters:
- Global snapshot directory = $HOME
 - FileM 'rsh' = 'ssh'
 - FileM 'rcp' = 'scp'

This commit was SVN r16444.
2007-10-15 15:21:17 +00:00
Josh Hursey
520c27ac94 If the HNP is acting as the orted for local launch then the gpr_replica
variable is not defined. Make sure to set it to something reasonable 
so that file preloading still works (instead of seg faulting :)

Thanks to Hiep Bui Hoang for reporting this bug.

This commit was SVN r16433.
2007-10-11 19:47:04 +00:00
Josh Hursey
e483c36cea Remove a big of debug in filem/rsh that should have never been committed.
A guesture towards overlapping file removal with metadata update.

This commit was SVN r16432.
2007-10-11 19:37:33 +00:00
Ralph Castain
3dbd4d9be7 Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by:
1. taking advantage of the fact that we no longer create the launch  message via a GPR trigger. In earlier times, we had the GPR create the launch message based on a subscription. In that mode of operation, we could not guarantee the order in which the data was stored in the message - hence, we had no choice but to parse the message in a loop that checked each value against a list of possible "keys" until the corresponding value was found.

Now, however, we construct the message "by hand", so we know precisely what data is in each location in the message. Thus, we no longer need to send the character string "keys" for each data value any more. This represents a rather large savings in the message size - to give you an example, we typically would use a 30-char "key" for a 2-byte data value. As you can see, the overhead can become very large.

2. sending node-specific data only once. Again, because we used to construct the message via subscriptions that were done on a per-proc basis, the data for each node (e.g., the daemon's name, whether or not the node was oversubscribed) would be included in the data for each proc. Thus, the node-specific data was repeated for every proc.

Now that we construct the message "by hand", there is no reason to do this any more. Instead, we can insert the data for a specific node only once, and then provide the per-proc data for that node. We therefore not only save all that extra data in the message, but we also only need to parse the per-node data once.

The savings become significant at scale. Here is a comparison between the revised trunk and the trunk prior to this commit (all data was taken on odin, using openib, 64 nodes, unity message routing, tested with application consisting of mpi_init/mpi_barrier/mpi_finalize, all execution times given in seconds, all launch message sizes in bytes):

Per-node scaling, taken at 1ppn:

#nodes           original trunk                         revised trunk
             time               size                time               size
      1      0.10                819                0.09                564
      2      0.14               1070                0.14                677
      3      0.15               1321                0.14                790
      4      0.15               1572                0.15                903
      8      0.17               2576                0.20               1355
     16      0.25               4584                0.21               2259
     32      0.28               8600                0.27               4067
     64      0.50              16632                0.39               7683

Per-proc scaling, taken at 64 nodes

   ppn             original trunk                         revised trunk
              time               size                time               size
      1       0.50              16669                0.40               7720
      2       0.55              32733                0.54              11048
      3       0.87              48797                0.81              14376
      4       1.0               64861                0.85              17704


Condensing those numbers, it appears we gained:

per-node message size: 251 bytes/node -> 113 bytes/node

per-proc message size: 251 bytes/proc  -> 52 bytes/proc

per-job message size:  568 bytes/job -> 399 bytes/job 
(job-specific data such as jobid, override oversubscribe flag, total #procs in job, total slots allocated)

The fact that the two pre-commit trunk numbers are the same confirms the fact that each proc was containing the node data as well. It isn't quite the 10x message reduction I had hoped to get, but it is significant and gives much better scaling.

Note that the timing info was, as usual, pretty chaotic - the numbers cited here were typical across several runs taken after the initial one to avoid NFS file positioning influences.

Also note that this commit removes the orte_process_info.vpid_start field and the handful of places that passed that useless value. By definition, all jobs start at vpid=0, so all we were doing is passing "0" around. In fact, many places simply hardwired it to "0" anyway rather than deal with it.

This commit was SVN r16428.
2007-10-11 15:57:26 +00:00
Rolf vandeVaart
25c95c9ee9 Fix build on solaris. Need to include sys/wait.h.
This commit was SVN r16426.
2007-10-11 15:04:30 +00:00
Jeff Squyres
e2df42eea3 Move the <sys/wait.h> below "orte_config.h"
This commit was SVN r16424.
2007-10-11 11:31:09 +00:00
George Bosilca
7cc9f588a8 Decorate the base functions with ORTE_DECLSPEC.
This commit was SVN r16423.
2007-10-11 00:02:49 +00:00
Ralph Castain
53af94fd87 Modify the configure system so that gridengine support is only built in specific conditions:
1. --with-sge, always builds
2. --without-sge, never builds
3. if neither is specified, build if and only if either SGE_ROOT is set or "qrsh" is found in the path

This commit was SVN r16422.
2007-10-10 21:39:16 +00:00
Josh Hursey
6e5341c659 Forgot to move a header in the code movement.
This commit was SVN r16420.
2007-10-10 15:39:40 +00:00
Ralph Castain
82a8e2d10d Reorganize the odls framework to place common functionality in the base, thus making maintenance easier. We still need this to be a framework as some environments (e.g., bproc) require significantly different functionality. However, there is quite a bit of commonality across the components, so this ensures that fixes in one get propagated across the others.
This patch also fixes a minor bug discovered along the way: we had "lost" the passing of the oversubscribed condition flag from the mapper to the orteds. Thus, we were not setting sched_yield correctly when in oversubscribed conditions (except when a hostfile was specified - different logic there because we treat the number of slots allocated on the node as "uncertain")

I did not modify the process component in this patch - I will send a proposed patch to the maintainers of that component so they can review it first.

This commit was SVN r16418.
2007-10-10 15:02:10 +00:00
Josh Hursey
7f833a9cb2 silence a warning that is triggered on restart
This commit was SVN r16417.
2007-10-10 14:25:49 +00:00
Ethan Mallove
d0b61db65c Add in a missing #include for Solaris builds.
This commit was SVN r16416.
2007-10-10 12:49:15 +00:00
Josh Hursey
aa8391f888 Local and global coordinators should be the only ones involved in the
movement of checkpoint files. This reduces the overhead on the applicaiton.

This commit was SVN r16412.
2007-10-09 19:52:47 +00:00
Galen Shipman
fda1306807 revert my stupidity..
This commit was SVN r16410.
2007-10-09 19:01:20 +00:00
Josh Hursey
8fe2ef5647 a missing include
This commit was SVN r16402.
2007-10-09 14:32:36 +00:00
Josh Hursey
7437f37e96 This commit contains the following:
* Fix some missing includes in a few places.
 * Add the cr_request() functionality to the BLCR CRS component.
   We are now dependent upon the 0.6.* series of BLCR.
 * Made the CR notification mechanism a registered function.
   This way we can have an OPAL-only version and it can be replaced at
   runtime with the ORTE version.
 * Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only
   CR functionality when the user wants it. Default: Disabled.
 * Fix the placement of a checkpoint request check in MPI_Init
 * Pull the OPAL notification mechanism into the SnapC framework.
   * We no longer fork/exec the 'opal-checkpoint' command for local
   checkpointing, the Local coordinator in the orted does this directly.
   * The Local and Application coordinator talk together bypassing the OPAL
   notifiation mechanism.
   * Optimized the Local <-> App Coordinator communication.
   * Improved the structure used to track vpid_snapshots in the local coord.
 * Fix a race condition in which an application under heavy communication load
   may produce an inconsistent global checkpoint.

This commit was SVN r16389.
2007-10-08 20:53:02 +00:00
Galen Shipman
1c1b9d5480 make cray happy
This commit was SVN r16377.
2007-10-08 14:31:59 +00:00
Ralph Castain
54b2cf747e These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.

This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:

As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.

In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.

The incoming changes revamp these procedures in three ways:

1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.

The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.

Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.


2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.

The size of this data has been reduced in three ways:

(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.

To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.

(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.

(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.

While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.


3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.

It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.

Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.


There are a few minor additional changes in the commit that I'll just note in passing:

* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.

* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.

* cleanup of some stale header files

This commit was SVN r16364.
2007-10-05 19:48:23 +00:00