1
1
Граф коммитов

1529 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
6c800d452d Bring orte tests up to date with revised rml system.
Make first cut at fixing non-direct xcast modes

This commit was SVN r15553.
2007-07-23 13:05:34 +00:00
Jeff Squyres
78d214fec8 Oops -- didn't mean to commit the test program...
This commit was SVN r15538.
2007-07-20 20:15:51 +00:00
Jeff Squyres
2baa866026 Compiles to the new API, but doesn't quite work yet...
This commit was SVN r15537.
2007-07-20 19:49:27 +00:00
George Bosilca
1751b289ed Avoid a compiler warning about uninitialized variables.
This commit was SVN r15534.
2007-07-20 04:07:19 +00:00
George Bosilca
d1424689ce Always release the buffer (this imply the buffer has to be created
outside the special case).

This commit was SVN r15533.
2007-07-20 04:06:39 +00:00
George Bosilca
172a4fa543 Includ a missing header file.
This commit was SVN r15532.
2007-07-20 03:24:52 +00:00
Brian Barrett
9a476c7fdd The print function doesn't change the process name, so take a const...
This commit was SVN r15531.
2007-07-20 02:54:43 +00:00
Brian Barrett
5b9fa7e998 reapply r15517 and r15520, which were removed in r15527 so that I could get
the RML/OOB merge in slightly easier

This commit was SVN r15530.

The following SVN revision numbers were found above:
  r15517 --> open-mpi/ompi@41977fcc95
  r15520 --> open-mpi/ompi@9cbc9df1b8
  r15527 --> open-mpi/ompi@2d17dd9516
2007-07-20 02:34:29 +00:00
Tim Prins
9602230ad6 remove defunct file
This commit was SVN r15529.
2007-07-20 02:10:38 +00:00
Brian Barrett
39a6057fc6 A number of improvements / changes to the RML/OOB layers:
* General TCP cleanup for OPAL / ORTE
  * Simplifying the OOB by moving much of the logic into the RML
  * Allowing the OOB RML component to do routing of messages
  * Adding a component framework for handling routing tables
  * Moving the xcast functionality from the OOB base to its own framework

Includes merge from tmp/bwb-oob-rml-merge revisions:

    r15506, r15507, r15508, r15510, r15511, r15512, r15513

This commit was SVN r15528.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r15506
  r15507
  r15508
  r15510
  r15511
  r15512
  r15513
2007-07-20 01:34:02 +00:00
Brian Barrett
2d17dd9516 temporarily back our r15517 and 15520 so that I can get the RML / OOB changes
to cleanly apply

This commit was SVN r15527.

The following SVN revision numbers were found above:
  r15517 --> open-mpi/ompi@41977fcc95
2007-07-20 01:10:34 +00:00
George Bosilca
9cbc9df1b8 Export the orte_ns_base_print_name_args function.
This commit was SVN r15520.
2007-07-19 21:45:27 +00:00
Ralph Castain
41977fcc95 Remove the cellid field from the orte_process_name_t structure. This only affects a handful of files in itself, but...
Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point.

Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings.

This commit was SVN r15517.
2007-07-19 20:56:46 +00:00
Ralph Castain
2110064a9a Ensure that the LD_LIBRARY_PATH and PATH get properly set for procs locally spawned by mpirun.
This commit was SVN r15516.
2007-07-19 19:00:06 +00:00
Ralph Castain
b6c60dfc07 Bring over the extra debugging output that helped a user find his NSF mount problems. This just adds ERROR_LOG messages when the session directory creation process fails so we can see where it is happening - really helps users (and us as well) figure out what specifically went wrong.
This commit was SVN r15491.
2007-07-18 19:50:54 +00:00
Tim Prins
e41f86dfe6 add a small amount of debugging output
This commit was SVN r15483.
2007-07-18 15:20:55 +00:00
Sven Stork
c43335d671 - release the lock before we forward a message. In the threaded
build it's possible that we have to process an ack before this function
  returns. If we don't release the lock here we cause a deadlock later
  in ack processing function.

This commit was SVN r15441.
2007-07-16 14:43:24 +00:00
Ralph Castain
5121dfe7e7 With the changes to the failed-to-start logic, we need to revise the odls so it doesn't overwrite the exit status on procs that are not found. Otherwise, we lose the appropriate error message to the user.
This commit was SVN r15440.
2007-07-16 13:50:26 +00:00
Ralph Castain
cd9213a9f0 Son of a gun - how did that fix not get into r15390?
Fix the blasted iof null component so it only is selected if/when directed.

This commit was SVN r15437.

The following SVN revision numbers were found above:
  r15390 --> open-mpi/ompi@bd65f8ba88
2007-07-16 12:38:11 +00:00
Ralph Castain
d109e9a6f4 Roll in the Voltaire core/socket/etc process mapping implementation. Only change I made was to cleanup some of the diagnostic output in the odls_default component so it uses the -mca odls_base_verbose parameter.
You will not see any impact from this change unless you use the syntax described in ticket #1023. I've tried as many of the RAS components as possible and saw no problem - there may be issues with other RAS components that would not compile on any of my systems. Anything that appears should be trivial to fix.

This commit was SVN r15427.
2007-07-14 15:14:07 +00:00
Josh Hursey
eeba2cb871 Add a comment to clarify the relationship between
mca_base_cmd_line_process_args() and opal_init_util() so
we do not forget their ordering needs, and subtle relationship.

This commit was SVN r15412.
2007-07-13 19:08:05 +00:00
Sven Stork
3bb6ca6a3d - export orted symbols
This commit was SVN r15411.
2007-07-13 18:49:43 +00:00
Josh Hursey
20bbe9ee2b Bring back the AMCA fix for the orted from r15219.
This fixes C/R support for the trunk which regressed in r15390 due to the RTE
changes.

This commit was SVN r15409.

The following SVN revision numbers were found above:
  r15219 --> open-mpi/ompi@f88aa6c273
  r15390 --> open-mpi/ompi@bd65f8ba88
2007-07-13 18:15:36 +00:00
Tim Prins
d2f0806420 Fix a problem introduced by r15390 which was causing strange failures in numerous areas.
We no longer store whether we are a singleton in a MCA parameter, we now use a global constant. So all references to the MCA parameter must be removed.

This commit was SVN r15408.

The following SVN revision numbers were found above:
  r15390 --> open-mpi/ompi@bd65f8ba88
2007-07-13 17:52:16 +00:00
Ralph Castain
2bded34a1d Fix a problem observed by Brian where processes launched local to mpirun lost their environment except for MCA params.
The problem stemmed from no longer launching a local orted on the same node as mpirun. The orted would save and reuse the base environment. Mpirun didn't do that, and the odls was using the orted's globally saved environment (which wasn't being set).

This fix establishes a globally accessible base launch environment that both the orted and mpirun can utilize. Since we now use that, we don't need to pass it to the odls_launch_proc function, so remove that param from the API (and modify all components to handle the change).

This commit was SVN r15405.
2007-07-13 15:47:57 +00:00
Jeff Squyres
24a28494a6 Fix what looks like a cut-n-paste error.
This will not cause everyone to run autogen because this component is
.ompi_ignore'd for everyone except jsquyres and rhc.

This commit was SVN r15401.
2007-07-13 14:47:03 +00:00
Ralph Castain
ec79f264a6 The iof is still having problems with non-orte/ompi programs, apparently. Turn off the iof_flush when a program terminates as this will (in the case of non-orte/ompi programs) cause mpirun to hang.
This commit was SVN r15399.
2007-07-13 13:05:46 +00:00
Jeff Squyres
b20248709a Next round of LSF commits. Getting farther, but it still doesn't
fully work yet (everything is still .ompi_ignore'ed for everyone).

This commit was SVN r15398.
2007-07-13 11:57:17 +00:00
George Bosilca
52eebd706f Update the xgrid PLS to fit the current interface of the PLS.
This commit was SVN r15396.
2007-07-13 06:18:16 +00:00
Ralph Castain
bd65f8ba88 Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments.
Short description: major changes include -

1. singletons now fork/exec a local daemon to manage their operations.

2. the orte daemon code now resides in libopen-rte

3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided.

I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine.

This commit was SVN r15390.
2007-07-12 19:53:18 +00:00
Jeff Squyres
4439734a8a It compiles (too)! Getting a little further...
This commit was SVN r15383.
2007-07-12 14:46:41 +00:00
Jeff Squyres
aa2c64d66d It compiles! That's a start... :-)
This commit was SVN r15382.
2007-07-12 14:41:09 +00:00
Jeff Squyres
e51bb19fab Fix some include files
This commit was SVN r15381.
2007-07-12 14:22:47 +00:00
Ralph Castain
39013e2a18 Clean up a couple of minor typos. Bring the new bproc-related RAS components online.
This commit was SVN r15328.
2007-07-10 14:11:26 +00:00
Jeff Squyres
64083570f5 Add support for DDT parallel debugger, which required several things:
* Making some symbols and types be global (vs. static) in orterun
 * Adding a "ddt" entry in the MCA parameter orte_base_user_debugger
   default value
 * Add support for @executable@, @executable_argv@, and @single_app@
   tokens in the orte_base_user_debugger MCA parameter.
 * Added various error checks and corresponding help messages after
   finding a debugger in the PATH

Fixes trac:1081

This commit was SVN r15323.

The following Trac tickets were found above:
  Ticket 1081 --> https://svn.open-mpi.org/trac/ompi/ticket/1081
2007-07-10 12:53:48 +00:00
Ralph Castain
a1bf04f39e First cut at revamping bproc support to separate it out from LANL's configuration.
First cut at adding support for LSF

Lots of ompi_ignores so only Jeff and I will see this stuff

This commit was SVN r15321.
2007-07-10 12:43:05 +00:00
Brian Barrett
1d02b9e7b5 Fix a bunch of issues exposed by Ken Cain in getting Open MPI to work with
VxWorks.  Still some issues remaining, I'm sure.

Refs trac:1010

This commit was SVN r15320.

The following Trac tickets were found above:
  Ticket 1010 --> https://svn.open-mpi.org/trac/ompi/ticket/1010
2007-07-10 03:46:57 +00:00
Jeff Squyres
892bc38ad0 Protect against a bad free: full_line points to the full buffer. But
line may point to a few characters beyond the beginning of the buffer
(if the buffer had some extra white space padding at the beginning).
So if we want to free the buffer, free full_line, not line.

This commit was SVN r15315.
2007-07-09 19:56:16 +00:00
Brian Barrett
b27b9b5380 * Clean up the ompi_mca macro's support for different configuration
types and add STOP_AT_FIRST_PRIORITY type for framework configuration,
    which allows all components at the highest priority that succeeds to
    succeed
  * Use STOP_AT_FIRST_PRIORITY type for gpr framework, so that the null
    component isn't built when the replica and proxy components are
    available.

This commit was SVN r15286.
2007-07-04 22:00:15 +00:00
Ralph Castain
684aa1bc9f Since universe size now is an orte thing, we may as well give it some direct support. Create rmgr set/get functions so it becomes more obvious where this value is being defined and how to retrieve it. Modify the bproc pls to pass it to the app procs when launched. Modify one of the test programs to verify it has been correctly set.
This commit was SVN r15266.
2007-07-02 16:45:40 +00:00
Jeff Squyres
c796d84d61 * Integrate man pages contributed by Dirk Eddelbuettel
* Make orted.1 man page be non-descriptive because it's really an
   internal command.
 * Re-work the opal_wrapper man page logic a bit so that we can have a
   real opal_wrapper.1 installed that says "don't look here -- look at
   mpicc (etc.)"

This commit was SVN r15264.
2007-07-02 15:27:39 +00:00
Tim Prins
c46ed1d5d4 Make it so the universe size is passed through the ODLS instead of through a gpr trigger during MPI init. This matches what is currently being done with the app number.
The default odls has been updated and works fine. The process odls has been updated, but I could not verify its operation. The bproc ODLS has not been updated yet. Ralph will look at it soon.

This commit was SVN r15257.
2007-07-02 01:33:35 +00:00
Sven Stork
086624a4fe - guess that we should retain the ep instead of releasing it
This commit was SVN r15244.
2007-06-29 11:18:37 +00:00
Brian Barrett
f8fb1e9720 Fix some compile failures on Solaris 9 because it doesn't have V6ONLY.
This commit was SVN r15237.
2007-06-28 18:52:15 +00:00
Ralph Castain
e299f7039f Allow bproc operations on the head node if it was allocated for our use
This commit was SVN r15232.
2007-06-28 14:53:17 +00:00
Josh Hursey
f88aa6c273 This commit cleans up the AMCA parameter implementation a bit.
* Remove the 'opal_mca_base_param_use_amca_sets' global variable
* Harness the fact that you can (read should) call the cmd_line functions
  before initializing opal_init_util(). This pushes the MCA/GMCA/AMCA
  command line options into the environment before OPAL inits and starts
  to use these values. By putting the cmd_line parse before opal_init_util
  in orterun and orted we only parse the *MCA parameter files once, and 
  correctly (alleviating the need to 'recache' the files on init.)
* Small bits of cleanup.

This commit was SVN r15219.
2007-06-27 01:03:31 +00:00
Rainer Keller
15c03e8acc - Apply patch 31_manpages_lintian.dpatch
Thanks to Dirk Eddelbuettel <edd@debian.org>

This commit was SVN r15215.
2007-06-26 21:13:10 +00:00
Sven Stork
0edcf1d47e - export required symbol
This commit was SVN r15190.
2007-06-25 14:27:04 +00:00
Josh Hursey
84f102c343 Fix/Cleanup the Checkpoint Error propagation through the Snapc Full component.
This commit was SVN r15175.
2007-06-22 16:14:25 +00:00
Jeff Squyres
bd56dc7e5d Fixes trac:1060
Per suggestion, if we don't find a valid shell via getpwuid(), also
check the $SHELL environment variable.  Also perform a few minor
cleanups along the way.

This commit was SVN r15156.

The following Trac tickets were found above:
  Ticket 1060 --> https://svn.open-mpi.org/trac/ompi/ticket/1060
2007-06-21 11:40:42 +00:00
Josh Hursey
78df098aee If we can not checkpoint, then make sure we return an error
This commit was SVN r15151.
2007-06-20 21:05:19 +00:00
Josh Hursey
dd021e7121 Remove some leftover debugging that must have been accidentally left in
r15142.

This commit was SVN r15145.

The following SVN revision numbers were found above:
  r15142 --> open-mpi/ompi@a3998a1676
2007-06-20 14:06:13 +00:00
Josh Hursey
db59235af5 Fix an AMCA parameter regression introduced (as a side effect of) in r14449
(and, due to lack of in code documentation, in r14661).

The {{{opal_mca_base_param_use_amca_sets}}} flag tells the orted that it should
not look at the parameter files just yet since it may have an AMCA parameter
file to look at first. So we need to set this to {{{false}}} before initializing
the MCA paras, then quickly turn around and re-init them when we have the full
information.

This commit fixes trac:1058

This commit was SVN r15144.

The following SVN revision numbers were found above:
  r14449 --> open-mpi/ompi@0ba47105ed
  r14661 --> open-mpi/ompi@df86202202

The following Trac tickets were found above:
  Ticket 1058 --> https://svn.open-mpi.org/trac/ompi/ticket/1058
2007-06-20 14:00:40 +00:00
George Bosilca
a3998a1676 Allow the symbols required by TotalView to be exported even when
the visibility feature is on.

This commit was SVN r15142.
2007-06-19 22:35:23 +00:00
Josh Hursey
edb2cbd150 In r15007 the --bootproxy orted argument was removed to support daemon reuse.
The SnapC Full local Coordinator used this argument to attach to the job the
daemon would be launching. So once this option was removed C/R support broke.

This commit has the local coordinator attach to the job just before it is
launched by the ODLS module. This is a much cleaner solution, and will
eventually allow the SnapC modules to attach to multiple jobs launched 
on a single machine.

This commit fixes the C/R regression introduced in r15007.

This commit was SVN r15121.

The following SVN revision numbers were found above:
  r15007 --> open-mpi/ompi@85df3bd92f
2007-06-18 15:39:04 +00:00
Josh Hursey
5719182a4e Fix a break introduced in r14706 when RANK_KEY changed types.
This commit was SVN r15120.

The following SVN revision numbers were found above:
  r14706 --> open-mpi/ompi@d9acc93efa
2007-06-18 14:57:53 +00:00
Shiqing Fan
2a77d46117 Fix a small bug.
This commit was SVN r15119.
2007-06-18 12:50:29 +00:00
George Bosilca
55cf6fc866 Be a little bit more verbose: tell which file we have trouble with...
This commit was SVN r15115.
2007-06-17 04:59:15 +00:00
George Bosilca
99e701062a The Windows job scheduler PLS. Initial commit as I have to move to
another Windows cluster. Right now it's not in a usable state.

This commit was SVN r15113.
2007-06-17 04:54:07 +00:00
Ralph Castain
e653da1d11 Where or where did that patch go??? Ah - there it went! ;-)
Fix singleton operations - allow multiple xcasts to be queued.

This commit was SVN r15097.
2007-06-15 13:45:29 +00:00
George Bosilca
35e824377e There seems to be a subtle race condition when we fail to spawn a
child. Marking the child as failed solve the issue.

This commit was SVN r15087.
2007-06-14 22:36:47 +00:00
George Bosilca
a4d99ddef6 More synchronizations for the Windows version. The problem came from
the multiple threads accessing the OOB/registry asynchronously via the
callbacks. The quickest solution (but definitively not the cleanest) is
to serialize these callbacks in such a way that at any given time
only one thread can execute a callbacks.

This commit was SVN r15086.
2007-06-14 22:35:38 +00:00
George Bosilca
fb9ff5cc75 Don't remove the tcp events from the list, they will remove themselves
in the destructor.

This commit was SVN r15085.
2007-06-14 22:33:09 +00:00
Josh Hursey
6cdfefad87 Fix portals BTL and cnos RML.
Both were failing due to interface changes that were never 
applied to them properly.

This commit was SVN r15082.
2007-06-14 18:49:41 +00:00
Ralph Castain
fde15ac97d Bring the TM launcher online
This commit was SVN r15076.
2007-06-14 12:33:34 +00:00
George Bosilca
95a607b945 A more Windows friendly version. As the socket event will be generated
through the win dll using multiple threads, we have to insure that
the oob callbacks happens only in a synchronous way or really bad
things happens with the current design (blocking messages from a receive
callback).

This commit was SVN r15069.
2007-06-14 04:38:06 +00:00
George Bosilca
de324502bc Update the Windows wait functions. The most important change is for
the event registration, which in the case of a process dead detection
should be marked as fire once and taking long time.

This commit was SVN r15068.
2007-06-14 04:35:46 +00:00
George Bosilca
8dfa06a617 Only output when the user request it.
This commit was SVN r15067.
2007-06-14 04:33:18 +00:00
George Bosilca
13a693faa0 Update the Windows process ODLS.
This commit was SVN r15066.
2007-06-14 04:32:19 +00:00
Pak Lui
de0f1eef89 No major changes here. Just updates to remove unused code and comments.
This commit was SVN r15051.
2007-06-13 17:23:03 +00:00
Pak Lui
03a93a38c5 Added an option for daemonizing orted. The existing behavior to --no-daemonize
for gridengine is not changed.

This commit was SVN r15050.
2007-06-13 17:11:37 +00:00
Ralph Castain
5adef03179 Clean up a diagnostic so it only outputs when requested
This commit was SVN r15048.
2007-06-13 15:53:10 +00:00
George Bosilca
18c2bb0ed6 Don't forget to set the name argument before spawning the daemon.
This commit was SVN r15047.
2007-06-13 15:45:34 +00:00
Pak Lui
8e7daea11f bring inline more changes with r15007.
This commit was SVN r15044.

The following SVN revision numbers were found above:
  r15007 --> open-mpi/ompi@85df3bd92f
2007-06-13 15:30:18 +00:00
Ralph Castain
425fed95ff Bring the SGE component online
This commit was SVN r15043.
2007-06-13 15:02:47 +00:00
Rainer Keller
7e0b400f3f - Small Fix.
This commit was SVN r15037.
2007-06-13 10:43:03 +00:00
George Bosilca
278ec7fd4f I wonder how this one compiled before ... or how do I manage to
miss it ...

This commit was SVN r15032.
2007-06-12 23:24:39 +00:00
George Bosilca
9d342ccb61 Shorter warning message.
This commit was SVN r15031.
2007-06-12 23:22:09 +00:00
George Bosilca
715f6012cf The DSS pack function can use the const attribute for the src field
as it is never modified by the pack functions directly. Enforce it
all over the code base.

This commit was SVN r15026.
2007-06-12 22:47:14 +00:00
George Bosilca
649ab84654 Don't do SIGPIPE handling on Windows.
This commit was SVN r15025.
2007-06-12 22:44:39 +00:00
George Bosilca
9e89abbd57 HAVE_SYS_TYPES_H require an ifdef.
This commit was SVN r15024.
2007-06-12 22:43:18 +00:00
George Bosilca
432185d617 Forget to remove the MCA parameter corresponding to the 2 unused
fields in the RSH PLS component.

This commit was SVN r15023.
2007-06-12 22:41:38 +00:00
George Bosilca
49e7bf3069 Be a little bit more clear when we fail to identify the shell.
This commit was SVN r15022.
2007-06-12 22:40:44 +00:00
George Bosilca
5b7796dfcd Remove 2 unused fields.
This commit was SVN r15021.
2007-06-12 22:39:57 +00:00
George Bosilca
16c38cabe1 Update the Windows ODLS component.
This commit was SVN r15020.
2007-06-12 22:37:04 +00:00
George Bosilca
bf6f30a42c Make the Windows PLS component match the current requirements for
a PLS module.

This commit was SVN r15019.
2007-06-12 22:34:56 +00:00
Ralph Castain
af64009368 Bring the CNOS component of the PLS back online
This commit was SVN r15018.
2007-06-12 22:17:05 +00:00
Jeff Squyres
54064f6fa1 Fix a warning that Tim P. found this morning.
The warning was indicative of overly-complex code anyway.  So I
removed the "first" bool and simply use a sentinel value in seq_min to
indicate that nothing has changed.  Note that this is "correct enough"
for the moment -- more fixes will come in this area with tickets #1049
and/or #1051.

This commit was SVN r15013.
2007-06-12 17:30:54 +00:00
Brian Barrett
84d1512fba Add the potential for doing some basic error checking on mutexes during
single threaded builds.  In its default configuration, all this does
is ensure that there's at least a good chance of threads building
based on non-threaded development (since the variable names will be
checked).  There is also code to make sure that a "mutex" is never
"double locked" when using the conditional macro mutex operations.
This is off by default because there are a number of places in both
ORTE and OMPI where this alarm spews mega bytes of errors on a
simple test.  So we have some work to do on our path towards
thread support.

Also removed the macro versions of the non-conditional thread locks,
as the only places they were used, the author of the code intended
to use the conditional thread locks.  So now you have upper-case
macros for conditional thread locks and lowercase functions for
non-conditional locks.  Simple, right? :).

This commit was SVN r15011.
2007-06-12 16:25:26 +00:00
Ralph Castain
4e8081ed1e Cleanup a now unnecessary variable
This commit was SVN r15010.
2007-06-12 14:23:33 +00:00
Tim Prins
1467558157 Cleanup a couple warnings.
Update svn:ignore

This commit was SVN r15009.
2007-06-12 14:11:06 +00:00
Ralph Castain
85df3bd92f Bring in the generalized xcast communication system along with the correspondingly revised orted launch. I will send a message out to developers explaining the basic changes. In brief:
1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names.

2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used.

3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying.

Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed.

This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems.

This commit was SVN r15007.
2007-06-12 13:28:54 +00:00
Brian Barrett
27ad954265 Fix a couple of problems with the way we were using orte_process_name_t
structures in the system.  Instead of using memcmp, use the ns function.
This won't cause a problem as long as all three elements of the name are
ints, but if they have different sizes, alignment and padding rules
can cause memcmp() to compare padding space, which rarely holds a sane
value.

This commit was SVN r14998.
2007-06-11 19:12:11 +00:00
Brian Barrett
1d11cc4b2d Fix mis-declared variable type
This commit was SVN r14994.
2007-06-11 16:48:35 +00:00
Shiqing Fan
d9fa58dc33 Add two more arguments to call. The definition of the function has been modified with 2 additional arguments.
This commit was SVN r14990.
2007-06-11 14:27:36 +00:00
Jeff Squyres
b704ff9f4e Fix typo that accidentally always resulted in a "true" result.
This commit was SVN r14984.
2007-06-11 12:52:56 +00:00
Jeff Squyres
1ed906a78b * Forgot to update the iof null component when we revamped the IOF
framework.  Updated pointers to match current definitions.
 * Trimmed some dead wood while I was at it:
   * No need for component close function that does nothing
   * Use BEGIN/END_C_DECLS
   * Use recent MCA param register function
   * Ditch MCA param orte_iof_debug (it wasn't used anywhere)
   * Use MCA param orte_iof_override properly in the code (i.e., look
     up the value once and use the cached value later)

This commit was SVN r14981.
2007-06-11 00:58:21 +00:00
Jeff Squyres
9fb2e807a9 Remove some unnecessary code, probably dating back to before we had
generalized component include/exclude infrastructure.  This commit
removes the oob_base_include and oob_base_exclude MCA params because
they have long-since been handled by the "oob" MCA parameter in the
MCA base.

This commit was SVN r14979.
2007-06-10 14:16:05 +00:00
Jeff Squyres
4f3a11b4db Fixes trac:967.
A bunch of fixes from the /tmp/iof-fixes branch that fix up ''some''
(but not ''all'') of the problems that we have seen with iof:

 * Reading very large files via stdin redirected to orteun (Sun saw
   this)
 * Reading a little bit of a large file redirected to orterun's stdin
   and then either closing stdin or exiting the process

The Big Change was to make the proxy iof (the one running in non-HNP
orteds) send back a "I'm closing the stream" ACK back to the service
iof.  This tells the HNP that there will be nothing more coming from
that peer, and therefore the iof forward should be removed.

Many other minor cleanups/fixes, terminology changes, and
documentation additions are included in this commit as well.  However,
there are still some pretty big outstanding issues with IOF that are
not addressed either by #967 or this commit.  A few examples:

 * IOF was designed to allow multiple subscribers to a single stream.
   We're not entirely sure that this works (for one thing, there is
   nothing in the ORTE/OMPI code base that uses this functionality).
 * There are also resources leaked when processes/jobs exit (per
   Ralph's first comment on this ticket).  
 * There is no feedback to close orterun's stdin when all subscribers
   to the corresponding stream have closed stdin.

This commit was SVN r14967.

The following Trac tickets were found above:
  Ticket 967 --> https://svn.open-mpi.org/trac/ompi/ticket/967
2007-06-08 22:59:31 +00:00
Tim Prins
31a3430c85 Fix threaded builds broken by r14914
This commit was SVN r14933.

The following SVN revision numbers were found above:
  r14914 --> open-mpi/ompi@983fd3432a
2007-06-06 22:30:34 +00:00
Ralph Castain
e0e4163f53 Remove pithy comment
This commit was SVN r14930.
2007-06-06 20:26:52 +00:00
George Bosilca
3b7f3e5565 Keep the unknown shell string.
This commit was SVN r14929.
2007-06-06 20:24:42 +00:00
George Bosilca
29dd535c01 Remove all references to the orte_bitmap as well as the files.
This commit was SVN r14928.
2007-06-06 20:24:07 +00:00
George Bosilca
fbb46f0ee7 A faster search without the bitmap. Remove all references to the orte_bitmap.
This commit was SVN r14926.
2007-06-06 20:23:14 +00:00
George Bosilca
24eae5c1ec We have a goto label for cleanup so make sure we always use it. This way we insure
that the lock are correctly released in all cases.

This commit was SVN r14925.
2007-06-06 20:20:52 +00:00
George Bosilca
28c9d0758b These functions are supposed to be static so there is no reason to have them declared in the header file.
This commit was SVN r14924.
2007-06-06 20:18:37 +00:00
George Bosilca
b047ed75d7 Don't forget to free the temporary buffer.
This commit was SVN r14923.
2007-06-06 20:17:27 +00:00
Ralph Castain
ea0c03fd7a Revert out r14910. Turns out that the GPR *has* to be able to deal with NULL data values. We fixed this a long time ago on the "put" side, but never dealt with it for "get" - hence, we could "put" ORTE_UNDEF'd attributes in a mapping policy, but couldn't retrieve them. This is why you only encountered the error on comm_spawn and not during the original launch of a job.
This correctly repairs the problem by enabling the GPR's "get" function to correctly handle NULL data values.

This commit was SVN r14916.

The following SVN revision numbers were found above:
  r14910 --> open-mpi/ompi@0757467d77
2007-06-06 18:34:54 +00:00
Ralph Castain
983fd3432a Fix singleton comm_spawn. Ensure that singleton's start the RML receive function so they can receive RML updates during xconnect procedures once any comm_spawn'd children start. Since singleton's only use the RMGR/URM component, update that component to also hold us until xconnect is completed (if it is invoked) before returning to the caller.
This commit was SVN r14914.
2007-06-06 17:39:23 +00:00
Ralph Castain
bbb60e37c0 Remove stale define
This commit was SVN r14913.
2007-06-06 17:19:43 +00:00
Ralph Castain
0757467d77 Fix comm_spawn. The problem stems from our use of the existence of an attribute as equivalent to a boolean "true" - in other words, we only confirm the existence of an attribute on a list to indicate something as opposed to looking at its specific value. Hence, we create the attribute with a type of ORTE_UNDEF - which is fine...until we then attempt to store/retrieve that attribute from the registry. In that case, the DSS barks because it treats ORTE_UNDEF as an error.
The only place where we attempt to store/retrieve attributes is in the RMAPS framework in support of comm_spawn. So this is where things broke down. The fix was simply to say "if the attribute data type is ORTE_UNDEF, then treat it like a boolean with value true". Trivial fix - solves problem.

This commit was SVN r14910.
2007-06-06 15:16:22 +00:00
Brian Barrett
508da4e959 OS X apparently really doesn't like shared libraries with unresolvable
symbols in them and environ is defined only in the final application
(probably in crt1.o).  Apple provides a function for getting at the
environment, so use that instead if it's available.

This commit was SVN r14857.
2007-06-05 03:03:59 +00:00
Brian Barrett
34fea87819 * Only need to to the opal_progress_event_users_increment() once between
OPAL and ORTE.  Since we now do opal_progress_init(), we do it
    there.  Fixes a performance issue introduced in r14773.
  * While trying to find the above, notived that we did the reference
    counting for the init in init_util and for finalize in fini.  That
    isn't right, so make them both in the non-util versions.

This commit was SVN r14830.

The following SVN revision numbers were found above:
  r14773 --> open-mpi/ompi@1e678c3f55
2007-06-01 02:43:46 +00:00
Brian Barrett
11ae30333d strmode is a standard BSD function defined in string.h
This commit was SVN r14828.
2007-05-31 22:47:40 +00:00
Rolf vandeVaart
ec963d02c7 Fix debug timing output so that proper algorithm is printed for
each type of xcast (direct, linear, binomial).

This commit was SVN r14822.
2007-05-31 18:27:54 +00:00
Brian Barrett
e4b369c93e Properly handle case where user instructs the oob to not use all non-localhost
interfaces

This commit was SVN r14815.
2007-05-31 02:29:44 +00:00
George Bosilca
f8f71b9ba0 Correct a threaded problem and make sure we only free what was allocated.
This commit was SVN r14803.
2007-05-30 18:50:29 +00:00
George Bosilca
905570a6d2 Call opal_show_help with the expected number of arguments.
This commit was SVN r14802.
2007-05-30 18:49:43 +00:00
George Bosilca
6afbc02052 The idea behind this patch is to decrease the number of strcmp used in the replica
by using a small hash function before doing the strcmp. The hask key for each
registry entry is computed when it is added to the registry. When we're doing a
query, instead of comparing the 2 strings we first check if the hash key match,
and if they do match then we compare the 2 strings in order to make sure we
eliminate collisions from our answers.

There is some benefit in terms of performance. It's hardly visible for few
processes, but it start showing up when the number of processes increase. In fact
the number of strcmp in the trace file drastically decrease. The main reason it
works well, is because most of the keys start with basically the same chars
(such as orte-blahblah) which transform the strcmp on a loop over few chars.

This commit was SVN r14791.
2007-05-29 18:40:07 +00:00
Ralph Castain
a2964f429e Fix a compiler warning - strncmp returns an int, so you have to compare to 0 instead of NULL.
This commit was SVN r14790.
2007-05-29 18:02:10 +00:00
Anya Tatashina
de676d717b Ref Trac #1032; added suport for full path launching with TotalView
This commit was SVN r14789.
2007-05-29 17:39:11 +00:00
Tim Prins
f95442dec9 better error output
This commit was SVN r14788.
2007-05-29 16:53:40 +00:00
Tim Prins
b4e3ad8da0 Fixup tests for recent api changes
cleanup a ton of warnings, include proper files

fix orte_ring, it had a deadlock in it...

fix the abort test so it can be used with less than 4 processes

This commit was SVN r14787.
2007-05-29 16:22:50 +00:00
Josh Hursey
1e678c3f55 per conversation with Ralph and Jeff take out the opal_init_only logic.
This commit moves the initalization/finalization of opal_event and opal_progress
to opal_init/finalize. These were previously init/final in ORTE which is an
abstraction violation. After talking about it we concluded that there are no
ordering issues that require these to be init/final in ORTE instead of OPAL.

I ran the IBM test suite against this commit and it didn't turn up any new
failures so I think it is good to go.

Let us know if this causes problems.

This commit was SVN r14773.
2007-05-24 21:54:58 +00:00
Josh Hursey
e8b85faf28 Fix for the invalid arguments case. we were not finalizing cleanly.
This commit was SVN r14770.
2007-05-24 21:27:06 +00:00
Josh Hursey
a296ef5487 Checkpoint/restart fix:
Still recovering from interface changes.

This commit was SVN r14769.
2007-05-24 21:12:34 +00:00
Rainer Keller
a665b7a20d - Getting rid of "missing initializer" warnings
This commit was SVN r14766.
2007-05-24 19:19:52 +00:00
Rainer Keller
7d84de8510 - now the formatting (just getting rid of spaces at the end)....
This commit was SVN r14764.
2007-05-24 19:10:32 +00:00
Rainer Keller
ff3cfc0011 - Get rid of "set but never used" warning
This commit was SVN r14763.
2007-05-24 19:07:45 +00:00
Jeff Squyres
379a4ec5e2 While we're editing MCA params in the oob tcp component, ditch the use
of the deprecated MCA param API for registering MCA parameters and
update to the current API.

This commit was SVN r14747.
2007-05-24 13:01:55 +00:00
Jeff Squyres
839c1db95c Fix something that has been bugging me for a while:
Rename the oob_tcp_include and oob_tcp_exclude MCA parameters to be
oob_tcp_if_include and oob_tcp_if_exclude (to match the convention
with btl_tcp_if_[in|ex]clude).  Keep "hidden" synonyms oob_tcp_include
and oob_tcp_exclude in case anyone is actually using them (and some
users undoubtedly are), but do not have them show up in ompi_info
--param output.  Instead, the new "oob_tcp_if_*" names will show up in
ompi_info output.

This commit was SVN r14746.
2007-05-24 12:52:26 +00:00
George Bosilca
7485c920c4 Remove the useless comment as it does not reflect the reality anymore.
This commit was SVN r14737.
2007-05-23 18:55:02 +00:00
Ralph Castain
02f6e6ab3e Slight touchup to make it pretty
This commit was SVN r14734.
2007-05-23 16:39:18 +00:00
Ralph Castain
3fc227286f Be sure to NULL terminate the list of keys...
This commit was SVN r14733.
2007-05-23 16:35:03 +00:00
Ralph Castain
cffea274b3 Fix the hostfile system. Enable the hostfile RDS to properly setup the name service. Modify the name service so that a caller can provide a valid cellid and update its site info for later retrieval.
This commit was SVN r14732.
2007-05-23 16:31:44 +00:00
Sven Stork
ed72cbcaec - Fix a deadlock problem for threaded builds. We have to release the lock
before we wait for the callback, because the callback will try to lock
  the lock again. (show up in debug+threaded build)

This commit was SVN r14731.
2007-05-23 16:11:50 +00:00
Josh Hursey
a010ff6e6a Some updates from the interface change to orte_init
This commit was SVN r14729.
2007-05-23 14:44:23 +00:00
Ralph Castain
e6ff7757ab Modify the new DSS xfer and copy functions so they only xfer/copy the unpacked portion of a buffer's payload. This allows for more rapid transfer of data during message relay without requiring any knowledge of what is in the buffer.
Begin work on restoring binomial message distribution method.

This commit was SVN r14728.
2007-05-23 14:06:32 +00:00
Ralph Castain
b582d98d4a Fix singleton comm_spawn so it can see available resources
This commit was SVN r14719.
2007-05-22 13:29:07 +00:00
Ralph Castain
5b0abf520b Don't update our own contact info
This commit was SVN r14718.
2007-05-22 13:28:23 +00:00
Ralph Castain
677eb5e4bc Ensure the singleton wakes up when comm_spawn fails
This commit was SVN r14714.
2007-05-21 20:13:31 +00:00
Ralph Castain
b771cfcce3 Fix compile problem
This commit was SVN r14713.
2007-05-21 20:11:03 +00:00
Ralph Castain
4fff584a68 Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.

Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.

Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.

With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.

Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".

This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
Ralph Castain
c9c2922706 Drat - commit missing file.
This commit was SVN r14710.
2007-05-21 17:20:56 +00:00
Ralph Castain
3288ce0462 Cleanup the SDS components and move some common code to the base.
Modify the seed and singleton SDS code so it returns the right local rank and num_local_procs.

This commit was SVN r14707.
2007-05-21 14:57:58 +00:00
Ralph Castain
d9acc93efa Compute and pass the local_rank and local number of procs (in that proc's job) on the node.
To be precise, given this hypothetical launching pattern:

host1: vpids 0, 2, 4, 6
host2: vpids 1, 3, 5, 7

The local_rank for these procs would be:

host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3
host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3

and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid.

I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future.

This commit was SVN r14706.
2007-05-21 14:30:10 +00:00
Jeff Squyres
97248d6bc6 Add another test to check multiple, concurrent COMM_SPAWN's.
This commit was SVN r14701.
2007-05-19 19:02:24 +00:00
Jeff Squyres
47ba3db3b8 Add a simple MPI_COMM_SPAWN_MULTIPLE test.
This commit was SVN r14700.
2007-05-19 02:30:53 +00:00
Ralph Castain
180c96bb8f Clear an erroneous error message pending a more complete fix
This commit was SVN r14698.
2007-05-18 14:44:27 +00:00
Ralph Castain
6ef39ef525 Extend fix in r14693 to the comm_spawn case
This commit was SVN r14694.

The following SVN revision numbers were found above:
  r14693 --> open-mpi/ompi@75d51812a3
2007-05-18 13:34:26 +00:00
Ralph Castain
75d51812a3 Fix the app-failed-to-start capability that was broken by r14554 (holding the caller in rmgr.spawn until the application - as opposed to just the orteds - have started). Allow the rmgr.spawn function to return if the app terminates, correctly handling its return status code to show abnormal termination. Modify orterun to correctly handle the returned status code so it doesn't enter a conditioned wait if the app fails to start since it will never wakeup if it does.
This commit was SVN r14693.

The following SVN revision numbers were found above:
  r14554 --> open-mpi/ompi@4510b42638
2007-05-18 13:29:11 +00:00
Brian Barrett
33a5758521 Some IPv6 improvements:
* Move ipv6comat.h code into opal_config_bottom.h and change into some
    more intelligent testing of structures
  * Change opal's if interface to use sockaddr instead of sockaddr_storage,
    as the RFCs suggest we do
  * Move the networking code in opal that isn't directly related to if
    detection into net.h
  * Add quicky function to get the port out of either a sockaddr_in
    or sockaddr_in6, saving a bunch of code in the oob.
  * Update TCP oob and btl with new interface

This commit was SVN r14679.
2007-05-17 01:17:59 +00:00
Galen Shipman
542937ee2f ompi running on bproc again.
This commit was SVN r14675.
2007-05-16 19:55:43 +00:00
Sven Stork
22af6d38e6 - UNexport symbols that shouldn't be needed outside the libraries
- replace #if/#endif with BEGIN/END_C_DECLS
- reformating

This commit was SVN r14669.
2007-05-16 15:46:52 +00:00
Galen Shipman
df86202202 get bproc to compile, other issues still remain..
This commit was SVN r14661.
2007-05-15 23:11:33 +00:00
Brian Barrett
21e00f6f0c Clean up a couple of configure things:
* Require Autoconf 2.60 or higher and remove some cruft
    required for AC 2.59 or the AC 2.59 / AC 2.60 mix
  * Remove a bunch of now unnecessary AC_SUBST calls
  * Use the libtool-provided variables for the -I and
    library to use when compiling against ltdl

Fixes trac:1000

This commit was SVN r14652.

The following Trac tickets were found above:
  Ticket 1000 --> https://svn.open-mpi.org/trac/ompi/ticket/1000
2007-05-15 04:23:48 +00:00
Jeff Squyres
c5782642d9 Fix some param names so that they show up when you "ompi_info --param
oob all".

This commit was SVN r14646.
2007-05-11 20:58:11 +00:00
Rich Graham
5359cee937 declare undeclared function, so that the code will compile.
This commit was SVN r14625.
2007-05-09 04:47:40 +00:00
Jeff Squyres
51ff779a5d Minor gramatical nit found by Karen/Sun.
This commit was SVN r14622.
2007-05-08 21:24:44 +00:00
Jeff Squyres
395d05b6bc Update the man page to describe both -wdir and -wd. -wdir is consider
the "primary" option and -wd is the synonym.  Regardless, either of
them function exactly like the other.

This commit was SVN r14618.
2007-05-08 20:27:20 +00:00
Jeff Squyres
8a68b2dba7 Add -wdir option as a synonym for -wd (to make us match the man page).
This commit was SVN r14614.
2007-05-08 19:09:32 +00:00
Ralph Castain
ad541e163e Fix compiler warning
This commit was SVN r14605.
2007-05-08 13:21:18 +00:00
Sven Stork
3707207cca - we don't need to export this symbol
This commit was SVN r14593.
2007-05-07 13:05:52 +00:00
Sven Stork
a04c8eb39a - Bring over the visibility feature, for a finer symbol export control
via the visibility feature that is provided by some compilers.

  Per default this feature is disabled, to enable it you need to
  configure with --enable-visibility and obviously you need a compiler
  with visibility support. Please refer to the wiki for more information.
  https://svn.open-mpi.org/trac/ompi/wiki/Visibility

This commit was SVN r14582.
2007-05-04 09:03:37 +00:00
Ralph Castain
2683c85085 Update the TM launcher so it provides an appropriate error message when encountering an invalid launch_id. This is a first step towards fixing ticket #1016, but needs to be followed by a more complete solution.
This commit was SVN r14578.
2007-05-03 20:14:24 +00:00
Gleb Natapov
25190b85f8 Init locks in open() function. Init() function is not called from ompi_info and
we crash in close().

This commit was SVN r14568.
2007-05-02 09:03:14 +00:00
Ralph Castain
4510b42638 Hold the RMGR in the spawn command until the application process actually launches. Previously, we returned from spawn immediately after launching the daemons - this meant that the caller had to define their own "wait until app launches". This only tells the caller that the app procs were launched, of course - it doesn't mean that they have started execution or reached any particular stage. However, for non-MPI procs, this is as far as we can go - there is no further stage gate we can provide.
Still, better than what we provided before...

This commit was SVN r14554.
2007-05-01 11:27:36 +00:00
Shiqing Fan
c166e3d02c Too few arguments for call, fixed according to the corresponding definition.
This commit was SVN r14538.
2007-04-27 13:14:43 +00:00
Ralph Castain
7d6d0a1c00 Update reuse_daemons to find the daemons again - requires that orteds now report their nodenames (probably temporary patch pending upcoming minor revision of orted)
This commit was SVN r14533.
2007-04-26 15:09:54 +00:00
Ralph Castain
c733a7916b Update the gridengine pls to handle failed-to-start. Fix a few places where the fork'd child incorrectly called "return" instead of "exit" (undoubtedly copied from the same error in the old rsh pls).
This commit was SVN r14532.
2007-04-26 15:08:37 +00:00
Ralph Castain
bca2de3a57 Complete the update of the rsh pls to handle failed-to-start
This commit was SVN r14531.
2007-04-26 15:07:40 +00:00
Josh Hursey
d68ff8c2a3 minor typo
This commit was SVN r14516.
2007-04-25 19:54:53 +00:00
Josh Hursey
596062d34b Seems that the recent changes in the sds and oob exposed some invalid
assumptions in the FT restart code for the ORTE layer.

This fixes those problems by having the RML completely shutdown and 
restart the OOB framework (instead of just the module as before).
This makes it much easier to manage, and maintainable as the OOB
changes in the future.

The SDS now does communication as part of its startup procedure, so
we need to make sure we restart the RML before the SDS so that it can
communicate properly.

OOB base [close|open] used a static bool to determine if they have
been called previously or not. I needed to expose this boolean so 
that I can close() then open() the oob base in the restart procedure.
The functionality has not changed, we just now have the ability to 
open/close the framework as many times as we need to as long as we
always call them in that order. (So calling open twice in a row is not allowed
as before, it is only allowed if you open(), close(), then open() again).

Things seem to be working now.

This commit was SVN r14515.
2007-04-25 19:51:52 +00:00
Brian Barrett
4b8bb70afb A couple cleanups for the IPv6 support:
- make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6
    so that it works for IPv4 and IPv6 addresses, and remove a whole bunch
    of #ifs in the OOOB code.
  - Fix a compiler warning in the TCP BTL due to run-time determined
    array size by making it a dynamicly allocated array.
  - Fix the unpacking code of IPv4 addresses when using IPv6 support, so
    that the address is in the correct location (instead of in an IPv6
    structure, use an IPv4 structure).  Refs trac:1005.

This commit was SVN r14514.

The following Trac tickets were found above:
  Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005
2007-04-25 19:08:07 +00:00
Adrian Knoth
d1ce39de4f Move mca_btl_tcp_addr_isipv4public to opal_addr_isipv4public
This commit was SVN r14512.
2007-04-25 18:06:06 +00:00
Ralph Castain
7d0f51e6b9 Begin setting up for a change to the OOB information passing functionality - this is totally transparent at the moment (need to change computers).
This commit was SVN r14510.
2007-04-25 17:36:26 +00:00
Adrian Knoth
35fce38f43 Don't know why this line was here.
This commit was SVN r14509.
2007-04-25 12:31:13 +00:00
Ralph Castain
8517a5a3a6 cleanup a few compiler warnings
This commit was SVN r14507.
2007-04-25 11:51:18 +00:00
Adrian Knoth
868d8febfa Enable rds/hostfile to accept IPv6 addresses.
This commit was SVN r14505.
2007-04-25 06:55:58 +00:00
Jeff Squyres
c4c68e666a Merge in the ipv6 work from /tmp/ipv6-merge.
This commit was SVN r14503.
2007-04-25 01:55:40 +00:00
Jeff Squyres
321e08c605 Add some missing header files
This commit was SVN r14500.
2007-04-24 21:39:12 +00:00
Ralph Castain
18cb5c9762 Complete modifications for failed-to-start of applications. Modifications for failed-to-start of orteds coming next.
This completes the minor changes required to the PLS components. Basically, there is a small change required to the parameter list of the orted cmd functions. I caught and did it for xcpu and poe, in addition to the components listed in my email - so I think that only leaves xgrid unconverted.

The orted fail-to-start mods will also make changes in the PLS components, but those can be localized so they come in one at a time.

This commit was SVN r14499.
2007-04-24 20:53:54 +00:00
Ralph Castain
a764aa6395 Modify iof to report back more descriptive errors
This commit was SVN r14497.
2007-04-24 19:28:37 +00:00
Ralph Castain
c774f641fb Modify orterun to provide more user-friendly reporting on jobs that fail to start
This commit was SVN r14496.
2007-04-24 19:19:14 +00:00
Ralph Castain
19767802de Let the errmgr know how to deal with incomplete starts
This commit was SVN r14495.
2007-04-24 19:04:29 +00:00
Ralph Castain
ef71055cf8 Teach the odls to properly test for and report failed-to-start for application processes.
Test for system limits (where known) prior to doing things like fork and pipe since some systems aren't very nice about it when we try to exceed such limits.

This commit was SVN r14494.
2007-04-24 18:54:45 +00:00
Ralph Castain
f5ef3d795e Tell the smr how to handle failed-to-start
This commit was SVN r14488.
2007-04-24 16:23:26 +00:00
Jeff Squyres
0674bbd001 Fix segv when the shell is not recognized. Thanks to Mostyn Lewis for
noticing the problem.

This commit was SVN r14483.
2007-04-24 12:00:54 +00:00
Ralph Castain
2d04298002 Update the orted cmd xmit functions to match orted recv's. This fixes trac:1004.
This commit was SVN r14482.

The following Trac tickets were found above:
  Ticket 1004 --> https://svn.open-mpi.org/trac/ompi/ticket/1004
2007-04-24 01:58:40 +00:00
Josh Hursey
260e7612ad Fix a few interface changes introduced by r14475
This commit was SVN r14479.

The following SVN revision numbers were found above:
  r14475 --> open-mpi/ompi@18b2dca51c
2007-04-23 20:18:27 +00:00
Ralph Castain
5f94d6d791 Fix the cnos rml to match revised xcast API
This commit was SVN r14478.
2007-04-23 19:07:44 +00:00
Ralph Castain
18b2dca51c Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly.
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).

This also involved a slight change to the oob.xcast API, so propagated that as required.

Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)

Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.

This commit was SVN r14475.
2007-04-23 18:41:04 +00:00
Ralph Castain
009be1c1b5 Reorganize the orted code for easier maintenance. Add ability to deliver xcast messages to local procs (not used at this point).
This commit was SVN r14474.
2007-04-23 18:28:20 +00:00
Ralph Castain
b260f8ee36 Enable the job_family API
This commit was SVN r14473.
2007-04-23 18:26:33 +00:00
Ralph Castain
7a57b694bb Allow caller to get session directory name without anything else
This commit was SVN r14472.
2007-04-23 18:25:36 +00:00
Ralph Castain
9cd85ef55a Add a few more error constants that will help provide more definitive output to the user
This commit was SVN r14471.
2007-04-23 18:25:03 +00:00
Brian Barrett
0a8af62c64 Fix broken build on OS X with static compiles. Everything that uses
anything in OPAL *MUST* call either opal_init() or opal_init_util().

This commit was SVN r14468.
2007-04-23 15:45:39 +00:00
Ralph Castain
477828159e Add a few test functions transferred from ORTE trunk
This commit was SVN r14467.
2007-04-23 14:43:55 +00:00
Ralph Castain
f47e7382e3 Add a new function to wake orterun up - used in failed-to-start scenarios, but can be used anytime a lower level needs to ensure orterun wakes up
This commit was SVN r14466.
2007-04-23 12:49:25 +00:00
Ralph Castain
3d4f1b86d2 Modify the name service to provide necessary support for failed-to-start scenarios. Add a new API to get_vpid_range - this should be used in place of the rmgr API of that name to avoid race conditions (will remove that API in later commit).
This commit was SVN r14465.
2007-04-23 12:48:19 +00:00
Josh Hursey
27a42f48d3 Make sure to call opal_init_util before mca_base_open().
This bug(?) become apparent due to the installdirs commit since these tools
were not finding the proper libraries since the paths were wonkey.

It all looks good now. :)

This commit was SVN r14461.
2007-04-21 22:38:15 +00:00
Jeff Squyres
5bebd24250 Bring over Brian's installdirs fixes from this afternoon (r14445).
This commit was SVN r14450.

The following SVN revision numbers were found above:
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:16:31 +00:00
Jeff Squyres
0ba47105ed Merge the /tmp/jms-installdirs-trunk branch into the trunk. This
finally brings in functionality that is already on the 1.2 branch, and
was developed and tested in the v1.2ofed branch (and other places).

Short version of new features:

 * Support for ibv_fork_init() 
 * Automatically fill in the openib BTL bandwidth value by 
   querying the HCA port 
 * Installdirs functionality 
 * Fixes to always use -I in the Fortran wrapper compilers (#924) 
 * Gleb's mpool updates 
 * Remove some kruft in btl/openib/configure.m4, therefore 
   fixing the harmless warnings noted in #665 
 * Bunches of updates to the Linux RPM spec file 

I.e., effectively the same thing that r14411 brought to the v1.2
branch.

Also effectively brought in r14432 and r14433 (some fixes on top of
the original r14411 commit to v1.2).  Still need to bring in the moral
equivalent of r14445 after this commit (fixes to installdirs).

This commit was SVN r14449.

The following SVN revision numbers were found above:
  r14411 --> open-mpi/ompi@83b31314ae
  r14432 --> open-mpi/ompi@a48f160595
  r14433 --> open-mpi/ompi@68f346d2bc
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:15:05 +00:00
Josh Hursey
b9da59ebc3 Fix the way we determine which sequence number to restart with.
Create a sentinel value in the metadata file to clearly indicate
that the sequence number is complete (versus in progress). This
way we do not try to restart from an invalid sequence number
which can lead to badness.

This commit was SVN r14423.
2007-04-19 15:04:27 +00:00
Sven Stork
037b01ce9e - more symbols that need to be exported
This commit was SVN r14415.
2007-04-18 14:53:56 +00:00
Jeff Squyres
1e364218a2 Remove unused variable
This commit was SVN r14413.
2007-04-18 13:10:10 +00:00
Josh Hursey
6ee0c641fd Cleanup the output from orte-checkpoint so it is a bit more clear and references
the sequence number.

Before:
[...] Finished - Global Snapshot Reference: ompi_global_snapshot_1234.ckpt

After:
Snashot Ref.:   1 ompi_global_snapshot_1234.ckpt

This commit was SVN r14381.
2007-04-15 14:28:56 +00:00
George Bosilca
9e840fbe14 The missing orted bug is now fixed. orterun will not deadlock when
the program it try to spawn is missing.

Description of the problem: When the rsh pls try to spawn a local
process which is missing (such as a removed orted) the orterun
deadlock.

Description of the fix: The forked child deal with finding the
program to be executed. If it fails to find it, then instead of
calling exit (as a normal forked program is expected to do) it 
continue the execution using a execution path it was never
expected to use (back in orterun and then main). Bad things 
happens as expected. Forcing the child to use exit when it fails
to find the orted (and forcing the child to use exit everywhere
instead of return) correct the logic of the rsh pls and make it
behave as expected.

This commit was SVN r14377.
2007-04-14 17:36:27 +00:00
Ralph Castain
adb44c44b1 Revert prior commits from last night that involved significant change to the GPR, along with cosmetic changes to the odls_default module pending review and test.
Reverts r14328, r14329, r14331, r14333, r14335, r14338, and r14336.

This commit was SVN r14351.

The following SVN revision numbers were found above:
  r14328 --> open-mpi/ompi@d1ce4a44ca
  r14329 --> open-mpi/ompi@604e79f2d2
  r14331 --> open-mpi/ompi@b2b3417475
  r14333 --> open-mpi/ompi@8882f355b4
  r14335 --> open-mpi/ompi@10dfd534f6
  r14336 --> open-mpi/ompi@5c65c55e59
  r14338 --> open-mpi/ompi@579184cd72
2007-04-12 13:13:28 +00:00
Jeff Squyres
51f286d737 Just like r14289 on the ORTE trunk:
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).

This commit was SVN r14345.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r14289
2007-04-12 11:19:42 +00:00
George Bosilca
1c037df7e7 Only print information if the condition is met.
This commit was SVN r14340.
2007-04-12 07:28:18 +00:00
George Bosilca
579184cd72 Rollback commit r14335 it get into the trunk too early.
This commit was SVN r14338.

The following SVN revision numbers were found above:
  r14335 --> open-mpi/ompi@10dfd534f6
2007-04-12 06:21:59 +00:00
George Bosilca
5c65c55e59 Few cleanups. The most important is getting rid of the orte_bitmap_t class
which is not used anymore in the orte code.

This commit was SVN r14336.
2007-04-12 05:50:33 +00:00
George Bosilca
10dfd534f6 Correctly remove the itag if we fail the condition. And be pedantic with the code.
This commit was SVN r14335.
2007-04-12 05:33:31 +00:00
George Bosilca
b882b7e1b3 Update the Windows ODLS.
This commit was SVN r14334.
2007-04-12 05:19:25 +00:00
George Bosilca
8882f355b4 Move these functions at their right place.
This commit was SVN r14333.
2007-04-12 05:18:23 +00:00
George Bosilca
9de6ae0753 ORTE_MODULE_DECLSPEC is not required here.
This commit was SVN r14332.
2007-04-12 05:17:03 +00:00
George Bosilca
b2b3417475 A more optimized version of the orte_gpr_replica_check_itag_list function. Strictly
follow the same behavior as before, the changes just make sure the check is done
in linear time and the memory usage is kept to a minimum.

This commit was SVN r14331.
2007-04-12 05:13:10 +00:00
George Bosilca
604e79f2d2 There is a cleanup label, so I expect to use it in all cases.
This commit was SVN r14329.
2007-04-12 05:05:36 +00:00
George Bosilca
d1ce4a44ca Fix small memory leak (only happens in debug mode).
This commit was SVN r14328.
2007-04-12 05:02:57 +00:00
George Bosilca
cad93a7693 Add more output. Fix some typos, and some small cleanups.
This commit was SVN r14327.
2007-04-12 05:01:29 +00:00
George Bosilca
0d82473b9d Enable the null IOF.
This commit was SVN r14326.
2007-04-12 05:00:05 +00:00
George Bosilca
f5478d95df Dont do anything if the array is already empty.
This commit was SVN r14325.
2007-04-12 04:58:47 +00:00
George Bosilca
e7c4f1ca64 Remove some unused code and correct the finalize function (cancel the pending
receive request).

This commit was SVN r14324.
2007-04-12 04:58:12 +00:00
George Bosilca
4a87c782c3 Release all unselected components. This is a little bit more tricky than usual,
as the IOF components lack the required finalize function. Instead rely on the
module finalize. Read the comment or more informations.

This commit was SVN r14323.
2007-04-12 04:57:08 +00:00
George Bosilca
c15cd5e4ab Unload all non necessary PLS. Once the selection process is done, we should release all
unselected PLS. This decrease the footprint of all Open MPI based processes.

This commit was SVN r14322.
2007-04-12 04:55:23 +00:00
George Bosilca
af6891f471 Fix a small typo.
This commit was SVN r14321.
2007-04-12 04:53:30 +00:00
Tim Prins
6872f21af0 remove unused variable
This commit was SVN r14306.
2007-04-11 17:15:14 +00:00
Pak Lui
e9e8dc2765 * comment out unused code
This commit was SVN r14297.
2007-04-10 22:38:34 +00:00
Josh Hursey
cd5047a9bf Refs trac:976
Collect the base 'orted' command line into a base function since most of the
PLS components were duplicating this code. Add AMCA parameter command line
component to the base set.

Add Aggregate MCA parameter support to the following PLS components:
 - gridengine
 - process
 - slurm
 - poe
 - tm

Improve support for 'rsh' component.

Did/could not support the following components:
 - bproc
 - proxy
 - xcpu
 - cnos
 - xgrid

The above components had peculiar needs that made it non-trivial to add an 
option. The authors of these components need to help in supporting this
new option.

I was only able to test the SLURM and RSH components due to system availability.
The others should work without problem.

This commit was SVN r14284.

The following Trac tickets were found above:
  Ticket 976 --> https://svn.open-mpi.org/trac/ompi/ticket/976
2007-04-10 14:23:32 +00:00
Tim Prins
1e7ff7f0fe Fix another buglet.
This commit was SVN r14270.
2007-04-09 17:54:11 +00:00
Tim Prins
2ffc02870d Reduce the memory usage of the GPR:
- Make it so that all the GPR pointer arrays are allocated initially at 16 elements instead of 512. This saves (on a 64 bit machine) approximately 4*(# procs + # nodes) KB.
- Fix up the segment prealloc function so that preallocating an existant segment is not an error, and make the areas where we do large inserts use it.

Fix the orte_pointer_array to efficiently implement setting its size. Before we just realloced the array one block at a time until the desired size was reached. Now we resize it all in one realloc.

This commit was SVN r14264.
2007-04-09 00:40:15 +00:00
Brian Barrett
13a4bba13f Yet another dumb thing that shouldn't have been in r14261.
This commit was SVN r14263.

The following SVN revision numbers were found above:
  r14261 --> open-mpi/ompi@8a55c84d0b
2007-04-07 23:23:23 +00:00
Brian Barrett
32f0090f81 fix dumb variable scope mistake
This commit was SVN r14262.
2007-04-07 23:00:57 +00:00
Brian Barrett
8a55c84d0b Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
    when connections were established as the trigger was firing.  A better
    solution has been found for the cluster where this was needed, so just
    removing it was easiest.
  * When a fatal error (too many connection failures) occurs, set an error
    on messages in the queue even if there isn't an active message.  The
    first message to any peer will be queued without being active (and
    so will all subsequent messages until the connection is established),
    and the orteds will hang until that first message completes.  So if
    an orted can never contact it's peer, it will never exit and just sit
    waiting for that message to complete.
  * Cover an interesting RST condition in the connect code.  A connection
    can complete the three-way handshake, the connector can even send
    some data, but the server side will drop the connection because it
    can't move it from the half-connected to fully-connected state because
    of space shortage in the listen backlog queue.  This causes a RST to
    be received first time that recv() is called, which will be when waiting
    for the remote side of the OOB ack.  In this case, transition the
    connection back into a CLOSED state and try to connect again.
  * Add levels of debugging, rather than all or nothing, each building on
    the previous level.  0 (default) is hard errors.  1 is connection 
    error debugging info.  2 is all connection info.  3 is more state
    info.  4 includes all message info.
  * Add some hopefully useful comments

This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
Tim Prins
df4c468bb4 fix some more minor memory leaks
This commit was SVN r14260.
2007-04-07 18:41:16 +00:00
Tim Prins
8e7765e456 Fix a gigantic memory leak. We were copying a message to send into a buffer, then never freeing the copy we made. But we were mistakenly allocating the buffer on the stack, so the memory checking tools never caught the leak. On 96 nodes, 384 processes, mpirun memory usage went from about 12M to 3M for me after this minor change...
This commit was SVN r14257.
2007-04-07 02:25:48 +00:00
Tim Prins
e058266c96 Change the ORTE datatype service in 2 ways:
1. Remove a unneeded field, bytes_avail, from orte_buffer_t. It is a calcualed value, and updating it everywhere is worse then just calculating it in the one place it is acutally used.
2. Change it so the default size of a orte_buffer is 128 bytes instead of 1024 bytes. We then double the size of the buffer up to 1024 bytes, then we additively increase the size by 1024 bytes at a time as was done before.

This commit was SVN r14252.
2007-04-06 19:40:29 +00:00
George Bosilca
33bf6c6e54 Move the comment at the right place.
This commit was SVN r14237.
2007-04-05 20:36:33 +00:00
George Bosilca
5c355d0bea Always return an initialized variable. More output if we fail to read
from the shell detection child. Don't spawn orted, instead spawn what's
inside the mca_pls_rsh_component.orted.

This commit was SVN r14236.
2007-04-05 20:17:10 +00:00
George Bosilca
ef4baeb6ab Don't reset the pid, as at this point it is already set.
This commit was SVN r14235.
2007-04-05 20:13:50 +00:00
George Bosilca
8fb8363868 Correctly detect the remote shell, and the local one. Big clean-up on how we
deal with the PLS RSH. Remove support for unknown user (i.e. if the user is
not known by the system, then it shouldn't be allowed to spawn anything).

This commit was SVN r14232.
2007-04-05 19:22:26 +00:00
Josh Hursey
8fd6d4ba09 add a newline so output is cleaner/clearer
This commit was SVN r14229.
2007-04-05 17:45:03 +00:00
Ralph Castain
e95539a16a Add two new test codes - orte_loop_spawn/child - to help debug issues surrounding multiple calls to comm_spawn
This commit was SVN r14217.
2007-04-04 21:02:18 +00:00
Jeff Squyres
2cbcb4abf1 Remove the French and strip the tests down to essentials (no need for
buffer attaching/detaching, for example).

This commit was SVN r14216.
2007-04-04 15:38:23 +00:00
Ralph Castain
d5b5cd2d3c Add test code for multiple comm_spawn calls.
Add ERROR_LOG calls to more clearly document failures in the rsh launcher.

This commit was SVN r14214.
2007-04-04 13:24:39 +00:00
Jeff Squyres
fe58753a23 Add a little documentation to iof.h.
This commit was SVN r14208.
2007-04-03 18:17:35 +00:00
George Bosilca
f2a6b9394f Deal with the include spree. Protect "environ" on Windows.
Some others minors modifications in order to make it
compile [again] on Windows.

This commit was SVN r14188.
2007-04-01 16:16:54 +00:00
George Bosilca
01a4f56369 Mostly DECLSPEC cleanups and some include corrections.
This commit was SVN r14186.
2007-04-01 16:08:27 +00:00
Tim Prins
2f74160a37 Fix some more memory leaks
This commit was SVN r14175.
2007-03-30 13:43:50 +00:00