1
1
Граф коммитов

72 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
2e09128337 Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!!
Restore the OBJ_RELEASE calls to cleanup map objects.

This commit was SVN r12064.
2006-10-07 22:44:00 +00:00
Ralph Castain
ae79894bad Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue.
I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there).

Gridengine compiles but I cannot test (believe it likely will run).

Poe and xgrid compile to the extent they can without the proper include files.

This commit was SVN r12059.
2006-10-07 15:45:24 +00:00
George Bosilca
c79c436c8d Cleanups. Remove all __WINDOWS__ checks as this module will never
get compiled on Windows.

This commit was SVN r12011.
2006-10-05 06:17:30 +00:00
George Bosilca
dbe7f8ac32 Always return bool.
This commit was SVN r12009.
2006-10-05 05:45:18 +00:00
Ralph Castain
cd7d87aa7b Define the map data types for dss compatibility. Setup to debug bproc
This commit was SVN r11955.
2006-10-03 17:40:00 +00:00
Ralph Castain
3fd67a038f Bring comm_spawn and persistent operations online. Still some minor problems, though - so don't use them yet, please.
This won't affect anything except those two scenarios.

This commit was SVN r11936.
2006-10-02 18:29:15 +00:00
Ralph Castain
7494a7a83f Clean out some debugging statements that were inadvertently left in the commit
This commit was SVN r11933.
2006-10-02 15:03:18 +00:00
Ralph Castain
559b9b0ae8 Continue beating on comm_spawn. Setup to debug bproc.
This commit was SVN r11932.
2006-10-02 14:58:22 +00:00
Ralph Castain
121f834776 Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component.
Still need some work on the proxy component, and on job termination for persistent daemon case.

This commit was SVN r11928.
2006-10-02 00:46:31 +00:00
Ralph Castain
37dfdb76eb Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done.
This commit was SVN r11661.
2006-09-14 21:29:51 +00:00
Josh Hursey
160120b4c5 Fix a cut-n-paste error that causes the 'num_concurrent' to be
set to 1 or 0 instead of the user defined number or default (128).

This caused the PLS to deadlock when using '--debug-daemons' with
more than 2 processes. :(

svn blame says that it was broken in r11347

It is *not* a problem on v1.1 or v1.2 branches.

Bug spotted by Tim Mattox and myself.

This commit was SVN r11575.

The following SVN revision numbers were found above:
  r11347 --> open-mpi/ompi@f52c10d18e
2006-09-08 15:17:17 +00:00
Pak Lui
9dda057f05 - Do the changes as in r11347 for gridengine to use opal_os_path().
- Remove extra NULL argument from rsh module.

This commit was SVN r11377.

The following SVN revision numbers were found above:
  r11347 --> open-mpi/ompi@f52c10d18e
2006-08-23 20:40:01 +00:00
Jeff Squyres
715bae369c Remove extra argument - now obsoleted by the use of opal_os_path().
This commit was SVN r11366.
2006-08-23 14:32:06 +00:00
George Bosilca
f52c10d18e And ORTE is ready for prime-time. All Windows tricks are in:
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.

Plus a full implementation for the orte_wait functions.

This commit was SVN r11347.
2006-08-23 03:32:36 +00:00
Ralph Castain
8c7f0ed9ae Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system.
Other changes:

1. Remove the old xcpu components as they are not functional.

2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one.

This will require an autogen/configure, I'm afraid.

This commit was SVN r11228.
2006-08-16 16:35:09 +00:00
Ralph Castain
5dfd54c778 With the branch to 1.2 made....
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).

Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).

I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).

In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...

Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.

This commit was SVN r11204.
2006-08-15 19:54:10 +00:00
Ralph Castain
ee5a626d25 Add ability to trap and propagate SIGUSR1/2 to remote processes. There are a number of small changes that hit a bunch of files:
1. Changed the RMGR and PLS APIs to add "signal_job" and "signal_proc" entry points. Only the "signal_job" entries are implemented - none of the components have implementations for "signal_proc" at this time. Thus, you can signal all of the procs in a job, but cannot currently signal only one specific proc.

2. Implemented those new API functions in all components except xgrid (Brian will do so very soon). Only the rsh/ssh and fork modules have been tested, however, and only under OS-X.

3. Added signal traps and callback functions for SIGUSR1/2 to orterun/mpirun that catch those signals and call the appropriate commands to propagate them out to all processes in the job.

4. Added a new test directory under the orte branch to (eventually) hold unit and system level tests for just the run-time. Since our test branch of the repository is under restricted access, people working on the RTE were continually developing their own system-level tests - thus making it hard to help diagnose problems. I have moved the more commonly-used functions here, and added one specifically for testing the SIGUSR1/2 functionality.

I will be contacting people directly to seek help with testing the changes on more environments. Other than compile issues, you should see absolutely no change in behavior on any of your systems - this additional functionality is transparent to anyone who does not issue a SIGUSR1/2 to mpirun.

Ralph

This commit was SVN r10258.
2006-06-08 18:27:17 +00:00
Jeff Squyres
4882dc0e2c Addendum to r9930: missed a chunk of the rsh pls to use the basename
of $libdir and $bindir (i.e., was correctly doing local launches, but
was still using $prefix/lib and $prefix/bin for remote launches).

[Re-]Fixes OFED bug 59.

This commit was SVN r10207.

The following SVN revision numbers were found above:
  r9930 --> open-mpi/ompi@1d6902296c
2006-06-05 21:12:36 +00:00
Jeff Squyres
1d6902296c Additions to the tm, slurm, and rsh pls modules to handle the --prefix
option as discussed on the devel-core mailing list.  The Big
Difference is that instead of hard-coding the strings "/lib" and
"/bin" in to append to the prefix, we append the basename of the local
libdir and bindir.  Hence, if your libdir is $prefix/lib64, we'll
append /lib64 to construct the remote node's LD_LIBRARY_PATH (etc.).

Also appended the orterun.1 man page to include a description of
--prefix, how it is constructed, what it handles / what it does not,
etc.

This commit was SVN r9930.
2006-05-16 14:14:12 +00:00
Brian Barrett
c42da09796 * Fix a small bug George noticed - if you change the prefix (or any of the
installation directories) in configure, the files that depend on this
  information are not properly rebuilt.  If you need this information,
  don't setup a -D in the Makefile.am - instead, include 
  opal/install_dirs.h.
* Use the : option in AC_CONFIG_FILES to avoid needing to expose that
  we are playing around with temporary files with our headers to avoid
  rebuilding
* Clean up the version file information a bit, and like the install 
  directory stuff, make sure that there is a dependency so that 
  ompi_info gets rebuilt properly when a version number changes.

This commit was SVN r9256.
2006-03-12 04:35:01 +00:00
Jeff Squyres
81e0bd444b - Remove extraneous chdir("/tmp")
- For a local orted launch, chdir($HOME) to be consistent with what
  [we assume] will happen on the remote nodes

This commit was SVN r9073.
2006-02-16 22:14:05 +00:00
Brian Barrett
566a050c23 Next step in the project split, mainly source code re-arranging
- move files out of toplevel include/ and etc/, moving it into the
    sub-projects
  - rather than including config headers with <project>/include, 
    have them as <project>
  - require all headers to be included with a project prefix, with
    the exception of the config headers ({opal,orte,ompi}_config.h
    mpi.h, and mpif.h)

This commit was SVN r8985.
2006-02-12 01:33:29 +00:00
Jeff Squyres
abc67a257f This approach is cleaner than the previous one -- use a temporary
shell variable to avoid setting the OMPI $libpath twice in
$LD_LIBRARY_PATH.  Many thanks to Glenn Morris.

This commit was SVN r8883.
2006-02-02 11:58:40 +00:00
Jeff Squyres
cc1ee11eeb Fix issues with tcsh and LD_LIBRARY_PATH when using --prefix. See
lengthy comment inside for details.  Thanks to Glen Morris for finding
this issue and suggesting the fix.

This commit was SVN r8880.
2006-02-02 06:26:55 +00:00
Jeff Squyres
f7097d34c8 Remove some \n typos. Thanks to Glenn Morris for finding these.
This commit was SVN r8878.
2006-02-02 05:50:15 +00:00
Jeff Squyres
e6bd80b424 Per the commit message of r8514, change the search order to be "ssh :
rsh", *not* "rsh : ssh".

This commit was SVN r8736.

The following SVN revision numbers were found above:
  r8514 --> open-mpi/ompi@9c25bdc5ac
2006-01-18 22:00:34 +00:00
George Bosilca
bf266c6109 Rollback the 8682 commit until we figure out the correct way to do it. It break several things
inside (like MPI_Wait* functions).

This commit was SVN r8686.
2006-01-13 22:02:40 +00:00
Rainer Keller
95f886b6ab - Protect callers of opal/ompi_condition_wait from spurious wakeups,
possible when with building with pthreads.
   Compiled on Linux ia32 with and without
   --enable-progress-threads

This commit was SVN r8682.
2006-01-12 17:13:08 +00:00
Jeff Squyres
1e93c78e2e - Rename rsh component members: argv->agent_argv, argc->agent_argc,
and path->agent_path so that it's totally clear what these are for
- make a new rsh component param for agent_param (the value from the
  MCA param)
- delay the path check for the agent until the component init -- don't
  make it fail during open, because the MCA base will print a warning
  if a component fails open() (e.g., on clusters without rsh/ssh (!),
  this component was failing noisly even though it was
  normal/expected)

This commit was SVN r8596.
2005-12-22 14:37:19 +00:00
Rainer Keller
b06d79d4fe - Seems with change r7664, the mapping has slightly changed.
In case of checking for Shell with --mca pls_rsh_assume_same_shell 0
   have the node point to sensible values.

This commit was SVN r8563.

The following SVN revision numbers were found above:
  r7664 --> open-mpi/ompi@0629cdc2d7
2005-12-20 15:59:17 +00:00
Brian Barrett
456ba1c11f * need to declare environ on OS X
this should go to the 1.0 branch

This commit was SVN r8527.
2005-12-16 19:20:33 +00:00
Jeff Squyres
9c25bdc5ac Change to the rsh pls component to have the pls_rsh_agent MCA param
now take a colon-delimited list of agents (and associated argv).  Also
change the default value to "ssh : rsh".  Hence, if we run on a
cluster that does not have ssh, we'll fall back to rsh.  If we can't
find rsh, then the rsh component will disqualify itself from
selection.

This commit was SVN r8514.
2005-12-15 20:54:24 +00:00
George Bosilca
7d8d516a4a A bunch of fixed for Windows support.
- protection with __WINDOWS__ and not WIN32 or _WIN32
 - protect all the headers

This commit was SVN r8463.
2005-12-12 20:04:00 +00:00
Jeff Squyres
42ec26e640 Update the copyright notices for IU and UTK.
This commit was SVN r7999.
2005-11-05 19:57:48 +00:00
Tim Woodall
aa5b61e4f1 corrections for multiple app contexts
This commit was SVN r7939.
2005-10-31 20:37:44 +00:00
Tim Woodall
60754acae8 - modified rmaps data structures to point directly to ras node
- modified rsh to NOT query for each nodes mapping, as all data is
  already available in the rmaps structures

This commit was SVN r7894.
2005-10-27 17:04:10 +00:00
Josh Hursey
92429dc90f Fix for a problem Edgar and Jeff identified WRT PLS determining if we are
oversubscribed on a node. And thus whether to call sched_yield or not.

The value of node->node_slots_inuse does not currently represent the number of
slots actually in use, at the moment. This is actually a bug in the RAS/RMAPS
base components, but the fix for that specific bug is bigger than we want to 
address at the moment (but will certianly do so in the near future).

Since we cannot trust this value, use the total number of mapped processes
(which was properly set by the RMAPS component upon mapping -- Just not 
properly propagated back to the registry's node segment) from the process 
mapping.

In addition to this change I cleaned up a couple of the debug messages. It
seems that TM and RSH are the only two directly effected by this. SLURM
would be if that section of code wasn't currently inactive, but put the fix
in for prosparity.

This commit was SVN r7743.
2005-10-13 03:26:48 +00:00
Jeff Squyres
0629cdc2d7 Bring back the changes from /tmp/jjhursey-rmaps. Specific merge
command:

svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps .

(where "." is a trunk checkout)

The logs from this branch are much more descriptive than I will put
here (including a *really* long description from last night).  Here's
the short version:

- fixed some broken implementations in ras and rmaps
- "orterun --host ..." now works and has clearly defined semantics
  (this was the impetus for the branch and all these fixes -- LANL had
  a requirement for --host to work for 1.0)
- there is still a little bit of cleanup left to do post-1.0 (we got
  correct functionality for 1.0 -- we did not fix bad implementations
  that still "work")
  - rds/hostfile and ras/hostfile handshaking
  - singleton node segment assignments in stage1
  - remove the default hostfile (no need for it anymore with the
    localhost ras component)
  - clean up pls components to avoid duplicate ras mapping queries
  - [possible] -bynode/-byslot being specific to a single app context 

This commit was SVN r7664.
2005-10-07 22:24:52 +00:00
Jeff Squyres
e9ec846c68 Minor change to only display the prefix debug message at most once
This commit was SVN r7568.
2005-09-30 21:43:32 +00:00
Jeff Squyres
fcef1774d5 Per advice from Ralf W., change the pkgdata declarations in
Makefile.am's to be a *slightly* more correct (and, more importantly,
less error-prone) construct.

This commit was SVN r7554.
2005-09-30 13:32:39 +00:00
Jeff Squyres
de1c8fb125 - Make debug output a bit more accurate and readable
- Fix bug identified by users: --prefix may also apply on the local
  node; we need to prefix the PATH and LD_LIBRARY_PATH environment
  variables before invoking execve()

This commit was SVN r7541.
2005-09-29 12:35:43 +00:00
Andrew Friedley
555ae37255 Add lib{opal,orte,mpi}.la to appropriate LIBADD's, some whitespace cleanup as well.
This commit was SVN r7477.
2005-09-22 12:28:54 +00:00
George Bosilca
703e874468 Remove a race condition. If this functions is called by the progress thread then it does not have to
add an event, it can call the spawn function directly. This will avoid it standing on the condition who 
will never get released.

This commit was SVN r7428.
2005-09-19 15:54:53 +00:00
Rainer Keller
5fed46e072 - Allow usernames to be specified in the hostfile.
The following formats are parsed:
    user@IPv4
    user@fqdn
    IPv4 or fqdn [username|user-name|user_name]=user
- Try a better error-detection when parsing (recognize wrong
  IPs, fqdns...)

This commit was SVN r7288.
2005-09-10 07:57:50 +00:00
Brian Barrett
ed56e743b7 * update configure.ac to use the modern version of AC_INIT and
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
  number to be set at autoconf time (instead of at configure time, as
  it was before).  Set the version number, minus the subversion r number,
  at autoconf time.  Override the internal variables to include the r
  number (if needed) at configure time.  Basically, the right thing
  should always happen.  The only place it might not is the version
  reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
  them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
  in the directory containing source files, even if the Makefile.am is
  in another directory.  This should start making it feasible to
  reduce the number of Makefile.am files we have in the tree, which
  will greatly reduce the time to run autogen and configure.

This commit was SVN r7211.
2005-09-07 05:54:53 +00:00
Rainer Keller
a36347d728 - Support -prefix specification on mpirun/orterun cmd-line per
app_context:
  mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \
         -np 2 -prefix /path/to/ompi/on/machineB ./exec2

- Allow with -mca pls_rsh_assume_same_shell 0, the checking for the
  SHELL-variable on the actual node (currently 1st node).
  Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and 
  csh/tcsh.

This commit was SVN r7195.
2005-09-06 16:10:05 +00:00
Rainer Keller
588a62cb90 - Missed file in last commit
This commit was SVN r7179.
2005-09-04 20:55:27 +00:00
Ralph Castain
4b5b3b4164 Properly handle the argv array and clean it up when done.
This commit was SVN r7166.
2005-09-03 00:15:21 +00:00
Ralph Castain
4bd25e0292 Few minor memory leak cleanups
This commit was SVN r7156.
2005-09-02 18:50:01 +00:00
Rainer Keller
d7901c97a5 - Del whitespaces, to make coming patch smaller.
This commit was SVN r7089.
2005-08-30 06:58:37 +00:00