1
1
Граф коммитов

68 Коммитов

Автор SHA1 Сообщение Дата
Tim Woodall
41b6fc166e setup callback before actually launching - otherwise this is
a definate race condition

This commit was SVN r7413.
2005-09-16 20:45:25 +00:00
Josh Hursey
8bf587475b Added a flag to orte_rmgr_base_proc_stage_gate_subscribe() allowing the
caller to specify a subset of the state variables that it can can subscribe to.
This is specified with one of three special flags defined in rmgr/rmgr_types.h

This is useful when we only care about a subset of the state changes, such as
in orted which only needs to know when a job has terminated or aborted.

This commit was SVN r7356.
2005-09-13 21:14:34 +00:00
Josh Hursey
8dfcc41efd Bootproxy daemons should persist after their local children have exited,
waiting instead for the SOH to indicate that the jobid has terminated.

In a scheduled environment, if your program has a section of MPI code
followed by a section of computation that some processes execute while
other proceses terminate normally. This patch keeps the scheduler from 
terminating all of the processes and the allocation if all of the processes 
on an allocated node exit well before other processes on other nodes.

This commit was SVN r7333.
2005-09-13 02:37:34 +00:00
Josh Hursey
7d403cf1b4 whoops forgot a bit of debug. ;(
This commit was SVN r7313.
2005-09-12 16:58:10 +00:00
Josh Hursey
e1abd9a6e4 Added a couple of missing initalizations that caused orteprobe to
segv when in opal_finalize

This commit was SVN r7311.
2005-09-12 16:52:58 +00:00
Brian Barrett
1fcf18c211 non-persistent signal behavior isn't quite right, so use the proper SIGNAL
macros and deregister at the appropriate time.

This commit was SVN r7293.
2005-09-10 23:22:37 +00:00
Brian Barrett
ed56e743b7 * update configure.ac to use the modern version of AC_INIT and
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
  number to be set at autoconf time (instead of at configure time, as
  it was before).  Set the version number, minus the subversion r number,
  at autoconf time.  Override the internal variables to include the r
  number (if needed) at configure time.  Basically, the right thing
  should always happen.  The only place it might not is the version
  reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
  them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
  in the directory containing source files, even if the Makefile.am is
  in another directory.  This should start making it feasible to
  reduce the number of Makefile.am files we have in the tree, which
  will greatly reduce the time to run autogen and configure.

This commit was SVN r7211.
2005-09-07 05:54:53 +00:00
Jeff Squyres
383d9f58e7 Be [slightly] more descriptive. :-)
This commit was SVN r7198.
2005-09-06 16:57:11 +00:00
Rainer Keller
a36347d728 - Support -prefix specification on mpirun/orterun cmd-line per
app_context:
  mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \
         -np 2 -prefix /path/to/ompi/on/machineB ./exec2

- Allow with -mca pls_rsh_assume_same_shell 0, the checking for the
  SHELL-variable on the actual node (currently 1st node).
  Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and 
  csh/tcsh.

This commit was SVN r7195.
2005-09-06 16:10:05 +00:00
Rainer Keller
192625d2a1 - Once again: uninteresting cleanup to get diff smaller.
This commit was SVN r7178.
2005-09-04 20:54:19 +00:00
Ralph Castain
12daecb826 More cleanup
This commit was SVN r7167.
2005-09-03 01:22:11 +00:00
Jeff Squyres
4c59058053 - Add some logic to configure to make a version of CFLAGS that doesn't
include any optimization flags
- Use these flags to always compile ompi/debuggers/* and orterun so
  that parallel debuggers (such as Totalview) can always see the
  debugging symbols (see comments in ompi/debuggers/Makefile.am and
  orte/tools/orterun/Makefile.am)
- Remove some obsolete LAM-named variables from configure.ac

This commit was SVN r7125.
2005-09-01 10:37:20 +00:00
David Daniel
c6054662d5 Forgot to add new header to sources
This commit was SVN r7109.
2005-08-31 16:21:58 +00:00
David Daniel
a5eff8fc78 A little more clean-up. TotalView now works with --enable-debug build.
Tested with:
pls = rsh
totalview.6.6.0-2
Linux cadillac82.ccstar.lanl.gov 2.4.24 #1 SMP Thu Jul 1 15:28:04 MDT
2004 i686 i686 i386 GNU/Linux

This commit was SVN r7108.
2005-08-31 16:15:59 +00:00
Jeff Squyres
284328afe3 Add missing .h file so that it is included in the tarball
This commit was SVN r7107.
2005-08-31 11:01:28 +00:00
George Bosilca
d64a702a5b There is a missing header. --enable-picky help to track down such kind of errors.
This commit was SVN r7102.
2005-08-31 00:47:52 +00:00
David Daniel
995641c1e6 Don't initialize proctable more than once (since the stage gate 1 trigger
seems to get fired at least twice).

This commit was SVN r7101.
2005-08-31 00:21:55 +00:00
David Daniel
ced11250e4 Basic totalview support for orterun. Close to working, but need to
check hostnames are obtained correctly.

This commit was SVN r7096.
2005-08-30 17:29:43 +00:00
David Daniel
6cb97e6ade Reverting totalview support to *not* use the as yet unimplemented
orte_jobgrp_t.  Now just need to work out where to call it...

This commit was SVN r7092.
2005-08-30 12:59:04 +00:00
Jeff Squyres
774f879a41 Oops -- add second string in there because we added a second %s to the
help message.

This commit was SVN r7064.
2005-08-27 13:32:25 +00:00
Jeff Squyres
b3bd549331 - Change a few calls from exit() to orte_abort() so that we get
session directory cleanup (among other things)
- When we get an abnormal exit in orterun (i.e., timeout expires and
  we haven't gotten termination notices from all processes), print a
  better message an exit in a better way (which includes session
  directory cleanup)
- Fix tm and poe pls's to not exit() but rather propagate the error up
  the stack (where relevant)

This commit was SVN r7058.
2005-08-26 20:36:11 +00:00
Josh Hursey
4eefb33182 Some param changes:
- Change orte_base_infrastructre to orte_infrastructre to conform with 
  ompi_info's needs
- Move MCA Param registration in ORTE to a centralized function that is 
  called first in orte_init_stage1
- Set the infrastructre flag as an argument to orte_init
- Adjust initalization functions to properly pass down the infrastructre
  flag.

This commit was SVN r7053.
2005-08-26 20:13:35 +00:00
Josh Hursey
3e18fa4555 Insert some signal handlers so orted properly cleans up after it self
when given a kill signal.

This commit was SVN r7050.
2005-08-26 18:56:08 +00:00
Brian Barrett
3e8740e740 * mostly working SLURM component. Had to add a sds for the daemons so that
we could vector launch the daemons and still have the nodenames fixed 
  up in the end

This commit was SVN r7041.
2005-08-25 22:29:23 +00:00
Brian Barrett
acd652a7ac * have rsh setup opal_progress so that call_yield is only called if the nodes
are oversubscribed (based on information from ras and current data in gpr)

This commit was SVN r6941.
2005-08-19 18:56:44 +00:00
Josh Hursey
9bcb2989a5 Some general cleanup.
Added the FT-MPI interface functions. Many are hidden and not implemented,
but serve as place holders for future work.

This commit was SVN r6847.
2005-08-12 22:09:58 +00:00
Josh Hursey
e370141ae3 add memory manager init
This commit was SVN r6842.
2005-08-12 20:45:46 +00:00
Ralph Castain
67bd42d167 Fix a few minor compiler warnings
This commit was SVN r6816.
2005-08-12 04:07:47 +00:00
Josh Hursey
22c7f2b3e0 Quite a range of small changes.
ns_replica.c
 - Removed the error logging since I use this function in orte_init_stage1 to 
   check if we have created a cellid yet or not.

ras_types.h & rase_base_node.h
 - This was an empty file. moved the orte_ras_node_t from base/ras_base_node.h
   to this file.
 - Changed the name of orte_ras_base_node_t to orte_ras_node_t to match the 
   naming mechanisms in place.

ras.h
 - Exposed 2 functions:
   - node_insert:
     This takes a list of orte_ras_base_node_t's and places them in the Node 
     Segment of the GPR. This is to be used in orte_init_stage1 for singleton 
     processes, and the hostfile parsing (see rds_hostfile.c). This just puts 
     in the appropriate API interface to keep from calling the 
     orte_ras_base_node_insert function directly.
   - node_query:
     This is used in hostfile parsing. This just puts in the appropriate API 
     interface to keep from calling the orte_ras_base_node_query function 
     directly.
 - Touched all of the implemented components to add reference to these new 
   function pointers

ras_base_select.c & ras_base_open.c
 - Add and set the global module reference

rds.h
 - Exposed 1 function:
   - store_resource:
     This stores a list of rds_cell_desc_t's to the Resource Segment. 
     This is used in conjunction with the orte_ras.node_insert function in 
     both the orte_init_stage1 for singleton processes and rds_hostfile.c

rds_base_select.c & rds_base_open.c
 - Add and set the global module reference

rds_hostfile.c
 - Added functionality to create a new cellid for each hostfile, placing 
   each entry in the hostfile into the same cellid. Currently this is 
   commented out with the cellid hard coded to 0, with the intention of 
   taking this out once ORTE is able to handle multiple cellid's
 - Instead of just adding hosts to the Node Segment via a direct call to 
   the ras_base_node_insert() function. First add the hosts to the Resource 
   Segment of the GPR using the orte_rds.store_resource() function then use 
   the API version of orte_ras.node_insert() to store the hosts on the Node 
   Segment.
 - Add 1 new function pointer to module as required by the API.

rds_hostfile_component.c
 - Converted this to use the new MCA parameter registration

orte_init_stage1.c
 - It is possible that a cellid was not created yet for the current environment. 
   So I put in some logic to test if the cellid 0 existed. If it does then 
   continue, otherwise create the cellid so we can properly interact with the 
   GPR via the RDS.
 - For the singleton case we insert some 'dummy' data into the GPR. The RAS 
   matches this logic, so I took out the duplicate GPR put logic, and 
   replaced it with a call to the orte_ras.node_insert() function.
 - Further before calling orte_ras.node_insert() in the singleton case, 
   we also call orte_rds.store_resource() to add the singleton node to the 
   Resource Segment.

Console:
 - Added a bunch of new functions. Still experimenting with many aspects of the
   implementation. This is a checkpoint, and has very limited functionality.
 - Should not be considered stable at the moment.

This commit was SVN r6813.
2005-08-11 19:51:50 +00:00
Josh Hursey
bf2789fc66 Added some error logging for unimplemented daemon commands.
This commit was SVN r6809.
2005-08-11 16:57:06 +00:00
Josh Hursey
afe7e687cb A bit of cleanup and a couple of bug fixes for remote orted launching
using orteprobe.

Created a header file for orte_setup_hnp. [HNP = Head Node Process]

General cleanup and added a bit of documentation in orte_setup_hnp.c
Also fixed a cellid tokens issue (circa line 285)
Changed the launched scope from private to public

In orteprobe:
- added reference to orted.h to avoid duplicate header contents in orteprobe.h
- removed the version tag, and put in a verbose argument
- Fixed a buffer packing problem that was causing the parent from receiving the
  proper contact information for the new daemon.

This commit was SVN r6802.
2005-08-10 20:01:25 +00:00
Jeff Squyres
32e71e5c6c Fix a problem where orterun itself would not receive MCA parameters
that were set on the command line.  This was techinically exactly the
way the code was designed, but it certainly violated the Law of Least
Astonishment (even to its designer ;-) ).  So now if you execute
something like this:

   mpirun -mca pls_rsh_debug 1 -np 4 hello

You'll see debugging output from the rsh pls component, as you would
expect (this was not previously the case -- the MCA pls_rsh_debug
parame would be set to 1 in the 4 spawned hello processes, but *not*
in the orterun process).

More specifically, MCA parameters will be set in the orterun process
in the following cases:

- The new command line switch "--gmca" (or "-gmca") is used,
  indicating that the MCA parameter is "global".  --gmca also means
  that that MCA parameter will be applied to all context app's.  For
  example:

      mpirun -gmca foo bar -np 1 hello : -np 2 goodbye

  The foo MCA param will be set in both the hello and goodbye
  processes.

- If there is only one context app.  For example:

      mpirun -mca pls_rsh_debug 1 -np 4 hello

  will set pls_rsh_debug to 1 in both the orterun process and the 4
  spawned hello processes.

Also added a few more comments inside orterun to document a somewhat
confusing use of a state variable in a recursive case.

This commit was SVN r6764.
2005-08-08 16:42:28 +00:00
Ralph Castain
1438009dbd Properly set the MCA parameter to indicate these functions are infrastructure so that the singleton flag does not get set.
Somehow, in changing over to the new MCA interfaces, the "set" part of that logic got lost, so the singleton flag was always being set. This should repair some of the anomalous behavior seen recently where the local host was always being used for an application process.

This commit was SVN r6757.
2005-08-07 04:17:10 +00:00
Jeff Squyres
d0a0434172 Investigating an MCA param problem -- converted over orterun to new
MCA param API in the process.

This commit was SVN r6739.
2005-08-04 18:15:47 +00:00
Josh Hursey
12031db535 Added missing help file.
This commit was SVN r6737.
2005-08-04 17:40:22 +00:00
Jeff Squyres
ef9e06451c Ensure that --mca is listed in the --help message (thanks for pointing
this out Gleb!)

This commit was SVN r6712.
2005-08-02 18:52:12 +00:00
Josh Hursey
c6a3e67f07 switch from if's to ifdef's to prevent compiler warnings
This commit was SVN r6705.
2005-08-02 17:07:21 +00:00
Josh Hursey
943a74e266 Checkpoint.
- Add functionality to parse multiple arguments provided in the console
- Cleaned up help function
- Added an option to hide commands from the help menu

Working on launching and reaping of daemons from within the console.

This commit was SVN r6699.
2005-08-01 22:52:48 +00:00
Josh Hursey
835dad20d5 forgot to add orteconsole.h to the distro in the makefile
This commit was SVN r6686.
2005-07-29 17:57:35 +00:00
Josh Hursey
9acbd4e21f forgot to take out initalizer when I removed the verbose stuff
This commit was SVN r6682.
2005-07-29 00:21:10 +00:00
Josh Hursey
e849f7ba07 Significant clean up of the orteconsole.
- Added user help messages.
 - Abstracted the internal commands, and the mechanism for
   parsing and executing them.
 - Cleaned up the command line parsing
 - Some other misc. cleanup items.

Still much more work to do here, but should provide a more
intuitive interface for extending functionality in the 
system.

This commit was SVN r6676.
2005-07-28 23:48:46 +00:00
Josh Hursey
018c4aa44e remove unnecessary slashes
This commit was SVN r6673.
2005-07-28 21:33:33 +00:00
Josh Hursey
8b56769307 removed the version command line option. Added some more user help messages
This commit was SVN r6672.
2005-07-28 21:17:48 +00:00
Josh Hursey
5ad860fc47 forgot to take out a line in the help message.
This commit was SVN r6671.
2005-07-28 20:51:56 +00:00
Josh Hursey
8deed21e00 Replaced some stderr fprintfs with opal_show_help functions, with
more user friendly error messages.

Removed the "--version" command line option, since they should 
get this from ompi_info [later to be orte_info].

If we find an invalid command line option print out the help
screen before exiting.

This commit was SVN r6670.
2005-07-28 20:49:17 +00:00
Josh Hursey
033b0be417 clean up help msg for orted
This commit was SVN r6657.
2005-07-28 18:38:37 +00:00
Josh Hursey
707fbb35ce added help message file to orted
This commit was SVN r6655.
2005-07-28 17:18:33 +00:00
Ralph Castain
526217b9fc Two things here:
1. Fix the reigstry's overwrite logic. It was only overwriting the first keyval specified in a value - the rest were just added on regardless of whether or not the keyval already existed. This was the source of the multiple keyvals some people were seeing - should be fixed now.

2. Change the orted command parsing options so it reports options that aren't recognized - should help reduce confusion

This commit was SVN r6536.
2005-07-16 23:08:15 +00:00
Brian Barrett
14b89e0e50 Bunch more updates from operation Red Storm:
* Add ability to completely disable libltdl (the dlopen code to load
  dynamic shared objects) to configure: --disable-dlopen
* Added MCA param (component_disable_dlopen) to disable DSO loading
  at runtime
* Made the event library behave in some not-completely-erroneous way
  on platforms where it has absolutely no eventops support (ie, no
  select, poll, or epoll)
* Disabled orte_wait, opal_few, and opal_daemon_init code on
  platforms without fork, waitpid support.  All non-init functions
  will return OPMI_ERR_NOT_SUPPORTED
* Disable orteprobe tool when fork or pipe aren't supported

This commit was SVN r6490.
2005-07-14 18:05:30 +00:00
Tim Prins
d4151fa9fd properly fix the usage of the app pointer array by checking for NULLs instead of forcing it to be the same size as the number of entries
This commit was SVN r6395.
2005-07-08 18:48:25 +00:00