1
1
Граф коммитов

33 Коммитов

Автор SHA1 Сообщение Дата
Brian Barrett
2d17dd9516 temporarily back our r15517 and 15520 so that I can get the RML / OOB changes
to cleanly apply

This commit was SVN r15527.

The following SVN revision numbers were found above:
  r15517 --> open-mpi/ompi@41977fcc95
2007-07-20 01:10:34 +00:00
Ralph Castain
41977fcc95 Remove the cellid field from the orte_process_name_t structure. This only affects a handful of files in itself, but...
Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point.

Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings.

This commit was SVN r15517.
2007-07-19 20:56:46 +00:00
Ralph Castain
bd65f8ba88 Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments.
Short description: major changes include -

1. singletons now fork/exec a local daemon to manage their operations.

2. the orte daemon code now resides in libopen-rte

3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided.

I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine.

This commit was SVN r15390.
2007-07-12 19:53:18 +00:00
Ralph Castain
4fff584a68 Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.

Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.

Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.

With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.

Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".

This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
Rainer Keller
0889ebd59f - Eliminate warnings, that PGI-6.2.5 issues with -Minform=inform
This commit was SVN r13840.
2007-02-28 08:36:34 +00:00
Rainer Keller
3669e8921e - Fix further compiler warnings regarding initialization
and shadowing variables.

This commit was SVN r13358.
2007-01-30 06:34:38 +00:00
Brian Barrett
6f8b366acb Rename liborte to libopen-rte and libopal to libopen-pal per telecon today
and bug #632.

Refs trac:632

This commit was SVN r12762.

The following Trac tickets were found above:
  Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632
2006-12-05 18:27:24 +00:00
Ralph Castain
37dfdb76eb Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done.
This commit was SVN r11661.
2006-09-14 21:29:51 +00:00
Ralph Castain
5dfd54c778 With the branch to 1.2 made....
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).

Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).

I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).

In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...

Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.

This commit was SVN r11204.
2006-08-15 19:54:10 +00:00
Ralph Castain
a90f8feb35 Need to initialize the buffer in the contact_info command.
This commit was SVN r10563.
2006-06-29 14:57:10 +00:00
Brian Barrett
52369307f8 Add a feature to the build system that Terry from Sun and I talked about
in San Jose.  Allow the configure option --disable-binaries to build OMPI,
but not build or install the support binaries (so basically, just build
the libraries).

This commit was SVN r9777.
2006-04-29 02:16:41 +00:00
Ralph Castain
b9bdb2125e Fix and upgrade the console to support better debugging. Activate "dump" commands to display registry content. Remove the blasted opal_output default prefix that made the dump output illegible. Properly connect to existing daemons and/or start new ones.
This commit was SVN r9528.
2006-04-04 11:05:52 +00:00
Brian Barrett
566a050c23 Next step in the project split, mainly source code re-arranging
- move files out of toplevel include/ and etc/, moving it into the
    sub-projects
  - rather than including config headers with <project>/include, 
    have them as <project>
  - require all headers to be included with a project prefix, with
    the exception of the config headers ({opal,orte,ompi}_config.h
    mpi.h, and mpif.h)

This commit was SVN r8985.
2006-02-12 01:33:29 +00:00
Ralph Castain
4b9f015c0b Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list.
This commit was SVN r8912.
2006-02-07 03:32:36 +00:00
Jeff Squyres
42ec26e640 Update the copyright notices for IU and UTK.
This commit was SVN r7999.
2005-11-05 19:57:48 +00:00
Jeff Squyres
fcef1774d5 Per advice from Ralf W., change the pkgdata declarations in
Makefile.am's to be a *slightly* more correct (and, more importantly,
less error-prone) construct.

This commit was SVN r7554.
2005-09-30 13:32:39 +00:00
Brian Barrett
e0c3775551 * remove some duplicate dependencies that were making Solaris mad
This commit was SVN r7549.
2005-09-30 04:13:26 +00:00
Brian Barrett
ed56e743b7 * update configure.ac to use the modern version of AC_INIT and
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
  number to be set at autoconf time (instead of at configure time, as
  it was before).  Set the version number, minus the subversion r number,
  at autoconf time.  Override the internal variables to include the r
  number (if needed) at configure time.  Basically, the right thing
  should always happen.  The only place it might not is the version
  reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
  them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
  in the directory containing source files, even if the Makefile.am is
  in another directory.  This should start making it feasible to
  reduce the number of Makefile.am files we have in the tree, which
  will greatly reduce the time to run autogen and configure.

This commit was SVN r7211.
2005-09-07 05:54:53 +00:00
Josh Hursey
4eefb33182 Some param changes:
- Change orte_base_infrastructre to orte_infrastructre to conform with 
  ompi_info's needs
- Move MCA Param registration in ORTE to a centralized function that is 
  called first in orte_init_stage1
- Set the infrastructre flag as an argument to orte_init
- Adjust initalization functions to properly pass down the infrastructre
  flag.

This commit was SVN r7053.
2005-08-26 20:13:35 +00:00
Josh Hursey
9bcb2989a5 Some general cleanup.
Added the FT-MPI interface functions. Many are hidden and not implemented,
but serve as place holders for future work.

This commit was SVN r6847.
2005-08-12 22:09:58 +00:00
Ralph Castain
67bd42d167 Fix a few minor compiler warnings
This commit was SVN r6816.
2005-08-12 04:07:47 +00:00
Josh Hursey
22c7f2b3e0 Quite a range of small changes.
ns_replica.c
 - Removed the error logging since I use this function in orte_init_stage1 to 
   check if we have created a cellid yet or not.

ras_types.h & rase_base_node.h
 - This was an empty file. moved the orte_ras_node_t from base/ras_base_node.h
   to this file.
 - Changed the name of orte_ras_base_node_t to orte_ras_node_t to match the 
   naming mechanisms in place.

ras.h
 - Exposed 2 functions:
   - node_insert:
     This takes a list of orte_ras_base_node_t's and places them in the Node 
     Segment of the GPR. This is to be used in orte_init_stage1 for singleton 
     processes, and the hostfile parsing (see rds_hostfile.c). This just puts 
     in the appropriate API interface to keep from calling the 
     orte_ras_base_node_insert function directly.
   - node_query:
     This is used in hostfile parsing. This just puts in the appropriate API 
     interface to keep from calling the orte_ras_base_node_query function 
     directly.
 - Touched all of the implemented components to add reference to these new 
   function pointers

ras_base_select.c & ras_base_open.c
 - Add and set the global module reference

rds.h
 - Exposed 1 function:
   - store_resource:
     This stores a list of rds_cell_desc_t's to the Resource Segment. 
     This is used in conjunction with the orte_ras.node_insert function in 
     both the orte_init_stage1 for singleton processes and rds_hostfile.c

rds_base_select.c & rds_base_open.c
 - Add and set the global module reference

rds_hostfile.c
 - Added functionality to create a new cellid for each hostfile, placing 
   each entry in the hostfile into the same cellid. Currently this is 
   commented out with the cellid hard coded to 0, with the intention of 
   taking this out once ORTE is able to handle multiple cellid's
 - Instead of just adding hosts to the Node Segment via a direct call to 
   the ras_base_node_insert() function. First add the hosts to the Resource 
   Segment of the GPR using the orte_rds.store_resource() function then use 
   the API version of orte_ras.node_insert() to store the hosts on the Node 
   Segment.
 - Add 1 new function pointer to module as required by the API.

rds_hostfile_component.c
 - Converted this to use the new MCA parameter registration

orte_init_stage1.c
 - It is possible that a cellid was not created yet for the current environment. 
   So I put in some logic to test if the cellid 0 existed. If it does then 
   continue, otherwise create the cellid so we can properly interact with the 
   GPR via the RDS.
 - For the singleton case we insert some 'dummy' data into the GPR. The RAS 
   matches this logic, so I took out the duplicate GPR put logic, and 
   replaced it with a call to the orte_ras.node_insert() function.
 - Further before calling orte_ras.node_insert() in the singleton case, 
   we also call orte_rds.store_resource() to add the singleton node to the 
   Resource Segment.

Console:
 - Added a bunch of new functions. Still experimenting with many aspects of the
   implementation. This is a checkpoint, and has very limited functionality.
 - Should not be considered stable at the moment.

This commit was SVN r6813.
2005-08-11 19:51:50 +00:00
Ralph Castain
1438009dbd Properly set the MCA parameter to indicate these functions are infrastructure so that the singleton flag does not get set.
Somehow, in changing over to the new MCA interfaces, the "set" part of that logic got lost, so the singleton flag was always being set. This should repair some of the anomalous behavior seen recently where the local host was always being used for an application process.

This commit was SVN r6757.
2005-08-07 04:17:10 +00:00
Josh Hursey
c6a3e67f07 switch from if's to ifdef's to prevent compiler warnings
This commit was SVN r6705.
2005-08-02 17:07:21 +00:00
Josh Hursey
943a74e266 Checkpoint.
- Add functionality to parse multiple arguments provided in the console
- Cleaned up help function
- Added an option to hide commands from the help menu

Working on launching and reaping of daemons from within the console.

This commit was SVN r6699.
2005-08-01 22:52:48 +00:00
Josh Hursey
835dad20d5 forgot to add orteconsole.h to the distro in the makefile
This commit was SVN r6686.
2005-07-29 17:57:35 +00:00
Josh Hursey
e849f7ba07 Significant clean up of the orteconsole.
- Added user help messages.
 - Abstracted the internal commands, and the mechanism for
   parsing and executing them.
 - Cleaned up the command line parsing
 - Some other misc. cleanup items.

Still much more work to do here, but should provide a more
intuitive interface for extending functionality in the 
system.

This commit was SVN r6676.
2005-07-28 23:48:46 +00:00
Brian Barrett
170ef8af1f * rename ompi_show_help to opal_show_help
* rename ompi_stacktrace to opal_stacktrace
* rename ompi_strncpy to opal_strncpy

This commit was SVN r6336.
2005-07-04 02:38:44 +00:00
Brian Barrett
46245aaac1 * rename orte_os_create_dirpath to opal_os_create_dirpath
* rename orte_os_path to opal_os_path
* rename ompi_path_find to opal_path_find
* rename ompi_pow2 to opal_pow2

This commit was SVN r6334.
2005-07-04 01:59:52 +00:00
Brian Barrett
9f44b80291 * rename ompi_argv to opal_argv
* rename ompi_basename to opal_basename
* rename ompi bitop functions to opal
* rename ompi_cmd_line to opal_cmd_line
* rename ompi_sizet2int to opal_sizet2int
* rename orte_daemon_init to opal_daemon_init
* rename ompi_few to opal_few

This commit was SVN r6330.
2005-07-04 00:13:44 +00:00
Brian Barrett
a13166b500 * rename ompi_output to opal_output
This commit was SVN r6329.
2005-07-03 23:31:27 +00:00
Jeff Squyres
aa056f7bfd First cut of OMPI Makefile.am's, plus a few more catchup updates in orte
This commit was SVN r6286.
2005-07-02 15:06:47 +00:00
Jeff Squyres
1b18979f79 Initial population of orte tree
This commit was SVN r6266.
2005-07-02 13:42:54 +00:00