Collect the base 'orted' command line into a base function since most of the
PLS components were duplicating this code. Add AMCA parameter command line
component to the base set.
Add Aggregate MCA parameter support to the following PLS components:
- gridengine
- process
- slurm
- poe
- tm
Improve support for 'rsh' component.
Did/could not support the following components:
- bproc
- proxy
- xcpu
- cnos
- xgrid
The above components had peculiar needs that made it non-trivial to add an
option. The authors of these components need to help in supporting this
new option.
I was only able to test the SLURM and RSH components due to system availability.
The others should work without problem.
This commit was SVN r14284.
The following Trac tickets were found above:
Ticket 976 --> https://svn.open-mpi.org/trac/ompi/ticket/976
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.
This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.
This commit closes trac:158
More details to follow.
This commit was SVN r14051.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r13912
The following Trac tickets were found above:
Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned.
2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up.
3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded
4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond
This needs more testing before migrating to 1.2.
This commit was SVN r13304.
components that use configure.m4 for configuration or are always built.
The macro has not been needed since moving to configure types other than
configure.stub
Fixes trac:590
This commit was SVN r13031.
The following Trac tickets were found above:
Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there).
Gridengine compiles but I cannot test (believe it likely will run).
Poe and xgrid compile to the extent they can without the proper include files.
This commit was SVN r12059.
Allow the POE RAS to be compled for linux as well as AIX.
The POE RAS is really a Loadleveler RAS, and IU now has
a cluster that uses Loadleveler in a Linux environment (BigRed).
This seems to be the only thing we need to do so far to run
Open MPI on BigRed. Yay :)
This commit was SVN r11600.
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.
Plus a full implementation for the orte_wait functions.
This commit was SVN r11347.
Other changes:
1. Remove the old xcpu components as they are not functional.
2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one.
This will require an autogen/configure, I'm afraid.
This commit was SVN r11228.
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).
Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).
I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).
In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...
Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.
This commit was SVN r11204.
1. Changed the RMGR and PLS APIs to add "signal_job" and "signal_proc" entry points. Only the "signal_job" entries are implemented - none of the components have implementations for "signal_proc" at this time. Thus, you can signal all of the procs in a job, but cannot currently signal only one specific proc.
2. Implemented those new API functions in all components except xgrid (Brian will do so very soon). Only the rsh/ssh and fork modules have been tested, however, and only under OS-X.
3. Added signal traps and callback functions for SIGUSR1/2 to orterun/mpirun that catch those signals and call the appropriate commands to propagate them out to all processes in the job.
4. Added a new test directory under the orte branch to (eventually) hold unit and system level tests for just the run-time. Since our test branch of the repository is under restricted access, people working on the RTE were continually developing their own system-level tests - thus making it hard to help diagnose problems. I have moved the more commonly-used functions here, and added one specifically for testing the SIGUSR1/2 functionality.
I will be contacting people directly to seek help with testing the changes on more environments. Other than compile issues, you should see absolutely no change in behavior on any of your systems - this additional functionality is transparent to anyone who does not issue a SIGUSR1/2 to mpirun.
Ralph
This commit was SVN r10258.
- move files out of toplevel include/ and etc/, moving it into the
sub-projects
- rather than including config headers with <project>/include,
have them as <project>
- require all headers to be included with a project prefix, with
the exception of the config headers ({opal,orte,ompi}_config.h
mpi.h, and mpif.h)
This commit was SVN r8985.
command:
svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps .
(where "." is a trunk checkout)
The logs from this branch are much more descriptive than I will put
here (including a *really* long description from last night). Here's
the short version:
- fixed some broken implementations in ras and rmaps
- "orterun --host ..." now works and has clearly defined semantics
(this was the impetus for the branch and all these fixes -- LANL had
a requirement for --host to work for 1.0)
- there is still a little bit of cleanup left to do post-1.0 (we got
correct functionality for 1.0 -- we did not fix bad implementations
that still "work")
- rds/hostfile and ras/hostfile handshaking
- singleton node segment assignments in stage1
- remove the default hostfile (no need for it anymore with the
localhost ras component)
- clean up pls components to avoid duplicate ras mapping queries
- [possible] -bynode/-byslot being specific to a single app context
This commit was SVN r7664.
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
number to be set at autoconf time (instead of at configure time, as
it was before). Set the version number, minus the subversion r number,
at autoconf time. Override the internal variables to include the r
number (if needed) at configure time. Basically, the right thing
should always happen. The only place it might not is the version
reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
in the directory containing source files, even if the Makefile.am is
in another directory. This should start making it feasible to
reduce the number of Makefile.am files we have in the tree, which
will greatly reduce the time to run autogen and configure.
This commit was SVN r7211.
session directory cleanup (among other things)
- When we get an abnormal exit in orterun (i.e., timeout expires and
we haven't gotten termination notices from all processes), print a
better message an exit in a better way (which includes session
directory cleanup)
- Fix tm and poe pls's to not exit() but rather propagate the error up
the stack (where relevant)
This commit was SVN r7058.
ns_replica.c
- Removed the error logging since I use this function in orte_init_stage1 to
check if we have created a cellid yet or not.
ras_types.h & rase_base_node.h
- This was an empty file. moved the orte_ras_node_t from base/ras_base_node.h
to this file.
- Changed the name of orte_ras_base_node_t to orte_ras_node_t to match the
naming mechanisms in place.
ras.h
- Exposed 2 functions:
- node_insert:
This takes a list of orte_ras_base_node_t's and places them in the Node
Segment of the GPR. This is to be used in orte_init_stage1 for singleton
processes, and the hostfile parsing (see rds_hostfile.c). This just puts
in the appropriate API interface to keep from calling the
orte_ras_base_node_insert function directly.
- node_query:
This is used in hostfile parsing. This just puts in the appropriate API
interface to keep from calling the orte_ras_base_node_query function
directly.
- Touched all of the implemented components to add reference to these new
function pointers
ras_base_select.c & ras_base_open.c
- Add and set the global module reference
rds.h
- Exposed 1 function:
- store_resource:
This stores a list of rds_cell_desc_t's to the Resource Segment.
This is used in conjunction with the orte_ras.node_insert function in
both the orte_init_stage1 for singleton processes and rds_hostfile.c
rds_base_select.c & rds_base_open.c
- Add and set the global module reference
rds_hostfile.c
- Added functionality to create a new cellid for each hostfile, placing
each entry in the hostfile into the same cellid. Currently this is
commented out with the cellid hard coded to 0, with the intention of
taking this out once ORTE is able to handle multiple cellid's
- Instead of just adding hosts to the Node Segment via a direct call to
the ras_base_node_insert() function. First add the hosts to the Resource
Segment of the GPR using the orte_rds.store_resource() function then use
the API version of orte_ras.node_insert() to store the hosts on the Node
Segment.
- Add 1 new function pointer to module as required by the API.
rds_hostfile_component.c
- Converted this to use the new MCA parameter registration
orte_init_stage1.c
- It is possible that a cellid was not created yet for the current environment.
So I put in some logic to test if the cellid 0 existed. If it does then
continue, otherwise create the cellid so we can properly interact with the
GPR via the RDS.
- For the singleton case we insert some 'dummy' data into the GPR. The RAS
matches this logic, so I took out the duplicate GPR put logic, and
replaced it with a call to the orte_ras.node_insert() function.
- Further before calling orte_ras.node_insert() in the singleton case,
we also call orte_rds.store_resource() to add the singleton node to the
Resource Segment.
Console:
- Added a bunch of new functions. Still experimenting with many aspects of the
implementation. This is a checkpoint, and has very limited functionality.
- Should not be considered stable at the moment.
This commit was SVN r6813.
- After long discussions and ruminations on how we run components in
LAM/MPI, made the decision that, by default, all components included
in Open MPI will use the version number of their parent project
(i.e., OMPI or ORTE). They are certaint free to use a different
number, but this simplification makes the common cases easy:
- components are only released when the parent project is released
- it is easy (trivial?) to distinguish which version component goes
with with version of the parent project
- removed all autogen/configure code for templating the version .h
file in components
- made all ORTE components use ORTE_*_VERSION for version numbers
- made all OMPI components use OMPI_*_VERSION for version numbers
- removed all VERSION files from components
- configure now displays OPAL, ORTE, and OMPI version numbers
- ditto for ompi_info
- right now, faking it -- OPAL and ORTE and OMPI will always have the
same version number (i.e., they all come from the same top-level
VERSION file). But this paves the way for the Great Configure
Reorganization, where, among other things, each project will have
its own version number.
So all in all, we went from a boatload of version numbers to
[effectively] three. That's pretty good. :-)
This commit was SVN r6344.
* rename ompi_malloc to opal_malloc
* rename ompi_numtostr to opal_numtostr
* start of rename of ompi_environ to opal_environ
This commit was SVN r6332.
* rename ompi_basename to opal_basename
* rename ompi bitop functions to opal
* rename ompi_cmd_line to opal_cmd_line
* rename ompi_sizet2int to opal_sizet2int
* rename orte_daemon_init to opal_daemon_init
* rename ompi_few to opal_few
This commit was SVN r6330.