1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names.
2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used.
3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying.
Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed.
This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems.
This commit was SVN r15007.
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).
This also involved a slight change to the oob.xcast API, so propagated that as required.
Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)
Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.
This commit was SVN r14475.
remote nodes. It will also kill off rogue orteds and orterun
processes. The killing of processes is ifdef'ed out for Windows
since I do not know how to do it there. Note that this change
will requite an autogen.
This commit was SVN r13477.
but remove them also. This current set of changes will affect
nothing as no one is making use of this ability. However, orte-clean
will be changed soon to utilize this new feature.
This commit was SVN r12996.
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.
Plus a full implementation for the orte_wait functions.
This commit was SVN r11347.
- orte-clean.c : check to see if the base session directory is empty
and delete it if it is.
- orte_universe_exists.c : Fix a down stread problem resulting from
George's r10718 commit. Don't use the 'fulldirpath' since
that is no longer guarenteed to be the absolute path
to the session directory. Construct this value outside of that
function from the prefix and frontend vars.
This commit was SVN r10741.
The following SVN revision numbers were found above:
r10718 --> open-mpi/ompi@47eef2e002
After seeing the uglyness that is removing directories in the
codebase I decided to push down this to the OPAL by extending the
opal/os_create_dirpath.(c|h) to contain some more functionality.
In this process I renamed 'os_create_dirpath' to 'os_dirpath' since it
is a bit more general now.
Added a few functions to:
- check if an directory is empty
- check to see if the access permissions are set correctly
- destroy the directory at the end of the dirpath
- By using a caller callback function (a la Perl, I believe)
for every file, the caller can have fine grained control over
whether a specific file is deleted or not.
This simplifies things a bit for orte_session_dir_(finalize|cleanup)
as it should no longer contain any of this functionality, but uses
these functions to do the work.
From the external perspective nothing has changed, from the
developer point of view we have some cleaner, more generic code.
This commit was SVN r10640.
from the tmp/jjhursey-ft-cr branch.
In this commit we change the way universe names are created.
Before we by default first created "default-universe" then
if there was a conflict we created "default-universe-PID"
where PID is the PID of the HNP.
Now we create "default-universe-PID" all the time (when
a default universe name is used). This makes it much
easier when trying to find a HNP from an outside app
(e.g. orte-ps, orteconsole, ...)
This also adds a "search" function to find all of the
universes on the machine. This is useful in many contexts
when trying to find a persistent daemon or when trying to
connect to a HNP.
This commit also makes orte_universe_t an opal_object_t,
which is something that needed to happen, and only effected
the SDS in one of it's base functions.
I was asked to bring this over to aid in fixing orteconsole
and orteprobe. Due to the change of orte_universe_t to
an object orteprobe may need to be updated to reflect this
change. Since orteprobe needs to be looked at anyway I'll
leave this to Ralph to take care of.
*Note*:
These changes do not depend upon any of the FT work (but
the FT work does depend upon them). These were brought over
to help in fixing some of the ORTE tool set that require
the functionality layed out in this patch.
Testing:
Ran the 'ibm' tests before and after this change, and all was
as well as before the change. If anyone notices additional
irregularities in the system let me know. But none are expected.
This commit was SVN r10550.
- move files out of toplevel include/ and etc/, moving it into the
sub-projects
- rather than including config headers with <project>/include,
have them as <project>
- require all headers to be included with a project prefix, with
the exception of the config headers ({opal,orte,ompi}_config.h
mpi.h, and mpif.h)
This commit was SVN r8985.