ns_replica.c
- Removed the error logging since I use this function in orte_init_stage1 to
check if we have created a cellid yet or not.
ras_types.h & rase_base_node.h
- This was an empty file. moved the orte_ras_node_t from base/ras_base_node.h
to this file.
- Changed the name of orte_ras_base_node_t to orte_ras_node_t to match the
naming mechanisms in place.
ras.h
- Exposed 2 functions:
- node_insert:
This takes a list of orte_ras_base_node_t's and places them in the Node
Segment of the GPR. This is to be used in orte_init_stage1 for singleton
processes, and the hostfile parsing (see rds_hostfile.c). This just puts
in the appropriate API interface to keep from calling the
orte_ras_base_node_insert function directly.
- node_query:
This is used in hostfile parsing. This just puts in the appropriate API
interface to keep from calling the orte_ras_base_node_query function
directly.
- Touched all of the implemented components to add reference to these new
function pointers
ras_base_select.c & ras_base_open.c
- Add and set the global module reference
rds.h
- Exposed 1 function:
- store_resource:
This stores a list of rds_cell_desc_t's to the Resource Segment.
This is used in conjunction with the orte_ras.node_insert function in
both the orte_init_stage1 for singleton processes and rds_hostfile.c
rds_base_select.c & rds_base_open.c
- Add and set the global module reference
rds_hostfile.c
- Added functionality to create a new cellid for each hostfile, placing
each entry in the hostfile into the same cellid. Currently this is
commented out with the cellid hard coded to 0, with the intention of
taking this out once ORTE is able to handle multiple cellid's
- Instead of just adding hosts to the Node Segment via a direct call to
the ras_base_node_insert() function. First add the hosts to the Resource
Segment of the GPR using the orte_rds.store_resource() function then use
the API version of orte_ras.node_insert() to store the hosts on the Node
Segment.
- Add 1 new function pointer to module as required by the API.
rds_hostfile_component.c
- Converted this to use the new MCA parameter registration
orte_init_stage1.c
- It is possible that a cellid was not created yet for the current environment.
So I put in some logic to test if the cellid 0 existed. If it does then
continue, otherwise create the cellid so we can properly interact with the
GPR via the RDS.
- For the singleton case we insert some 'dummy' data into the GPR. The RAS
matches this logic, so I took out the duplicate GPR put logic, and
replaced it with a call to the orte_ras.node_insert() function.
- Further before calling orte_ras.node_insert() in the singleton case,
we also call orte_rds.store_resource() to add the singleton node to the
Resource Segment.
Console:
- Added a bunch of new functions. Still experimenting with many aspects of the
implementation. This is a checkpoint, and has very limited functionality.
- Should not be considered stable at the moment.
This commit was SVN r6813.
using orteprobe.
Created a header file for orte_setup_hnp. [HNP = Head Node Process]
General cleanup and added a bit of documentation in orte_setup_hnp.c
Also fixed a cellid tokens issue (circa line 285)
Changed the launched scope from private to public
In orteprobe:
- added reference to orted.h to avoid duplicate header contents in orteprobe.h
- removed the version tag, and put in a verbose argument
- Fixed a buffer packing problem that was causing the parent from receiving the
proper contact information for the new daemon.
This commit was SVN r6802.
If you register a parameter a second time, it overwrites the default
value (this was causing a problem with mpirun not being marked as orte
infrastructure, and therefore thinking that it was a singleton, and
therefore always adding the localhost into the node list).
This commit was SVN r6789.
- converted some things to new MCA param API
- renamed the pls_bproc_seed component struct so its name isn't the same as
the pls_bproc component's struct
- minor bugfixes
This commit was SVN r6774.
his e-mail:
I ran into a small bug in rmaps_rr.c: map_app_by_slot which was
triggered by using multiple app contexts. Basically, if not all the
slots we allocated on a node were used by an app, we would
automatically move onto the next node. This caused a problem with
multiple app contexts when the first app takes a partial allocation of
a node, the second app would not be able to access these slots because
we had already move past the node, and the byslot routine does not
wrap back around the list.
This commit was SVN r6766.
that were set on the command line. This was techinically exactly the
way the code was designed, but it certainly violated the Law of Least
Astonishment (even to its designer ;-) ). So now if you execute
something like this:
mpirun -mca pls_rsh_debug 1 -np 4 hello
You'll see debugging output from the rsh pls component, as you would
expect (this was not previously the case -- the MCA pls_rsh_debug
parame would be set to 1 in the 4 spawned hello processes, but *not*
in the orterun process).
More specifically, MCA parameters will be set in the orterun process
in the following cases:
- The new command line switch "--gmca" (or "-gmca") is used,
indicating that the MCA parameter is "global". --gmca also means
that that MCA parameter will be applied to all context app's. For
example:
mpirun -gmca foo bar -np 1 hello : -np 2 goodbye
The foo MCA param will be set in both the hello and goodbye
processes.
- If there is only one context app. For example:
mpirun -mca pls_rsh_debug 1 -np 4 hello
will set pls_rsh_debug to 1 in both the orterun process and the 4
spawned hello processes.
Also added a few more comments inside orterun to document a somewhat
confusing use of a state variable in a recursive case.
This commit was SVN r6764.
1. dump_xxx - analogous to the registry's dump commands, allows you to examine the contents of the name services' structures
2. get_job_peers - get an array of process names for all processes in the specified job
This commit was SVN r6759.
Somehow, in changing over to the new MCA interfaces, the "set" part of that logic got lost, so the singleton flag was always being set. This should repair some of the anomalous behavior seen recently where the local host was always being used for an application process.
This commit was SVN r6757.
containers when one is requested.
Fix a bug in gpr_replica_del_index_api which doesn't preset num_tokens and
num_keys, but assumes they are 0.
Fix orte_ras_base_node_delete() function to operate properly to delete the
appropriate container in the 'orte-node' segment when requested.
This commit was SVN r6756.
- convert MCA params to the new API
- some style and indenting fixes
- look at local shell, and if [new] MCA param
pls_rsh_assume_same_shell is 1, then assume that the remote shell is
the same as the local shell. If pls_rsh_assume_same_shell is 0, do
a probe to figure out what the remote shell is (NOT CURRENTLY
IMPLEMENTED! you'll get a run-time warning if you set this MCA param
to 0).
- if the remote shell is not csh and not bash, then prefix the remote
command with "( ! [ -e ./.profile ] || . ./.profile;" (and suffix it
with ")") so that we run the .profile on the remote side in order to
set PATHs and the like. See the LAM FAQ for details (will someday
be on the Open MPI FAQ:
http://www.lam-mpi.org/faq/category4.php3#question8)
- add a bunch of debugging output if the MCA param pls_rsh_debug is
enabled (or the top-level debug MCA param is enabled)
- add more help messages (and corresponding calls to opal_show_help())
in help-pls-rsh.txt
This commit was SVN r6731.
- we now properly support multiple application contexts
- much improved error messages, using opal_show_help
- fix some small bugs in the way the processes were discovering their names
- better searching for orted
- use the new mca parameter interface
These changes still need some testing, but they seem stable.
This commit was SVN r6719.
- Add functionality to parse multiple arguments provided in the console
- Cleaned up help function
- Added an option to hide commands from the help menu
Working on launching and reaping of daemons from within the console.
This commit was SVN r6699.
This required a little fiddling with a number of areas. Biggest problem was that it uncovered a potential for an infinite loop to be created in the registry. If a callback function modified the registry, the registry checked the triggers to see if anything had fired. Well, if the original callback was due to a trigger firing, that condition hadn't changed - so the trigger fired again....which caused the callback to be called, which modified the registry, which checked the triggers, etc. etc.
Triggers are now checked and then "flagged" as being "in process" so that the registry will NOT recheck that trigger until all callbacks have been processed. Tried doing this with subscriptions as well, but that caused a problem - when we release processes from a stagegate, they (at the moment) immediately place data on the registry that should cause a subscription to fire. Unfortunately, the system will just hang if that subscription doesn't get processed. So, I have left the subscription system alone - any callback function that modifies the registry in a fashion that will fire a subscription will indeed fire that subscription. We'll have to see if this causes problems - it shouldn't, but a careless user could lock things up if the callback generates a callback to itself.
Also fixed the code that placed a process' RML contact info on the registry to eliminate the leading '/' from the string.
This commit was SVN r6684.
- Added user help messages.
- Abstracted the internal commands, and the mechanism for
parsing and executing them.
- Cleaned up the command line parsing
- Some other misc. cleanup items.
Still much more work to do here, but should provide a more
intuitive interface for extending functionality in the
system.
This commit was SVN r6676.
more user friendly error messages.
Removed the "--version" command line option, since they should
get this from ompi_info [later to be orte_info].
If we find an invalid command line option print out the help
screen before exiting.
This commit was SVN r6670.
support in OMPI. Currently only enables/disables the architecture
sharing modex in ob1 pml.
* Add sds framework to ompi_info
* Figure out table ids to use for Portals BTL at configure time, since
we should use 30 & 31 on Red Storm, but the reference implementation
only supports 0-8.
* Some bug fixes in Portals UTCP sds
This commit was SVN r6650.
* Add Portals UTCP reference sds for when we are using the portals
reference implementation without the ORTE starters (when we want to
pretend like we're on Red Storm, only with a debugger and valgrind and
possibly even a printf that actually works...)
* Add super-secret --with flag to cnos rml to enable the cnos rml but
disable cnos_barrier (for use with portals utcp reference implementation)
This commit was SVN r6642.
test from orte_init_stage1 into a new framework, Startup Discovery Service
(sds). This allows us to have more flexibility with platforms like
Red Storm, which do not have a universe in the usual meaning and don't have
a seed daemon they can contact
This commit was SVN r6630.
- only call sched_yield if it exists
- don't fail out if modex doens't work in ob1
- bunch of fixes for Portals BTL
- add cnos rml component
- add NULL gpr component (should only be used if replica AND proxy
fail to load)
This commit was SVN r6629.
Haven't fully tested these yet (nobody is using them at the moment that I know of - good thing, since they haven't been working for a long time - though I know the MPI-2 stuff needs the functionality), but will do so shortly. For now, they compile.
This commit was SVN r6567.
1. Modify the registry to eliminate redundant data copying for startup messages.
2. Revise the subscription/trigger system to avoid redundant storage of triggers and subscriptions. This dramatically reduces the search time when a registry action occurs - to illustrate the point, there are now only a handful of triggers on the system for each job. Before, there were a handful of triggers for each PROCESS in the job, all of which had to be checked every time something happened on the registry. This is much, much faster now.
3. Update all subscriptions to the new format. There are now "named" subscriptions - this allows you to "name" a subscription that all the processes will be using. The first one to hit the registry actually defines the subscription. From then on, any subsequent "subscribes" to the same name just cause that process to "attach" to the existing subscription. This keeps the number of subscriptions being tracked by the registry to a minimum, while ensuring that each process still gets notified.
4. Do the same for triggers.
Also fixed a duplicate subscription problem that was causing people to receive data equal to the number of processes times the data they should have received from a trigger/subscription. Sorry about that... :-( ...but it's all better now!
Uncovered a situation where the modex data seems to be getting entered on the registry a second time - the latter time coming after the compound command has been "fired", thereby causing all the subscriptions to fire. Asked Tim and Jeff to look into this.
Second phase of the changes will involve modifying the xcast system so that the same message gets sent to all processes. This will further reduce the message traffic, and - once we have a true "broadcast" version of xcast - really speed things up and improve scalability.
This commit was SVN r6542.
1. Fix the reigstry's overwrite logic. It was only overwriting the first keyval specified in a value - the rest were just added on regardless of whether or not the keyval already existed. This was the source of the multiple keyvals some people were seeing - should be fixed now.
2. Change the orted command parsing options so it reports options that aren't recognized - should help reduce confusion
This commit was SVN r6536.
* Add ability to completely disable libltdl (the dlopen code to load
dynamic shared objects) to configure: --disable-dlopen
* Added MCA param (component_disable_dlopen) to disable DSO loading
at runtime
* Made the event library behave in some not-completely-erroneous way
on platforms where it has absolutely no eventops support (ie, no
select, poll, or epoll)
* Disabled orte_wait, opal_few, and opal_daemon_init code on
platforms without fork, waitpid support. All non-init functions
will return OPMI_ERR_NOT_SUPPORTED
* Disable orteprobe tool when fork or pipe aren't supported
This commit was SVN r6490.