1
1

59 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
70779fa2ab Cleanup some old logic - nothing major.
This commit was SVN r7712.
2005-10-12 01:12:27 +00:00
Josh Hursey
8ba2900341 fixed a typo, added comments for future work
This commit was SVN r7700.
2005-10-11 20:59:31 +00:00
Ralph Castain
e1244fc160 Fix a few thread-lock things discovered by Josh. The thread locks in the registry's local notify delivery system had not been updated to reflect the design change whereby the xcast uses the notify delivery system. This has now been fixed.
Also revised the callbacks to store and utilize local variables to avoid problems where threads modify the global structures. Not sure this totally fixes the problem, but it's a shot - suggested by Josh (and Jeff, I believe).

This commit was SVN r7694.
2005-10-11 19:35:04 +00:00
Ralph Castain
a47655b3fd Add unlock/lock around the delivery of a local callback to remove thread-lock condition if the callback function attempts to re-enter the registry.
This commit was SVN r7678.
2005-10-10 02:45:50 +00:00
Ralph Castain
6c839048cf Fix a typo that caused valgrind to bark on 64-bit machines. Actually was a potential source of error, so the barking was legit.
This commit was SVN r7677.
2005-10-10 02:34:26 +00:00
Josh Hursey
d39841174d Must release the lock before entering the non blocking recv, since
it is possible that if the receive has been arrived the callback will
be called before recv_buffer_nb() returns. This causes deadlock
as we try to acquire the lock, but already hold it.

This was causing orterun and orteds to stall in certian situations.
Became evident when stress testing dynamics with remote nodes.

This commit was SVN r7543.
2005-09-29 14:24:11 +00:00
Ralph Castain
b589a93e29 Continue to lace the trace functionality into orte...
This commit was SVN r7427.
2005-09-19 15:29:14 +00:00
Josh Hursey
575afef072 Use non blocking sends in orte_gpr_replica_remote_notify.
This fixes one of the race conditions in orterun is sent a kill signal.
Before it would sometimes spin in the OOB waiting for a message to complete
to a peer that was no longer around. Stalling at this level prevented orterun
from noticing that it had received a kill signal.

This commit was SVN r7408.
2005-09-16 15:34:44 +00:00
Jeff Squyres
f4e8fe4817 Arrgh -- stupid mistake on last commit -- accidentally replaced a
LIBADD instead of appending to the existing one.

Also removed some more Makefile.options whitespace, and I think emacs
removed some tabs (i.e., replaced them with whitespace).

This commit was SVN r7399.
2005-09-15 21:37:24 +00:00
Jeff Squyres
15d0a95202 - Remove extra whitespace from Makefile.am's from when we removed
Makefile.options
- Sample in each of the three projects of how to link againt the
  relevant libraries so that when components are loaded into a parent
  process' space, we don't rely on the libopal/liborte/libmpi symbols
  being in the parent's public symbol namespace -- instead,
  dynamically link to the relevant libraries, allowing the dynamic
  linker to pull those libraries in at run-time, if needed

This commit was SVN r7397.
2005-09-15 20:56:18 +00:00
Tim Woodall
d4ef08c074 Ralph - please review and revise if necessary.
Add support for PRE_EXISTING values on new subscription

This commit was SVN r7334.
2005-09-13 03:51:58 +00:00
Ralph Castain
76ccec0cee Upgrade the new opal trace system to utilize verbosity. Begin building the trace command into the ORTE system.
This commit was SVN r7267.
2005-09-09 18:27:17 +00:00
Jeff Squyres
d19e5b4af8 Remove unused variable
This commit was SVN r7250.
2005-09-09 10:11:46 +00:00
Ralph Castain
7fbe575edd Make sure rc is initialized.
This commit was SVN r7233.
2005-09-08 13:20:38 +00:00
Brian Barrett
ed56e743b7 * update configure.ac to use the modern version of AC_INIT and
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
  number to be set at autoconf time (instead of at configure time, as
  it was before).  Set the version number, minus the subversion r number,
  at autoconf time.  Override the internal variables to include the r
  number (if needed) at configure time.  Basically, the right thing
  should always happen.  The only place it might not is the version
  reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
  them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
  in the directory containing source files, even if the Makefile.am is
  in another directory.  This should start making it feasible to
  reduce the number of Makefile.am files we have in the tree, which
  will greatly reduce the time to run autogen and configure.

This commit was SVN r7211.
2005-09-07 05:54:53 +00:00
Ralph Castain
47bf2574e1 Ensure that subscriptions for a specific requestor/subscription return id only get registered once. It appears that sometimes the system registers a subscription for the same return location multiple times. This prevents getting multiple callbacks when that happens. Still need to track down why it is happening at all.
This commit was SVN r7197.
2005-09-06 16:33:41 +00:00
Rainer Keller
a36347d728 - Support -prefix specification on mpirun/orterun cmd-line per
app_context:
  mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \
         -np 2 -prefix /path/to/ompi/on/machineB ./exec2

- Allow with -mca pls_rsh_assume_same_shell 0, the checking for the
  SHELL-variable on the actual node (currently 1st node).
  Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and 
  csh/tcsh.

This commit was SVN r7195.
2005-09-06 16:10:05 +00:00
Ralph Castain
7fc67f57a5 Little logic cleanup and handle thread locking correctly.
This commit was SVN r7192.
2005-09-06 14:04:43 +00:00
George Bosilca
648ef2ae5c One of the latest gcc version bark about a variable being use uninitialized. It was kind of right, because the
variable was protected by another one ... But with few modifications I get rid of this warning.

This commit was SVN r7189.
2005-09-06 03:13:03 +00:00
Ralph Castain
f352890732 Cleaning up memory leaks for proxy operations.
This commit was SVN r7157.
2005-09-02 19:26:21 +00:00
Ralph Castain
4bd25e0292 Few minor memory leak cleanups
This commit was SVN r7156.
2005-09-02 18:50:01 +00:00
Ralph Castain
66a215eae1 More memory cleanup...
1. Valgrind is good for something - chasing down memory leaks in registry led me to re-visit the dictionary functions and discover that I wasn't keeping track of the number of dictionary entries on each segment! Resulted in wasted time searching blank entries as well as leaked memory. This has now been fixed.

2. Fixed the orte_bitmap test. The init function for that class has been eliminated and the constructor adjusted to provide that functionality.

This commit was SVN r7136.
2005-09-02 00:26:58 +00:00
Ralph Castain
76e622a552 Clean up a few memory leaks - more to go...
This commit was SVN r7134.
2005-09-01 17:38:04 +00:00
Ralph Castain
03e45e6723 Two quick additions:
1. Added OMPI_PROC_ARCH as a defined registry key and added the code so that the architecture info gets properly transmitted across all processes using the startup message.

2. Added an OMPI_MODEX_KEY definition and removed the hard-coded "modex" key from pml_modex_exchange

This commit was SVN r7129.
2005-09-01 15:05:03 +00:00
Jeff Squyres
3962c53e2e - Add to AM_CPPFLAGS $(OPAL_LTDL_CPPFLAGS) where necessary in order to
add a -I to find the included ltdl.h (vs. a system-installed ltdl.h)
- Clean up kruft in a bunch of Makefile.am's to remove now-unnecessary
  AM_CPPFLAGS settings to get static-components.h for each framework
- Move the component_repository API functions out of opal/mca/base/base.h
  and into opal/mca/base/mca_base_component_repository.h in order to
  decrease unnecessary dependencies (e.g., before this, almost
  everything in the tree depended on ltdl.h, which is unnecessary --
  only a small number of files really need ltdl.h)

This commit was SVN r7127.
2005-09-01 12:16:36 +00:00
Ralph Castain
96f4bb7a63 Hey, sports fans!! Guess what??
Here's the huge registry check-in you've all been waiting for with baited breath. The revised version sends a single message to all processes at the various stage gates, thus making the startup much more scalable. I could provide you with all the tawdry details, but won't for now - you are welcome to ask, though, and I'll merrily bore your ears to tears.

In addition, the commit contains the following:

1. set the ignore properties on ompi/debuggers and orte/mca/pls/poe

2. Added simplified subscribe and put functions to the registry's API. I have also converted all of the ompi functions that registered subscriptions to the new API, and caught their associated put's as well.

In a follow-on commit, I'll be adding support for George's hetero arch registry subscription (wanted to get this one in first).

This commit was SVN r7118.
2005-09-01 01:07:30 +00:00
Rainer Keller
f52784bad3 - Just changes to comments, deletion of spaces to make diff smaller
This commit was SVN r7030.
2005-08-25 15:42:41 +00:00
Ralph Castain
5d7e5b17e0 Add these two functions so I don't have to keep adding them when I transfer diff's around.
NOTE: These have NOT been added to the Makefile.am in the repository. Please do NOT add them at this time - I will do so later.

This commit was SVN r6979.
2005-08-23 03:23:53 +00:00
Rainer Keller
f0c2f78dd4 - Another one, just missed.
This commit was SVN r6976.
2005-08-22 18:12:05 +00:00
Rainer Keller
1ac8c75965 - Nothing of interest: Fixed comments, indentation...
To get a clear view on the next patch.

This commit was SVN r6975.
2005-08-22 18:02:10 +00:00
Jeff Squyres
cce0950df7 - change a bunch of OMPI_* constants or ORTE_* equivalents
- change the framework opens to [mostly] use the new MCA param API
- properly pass in framework debug output streams to the
  mca_base_component_open() function

This commit was SVN r6888.
2005-08-15 18:25:35 +00:00
Josh Hursey
3b187c4db3 Fix the 'delete container' logic in gpr to prevent recursive delete of all
containers when one is requested.

Fix a bug in gpr_replica_del_index_api which doesn't preset num_tokens and
num_keys, but assumes they are 0.

Fix orte_ras_base_node_delete() function to operate properly to delete the
appropriate container in the 'orte-node' segment when requested.

This commit was SVN r6756.
2005-08-05 23:37:39 +00:00
Ralph Castain
4e1837687b Finish simplified interfaces for put and subscribe - more details to come.
This commit was SVN r6713.
2005-08-02 19:43:29 +00:00
Ralph Castain
8c6c78c47a Add a few new functions that were requested last week - not tested yet, so please don't use them! I will test them this afternoon on a different computer. For now, they won't cause any problems since they aren't being called.
This commit was SVN r6689.
2005-08-01 16:38:15 +00:00
Ralph Castain
4e79a51395 Add a job_info segment to the system that holds a container for each job. Within each container is a keyval indicating the job state (i.e., all procs at stage1, finalized, etc.). This provides a rough state-of-health for the job.
This required a little fiddling with a number of areas. Biggest problem was that it uncovered a potential for an infinite loop to be created in the registry. If a callback function modified the registry, the registry checked the triggers to see if anything had fired. Well, if the original callback was due to a trigger firing, that condition hadn't changed - so the trigger fired again....which caused the callback to be called, which modified the registry, which checked the triggers, etc. etc.

Triggers are now checked and then "flagged" as being "in process" so that the registry will NOT recheck that trigger until all callbacks have been processed. Tried doing this with subscriptions as well, but that caused a problem - when we release processes from a stagegate, they (at the moment) immediately place data on the registry that should cause a subscription to fire. Unfortunately, the system will just hang if that subscription doesn't get processed. So, I have left the subscription system alone - any callback function that modifies the registry in a fashion that will fire a subscription will indeed fire that subscription. We'll have to see if this causes problems - it shouldn't, but a careless user could lock things up if the callback generates a callback to itself.

Also fixed the code that placed a process' RML contact info on the registry to eliminate the leading '/' from the string.

This commit was SVN r6684.
2005-07-29 14:11:19 +00:00
George Bosilca
9fdfbd9934 correct the printf for 64 bits architectures.
This commit was SVN r6667.
2005-07-28 19:54:06 +00:00
Brian Barrett
747f23099e * fix some warnings
This commit was SVN r6661.
2005-07-28 19:25:47 +00:00
Brian Barrett
6aa464b67e More changes from Red Storm port
- only call sched_yield if it exists
  - don't fail out if modex doens't work in ob1
  - bunch of fixes for Portals BTL
  - add cnos rml component
  - add NULL gpr component (should only be used if replica AND proxy
    fail to load)  

This commit was SVN r6629.
2005-07-27 23:07:14 +00:00
Ralph Castain
13fdcff66b Fix a bug Greg was seeing on subscription returns - problem in pointer arithmetic
This commit was SVN r6594.
2005-07-22 20:46:07 +00:00
Ralph Castain
f604fb72db Turn "on" the delete functionality for the registry. Should now be able to delete entries and segments, and get an index of the dictionary entries on the registry.
Haven't fully tested these yet (nobody is using them at the moment that I know of - good thing, since they haven't been working for a long time - though I know the MPI-2 stuff needs the functionality), but will do so shortly. For now, they compile.

This commit was SVN r6567.
2005-07-20 18:07:46 +00:00
Ralph Castain
5e437f9a09 Fix a potential "free" that shouldn't happen
This commit was SVN r6552.
2005-07-19 16:21:06 +00:00
Ralph Castain
9af1739d33 Correct an opal_hash_table_get/set_proc name to orte_hash_table_get/set_proc.
Remove a couple of unused variable complaints from registry dump.

This commit was SVN r6550.
2005-07-19 13:33:04 +00:00
Jeff Squyres
7e413d6c26 Remove mistaken return with a value in a void function.
This commit was SVN r6548.
2005-07-19 12:23:41 +00:00
Ralph Castain
485e549f38 missing file
This commit was SVN r6545.
2005-07-18 21:18:26 +00:00
Ralph Castain
19d58ee17e First phase of the scalable RTE changes:
1. Modify the registry to eliminate redundant data copying for startup messages.

2. Revise the subscription/trigger system to avoid redundant storage of triggers and subscriptions. This dramatically reduces the search time when a registry action occurs - to illustrate the point, there are now only a handful of triggers on the system for each job. Before, there were a handful of triggers for each PROCESS in the job, all of which had to be checked every time something happened on the registry. This is much, much faster now.

3. Update all subscriptions to the new format. There are now "named" subscriptions - this allows you to "name" a subscription that all the processes will be using. The first one to hit the registry actually defines the subscription. From then on, any subsequent "subscribes" to the same name just cause that process to "attach" to the existing subscription. This keeps the number of subscriptions being tracked by the registry to a minimum, while ensuring that each process still gets notified.

4. Do the same for triggers.

Also fixed a duplicate subscription problem that was causing people to receive data equal to the number of processes times the data they should have received from a trigger/subscription. Sorry about that... :-( ...but it's all better now!

Uncovered a situation where the modex data seems to be getting entered on the registry a second time - the latter time coming after the compound command has been "fired", thereby causing all the subscriptions to fire. Asked Tim and Jeff to look into this.

Second phase of the changes will involve modifying the xcast system so that the same message gets sent to all processes. This will further reduce the message traffic, and - once we have a true "broadcast" version of xcast - really speed things up and improve scalability.

This commit was SVN r6542.
2005-07-18 18:49:00 +00:00
Ralph Castain
526217b9fc Two things here:
1. Fix the reigstry's overwrite logic. It was only overwriting the first keyval specified in a value - the rest were just added on regardless of whether or not the keyval already existed. This was the source of the multiple keyvals some people were seeing - should be fixed now.

2. Change the orted command parsing options so it reports options that aren't recognized - should help reduce confusion

This commit was SVN r6536.
2005-07-16 23:08:15 +00:00
Ralph Castain
44ace2f64e Well, I think this will fix the bug Greg encountered when sending no triggers on a subscription. However, I can't test it since the trunk no longer runs on my Mac notebook - I get an error message "No ptl components available. This shouldn't happen." and the processes exit.
This commit was SVN r6476.
2005-07-14 01:32:36 +00:00
Ralph Castain
81af57707f Don't release the message buffer - the messaging function takes care of it.
This commit was SVN r6437.
2005-07-12 15:41:45 +00:00
Brian Barrett
a991d883c1 * Rewrite ompi_mca.m4 to use m4_defined lists of projects (ompi, orte, etc.),
frameworks, and components without configure scripts instead of
  hard-coded shell variables (for projects and frameworks) and 
  shell variable building (for components).
* Add 3rd category of component configuration (in addition to configure
  scripts and no-configured components): configure.m4 components.  These
  components can only be built as part of OMPI (like no-configure), but
  can provide an m4 file that is run as part of the main configure
  script.  These macros can set whether the component should be built, 
  along with just about any other configuration wanted.  More care must
  be taken compared to configure components, as doing things like setting
  variables or calling AC_MSG_ERROR now affects the top-level configure
  script (so calling AC_MSG_ERROR if your component can't configure
  probably isn't what you want)
* Added support to autogen.sh for the configure.m4-style components,
  as well as building up the m4_define lists ompi_mca.m4 now expects
* Updated a number of macros to be more config.cache friendly (both
  so that config.cache can be used and so the test can be quickly
  run multiple times in the same configrue script):
    - ompi_config_asm
    - c_weak_symbols
    - c_get_alignment
* Added new macros to be shared when configuring components:
    - ompi_objc.m4 (this actually provides AC_PROG_OBJC - don't ask...)
    - ompi_check_xgrid
    - ompi_check_tm
    - ompi_check_bproc
* Updated a number of components to use configure.m4 instead of
  configure.stub
    - btl portals
    - io romio
    - tm ras and pls
    - bjs, lsf_bproc ras and bproc_seed pls
    - xgrid ras and pls
    - null iof (used by tm) 

This commit was SVN r6412.
2005-07-09 18:52:53 +00:00
Brian Barrett
0ae16f2ab7 * add local hook to remove static-components.h in distclean target. The
files are generated by configure, and not part of the tarball, so
  distclean would be the right place to remove them.

This commit was SVN r6390.
2005-07-08 13:54:12 +00:00