1
1

28 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
e1e224b81a Silence a couple of minor compiler warnings
This commit was SVN r18617.
2008-06-09 12:57:41 +00:00
Ralph Castain
7bee71aa59 Fix a potential, albeit perhaps esoteric, race condition that can occur for fast HNP's, slow orteds, and fast apps. Under those conditions, it is possible for the orted to be caught in its original send of contact info back to the HNP, and thus for the progress stack never to recover back to a high level. In those circumstances, the orted can "hang" when trying to exit.
Add a new function to opal_progress that tells us our recursion depth to support that solution.

Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death...

Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang.

After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them.

Thanks you Jeff! :-/

This commit was SVN r18611.
2008-06-06 19:36:27 +00:00
George Bosilca
b2aa751c28 Remove a race condition in the threaded mode. As a callback is allowed
to modify the callback array (add or remove), make sure we don't call
the same callback twice if it get remove in another thread.

This commit was SVN r18608.
2008-06-06 15:54:40 +00:00
Brian Barrett
de2c4deeda Fix deadlock in thread case exposed by ORTE message model -- if we are
in a callback from the event library and post an RML receive, we'll
deadlock because the event library wouldn't be entered until the
event library was not already entered.  Now just protect data structures
(which we were basically already doing) instead of code, like good
threading people ;).

This commit was SVN r15585.
2007-07-24 19:10:19 +00:00
Ralph Castain
c7be9a7121 Complete backout of prior sched_yield and paffinity changes
This commit was SVN r13530.
2007-02-07 14:22:37 +00:00
Brian Barrett
cf8bc2ad0b print debugging information about switing the event flag and print the
initialization information

This commit was SVN r13497.
2007-02-05 19:38:54 +00:00
Brian Barrett
e130f18cc2 Fix some compiler warnings that have slipped in lately...
This commit was SVN r13037.
2007-01-08 17:20:09 +00:00
Rainer Keller
20d5c35f43 - Add header needed for OPAL_OUTPUT.
This commit was SVN r12647.
2006-11-22 12:20:08 +00:00
Brian Barrett
33320b7165 Rework the opal_progress interface to better support dynamic processes and at
the same time, remove some of the MPI-related options from OPAL:

  - provide mechanism to change at runtime whether sched_yield() should 
    be called when the progress engine is idle
  - provide mechanism for changing the rate at which the event engine
    is called when there are "no" users of the event engine (ie, when
    using MPI but not TCP)
  - fix some function names in the progress engine to better match
    their intended use (and remove MPI naming scheme)
  - remove progress_mpi_enable / progress_mpi_disable because 
    we can now use the functions to set the sched_yield and
    tick rate interfaces
  - rename opal_progress_events() to opal_progress_set_event_flag()
    because the first really isn't descriptive of what the function
    does and I always got confused by it

This commit was SVN r12645.
2006-11-22 02:06:52 +00:00
Brian Barrett
f3a4026b39 * fix the comment, too
This commit was SVN r11392.
2006-08-24 14:06:01 +00:00
Brian Barrett
9e5f5fe0af * make the name of the define be correct...
This commit was SVN r11391.
2006-08-24 14:02:26 +00:00
George Bosilca
9d26565e27 Add the sched_yield for Windows.
This commit was SVN r11304.
2006-08-21 20:08:51 +00:00
Brian Barrett
a84e557815 Add new loop mode OPAL_EVLOOP_ONELOOP that behaved like OPAL_EVLOOP_ONCE
did pre-libevent update.  The problem is that the behavior of 
OPAL_EVLOOP_ONCE was changed by the OMPI team, which them broke things
during the update, so it had to be reverted to the old meaning of
loop until one event occurs.  OPAL_EVLOOP_ONELOOP will go through the
event loop once (like EVLOOP_NONBLOCK) but will pause in the event
library for a bit (like EVLOOP_ONCE).

fixes trac:234

This commit was SVN r11081.

The following Trac tickets were found above:
  Ticket 234 --> https://svn.open-mpi.org/trac/ompi/ticket/234
2006-08-01 22:23:57 +00:00
Tim Woodall
8bf6ed7a36 - corrected locking in gm btl - gm api is not thread safe
- initial support for gm progress thread
- corrected threading issue in pml
- added polling progress for a configurable number of cycles to wait for threaded case

This commit was SVN r9188.
2006-03-02 00:39:07 +00:00
George Bosilca
670cefa1d0 Reorder the if's to avoid doing useless functions calls on the fast path.
This commit was SVN r9061.
2006-02-16 16:08:12 +00:00
Brian Barrett
566a050c23 Next step in the project split, mainly source code re-arranging
- move files out of toplevel include/ and etc/, moving it into the
    sub-projects
  - rather than including config headers with <project>/include, 
    have them as <project>
  - require all headers to be included with a project prefix, with
    the exception of the config headers ({opal,orte,ompi}_config.h
    mpi.h, and mpif.h)

This commit was SVN r8985.
2006-02-12 01:33:29 +00:00
George Bosilca
bd0ee62e62 Protect headers and use __WINDOWS__ for Windows code.
This commit was SVN r8468.
2005-12-12 22:01:51 +00:00
Jeff Squyres
42ec26e640 Update the copyright notices for IU and UTK.
This commit was SVN r7999.
2005-11-05 19:57:48 +00:00
Brian Barrett
dfdb5dc12a * high resolution, low latency timers for a number of platforms, plus mods
to opal_progress() to use the timers instead of a tick count for deciding
  whether to call the event loop or not.  Currently supported platforms are:

     - solaris (x86 / sparc)
     - Linux (x86 / x86_64 / IA64)
     - Mac OS X (x86 / Power PC)

This commit was SVN r6922.
2005-08-18 05:34:22 +00:00
Jeff Squyres
cf2c8b45a8 - Use proper prefixes for all #include statements (opal/, orte/, and
ompi/).
- There's still a handful of places that have orte/ #include files;
  still need to clean those up
- A lot of places still use ompi/include/constants.h -- those need to
  be converted over to use OPAL_ return codes and then switch to the
  opal constants.h.  This commit is the first few steps towards
  that...

This commit was SVN r6843.
2005-08-12 20:46:25 +00:00
Brian Barrett
ad383f5fcd * If the event library is going to be a noop (really, this only happens on
Red Storm), don't bother doing all the book keeping work to do calls
  into the event library

This commit was SVN r6829.
2005-08-12 16:21:17 +00:00
Brian Barrett
57efade1e2 * bump the event_tick_rate count from 100 to 10k, per conversation with Tim.
Might want to bump this higher?
* remove assumption that progress functions registered with opal_progress
  are not reentrant.  There is still the assumption that the event loop
  is not reentrant (because it's not), so that part is still protected by
  a spinlock. 

This commit was SVN r6827.
2005-08-12 16:08:44 +00:00
Brian Barrett
a926a9b4fb * I'm sure any decent optimizing compiler would have figured this one out,
but there's no point in having the call_yield check performed if we
  don't have sched_yield to call in the first place.

This commit was SVN r6823.
2005-08-12 15:21:59 +00:00
Brian Barrett
aab684f159 * fix off by one error that was causing the event library to only be
triggered every other call to opal_progress when the TCP BTL/PTL
  were being used.

This commit was SVN r6822.
2005-08-12 15:20:37 +00:00
Brian Barrett
6aa464b67e More changes from Red Storm port
- only call sched_yield if it exists
  - don't fail out if modex doens't work in ob1
  - bunch of fixes for Portals BTL
  - add cnos rml component
  - add NULL gpr component (should only be used if replica AND proxy
    fail to load)  

This commit was SVN r6629.
2005-07-27 23:07:14 +00:00
Brian Barrett
a13166b500 * rename ompi_output to opal_output
This commit was SVN r6329.
2005-07-03 23:31:27 +00:00
Brian Barrett
23b687b0f4 * rename ompi_event to opal_event
This commit was SVN r6328.
2005-07-03 23:09:55 +00:00
Brian Barrett
ccd2624e3f * rename ompi_progress to opal_progress
This commit was SVN r6326.
2005-07-03 21:57:43 +00:00