1
1
Граф коммитов

13870 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
c749fefbd0 Instead of an odls-base mca param, make report_bindings a global param so that we can (a) detect it was set in the plm, and then (b) ensure it gets passed along to remote orteds so they will comply with the request.
This commit was SVN r22021.
2009-09-28 03:17:15 +00:00
Ralph Castain
47c9a5409e Ensure that tools init the multicast channel correctly
This commit was SVN r22020.
2009-09-28 03:15:51 +00:00
Ralph Castain
ef0fd8b8d1 Return an error code if the job failed to start
This commit was SVN r22019.
2009-09-26 03:34:58 +00:00
Ralph Castain
e337fa686e Correct handling of pointer array indexing
This commit was SVN r22018.
2009-09-26 03:33:55 +00:00
Ralph Castain
6fa2e81491 Correct handling of pointer array indexing
This commit was SVN r22017.
2009-09-26 03:33:26 +00:00
Jeff Squyres
bc3060d668 Fixes trac:2028. George and I found this via some collaborative debugging
(yay cisco webex!).  Make sure we only go up to OPAL max datatype, not
OMPI max datatype.

This commit was SVN r22016.

The following Trac tickets were found above:
  Ticket 2028 --> https://svn.open-mpi.org/trac/ompi/ticket/2028
2009-09-25 21:52:42 +00:00
Ethan Mallove
e9014ff20e Adjust comment according to r22014
This commit was SVN r22015.

The following SVN revision numbers were found above:
  r22014 --> open-mpi/ompi@9e951ce664
2009-09-25 19:53:16 +00:00
Ethan Mallove
9e951ce664 Remove static from MPIR_Breakpoint so Intel compilers will not inline it
This commit was SVN r22014.
2009-09-25 19:14:19 +00:00
Jeff Squyres
1886d5a004 Remove the libopenmpi_malloc library; it is only necessary for
backwards compatibility in the v1.3 series.

This commit was SVN r22013.
2009-09-25 17:09:54 +00:00
Ralph Castain
709b36efb4 Cleanup auto-wireup and enable tools to "discover" the HNP via multicast
This commit was SVN r22012.
2009-09-25 01:00:09 +00:00
Abhishek Kulkarni
2af7657db1 A few changes to the FTB notifier interface:
- add an orte ftb notifier help file for more verbose error messages
- check if we can connect to the FTB during component->query and close
  the component, if we cannot.
- make the ftb component interface methods static.
- add mca parameters to set override the default subscription style and
  priority.

This commit was SVN r22011.
2009-09-24 23:56:41 +00:00
Jeff Squyres
3340f62e5f This directory has not been used in years; no one has contributed
anything useful to it.  So let's ditch it.  It can always be brought
back if someone wants to put something useful in here.

This commit was SVN r22010.
2009-09-24 14:09:59 +00:00
George Bosilca
56c653ebcd Add some comments.
This commit was SVN r22008.
2009-09-24 00:08:28 +00:00
George Bosilca
74d56d51ac Fix a typo.
This commit was SVN r22007.
2009-09-23 23:36:44 +00:00
Ralph Castain
3167f0a0a0 Complete the next round of the multicast framework development. Needs further polish, upgrade to handle message fragmentation - but good enough for auto-bootstrap of orteds.
Teach the ess cm module to bootstrap orted launch

This commit was SVN r22006.
2009-09-23 20:57:49 +00:00
Jeff Squyres
152bc14079 Rename the help file to be consistent with others; add it to the Makefile.am.
This commit was SVN r22005.
2009-09-23 20:28:49 +00:00
Josh Hursey
c9bd045cff move {{{ess_env_ft_event_update_process_info}}} into SnapC {{{snapc_full_app_ft_event_update_process_info}}} where it should have been all along.
This commit was SVN r22004.
2009-09-23 18:29:13 +00:00
Josh Hursey
a6ee73156c Add a verbose debug options. And add some error prints in the ESS' ft_event code.
This commit was SVN r22003.
2009-09-23 17:05:49 +00:00
Josh Hursey
2769091261 Fix for the stalled scenario in which 'options' might be reset to NULL inadvertently.
Thanks to MTT for picking this up.

This commit was SVN r22002.
2009-09-23 13:26:48 +00:00
Ralph Castain
26bb6e8f79 Add a couple of non-orte multicast tests
This commit was SVN r22001.
2009-09-23 05:24:22 +00:00
Jeff Squyres
ef338602ef Arrgh -- effectively revert r21997. We ''do'' need that header file...
This commit was SVN r21998.

The following SVN revision numbers were found above:
  r21997 --> open-mpi/ompi@bf5f14ab32
2009-09-22 21:19:38 +00:00
Jeff Squyres
bf5f14ab32 Remove some debugging stuff.
This commit was SVN r21997.
2009-09-22 19:39:01 +00:00
Ralph Castain
dff0d01673 Yet another paffinity cleanup...sigh.
1. ensure that orte_rmaps_base_schedule_policy does not override cmd line settings

2. when you try to bind to more cores than we have, generate a not-enough-processors error message

3. allow npersocket -bind-to-core combination - because, yes, somebody actually wants to do it.

This commit was SVN r21996.
2009-09-22 18:44:53 +00:00
Josh Hursey
5406fdfb80 Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option).
This commit looks larger than it really is since it includes a fair amount of code cleanup.

The SIGSTOP/SIGCONT+checkpointing work uses some of the functionality in r20391. Basic use case below (note that the checkpoint generated is useable as usual if the stopped application is terminated).
{{{
shell 1) mpirun -np 2 -am ft-enable-cr my-app
... running ...

shell 2) ompi-checkpoint --stop -v MPIRUN_PID
[localhost:001300] [  0.00 /   0.20]                 Requested - ...
[localhost:001300] [  0.00 /   0.20]                   Pending - ...
[localhost:001300] [  0.01 /   0.21]                   Running - ...
[localhost:001300] [  1.01 /   1.22]                   Stopped - ompi_global_snapshot_1234.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt

shell 2) killall -CONT mpirun

... Application Continues execution in shell 1 ...
}}}

Other items in this commit are mostly cleanup that has been sitting off-trunk for too long:
 * Add a new {{{opal_crs_base_ckpt_options_t}}} type that encapsulates the various options that could be passed to the CRS. Currently only TERM and STOP, but this makes adding others ''much'' easier.
 * Eliminate ORTE_SNAPC_CKPT_STATE_PENDING_TERM, since it served a redundant purpose with the new options type.
 * Lay some basic ground work for some future features.

This commit was SVN r21995.

The following SVN revision numbers were found above:
  r20391 --> open-mpi/ompi@0704b98668
2009-09-22 18:26:12 +00:00
Jeff Squyres
bb69bf22c0 Fix dumb logic in common sm setup that determines which nodes are
local and who has the lowest name.  

This commit was SVN r21994.
2009-09-22 17:54:43 +00:00
Eugene Loh
67bac2fe31 Fix paffinity_linux_module.c. The set and get functions transferred cpu
masks between the mask argument and a local PLPA mask.  There were three
problems:
1) The "get" function computed the number of bits as sizeof(mask),
   which is the size of the pointer to the mask rather than the mask
   itself.  So, only 4 bits were copied with m32 and 8 bits with m64.
   There are actually 1024 bits.
2) The "get" and "set" functions both copied a number of bits computed
   from the sizeof() mask, but sizeof() reports the number of bytes.
   We have to multiply by 8 to get the number of bits.
3) These two functions check to make sure tha the mask argument is not
   bigger than the PLPA mask.  But, the set function copies a number
   of bits in the PLPA mask, which is conceivably greater than the
   number of bits in the mask argument.  So, accesses to the mask
   argument may overrun that argument.
Problems 1 and 2 meant that one would encounter errors when the number of
cores exceeded 4 (with -m32) or 8 (with -m64).  Problem 3 probably caused
no errors.

This commit was SVN r21993.
2009-09-22 16:00:37 +00:00
Ralph Castain
8da3aa8d5c Some (hopefully final!) adjustments and corrections to the paffinity support:
1. default -npersocket to force -bind-to-socket

2. if we cannot get a value for cores/socket, try using #logical cpus. otherwise, default to 1 core

3. add missing error message for not-enough-processors

4. since we no longer loop through orte_register_params twice, put the auto-detect of
   topology info in the rte_init for hnp and std_orted

5. fix bind-to-core, bysocket combination

This commit was SVN r21992.
2009-09-22 15:41:03 +00:00
Jeff Squyres
b91e7ba91f This is no longer necessary.
This commit was SVN r21991.
2009-09-22 15:01:00 +00:00
Ralph Castain
12613352eb Add missing header file
This commit was SVN r21990.
2009-09-22 13:07:57 +00:00
Ralph Castain
2210989e2d Update the cm ess module to support orted bootstrap. Continue work towards bootstrap capability.
This commit was SVN r21989.
2009-09-22 02:16:40 +00:00
Ralph Castain
c3f9096fd9 Add a reliable multicast framework, with an initial basic module. This is configured out unless specifically requested via --enable-multicast.
This commit was SVN r21988.
2009-09-22 00:58:29 +00:00
Ralph Castain
82af6ee940 Update test
This commit was SVN r21987.
2009-09-22 00:55:02 +00:00
Ralph Castain
247977fe70 Update cisco platform files
This commit was SVN r21986.
2009-09-22 00:54:22 +00:00
Ralph Castain
7765c71428 Add a macro for formatting IP addresses for printing
This commit was SVN r21985.
2009-09-22 00:53:54 +00:00
Jeff Squyres
1ef988c3d9 A slight optimization: no longer call sched_yield() when polling for
shmem progress (or the Windows equiv).  Instead, poll hard on the
condition, but periocially call opal_progress().  This allows
badly-formed apps (e.g., the ibm test communicator/bsend_free) to
actually complete.

To be clear, there are far too many apps out there that assume that
MPI collectives will actually progress the rest of MPI.  I don't like
putting in a feature to enable broken apps, but I have a dim
recollection of this issue coming up before (apps "hanging" when
testing the sm coll because they assumed that calling collectives
would trigger other MPI progress).  Rather than have people claim that
OMPI is broken, I prefer to put in this "workaround".  :-(

Indeed, the bsend_free test ''may'' be coded that way for exactly that
reason...?  I don't remember offhand...

This commit was SVN r21984.
2009-09-21 22:20:44 +00:00
Jeff Squyres
64e3689a52 Grr -- test ''before'' committing! Sorry for all the noise folks;
this one really fixes the problem.  One more optimization coming later
(separately).

This commit was SVN r21983.
2009-09-21 21:32:26 +00:00
Jeff Squyres
bc43b6a085 Arrgh -- there was an extra assignment in there. Additionally, clean
it up a little to drive the point home that the lowest named proc goes
into array position [0].

This commit was SVN r21982.
2009-09-21 21:15:32 +00:00
Jeff Squyres
f9dfa03fde Fix a potential ordering issue with the names and RML exchange during
sm coll setup.

This commit was SVN r21981.
2009-09-21 21:10:45 +00:00
Terry Dontje
0ccf2d87b6 rename do-not-bind to bind-to-none and clean up an error message
This commit was SVN r21980.
2009-09-21 17:00:02 +00:00
Terry Dontje
13be2d2a00 correct mistype in odle should be odls call to orte_show_help
This commit was SVN r21979.
2009-09-21 13:22:37 +00:00
Ralph Castain
7138fd131f Final cleanup on new paffinity "if-avail" messages, plus fix one bug reported by Terry
This commit was SVN r21978.
2009-09-19 17:43:21 +00:00
George Bosilca
b18ca686ae Correct the pointer math when we copy the opal_datatype_t object. In addition
don't set the ref count to 1, it has been already set by the call to OBJ_NEW
when the type was allocated. This fixes ticket #2014.

This commit was SVN r21976.
2009-09-18 20:05:22 +00:00
Ralph Castain
2028017554 Modify the paffinity system to handle binding directives that are "soft" - i.e., when someone directs that we bind if the system supports it. This allows community members to distribute OMPI with default MCA param files that direct general binding policies, without having the distributed software fail if the system cannot support those policies.
The new options work by adding an ":if-avail" qualifier to the "bind-to-socket" and "bind-to-core" MCA params. If the system does not support this capability, the job will launch anyway. Without the qualifier, the job will abort with an error message indicating that the required functionality is not supported on this system.

This commit was SVN r21975.
2009-09-18 19:48:42 +00:00
Josh Hursey
7ac8d89f12 Since r21967 converted the mpool sm module into a real module, it broke some of the C/R logic in the ft_event funciton (actually it wouldn't build after that patch).
This commit fixes the ft_event logic so that it uses the normal destroy funcitonality instead of the workaround with the component that was previously there. All and all it made for cleaner code, which is always good.

If r21967 moves to v1.3, this patch will need to be moved as well.

This commit was SVN r21972.

The following SVN revision numbers were found above:
  r21967 --> open-mpi/ompi@533633b8cb
2009-09-17 14:45:17 +00:00
Josh Hursey
59143be39d Fix a minor C/R bug related to cleaning up session directories when sm is present.
Before this, we would restore the topmost old session directory. This commit makes sure that we remove it when we are done with it.

This commit was SVN r21971.
2009-09-17 14:43:06 +00:00
Edgar Gabriel
9abeaad6e2 so here is what happens:
in the v1.2 series the cid's could never go above the max. allowed for a
particular pml. Because of that, pml_add_comm never checked for the cid, and
in fact pml_add_comm was called in comm_set, which is *before* we knew the
cid.

in the v1.3 series (and trunk) we check now the cid to detect overflow, and
because of that pml_add_comm has been moved *after* the cid allocation
routine, namely into the comm_activate routine.

in the v1.2 series, the comm_activate contained a synchronization step of the
old communicator in order to prevent incoming fragments on the new
communicator, with the main problem being that the allreduce in the
communicator allocation finished at different times on different processes,
and thus, this scenario could and did really occur.

in the v1.3 series, the comm_activate does not contain the synchronization
step anymore, since we introduced the new queue for fragments with unknown
cid. The problem is however, that whether a fragment is known or not is
decided by using ompi_comm_lookup(), which will return something useful as
soon as the cid allocation finished, even before pml_add_comm has been
called. So there is a small time gap where we will not post a message into
queue for unknown cid's, but we can also not look up the process structure
belonging to the rank in that comm ( that is in pml_ob1_match_recv_frag or
something like that). 


The current fix reintroduces the synchronization step in comm_activate, and
ensures that no fragment can be received for a new communicator before the
synchronization occurs , and thus comm_nextcid() and pml_add_comm has been
called. It seems to be the safest and easiest way for now. Welcome back, v1.2.

This commit was SVN r21970.
2009-09-17 14:37:02 +00:00
Ralph Castain
98a4450df6 Fix the seq mapper by initializing the proc object to NULL before claiming a slot for it
This commit was SVN r21969.
2009-09-17 05:18:37 +00:00
Jeff Squyres
4a40be650e Improve the MCA param help messages for btl_tcp_if_in|exclude.
This commit was SVN r21968.
2009-09-15 17:19:57 +00:00
Jeff Squyres
533633b8cb Fixes trac:1988. The little bug that turned out to be huge. Yoinks.
* Various cosmetic/style updates in the btl sm
 * Clean up concept of mpool module (I think that code was written way
   back when the concept of "modules" was fuzzy)
 * Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to
   fix potential segv's when mmap'ed regions were at different
   addresses in different processes (thanks Tim!).
 * Change sm coll to no longer use mpool as its main source of shmem;
   rather, just mmap its own segment (because it's fixed size --
   there was nothing to be gained by using mpool; shedding the use of
   mpool saved a lot of complexity in the sm coll setup).  This
   effectively made Tim's fixes moot (because now everything is an
   offset into the mmap that is computed locally; there are no global
   pointers).  :-)
 * Slightly updated common/sm to allow making mmap's for a specific
   set of procs (vs. ''all'' procs in the process).  This potentially
   allows for same-host-inter-proc mmaps -- yay!
 * Fixed many, many things in the coll sm (particularly in reduce):
   * Fixed handling of MPI_IN_PLACE in reduce and allreduce
   * Fixed handling of non-contiguous datatypes in reduce
   * Changed the order of reductions to go from process (n-1)'s data
     to process 0's data, because that's how all other OMPI coll
     components work
   * Fixed lots of usage of ddt functions
   * When using a non-contiguous datatype, if the root process is not
     (n-1), now we used a 2nd convertor to copy from shmem to the rbuf
     (saves a memory copy vs. what was done before)
   * Lots and lots of little cleanups, clarifications, and minor
     optimizations (although still more could be done -- e.g., I think
     the use of write memory barriers is fairly sub-optimal; they
     could be ganged together at the root, for example)

I'm marking this as "fixes trac:1988" and closing the ticket; if something
is still broken, we can re-open the ticket.

This commit was SVN r21967.

The following Trac tickets were found above:
  Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988
2009-09-15 00:25:21 +00:00
Jeff Squyres
cf6b71e8a2 Helper -- ensure to revert VERSION when we're done.
This commit was SVN r21966.
2009-09-11 15:49:15 +00:00