1
1
Граф коммитов

6136 Коммитов

Автор SHA1 Сообщение Дата
Josh Hursey
d39841174d Must release the lock before entering the non blocking recv, since
it is possible that if the receive has been arrived the callback will
be called before recv_buffer_nb() returns. This causes deadlock
as we try to acquire the lock, but already hold it.

This was causing orterun and orteds to stall in certian situations.
Became evident when stress testing dynamics with remote nodes.

This commit was SVN r7543.
2005-09-29 14:24:11 +00:00
Brian Barrett
997644af31 * There are now two forms of ibv_create_cq, one with 3 params and one with 5.
Try to detect which form this version of Open IB uses, defaulting to the 5
  version if we can't figure it out (the new version has 5 params)
* Only add -lcm if it exists on the system - some versions of Open IB
  apparently don't need it.

This commit was SVN r7542.
2005-09-29 13:35:57 +00:00
Jeff Squyres
de1c8fb125 - Make debug output a bit more accurate and readable
- Fix bug identified by users: --prefix may also apply on the local
  node; we need to prefix the PATH and LD_LIBRARY_PATH environment
  variables before invoking execve()

This commit was SVN r7541.
2005-09-29 12:35:43 +00:00
Jeff Squyres
fa4f7a6261 Move assignment out of inner loop -- only needs to be done once, and
fixes a compiler warning (and potential bug)

This commit was SVN r7540.
2005-09-29 10:09:20 +00:00
Josh Hursey
c11ba09655 Remove the progress engine stuff from abort. This was causing
some orted's to stall on locks in the MPI Dynamics cases. Since it
is not essentual that we call these functions, they can so away.

Unlock the peer lock when aborting. This causes a potential deadlock
in do_waitall [see comment in code]. This was causing orteds to
deadlock at times when the seed had terminated. With proper interleaving
and timing the orted was deadlocking. This seems to have fixed this in 
my stress testing with MPI 2 Dynamics.

This commit was SVN r7539.
2005-09-29 05:04:43 +00:00
Josh Hursey
e825b4522f Upon further investigation the fix in r7537 was an anomoly of zero'ing out the
bits to expose the low bits being set. We were casting from a size_t to a void*
which is not good when working with big endian machines.

This fix makes MPI 2 dynamics work on PPC 64 (tested with a Linux OS).

This commit was SVN r7538.

The following SVN revision numbers were found above:
  r7537 --> open-mpi/ompi@fd45714c03
2005-09-28 23:50:42 +00:00
Josh Hursey
fd45714c03 For some reason we have to initialize this variable or bad things happen in the
comm->c_coll.coll_bcast of the rnamebuflen.

This fixes the threaded MPI 2 Dynamics stuff. Should be working great now! Yay!

This commit was SVN r7537.
2005-09-28 22:30:41 +00:00
Galen Shipman
3ded88a3c0 use addr +size -1 instead of base->addr as base->addr is down_aligned.
This commit was SVN r7536.
2005-09-28 20:19:33 +00:00
Galen Shipman
26a74d42fa release, not retain on gm_free
This commit was SVN r7535.
2005-09-28 20:18:52 +00:00
Brian Barrett
f36642b73b * fix compile error. i should know better
This commit was SVN r7534.
2005-09-28 18:18:12 +00:00
Brian Barrett
ad62411a15 * fix deadlock in memory code that occurs if OBJ_NEW calls realloc to
expand it's class list array by pre-allocating the callback item before
  we obtain the lock (we free it if we ended up not needing it)

This commit was SVN r7533.
2005-09-28 16:40:19 +00:00
Edgar Gabriel
67dd52efb1 making the allreduce and reduce_scatter tests pass as well
This commit was SVN r7532.
2005-09-28 15:12:05 +00:00
Josh Hursey
75419313f7 check the return code and do something reasonable, instead of progressing and hanging on error
This commit was SVN r7531.
2005-09-28 06:13:51 +00:00
Galen Shipman
c1f5543f62 need to call mpool_release on all registrations obtained in the pml.
sanity checks 

This commit was SVN r7530.
2005-09-28 04:49:40 +00:00
Josh Hursey
4cf4b4ea86 Fix for MPI 2 dynamics.
The NS replica should give out tags that are over ORTE_RML_TAG_DYNAMIC
or it will overlap with other outstanding tags. This overlap was killing
MPI_Comm_spawn when a program tried to use it multiple times (> 3).

With this fix MPI_Comm_spawn is behaving properly.
A program can call it many times in a row with out problem.

NOTE: Not tested for multi-threaded build yet

(A long time debugging for a one liner... :/)

This commit was SVN r7529.
2005-09-28 03:20:43 +00:00
Galen Shipman
b9b78f8f5d modify rcache_rb to find registrations in the middle of a base and bound
This commit was SVN r7528.
2005-09-28 02:11:35 +00:00
Brian Barrett
676e34c2d4 * play the linker games that need to be played to shut Darwin up
This commit was SVN r7527.
2005-09-28 01:24:05 +00:00
Edgar Gabriel
dbbbd416df fixing MPI_IN_PLACE for the log-reduce algorithm.
This commit was SVN r7526.
2005-09-27 21:51:55 +00:00
Galen Shipman
0fc17cedee change order of ops on register
This commit was SVN r7525.
2005-09-27 21:43:41 +00:00
Jeff Squyres
285ded5655 - Ensure to have !initialized || finalized test *first*
- If we have an NS error, don't return an error -- this function's
  purpose is to abort :-)
- s/abort()/exit(1)/ so that we don't drop massive corefiles

This commit was SVN r7524.
2005-09-27 20:26:38 +00:00
Jeff Squyres
88ab3dcdf0 Fix a bunch of stupid typos
This commit was SVN r7523.
2005-09-27 19:31:20 +00:00
Jeff Squyres
26efa2c4cd Add a default prefix to stream 0 ("[hostname:pid]") to help identify
where output is coming from.  Threw in a few minor style fixes while I
was editing these files.

This commit was SVN r7522.
2005-09-27 19:27:26 +00:00
Galen Shipman
09e67ce4fd fix off by one on up_align_addr, use base and bound instead of base_align and
bound_align.. 

This commit was SVN r7521.
2005-09-27 18:10:44 +00:00
Jeff Squyres
621fb2b99e - Add more missing fortran values
- Add warnings to not tamper with constant values in mpi.h.in without
  also adjusting mpif.h.in

This commit was SVN r7520.
2005-09-27 17:47:21 +00:00
Brian Barrett
9d2c2d4ab8 * add declarations for MPI_SEEK_CUR and MPI_SEEK_END
This commit was SVN r7519.
2005-09-27 17:20:41 +00:00
Jeff Squyres
31b2ec198b Fix problem identified by Ferris McCormick -- we were missing some
out-of-bounds protection for output ID's in some functions.  Also,
move some logic for closing the syslog inside a conditional block --
it's only necessary to close the syslog if we actually closed a
stream.

This commit was SVN r7518.
2005-09-27 16:43:48 +00:00
Brian Barrett
7c924dc221 * don't try to fire up orte - nothing good comes from trying to open all those
components...

This commit was SVN r7517.
2005-09-27 14:48:41 +00:00
Galen Shipman
51f1c7a8e4 bring ompi_fifo up to new mpool interface,, looks like this has been stale for
some time. Comment out an incorrect test in ompi_rb_tree.c 

This commit was SVN r7516.
2005-09-27 14:36:53 +00:00
Galen Shipman
af04b3e1ab fix warnings..
This commit was SVN r7515.
2005-09-27 14:23:51 +00:00
Brian Barrett
70e59774e9 * add some missing ignores
This commit was SVN r7514.
2005-09-27 13:43:38 +00:00
Josh Hursey
a23370c007 Converted some MCA parameters from the old version to the new.
Have the ras_base_schedule_policy MCA parameter working once again. before it 
would only do slot based allocation, even if the MCA parameter was set properly.

Currently you can specify to orterun a node allocation by either:
-mca ras_base_schedule_policy node
-bynode

and slot allocation (which is the default) by:
-mca ras_base_schedule_policy slot
-byslot

This commit was SVN r7513.
2005-09-27 02:54:15 +00:00
Jeff Squyres
844cf6ae39 Add missing declarations for MPI_SEEK_SET, MPI_MODE_CREATE,
MPI_MODE_WRONLY

This commit was SVN r7512.
2005-09-27 02:37:11 +00:00
Brian Barrett
80ac5c2efd * there are now two upcoming points where we want to release a version with
a random string of characters as part of the version number (the really
  soon to happen 1.0lanl release and the 1.1sc2005 release that we've
  talked about).  So rather than having alpha and beta fields that must
  be numeric values, have a general field that can be any alphanumeric
  value.

This commit was SVN r7511.
2005-09-27 02:06:05 +00:00
Galen Shipman
3c97b3f722 Modified the registration to include a base_align and bound_align for
searching the tree. Modified the memory callback to search the tree at each
page boundary for registrations. This is necessary as an application may
malloc memory and send out of any portion of that memory, even discontiguous
regions. 

This commit was SVN r7510.
2005-09-27 02:01:21 +00:00
Jeff Squyres
5ffd95e971 Commit of the unified component selection system. This was mostly
in-place already (turns out that I was wrong in thinking that it
didn't work for static components), but the logic for excluding
components was not there.  This commit does a few things:

- Adds "exclude" logic, so that you can do:

  mpirun --mca btl ^mvapi,openib ...

  (note the "^" character -- I tried "!" but then you have to escape
  it in the shell, and that was icky) which will exclude both the
  mvapi and openib btl components (excluding one component means that
  you are excluding all components in the list; it doesn't make sense
  to include some and exclude others -- you're entire entirely
  including or entirely excluding)

- Simplifies the "include" logic, so the same old stuff like this
  still works:

  mpirun --mca btl tcp,self ...

  will only use the tcp and self btl components.

- Added more verbosity statements to make this selection process
  clear.

This commit was SVN r7509.
2005-09-26 21:55:32 +00:00
Brian Barrett
1d9b663b62 * test for condition where we think we can intercept malloc/free/munmap but
really can't.  Test for munmap, since it's the most likely to cause problems,
  since it's always an interposed symbol.

  The condition that usually causes problems is if libmpi was brought in as
  the result of a library dependency, rather than as a -l on the link line.
  The linker in this case will find malloc/free/munmap/etc. in libc, rather
  than in libmpi.

This commit was SVN r7508.
2005-09-26 20:20:20 +00:00
Brian Barrett
d9e80d8f2a * increase size of event queue for receives - it was too small to be useful
on a reasonably sized machine
* if no mpool exists, don't try to malloc out an array of 0 bytes

This commit was SVN r7507.
2005-09-25 17:04:03 +00:00
Galen Shipman
384c472c94 reset ompi_pointer_array in mca_rcache_rb_find otherwise you might use an old
registration by accident.. 

This commit was SVN r7506.
2005-09-24 20:48:14 +00:00
Galen Shipman
c728c0acc5 more changes to test..
seems to have errors inserting some items, these are apparently duplicates.. 

This commit was SVN r7505.
2005-09-24 17:07:08 +00:00
Galen Shipman
02ce7a176e had that backwards..
This commit was SVN r7504.
2005-09-24 16:58:07 +00:00
Galen Shipman
d1246be47e should be strictly > or <
This commit was SVN r7503.
2005-09-24 16:45:34 +00:00
Galen Shipman
ac935f3a51 a very basic test for using the tree with base and bound segments..
- FAILS..

This commit was SVN r7502.
2005-09-24 16:19:44 +00:00
Galen Shipman
c53d51778a fix for warnings..
This commit was SVN r7501.
2005-09-24 15:08:28 +00:00
Brian Barrett
4efd0405c3 * fixes to make tests compile again - dunno when these broke :/
This commit was SVN r7500.
2005-09-24 01:17:32 +00:00
Galen Shipman
9fe5844071 decrement ref count on removal of registration from mru and tree.
add misc asserts to check for proper reference counting. 

ugly hack 1 -- use mallopt to never release memory ala sbrk - this is
commented out in mca_btl_mvapi_component_init

ugly hack 2 -- test registrations comming out of the tree via rcache_find, for
an unknown reason the tree is returning registrations where the address is not
within the base or bound of the registration. If this happens, we return
NULL. 

comment out code to enable mem hooks if leave_pinned is set, note we can do
this via an mca param and will default it to leave_pinned with mem_hooks when
we iron out these issues. 

I am adding a unit test for the rcache. Note that we have a unit test for the
rb tree but the compare function is significantly different than that used for
registrations. After we have tracked down the issues with rcache_rb we will
remove the above hacks. 

This commit was SVN r7499.
2005-09-24 00:24:49 +00:00
Brian Barrett
50dc5499b4 * fix some remaining --with-btl-portals configure issues
This commit was SVN r7498.
2005-09-24 00:11:40 +00:00
Brian Barrett
0d68728b94 * add some more debugging output for send fragment issue to figure out why
Red Storm is complaining about invalid memory pointer (need to go back
  to Linux and look at this with valgrind)
* Turn off send in place for now, so I can run the tests on RS and see if
  everything else is ok

This commit was SVN r7497.
2005-09-23 19:30:54 +00:00
Brian Barrett
07b0b8c943 * add some useful debugging output
* fix dumb bug in btl_portals_get where I using the dest descriptor key instead
  of the source descriptor key for the match bits, resulting in a PtlGet() with
  the wrong match bits

This commit was SVN r7496.
2005-09-23 15:30:18 +00:00
Jeff Squyres
fbd6db142b Remove some old references to --with-btl-mvapi (replaced with
--with-mvapi).

This commit was SVN r7495.
2005-09-23 13:18:41 +00:00
Jeff Squyres
f5ded6dd04 Don't link in libopal here because this component is *always* built
statically, and therefore is part of libopal.

This commit was SVN r7494.
2005-09-23 12:39:47 +00:00