1
1
Граф коммитов

699 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
6af4f1896c test: these files are not used any more
The functions in components.* are not used by any tests.  Removing
this old kruft.
2015-01-30 14:30:14 -08:00
Gilles Gouaillardet
661c35ca67 cleanup dead code caused by the removal of the --with-threads configure option 2015-01-16 19:13:59 +09:00
George Bosilca
82c02b471e Take in acount the lower bound of the data. 2014-12-20 21:28:58 -05:00
George Bosilca
4f9a3bdbab Correctly compute the size of the needed memory for the datatype tests.
Fixes open-mpi/ompi#294.
2014-12-20 01:30:37 -05:00
George Bosilca
cf3ff3fe58 This was not supposed to be part of the 1895f29 commit. 2014-12-19 09:55:18 -05:00
George Bosilca
1895f29537 Remove all warnings from the datatype tests. 2014-12-18 02:58:21 -05:00
Nathan Hjelm
bc33b7a71d Add check for timersub to opal_lifo and opal_fifo tests
Some platforms do not provide a timersub macro. This commit adds a definition
to both tests when running on one of these platforms.
2014-12-17 22:16:15 -07:00
Ralph Castain
91bec7e9dd Fix some type declarations so make check works for SPARC. Thanks to Paul Hargrove for the report and correction 2014-12-15 06:44:51 -08:00
Nathan Hjelm
da5e3ce936 test/class: update class tests to also use opal_finalize_util 2014-12-10 17:50:26 -07:00
Jeff Squyres
b1e9e7f56f Whitespace cleanup only; no code changes 2014-12-10 13:32:04 -08:00
Jeff Squyres
8b2410f554 class tests: re-enable a bunch of tests
Many of these tests were failing due to opal_init() failing in some
cases (because the opal shmem framework needs installed components, so
"make distcheck" would fail these tests because the opal shmem
components were not installed).  However, all of these tests seem to
be fine with opal_init_util() -- so let's re-enable these tests.
2014-12-10 13:30:14 -08:00
Jeff Squyres
ff2a75b29b class tests: change from opal_init() to opal_init_util() 2014-12-10 13:29:38 -08:00
Nathan Hjelm
1231bb7479 Update lifo and fifo tests to use opal_init/finalize_util so they work during make distcheck 2014-12-09 17:41:18 -07:00
Ralph Castain
04c6d1d01d Silence warnings 2014-12-09 16:10:58 -08:00
Nathan Hjelm
b2b7ecc7c4 Merge pull request #300 from hjelmn/topic/atomic_lifo_fifo
Add opal_fifo_t class and rename opal_atomic_lifo_t to opal_lifo_t
2014-12-09 10:54:50 -06:00
Ralph Castain
595740a8e3 Sigh - readd missing headers 2014-12-05 21:54:41 -08:00
Ralph Castain
4a0b4ad5ef You can't have a variable of the same name as the function... 2014-12-05 21:50:40 -08:00
Ralph Castain
aff1f0ee49 Add missing header files 2014-12-05 19:03:21 -08:00
Nathan Hjelm
23d59b0f5d Fix one typo in opal_path_nfs.c 2014-12-05 13:13:35 -07:00
Nathan Hjelm
0fc8777aa8 opal_path_nfs test: do not try to test filesystems that can not be stat'd 2014-12-05 13:11:45 -07:00
Jeff Squyres
9b18b4b2d2 opal_path_nfs: enable debugging output
Now that "make check" siphons off stdout/stderr to logfiles, it's ok
to have output by default from tests.  This test fails often enough
that it's useful to see the diagnostic output.
2014-12-05 03:19:51 -08:00
Nathan Hjelm
3aefd78842 Add lifo and fifo checks to make check 2014-12-04 16:03:47 -07:00
Nathan Hjelm
d1114ec17a Add opal_fifo_t class
This commit adds a new class: opal_fifo.h. The new class has atomic, non-atomic,
and opal_using_threads() conditoned routines. It should be used when first-in
first-out is required and should perform much better than using locks and an
opal_list_t. Like with opal_lifo_t there are two versions of the atomic
implementation: 128-bit compare-and-swap, and spin-locked. More implementations
can be added later (LL/SC comes to mind).

This commit also adds a unit test for the opal_fifo_t class. This test verifies
the fifo implementation when using multiple threads.
2014-12-04 15:30:02 -07:00
Nathan Hjelm
20c6eb5237 Rename opal_atomic_lifo_t to opal_lifo_t and improve interface
- Rename opal_atomic_lifo_t to opal_lifo_t to reflect both atomic and
   non-atomic usage. Added new routines (opal_lifo_*_st) for non-atomic
   usage as well as routines conditioned off opal_using_threads(). The
   atomic versions are always thread safe and the non-atomic are always
   not thread safe.

 - Add a new atomic lifo implementation that makes use of 128-bit
   compare-and-swap. The new implementation should scale better with
   larger numbers of threads.

 - Add threading unit test for opal_lifo_t.
2014-12-04 15:30:02 -07:00
George Bosilca
8ee501350b Implement strict validation for the packing/unpacking of the data. 2014-12-02 16:22:18 +09:00
Ralph Castain
53af0f1594 Temporarily turn off a specific datatype test that is preventing the nightly tarball from running 2014-12-01 19:52:37 -08:00
George Bosilca
59b739ee90 Add a resized datatype to the test. Implement generic
data corectness for most of the conversion functions.
2014-11-29 19:47:25 -05:00
George Bosilca
ee3d1ed5fd Add tests for vector type. 2014-11-24 01:52:49 -05:00
Ralph Castain
780c93ee57 Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
2014-11-11 17:00:42 -08:00
Jeff Squyres
01fd96bfa5 Revert "Provide a mechanism by which an upstream project can rename
the OPAL and ORTE libraries. This is required by projects such as ORCM
that have their own ORTE and OPAL libraries in order to avoid library
confusion. By renaming their version of the libraries, the OMPI
applications can correctly dynamically load the correct one for their
build."

This reverts commit 63f619f871.
2014-10-22 10:32:11 -07:00
George Bosilca
7541c03b4c Mark all instances where atomic operations are used but their return value is unnecessary 2014-10-15 21:47:32 -04:00
Ralph Castain
63f619f871 Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build. 2014-10-10 11:39:08 -07:00
Joshua Ladd
1cabd73522 Adding a new OPAL hash table routine. Please read the algorithm description in opal/class/opal_hash_table.c for more precise details on the design and implementation. This algorithm was contributed by David Linden of H.P. in partnership with Mellanox Technologies. This contribution achieves two objectives:
1. It's actually hashing now, whereas the old OPAL hash table was not. Thus, it is a bug fix for and, as such, should be included in the 1.8 series.

2. It is dynamic and can grow and shrink the number of buckets in accordance with job size, whereas the old OPAL hash table had a fixed number of buckets which resulted in poor retrieval performance at large scale.

This scheme has been deployed in the field on very large H.P./Mellanox systems and has been demonstrated to significantly decrease job start-up time (~ 20% improvement) when launching applications directly with srun in SLURM environments. However, neither SLURM nor direct launch are prerequisites to take advantage of this change as any entity that utilizes OPAL hash table objects can benefit (at least partially) from this contribution.
2014-10-09 17:24:23 +02:00
Joshua Ladd
97abb7c727 Backing out the new Opal Hash table until the legal issues are address by H.P.
Refs trac:4872

This commit was SVN r32583.

The following Trac tickets were found above:
  Ticket 4872 --> https://svn.open-mpi.org/trac/ompi/ticket/4872
2014-08-22 19:10:09 +00:00
Joshua Ladd
84d0cc27a2 Adding a new OPAL hash table routine. Contributed by David Linden of H.P. in partnership
with Mellanox Technologies. This should be added to 


cmr=v1.8.2:subject=New OPAL hash table:reviewer=rhc

This commit was SVN r32564.
2014-08-20 21:40:28 +00:00
George Bosilca
f217661ee0 Use opal_process_info whenever possible. Some other minor cleanups.
This commit was SVN r32325.
2014-07-26 21:48:23 +00:00
Ralph Castain
6f96027aa1 Turn off the ompi_rb_tree test for now
This commit was SVN r32319.
2014-07-26 01:50:56 +00:00
Jeff Squyres
df82810d03 opal_path_nfs.c test: skip fuse filesystems
Linux statfs(2) lies about the type of fuse filesystems (it reports
fuse.encfs as an NFS filesystem).  So just skip fuse filesystems in
this test until/if we ever care to add some kind of workaround.

Refs trac:4767

cmr=v1.8.2:reviewer=rhc

This commit was SVN r32152.

The following Trac tickets were found above:
  Ticket 4767 --> https://svn.open-mpi.org/trac/ompi/ticket/4767
2014-07-08 13:30:49 +00:00
George Bosilca
fbe69808f2 A faster implementation of the OPAL_BITMAP. The corresponding
test has also been updated.

This commit was SVN r32001.
2014-06-13 21:15:35 +00:00
Gilles Gouaillardet
90c2f4a10a Fix unpack_ooo test
The test fails on a 32 bits system.
The root cause is a rounding error when testing double numbers.

This commit was SVN r31958.
2014-06-06 07:53:28 +00:00
George Bosilca
40d2c75046 Add a slightly modified version of Gilles test for the
irregular packing/unpacking of datatypes.

This commit was SVN r31952.
2014-06-04 18:33:30 +00:00
George Bosilca
ba211d97ef Remove a double const warning.
This commit was SVN r31879.
2014-05-22 06:09:45 +00:00
Jeff Squyres
09f98cb165 Fix a bunch of compiler warnings in the tests, including:
* Resolve set-but-not-used issues
 * Resolve incorrect const notation (I checked with George first to see
   what const notation he actually wanted)
 * Comment out unused code (didn't delete it because it's useful
   debugging code)
 * Resolve int<-->void* casting
 * Resolved signed / unsigned comparisons

This commit was SVN r30225.
2014-01-10 13:36:33 +00:00
Jeff Squyres
c44a1027d0 Make the non-Linux platforms support an interactive opal_path_nfs() test.
On Linux, if this test is run with no command line params, it will run
"mount" and analyze the output (same as it always has).

On all platforms, if you provide one or more command line options,
each command line option is given to opal_path_nfs() and the result is
sent to stdout.

This commit was SVN r30208.
2014-01-10 00:13:10 +00:00
Jeff Squyres
f026bdb68b Remove unused variable
Refs trac:4004

This commit was SVN r30021.

The following Trac tickets were found above:
  Ticket 4004 --> https://svn.open-mpi.org/trac/ompi/ticket/4004
2013-12-20 16:16:24 +00:00
George Bosilca
a85194ae96 Cleanup all the datatype test to avoid any memory leaks or RUI from valgrind.
This commit was SVN r30018.
2013-12-20 15:55:09 +00:00
Jeff Squyres
435eaf4671 This is an opal test; it should include opal_config.h, not ompi_config.h.
This matters if you autogen.pl --no-ompi.

This commit was SVN r29855.
2013-12-11 03:31:25 +00:00
Dave Goodell
002ba95deb regression test for r29285 (convertor_set_position)
This commit was SVN r29296.

The following SVN revision numbers were found above:
  r29285 --> open-mpi/ompi@43b4d76913
2013-09-30 16:21:19 +00:00
Ralph Castain
9366fda374 Fix names in test - still generating warnings
This commit was SVN r28740.
2013-07-09 02:58:58 +00:00
George Bosilca
c9e5ab9ed1 Our macros for the OMPI-level free list had one extra argument, a possible return
value to signal that the operation of retrieving the element from the free list
failed. However in this case the returned pointer was set to NULL as well, so the
error code was redundant. Moreover, this was a continuous source of warnings when
the picky mode is on.

The attached parch remove the rc argument from the OMPI_FREE_LIST_GET and
OMPI_FREE_LIST_WAIT macros, and change to check if the item is NULL instead of
using the return code.

This commit was SVN r28722.
2013-07-04 08:34:37 +00:00
Ralph Castain
a4b6fb241f Remove all remaining vestiges of the Windows integration
This commit was SVN r28137.
2013-02-28 17:31:47 +00:00
George Bosilca
ceb75eae75 Welcome in the wonderful world of MPI 3.0.
This commit was SVN r28106.
2013-02-26 10:22:12 +00:00
Ralph Castain
ebe45b4b9c Cleanup warnings that may be messing up older compilers, remove unused variables
cmr:v1.7

This commit was SVN r27542.
2012-10-31 14:42:44 +00:00
Ralph Castain
a6329ba1b6 Fix makefile
This commit was SVN r27333.
2012-09-13 03:20:05 +00:00
Jeff Squyres
3a4b92dbb7 If we get a filesystem type of "none", skip it.
This commit was SVN r27322.
2012-09-12 14:38:37 +00:00
Ralph Castain
a08c23dfdc Actually, do the right thing - leave the test alone, but just turn if "off" for now until someone, someday fixes it to work with bind mounts.
This commit was SVN r27301.
2012-09-11 19:56:58 +00:00
Ralph Castain
3c016d79db Soft mounts are okay
This commit was SVN r27300.
2012-09-11 19:48:24 +00:00
Jeff Squyres
36dc0d40a6 * Fix a few warnings in ompi_rb_tree
* Add the get_key function to the opal_tree test

This commit was SVN r27207.
2012-08-31 20:43:58 +00:00
Shiqing Fan
42dfbc7d2f Another CMake scripts update for:
correctly generate hwloc library
automatically define OMPI/OPAL/ORTE_OMPORTS for user applications
update the f77 bindings

This commit was SVN r26893.
2012-07-27 11:49:09 +00:00
Shiqing Fan
e788691fdb Include an example to show how to use Visual Studio together with Open MPI.
When building Open MPI with CMake, a VS solution will be generated automatically, this solution can be directly used.
For the installer, it's a bit tricky, need to do more in NSIS config codes, in order to make the solution file aware the installation directory of user.

This commit was SVN r26616.
2012-06-18 08:58:27 +00:00
Ralph Castain
36aab6db63 Fix test
This commit was SVN r26249.
2012-04-07 01:46:09 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
3284c6ec71 Per Paul Hargrove: add another file system name
This commit was SVN r25939.
2012-02-16 03:00:07 +00:00
Rainer Keller
4e6a6fc146 - Check, whether the compiler supports __builtin_clz (count leading
zeroes);
   if so, use it for bit-operations like opal_cube_dim and opal_hibit.
   Implement two versions of power-of-two.
   In case of opal_next_poweroftwo, this reduces the average execution
   time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
   measured rdtsc, with loop over 2^27 values).
   Numbers for other functions are similar (but of course heavily depend
   on the usage, e.g. opal_hibit() with a start of 4 does not save
   much).  The bsr instruction on AMD Opteron is also not as fast.

 - Replace various places where the next power-of-two is computed.
   
   Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
   Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.

This commit was SVN r25270.
2011-10-11 22:49:01 +00:00
Wesley Bland
4e7ff0bd5e By popular demand the epoch code is now disabled by default.
To enable the epochs and the resilient orte code, use the configure flag:

--enable-resilient-orte

This will define both:

ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE

This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Terry Dontje
266e663091 Add opal_tree class. This will be used in the future by sysinfo to store hw maps to be used by rmaps for the new affinity code.
This commit was SVN r24594.
2011-03-30 08:05:28 +00:00
Eugene Loh
2770a12beb Continue clean up of thread options started in r22841, 22842, and 22849.
No need for any CMRs to 1.5... that was already done in CMR 2728.

This commit was SVN r24545.

The following SVN revision numbers were found above:
  r22841 --> open-mpi/ompi@b400b84162
2011-03-18 21:36:35 +00:00
Ralph Castain
d5dfe05521 Remove stale code associated with OPAL_THREADS_HAVE_DIFFERENT_PIDS. In the past, we have supported the case of really, really old Linux kernels where threads have different pids. However, when we updated the event library, we didn't also update that support code. In addition, when we dropped progress thread support, we didn't remove areas of the code that could no longer be compiled (i.e., were protected by "if progress thread && if have different pids).
There was no compelling reason to support such old kernels. Accordingly, convert the test to print a nice error message indicating we no longer support old kernels (but indicate that earlier OMPI versions do) and error out. Remove all code that was protected by "if have different pids" since it can no longer be compiled.

This commit was SVN r24531.
2011-03-15 21:05:03 +00:00
Jeff Squyres
ddcbfa6af0 Fix some fairly-important typos (!)
This commit was SVN r24328.
2011-02-01 13:18:01 +00:00
Jeff Squyres
f015f885f6 Fix datatype variable names so that PGI builds stop failing in MTT.
This commit was SVN r24327.
2011-01-31 19:12:33 +00:00
George Bosilca
fc9133cc7f Correctly initialize the convertor to be used.
Don't forget to initialize the OPAL datatype module.

This commit was SVN r24279.
2011-01-20 20:05:21 +00:00
George Bosilca
29c7f2fba5 Update the tests to match the new datatype engine.
This commit was SVN r24252.
2011-01-14 07:58:50 +00:00
Shiqing Fan
f43862420c Convert the bad dos line endings to unix style for all windows related files.
This commit was SVN r24137.
2010-12-02 12:08:08 +00:00
Ralph Castain
86c7365e8e Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem.
Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution.

This commit was SVN r23943.
2010-10-26 02:41:42 +00:00
Ralph Castain
fceabb2498 Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.

Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.

Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.

I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:

1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)

2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.

There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.

This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
Brad Benton
09c4f4d95c Added copyright notices for the files modified in r23669.
This commit was SVN r23687.

The following SVN revision numbers were found above:
  r23669 --> open-mpi/ompi@271cfa8c9a
2010-08-30 17:46:47 +00:00
Nysal Jan
271cfa8c9a Fix the the opal_path_nfs test for GPFS. Reported by Paul H. Hargrove
This commit was SVN r23669.
2010-08-26 10:10:16 +00:00
Jeff Squyres
c59743d7e3 Move the predefined gap test to ompi/debuggers (we already have the
dlopen_test there, so why not put the other debugger test there with
it?).

This commit was SVN r23527.
2010-07-28 16:22:10 +00:00
Jeff Squyres
49b8008986 Remove the peruse test from any possibility of being run during "make
check" (it's been deactivated for 2+ years now, anyway).  It needs to
be launched via "mpirun" and needs >= 2 processes, so it wasn't a good
candidate for "make check", anyway.

The test itself has moved to OMPI's internal testing suites.

This commit was SVN r23526.
2010-07-28 16:04:18 +00:00
Jeff Squyres
f5c3c2c0ac s/ompi/opal/gi in all of these files because they're really OPAL
tests, not OMPI tests.

This allows us to "make distcheck" with "./autogen.sh -no-ompi"
trees (i.e., these tests will now still work even if the OMPI layer is
not present -- because they're OPAL tests and we should treat them
that way).

This commit was SVN r23524.
2010-07-28 14:20:58 +00:00
Jeff Squyres
953e2ace35 s/ompi/opal/g throughout the file, because this is really an OPAL
test, not an OMPI test.

Also fix a case where if you haven't run "make install", then
opal_show_help_string() will (rightfully) return NULL.  So be sure to
handle that case and not segv.

This commit was SVN r23522.
2010-07-28 14:18:16 +00:00
Jeff Squyres
ce186723a7 * Only link in the top-most library that is necessary; it is no
longer necessary to link in libopen-rte if you link in libmpi (for
   example) because of the fact that libmpi now completely slurps in
   libopen-rte (ditto with libopen-rte and libopen-pal).
 * Only build ompi_rb_tree if we have the OMPI layer.

This commit was SVN r23521.
2010-07-28 14:17:08 +00:00
Jeff Squyres
a6915364e9 Only build this test if we've enabled the OMPI layer.
This commit was SVN r23520.
2010-07-28 14:14:22 +00:00
Jeff Squyres
a25d5ffbfc Er... make sure to close the comment.
This commit was SVN r23486.
2010-07-23 13:24:15 +00:00
Jeff Squyres
3241a6f414 This test currently only works on linux. Simply returning 77 from
everywhere and compiling the rest of the test out helps reduce some
MTT stderr chatter.

This commit was SVN r23485.
2010-07-23 13:15:24 +00:00
Jeff Squyres
7fa92d0f22 Fix a missed _count -> ucount update.
This commit was SVN r23479.
2010-07-23 01:06:16 +00:00
Jeff Squyres
c8bb7537e7 Remove include/opal/sys/cache.h -- its only purpose in life was to
#define CACHE_LINE_SIZE to 128.  This name has a conflict on NetBSD,
and it seems kinda odd to have a header file that ''only'' defines a
single value.  Also, we'll soon be raising hwloc to be a first-class
item, so having this file around seemed kinda weird.

Therefore, I replaced CACHE_LINE_SIZE with opal_cache_line_size, an
int (in opal/runtime/opal_init.c and opal/runtime/opal.h) on the
rationale that we can fill this in at runtime with hwloc info (trunk
and v1.5/beyond, only).  The only place we ''needed'' a compile-time
CACHE_LINE_SIZE was in the BTL SM (for struct padding), so I made a
new BTL_SM_ preprocessor macro with the old CACHE_LINE_SIZE value
(128).  That use isn't suitable for run-time hwloc information,
anyway.

This commit was SVN r23349.
2010-07-06 14:33:36 +00:00
Josh Hursey
77532c9f44 minor test fix, found my MTT
This commit was SVN r23176.
2010-05-19 17:02:13 +00:00
Abhishek Kulkarni
afbe3e99c6 * Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
 SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
 back the native error code.

* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
  (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
  decode 'ret' to get the native error code.

This commit was SVN r23162.
2010-05-17 23:08:56 +00:00
Abhishek Kulkarni
4e33e6aeaa Merge OPAL SOS into the trunk.
The OPAL SOS framework tries to meet the following objectives:

 * reduce the cascading error messages and the amount of code needed to print an error message.
 * build and aggregate stacks of encountered errors and associate related individual errors with each other.
 * allow registration of custom callbacks to intercept error events.

For more information, refer to
https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

This commit was SVN r23158.
2010-05-17 22:51:52 +00:00
Rainer Keller
8dd87def77 - Keep only the _LAST_ entry when reading in output from mount:
On Jaguar / is NFS-mounted over the initially mounted ROOTFS...

This commit was SVN r22662.
2010-02-18 18:05:55 +00:00
Rainer Keller
ecbd530a77 - Well well, that's what one gets when turning on all kinds of old
tests ;-)) Turn them off again, didn't have time to look into them
   Also, the test-program on eddie.osl.iu.edu, detects the rpc_pipefs
   mounted on /var/lib/nfs/rpc_pipefs, required for NFS.

This commit was SVN r22607.
2010-02-11 22:07:07 +00:00
Rainer Keller
ea4de16561 - Check whether file is opened on network file-system.
If file does not exist, check the directory it lives in...
   Maybe used by caller, trying to open mmap() on NFS, Lustre or
   Panasas (thanks Sam).
   For now, this is used to warn about the usage of mmap on such FS.

   Please note, that Ralph mentioned the orte_no_session_dir parameter.
   The help message includes a reference to this.

   Tested on NFS and Lustre on Linux on
     smoky: mpirun --mca orte_tmpdir_base $HOME/tmp -np 2 ./mpi_stub
     jaguar: mpirun ... --mca orte_tmpdir_base /tmp/work/$USER ...

   Fixes trac:1354

   This should   cmr:v1.5   once it has soaked and is shown to work on
   Solaris

This commit was SVN r22604.

The following Trac tickets were found above:
  Ticket 1354 --> https://svn.open-mpi.org/trac/ompi/ticket/1354
2010-02-10 23:18:29 +00:00
Rainer Keller
583bb42739 - Adapt for changed opal_init() arguments -- takes argc&argv
It's orte/constants.h not orte/orte_constants.h

This commit was SVN r22594.
2010-02-10 18:29:01 +00:00
Rainer Keller
c161cf5fa4 - These orte tests refer to include files not available anymore, call
functions not in the orte-tree, so disable for now.

This commit was SVN r22593.
2010-02-10 18:21:04 +00:00
Ralph Castain
30056c77cf Grrr...remove debug
This commit was SVN r22546.
2010-02-03 21:02:30 +00:00
Ralph Castain
1b5e4b4ac9 Update the opal_bitmap test
This commit was SVN r22545.
2010-02-03 20:56:48 +00:00
Shiqing Fan
872a4047ba Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake.
In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time.

This commit was SVN r22405.
2010-01-14 18:10:20 +00:00
Josh Hursey
0ba58cfcce One more argv/argc fix in tests
This commit was SVN r22270.
2009-12-07 14:40:38 +00:00
Jeff Squyres
a7ca4050b5 Doh! Missed these when adding &argc,&argv.
This commit was SVN r22261.
2009-12-04 02:30:34 +00:00
Brian Barrett
fd39f466ce Remove elements previously removed from the real structures...
This commit was SVN r22241.
2009-11-30 00:36:26 +00:00
Rainer Keller
63e540366b - Include the datatype tests again
make distcheck works
   contrib/dist/make_tarball succeeds too
   make checks shows all 5 tests passing.

This commit was SVN r22163.
2009-10-28 23:19:04 +00:00
Ralph Castain
214e26b539 Per Jeff (this work was done on a branch of mine, so I will do the commit):
Re-enable "./autogen.sh -no-ompi" again. If you -no-ompi, the entire OMPI
configury is skipped and the entire ompi/ subtree is not built. There's
some simple m4-isms that prune out the relevant parts.

I added ompi/config/, orte/config/, and opal/config/ directories. I moved a
bunch of m4 files from the top-level config/ dir into ompi/config/, and a few
into orte/config/.

Note that all 3 <project>/config directories have a config_files.m4 file. This
file contains the AC_CONFIG_FILES list for that project. The AC_CONFIG_FILES
call cannot be in an AC_DEFUN macro and conditionally called -- if it is
included at all, Autoconf will process it. Hence, these config_files.m4 files
don't AC_DEFUN -- they just have AC_CONFIG_FILES. m4_ifdef() is used to
conditionally include the files or not.

I moved a bunch of obvious OMPI-only m4 files from config/ to ompi/config/,
but I'm sure that there's more that could go. A ticket will be filed with
thoughts on future work in this area.

This commit was SVN r22113.
2009-10-20 23:44:20 +00:00
Ralph Castain
9b47a46ed9 Eliminate the datatype test until someone can fix it so that make_tarball can work
This commit was SVN r21933.
2009-09-03 16:40:05 +00:00
Rainer Keller
8e1b23779f - Replace combinations of
#if defined (c_plusplus)
          defined (__cplusplus)
   followed by
      extern "C" {
   and the closing counterpart by BEGIN_C_DECLS and END_C_DECLS.

   Notable exceptions are:
    - opal/include/opal_config_bottom.h:
      This is our generated code, that itself defines BEGIN_C_DECL and
      END_C_DECL
    - ompi/mpi/cxx/mpicxx.h:
      Here we do not include opal_config_bottom.h:                                 
    - Belongs to external code:                                                    
      opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.c        
      opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.h        
    - opal/include/opal/prefetch.h:
      Has C++ specific macros that are protected:                                  

    - Had #if ... } #endif  _and_ END_C_DECLS (aka end up with 2x
      END_C_DECLS)
      ompi/mca/btl/openib/btl_openib.h
    - opal/event/event.h has #ifdef __cplusplus as BEGIN_C_DECLS...
    - opal/win32/ompi_process.h: had extern "C"\n {...
      opal/win32/ompi_process.h: dito
    - ompi/mca/btl/pcie/btl_pcie_lex.l: needed to add *_C_DECLS
      ompi/mpi/f90/test/align_c.c: dito
    - ompi/debuggers/msgq_interface.h: used #ifdef __cplusplus
    - ompi/mpi/f90/xml/common-C.xsl: Amend

   Tested on linux using --with-openib and --with-mx

   The following do not contain either opal_config.h, orte_config.h or
   ompi_config.h
   (but possibly other header files, that include one of the above):
      ompi/mca/bml/r2/bml_r2_ft.h
      ompi/mca/btl/gm/btl_gm_endpoint.h
      ompi/mca/btl/gm/btl_gm_proc.h
      ompi/mca/btl/mx/btl_mx_endpoint.h
      ompi/mca/btl/ofud/btl_ofud_endpoint.h
      ompi/mca/btl/ofud/btl_ofud_frag.h
      ompi/mca/btl/ofud/btl_ofud_proc.h
      ompi/mca/btl/openib/btl_openib_mca.h
      ompi/mca/btl/portals/btl_portals_endpoint.h
      ompi/mca/btl/portals/btl_portals_frag.h
      ompi/mca/btl/sctp/btl_sctp_endpoint.h
      ompi/mca/btl/sctp/btl_sctp_proc.h
      ompi/mca/btl/tcp/btl_tcp_endpoint.h
      ompi/mca/btl/tcp/btl_tcp_ft.h
      ompi/mca/btl/tcp/btl_tcp_proc.h
      ompi/mca/btl/template/btl_template_endpoint.h
      ompi/mca/btl/template/btl_template_proc.h
      ompi/mca/btl/udapl/btl_udapl_eager_rdma.h
      ompi/mca/btl/udapl/btl_udapl_endpoint.h
      ompi/mca/btl/udapl/btl_udapl_mca.h
      ompi/mca/btl/udapl/btl_udapl_proc.h
      ompi/mca/mtl/mx/mtl_mx_endpoint.h
      ompi/mca/mtl/mx/mtl_mx.h
      ompi/mca/mtl/psm/mtl_psm_endpoint.h
      ompi/mca/mtl/psm/mtl_psm.h
      ompi/mca/pml/cm/pml_cm_component.h
      ompi/mca/pml/csum/pml_csum_comm.h
      ompi/mca/pml/dr/pml_dr_comm.h
      ompi/mca/pml/dr/pml_dr_component.h
      ompi/mca/pml/dr/pml_dr_endpoint.h
      ompi/mca/pml/dr/pml_dr_recvfrag.h
      ompi/mca/pml/example/pml_example.h
      ompi/mca/pml/ob1/pml_ob1_comm.h
      ompi/mca/pml/ob1/pml_ob1_component.h
      ompi/mca/pml/ob1/pml_ob1_endpoint.h
      ompi/mca/pml/ob1/pml_ob1_rdmafrag.h
      ompi/mca/pml/ob1/pml_ob1_recvfrag.h
      ompi/mca/pml/v/pml_v_output.h
      opal/include/opal/prefetch.h
      opal/mca/timer/aix/timer_aix.h
      opal/util/qsort.h
      test/support/components.h

This commit was SVN r21855.

The following SVN revision numbers were found above:
  r2 --> open-mpi/ompi@58fdc18855
2009-08-20 11:42:18 +00:00
Rainer Keller
ddaee48680 - Hmm, make check ran, but make distcheck did not know about opal_ddt_lib.c.
This commit was SVN r21665.
2009-07-14 14:45:55 +00:00
Rainer Keller
6c5532072a - Split the datatype engine into two parts: an MPI specific part in
OMPI
   and a language agnostic part in OPAL. The convertor is completely
   moved into OPAL.  This offers several benefits as described in RFC
   http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
   namely:
    - Fewer basic types (int* and float* types, boolean and wchar
    - Fixing naming scheme to ompi-nomenclature.
    - Usability outside of the ompi-layer.
 - Due to the fixed nature of simple opal types, their information is
   completely
   known at compile time and therefore constified
 - With fewer datatypes (22), the actual sizes of bit-field types may be
   reduced
   from 64 to 32 bits, allowing reorganizing the opal_datatype
   structure, eliminating holes and keeping data required in convertor
   (upon send/recv) in one cacheline...
   This has implications to the convertor-datastructure and other parts
   of the code.
 - Several performance tests have been run, the netpipe latency does not
   change with
   this patch on Linux/x86-64 on the smoky cluster.
 - Extensive tests have been done to verify correctness (no new
   regressions) using:
   1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
    ompi-ddt:
    a. running both trunk and ompi-ddt resulted in no differences
       (except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
       correctly).
    b. with --enable-memchecker and running under valgrind (one buglet
       when run with static found in test-suite, commited)
   2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
      all passed (except for the dynamic/ tests failed!! as trunk/MTT)
   3. compilation and usage of HDF5 tests on Jaguar using PGI and
      PathScale compilers.
   4. compilation and usage on Scicortex.
 - Please note, that for the heterogeneous case, (-m32 compiled
   binaries/ompi), neither
   ompi-trunk, nor ompi-ddt branch would successfully launch.

This commit was SVN r21641.
2009-07-13 04:56:31 +00:00
Ralph Castain
f966d9f972 Fix visibility issues with opal_graph functions.
Fix the carto test so it can compile - need to update input file so it can run

This commit was SVN r21403.
2009-06-09 15:02:57 +00:00
Rainer Keller
fc65875542 - As in r21238, do not use printf %z for size_t...
This commit was SVN r21239.

The following SVN revision numbers were found above:
  r21238 --> open-mpi/ompi@b2f8095ba7
2009-05-14 14:11:31 +00:00
Greg Koenig
60485ff95f This is a very large change to rename several #define values from
OMPI_* to OPAL_*.  This allows opal layer to be used more independent
from the whole of ompi.

NOTE: 9 "svn mv" operations immediately follow this commit.

This commit was SVN r21180.
2009-05-06 20:11:28 +00:00
Rainer Keller
7663fb47f0 - In the included headers, the string.h is missing.
- For size_t, Posix offers %z length modifier, get rid
   of warning (or need to cast...)

This commit was SVN r21165.
2009-05-05 15:42:31 +00:00
Ralph Castain
e1673778be Replace missing headers
This commit was SVN r21136.
2009-05-01 15:09:10 +00:00
Jeff Squyres
80a1ae45ba Add missing header
This commit was SVN r21122.
2009-04-30 11:36:35 +00:00
Rainer Keller
221fb9dbca ... Delayed due to notifier commits earlier this day ...
- Delete unnecessary header files using
   contrib/check_unnecessary_headers.sh after applying
   patches, that include headers, being "lost" due to
   inclusion in one of the now deleted headers...

   In total 817 files are touched.
   In ompi/mpi/c/ header files are moved up into the actual c-file,
   where necessary (these are the only additional #include),
   otherwise it is only deletions of #include (apart from the above
   additions required due to notifier...)

 - To get different MCAs (OpenIB, TM, ALPS), an earlier version was
   successfully compiled (yesterday) on:
   Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled
   Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled
   Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled

This commit was SVN r21096.
2009-04-29 01:32:14 +00:00
Shiqing Fan
3d4e0472d6 Add windows support files into the tarball, including .windows, CMakeLists.txt files, and CMake modules. Thanks to Jeff for testing it on Linux.
This commit was SVN r21069.
2009-04-24 16:39:33 +00:00
Rainer Keller
ec0ed48718 - Revert r20739
This commit was SVN r20742.

The following SVN revision numbers were found above:
  r20739 --> open-mpi/ompi@781caee0b6
2009-03-05 21:56:03 +00:00
Rainer Keller
781caee0b6 - First of two or three patches, in orte/util/proc_info.h:
Adapt orte_process_info to orte_proc_info, and
   change orte_proc_info() to orte_proc_info_init().
 - Compiled on linux-x86-64
 - Discussed with Ralph

This commit was SVN r20739.
2009-03-05 20:36:44 +00:00
Jeff Squyres
8fe40fb4a1 r20701 was a lie; we ''do'' need the libraries when compiling in debug
mode, because some functions are not inlined.

This commit was SVN r20736.

The following SVN revision numbers were found above:
  r20701 --> open-mpi/ompi@b440c92455
2009-03-05 15:30:50 +00:00
Ralph Castain
1d4bbee096 Fix bitmap test so make tarball can succeed
This commit was SVN r20713.
2009-03-04 12:26:45 +00:00
Rainer Keller
811f2bd9b4 - As discussed on RFC, move the ompi_bitmap to the
opal layer.
   Add a check against a maximum (actually get rid of ifs internally to
   opal_bitmap.c) -- the functionality to set the current maximum size
   opal_bitmap_set_max_size() is currently only used in attribute.c
   to set the maximum OMPI_FORTRAN_HANDLE_MAX...

   Tested on linux/x86-64 with intel-tests with all_tests_no_perf_f
   run with 6 procs.
   Let's look into MTT as well...

This commit was SVN r20708.
2009-03-03 22:25:13 +00:00
Jeff Squyres
b440c92455 We don't need to link against any of the OMPI libraries; this test
just slurps in .h files.

This commit was SVN r20701.
2009-03-03 17:06:46 +00:00
Shiqing Fan
2326f14be5 Remove the unnecessary PROJECT command, I somehow misunderstood how it should be used on Windows....
This commit was SVN r20634.
2009-02-25 16:07:43 +00:00
Terry Dontje
0178b6c45f Added padding to predefined handle structures to maintain library version to
version compatibility.

This commit was SVN r20627.
2009-02-24 17:17:33 +00:00
Eugene Loh
463f11f993 Improve shared-memory allocation:
* compute mmap-file size more wisely and pass requested size to allocator
* change MCA parameters:
  - get rid of mpool_sm_per_peer_size
  - get rid of mpool_sm_max_size
  - set default mpool_sm_min_size to 0
* no longer pad sm allocations to page boundaries
* have sm_btl_first_time_init check return codes on free-list creations

Have mca_btl_sm_prepare_src() check to see if it can allocate an EAGER fragment
rather than a MAX fragment if the smaller size works.

Remove ompi/class/ompi_[circular_buffer_]fifo.h and references thereto.

Remove opal/util/pow2.[c|h] and references thereto.

This commit was SVN r20614.
2009-02-20 19:51:57 +00:00
George Bosilca
e0638c84c8 Update the test to check that all data is exposed via the
convertor_raw interface.

This commit was SVN r20383.
2009-01-28 23:07:02 +00:00
George Bosilca
ecdcda9268 Move the datatpye creation functions outside the test itself.
Add a test for the newly added raw functionality.

This commit was SVN r20374.
2009-01-28 15:42:30 +00:00
Shiqing Fan
a5281f0434 - 1/4 commit for Windows Visual Studio and CCP support:
CMakeLists and .windows files.
  In contribs preconfigured and precompiled parts.

This commit was SVN r20108.
2008-12-10 20:59:20 +00:00
Kenneth Matney
94f8189532 Under gcc 4.2.4, make check was failing without the <stdio.h>.
Moreover, I could not figure out why <time.h> would need to be
included twice.  So, I substituted the former for the latter,
in the superfluous instantiation.

This commit was SVN r19859.
2008-10-31 12:18:57 +00:00
Kenneth Matney
68248a32ef Add #include for stdio.h to allow make check to run with gcc 4.2.4 (on
Cray XT platform).

This commit was SVN r19605.
2008-09-22 18:00:30 +00:00
George Bosilca
2bd9ddfc28 The datatype dump function is always visible so we don't need a
fake one.

This commit was SVN r19158.
2008-08-05 14:45:42 +00:00
George Bosilca
e6f700bf04 Reenable the ddt_test as #1242 is now closed.
This commit was SVN r19145.
2008-08-04 15:57:02 +00:00
Brian Barrett
8cff3131d6 Remove memory tests, as they're out of date
This commit was SVN r18656.
2008-06-14 14:01:05 +00:00
Jeff Squyres
1f226b5898 Adjust the comment to be correct, per
http://www.open-mpi.org/community/lists/devel/2008/06/4095.php.

This commit was SVN r18604.
2008-06-06 01:23:58 +00:00
Ralph Castain
7c7b9b0486 Do a little cleanup on the opal graph class and opal carto framework to conform to OMPI naming conventions and avoid potential conflict with user applications - no change in functionality, passes carto test program
This commit was SVN r18407.
2008-05-07 19:33:49 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Jeff Squyres
0fbb399f13 Remove ddt_test from "make check" per #1242.
This commit was SVN r17818.
2008-03-14 14:21:47 +00:00
Jeff Squyres
4133b46ec5 Re-enable "make dist", at least until #1232 is fixed.
This commit was SVN r17796.
2008-03-09 21:36:10 +00:00
Jeff Squyres
498190e326 Add checks to ensure that opal_init() completes successfully so that
we fail gracefully (and don't segv) if opal_init() fails.

This commit was SVN r17760.
2008-03-06 14:55:32 +00:00
Tim Prins
2e1bda6d23 Remove the now-unused arithmatic interface to the dss
This commit was SVN r17654.
2008-02-28 21:36:51 +00:00
Ralph Castain
d70e2e8c2b Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately.
Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer

This commit was SVN r17632.
2008-02-28 01:57:57 +00:00
Jeff Squyres
04e026fa98 Fix "make check"; manually include <string.h> since the datatype
header files were re-orged to have fewer dependencies

This commit was SVN r17427.
2008-02-12 13:02:53 +00:00
Shiqing Fan
f5792bbda5 merging the memchecker into trunk.
This commit was SVN r17424.
2008-02-12 08:46:27 +00:00
Sharon Melamed
98e8de264d Wraped the carto API in carto_base_wrapers.c
This commit was SVN r17380.
2008-02-05 19:29:16 +00:00
Sharon Melamed
025b68becf Move the carto framework to the trunk.
This commit was SVN r17177.
2008-01-23 09:20:34 +00:00
George Bosilca
7eca186568 Fix a typo related to the conversion from ompi_pointer_array
to opal_pointer_array.

This commit was SVN r17023.
2007-12-22 05:32:40 +00:00
George Bosilca
906e8bf1d1 Replace the ompi_pointer_array with opal_pointer_array. The next step
(sometimes after the merge with the ORTE branch), the opal_pointer_array
will became the only pointer_array implementation (the orte_pointer_array
will be removed).

This commit was SVN r17007.
2007-12-21 06:02:00 +00:00
Rich Graham
27a748e7eb change all instances of ompi_free_list_init to ompi_free_list_init_new. Header
and payload data are specified separately at this stage.

This commit was SVN r16633.
2007-11-01 23:38:50 +00:00
Ralph Castain
54b2cf747e These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.

This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:

As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.

In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.

The incoming changes revamp these procedures in three ways:

1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.

The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.

Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.


2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.

The size of this data has been reduced in three ways:

(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.

To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.

(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.

(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.

While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.


3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.

It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.

Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.


There are a few minor additional changes in the commit that I'll just note in passing:

* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.

* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.

* cleanup of some stale header files

This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
Shiqing Fan
0f468f3668 - Remove the solution and project files, will commit them later.
This commit was SVN r15705.
2007-07-31 17:07:02 +00:00