1
1
Граф коммитов

15477 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
94356e98d4 Fix from Nikolay Piskun at Rogue Wave (TotalView) -- fixes the case
where MPI jobs are launched directly via srun (i.e., there's no HNP).

This commit was SVN r24376.
2011-02-14 19:03:53 +00:00
Abhishek Kulkarni
93d28a5792 Change opal_err2str_fn_t to return the error string as an argument.
This means that the converters (opal_err2str, orte_err2str) can now
return NULL as a "silent error". The return value of opal_err2str_fn_t
is the status of the operation (OPAL_SUCCESS or OPAL_ERROR).

This fixes the "Unknown error" message issues on the trunk.

This commit was SVN r24371.
2011-02-13 16:09:17 +00:00
Ralph Castain
33b68132cc Update the rmcast framework
This commit was SVN r24370.
2011-02-12 16:52:03 +00:00
Mike Dubman
81222e1fe7 * fix PGI compiler support which does not have __BASE_FILE__ macro
This commit was SVN r24369.
2011-02-10 06:42:37 +00:00
Ethan Mallove
c6fd141923 missing include
This commit was SVN r24368.
2011-02-09 17:59:55 +00:00
Josh Hursey
a9335ea423 Make sure to initialize the 'update_state' function for the default module.
This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB.

This commit was SVN r24365.
2011-02-08 20:42:32 +00:00
Nysal Jan
92e06b0a1f Missed this change suggested by Terry
This commit was SVN r24364.
2011-02-08 04:06:52 +00:00
Nysal Jan
a31025bb48 Fix pty setup code on AIX
This commit was SVN r24363.
2011-02-08 02:54:47 +00:00
Nysal Jan
f0f1d4e311 Older versions of config.guess detect the canonical system name of an AIX 7.1 system to be rs6000-ibm-aix. Add this workaround until AIX 7.1 support is available in the autotools releases
This commit was SVN r24362.
2011-02-08 02:52:10 +00:00
Jeff Squyres
b0ce9bae8e Oops. Also need to remove myriexpress.h from the Makefile.am.
This commit was SVN r24357.
2011-02-04 03:29:49 +00:00
Eugene Loh
cd5c2e794f Some minor changes to help the openib BTL build and run on Solaris:
- poll() can return POLLRDNORM even if not requested (Solaris bug)
- MIN macro not defined in btl_openib.c
  and while we're at it, we clean up the MIN definition in ad_bgl_pset.h
- btl_openib_connect_rdmacm.c was calling rdma_destroy_id() twice
  leading to undefined behavior (a hang on Solaris)

This commit was SVN r24356.
2011-02-03 23:53:21 +00:00
Abhishek Kulkarni
d711c5a4b1 SOS fix for the Studio compilers (Thanks to Terry for spotting this).
This commit was SVN r24355.
2011-02-03 22:36:28 +00:00
Jeff Squyres
6421abecc7 Fixes trac:2690.
Temporarily remove hwloc's internal version of myriexpress.h.  It is
causing a problem when compiling Open MPI with MX support because
hwloc uses AC_CONFIG_HEADER in hwloc's hwloc.m4 to generate
opal/mca/paffinity/hwloc/hwloc/include/hwloc/config.h.
AC_CONFIG_HEADER apparently has the (undocumented) side effect of
adding -I$(top_builddir)/opal/mca/paffinity/hwloc/hwloc/include/hwloc
to OMPI's compilation flags.  Hence, when the OMPI MX components are
compiled and #include "myriexpress.h" (or <myriexpress.h>) they see
hwloc's myriexpress.h before the system one.  Badness ensures.

This removal is temporary because we need to figure out a better
solution.  But for now, OMPI is not using hwloc's myriexpress.h file --
so it's safe to remove.  I'll push this issue upstream to hwloc to
figure out a better solution...

This commit was SVN r24354.

The following Trac tickets were found above:
  Ticket 2690 --> https://svn.open-mpi.org/trac/ompi/ticket/2690
2011-02-03 14:24:32 +00:00
Jeff Squyres
c0acc75ce0 Update copyrights and clarify README.txt.
This commit was SVN r24348.
2011-02-02 17:25:56 +00:00
Jeff Squyres
0de4f4c35a Oops -- forgot to make the README.txt be version-neutral.
This commit was SVN r24347.
2011-02-02 17:14:41 +00:00
Jeff Squyres
2755a9f261 As discussed here:
http://www.open-mpi.org/community/lists/devel/2011/01/8894.php
    http://blogs.cisco.com/performance/building-3rd-party-open-mpi-components/

Contribute a sample of how to build MCA components outside of the Open
MPI source tree.

This commit was SVN r24346.
2011-02-02 17:11:33 +00:00
Jeff Squyres
0d08f636b0 Add 1.5.2 items
This commit was SVN r24344.
2011-02-02 15:19:10 +00:00
Jeff Squyres
e388450e98 Add 1.5.2 items.
This commit was SVN r24343.
2011-02-02 15:17:30 +00:00
Jeff Squyres
9306bf259b Make it a little more friendly towards svn+hg trees
This commit was SVN r24341.
2011-02-02 14:41:36 +00:00
Nysal Jan
ab2f738b0b Recent versions of IBM XL compilers on AIX support GCC inline assembly format
This commit was SVN r24340.
2011-02-02 11:31:30 +00:00
Nysal Jan
3a8d251daa vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog
This commit was SVN r24339.
2011-02-02 10:05:57 +00:00
Jeff Squyres
4674e62929 These files are superflouos.
This commit was SVN r24331.
2011-02-01 21:31:35 +00:00
Jeff Squyres
c8badb79df Don't instantiate variables in for loops; we don't assume C99
compilers. 

This commit was SVN r24330.
2011-02-01 19:23:14 +00:00
Jeff Squyres
ddcbfa6af0 Fix some fairly-important typos (!)
This commit was SVN r24328.
2011-02-01 13:18:01 +00:00
Jeff Squyres
f015f885f6 Fix datatype variable names so that PGI builds stop failing in MTT.
This commit was SVN r24327.
2011-01-31 19:12:33 +00:00
Josh Hursey
fa3f6485d8 Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around.
This commit was SVN r24326.
2011-01-31 19:09:47 +00:00
Josh Hursey
5b58ff0663 Fix a C/R checkpoint->restart->checkpoint->restart case.
The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint.

This seems to be fixed now. Thanks to Alex Brick for the bug report.

This commit was SVN r24325.
2011-01-28 21:25:14 +00:00
Eugene Loh
45b222ecec Correct some subtle PTRHEAD_ typos (should be PTHREAD_) in
config/ompi_config_pthreads.m4.  Terry pointed them out.
Mostly just aix/freebsd.

This commit was SVN r24324.
2011-01-28 21:05:40 +00:00
Jeff Squyres
b3a22bbe82 Add note about mpirun's --debug switch multi-token fix.
This commit was SVN r24323.
2011-01-28 13:38:23 +00:00
Jeff Squyres
ec3d18dc9f As noted on the mailing list by Gabriele Fatigati
(http://www.open-mpi.org/community/lists/users/2011/01/15427.php), the
--tv (and friends) switches to mpirun would effectively munge the
orterun command line together and then split it apart again before
exec'ing the underlying debugger.  We would therefore lose multi-token
argv[x] value and split them into multiple tokens.  For example:

   mpirun --tv -np 2 a.out "foo bar"

would get launched with "foo" and "bar" as separate arguments; not one
argument.  This was due to the underlying code joining the argv into a
single string and then re-splitting it.  This commit removed the argv
join; it now does the parsing and re-jigering of the argv by only
looking at each individual argv item; multi-word tokens like "foo bar"
will never be split into separate tokens.

This commit was SVN r24322.
2011-01-28 13:01:06 +00:00
Nysal Jan
42015cf30a Fix build failure on AIX
This commit was SVN r24321.
2011-01-28 08:09:45 +00:00
Nysal Jan
857c32784e Fix detection of fd_mask
This commit was SVN r24320.
2011-01-28 06:20:32 +00:00
George Bosilca
d457338f66 Force mips2 asm acceptance before sc and ll.
This commit was SVN r24319.
2011-01-27 22:42:26 +00:00
Nathan Hjelm
2605fc6a54 actually need pml = csum for these
This commit was SVN r24318.
2011-01-27 20:44:13 +00:00
Josh Hursey
8ec85c6b8f Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally.
I want to thank Hugo Meyer for reporting this/these bugs.

Notes:
 * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation).
 * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function.
 * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery.
 * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup.
 * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading.

This commit was SVN r24317.
2011-01-27 20:40:23 +00:00
Jeff Squyres
5bc2ad2b44 Fix some deprecated notices to refer to the correct new function names
This commit was SVN r24313.
2011-01-27 19:55:42 +00:00
Jeff Squyres
6c8de8fb76 Bump up to hwloc 1.1.1
This commit was SVN r24312.
2011-01-26 23:20:26 +00:00
Jeff Squyres
511f87665b Fixes trac:2680: Add ARM support.
This commit was SVN r24308.

The following Trac tickets were found above:
  Ticket 2680 --> https://svn.open-mpi.org/trac/ompi/ticket/2680
2011-01-26 17:22:44 +00:00
Josh Hursey
81fd41f811 Return an informative error message if the user requests a migration of a job that is not capable of it.
C/R Functionality cleanup

This commit was SVN r24307.
2011-01-26 15:36:34 +00:00
Josh Hursey
8f45fcb429 More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now.
* Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation).
 * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer.
 * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process.
 * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario.

This commit was SVN r24306.
2011-01-26 14:56:35 +00:00
Nathan Hjelm
8a3179cdcb removed c99 test code
This commit was SVN r24297.
2011-01-25 23:02:35 +00:00
Josh Hursey
66af515061 Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized.
Short Version:
--------------
Event engine needs to be flushed so it does not use old/stale file descriptors.

Long Version:
-------------
The problem was that the restarted process was waiting for the socket to the local daemon to finish establishing during the 'sync' operation. The core problem was that the daemon was sending a header of 36 bytes, but the restarted process only received 35 bytes of the message. So the restarted process became stuck waiting for the last byte to arrive.

After many hours of digging, I figured out that the event engine was using the same file descriptor for its evsig_cb functionality (to signal itself when a signal arrives). So when the daemon wrote in to the new fd the event engine was stealing the first byte (*shakes fist at event engine*) before the recv() could be posted.

The solution is to use the event_reinit() function on restart to re-establish the now-stale file descriptors in the event engine. This seems to have fixed the problem.


A few other minor things:
-------------------------
 * Add a check to make sure the event engine is balanced in its init/finalize
 * Add the opal_event_base_close() to the BLCR restart exec function (still not 100% sure it is needed, but there it is).

This commit was SVN r24296.
2011-01-25 22:43:47 +00:00
Josh Hursey
e4d13d338f Fix a couple of compiler warnings
This commit was SVN r24295.
2011-01-25 22:22:32 +00:00
Nysal Jan
72ba038309 Add workaround for a Libtool (<2.2.8) bug concerning IBM xlf compilers
This commit was SVN r24294.
2011-01-25 09:53:34 +00:00
George Bosilca
09f645f9a9 There is no need for the byte variable.
This commit was SVN r24293.
2011-01-24 22:41:04 +00:00
Jeff Squyres
30e164e246 Fix all the problems with "make distcheck" caused by the new ROMIO import so that we can finally get a trunk nightly tarball\!
This commit was SVN r24292.
2011-01-24 21:10:14 +00:00
Nathan Hjelm
2ca55d54f7 use AC_PROG_CC_C99 to find flags to turn on c99 support. remove if mtt fails because of this.
This commit was SVN r24291.
2011-01-24 15:54:52 +00:00
Jeff Squyres
afa654746c Somehow this has been sitting, uncommitted, in a local checkout since
last December.  :-(

Add new MCA param: maffinity_libnuma_policy.  Thanks to David
Singleton for the suggestion.  Here's the help text about it:

{{{
   MCA maffinity: parameter "maffinity_libnuma_policy" (current value:
                  <loose>, data source: default value)
                  Binding policy that determines what happens if memory
                  is unavailable on the local NUMA node.  A value of
                  "strict" means that the memory allocation will fail;
                  a value of "loose" means that the memory allocation
                  will spill over to another NUMA node.
}}}

This commit was SVN r24290.
2011-01-24 14:39:16 +00:00
Jeff Squyres
272fe89252 Update svn:ignore
This commit was SVN r24289.
2011-01-24 14:15:24 +00:00
Jeff Squyres
1ea62f3bf6 Add svn:ignore
This commit was SVN r24288.
2011-01-24 14:15:07 +00:00