openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	94356e98d4	Fix from Nikolay Piskun at Rogue Wave (TotalView) -- fixes the case where MPI jobs are launched directly via srun (i.e., there's no HNP). This commit was SVN r24376.	2011-02-14 19:03:53 +00:00
Abhishek Kulkarni	93d28a5792	Change opal_err2str_fn_t to return the error string as an argument. This means that the converters (opal_err2str, orte_err2str) can now return NULL as a "silent error". The return value of opal_err2str_fn_t is the status of the operation (OPAL_SUCCESS or OPAL_ERROR). This fixes the "Unknown error" message issues on the trunk. This commit was SVN r24371.	2011-02-13 16:09:17 +00:00
Ralph Castain	33b68132cc	Update the rmcast framework This commit was SVN r24370.	2011-02-12 16:52:03 +00:00
Mike Dubman	81222e1fe7	* fix PGI compiler support which does not have __BASE_FILE__ macro This commit was SVN r24369.	2011-02-10 06:42:37 +00:00
Ethan Mallove	c6fd141923	missing include This commit was SVN r24368.	2011-02-09 17:59:55 +00:00
Josh Hursey	a9335ea423	Make sure to initialize the 'update_state' function for the default module. This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB. This commit was SVN r24365.	2011-02-08 20:42:32 +00:00
Nysal Jan	92e06b0a1f	Missed this change suggested by Terry This commit was SVN r24364.	2011-02-08 04:06:52 +00:00
Nysal Jan	a31025bb48	Fix pty setup code on AIX This commit was SVN r24363.	2011-02-08 02:54:47 +00:00
Nysal Jan	f0f1d4e311	Older versions of config.guess detect the canonical system name of an AIX 7.1 system to be rs6000-ibm-aix. Add this workaround until AIX 7.1 support is available in the autotools releases This commit was SVN r24362.	2011-02-08 02:52:10 +00:00
Jeff Squyres	b0ce9bae8e	Oops. Also need to remove myriexpress.h from the Makefile.am. This commit was SVN r24357.	2011-02-04 03:29:49 +00:00
Eugene Loh	cd5c2e794f	Some minor changes to help the openib BTL build and run on Solaris: - poll() can return POLLRDNORM even if not requested (Solaris bug) - MIN macro not defined in btl_openib.c and while we're at it, we clean up the MIN definition in ad_bgl_pset.h - btl_openib_connect_rdmacm.c was calling rdma_destroy_id() twice leading to undefined behavior (a hang on Solaris) This commit was SVN r24356.	2011-02-03 23:53:21 +00:00
Abhishek Kulkarni	d711c5a4b1	SOS fix for the Studio compilers (Thanks to Terry for spotting this). This commit was SVN r24355.	2011-02-03 22:36:28 +00:00
Jeff Squyres	6421abecc7	Fixes trac:2690. Temporarily remove hwloc's internal version of myriexpress.h. It is causing a problem when compiling Open MPI with MX support because hwloc uses AC_CONFIG_HEADER in hwloc's hwloc.m4 to generate opal/mca/paffinity/hwloc/hwloc/include/hwloc/config.h. AC_CONFIG_HEADER apparently has the (undocumented) side effect of adding -I$(top_builddir)/opal/mca/paffinity/hwloc/hwloc/include/hwloc to OMPI's compilation flags. Hence, when the OMPI MX components are compiled and #include "myriexpress.h" (or <myriexpress.h>) they see hwloc's myriexpress.h before the system one. Badness ensures. This removal is temporary because we need to figure out a better solution. But for now, OMPI is not using hwloc's myriexpress.h file -- so it's safe to remove. I'll push this issue upstream to hwloc to figure out a better solution... This commit was SVN r24354. The following Trac tickets were found above: Ticket 2690 --> https://svn.open-mpi.org/trac/ompi/ticket/2690	2011-02-03 14:24:32 +00:00
Jeff Squyres	c0acc75ce0	Update copyrights and clarify README.txt. This commit was SVN r24348.	2011-02-02 17:25:56 +00:00
Jeff Squyres	0de4f4c35a	Oops -- forgot to make the README.txt be version-neutral. This commit was SVN r24347.	2011-02-02 17:14:41 +00:00
Jeff Squyres	2755a9f261	As discussed here: http://www.open-mpi.org/community/lists/devel/2011/01/8894.php http://blogs.cisco.com/performance/building-3rd-party-open-mpi-components/ Contribute a sample of how to build MCA components outside of the Open MPI source tree. This commit was SVN r24346.	2011-02-02 17:11:33 +00:00
Jeff Squyres	0d08f636b0	Add 1.5.2 items This commit was SVN r24344.	2011-02-02 15:19:10 +00:00
Jeff Squyres	e388450e98	Add 1.5.2 items. This commit was SVN r24343.	2011-02-02 15:17:30 +00:00
Jeff Squyres	9306bf259b	Make it a little more friendly towards svn+hg trees This commit was SVN r24341.	2011-02-02 14:41:36 +00:00
Nysal Jan	ab2f738b0b	Recent versions of IBM XL compilers on AIX support GCC inline assembly format This commit was SVN r24340.	2011-02-02 11:31:30 +00:00
Nysal Jan	3a8d251daa	vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog This commit was SVN r24339.	2011-02-02 10:05:57 +00:00
Jeff Squyres	4674e62929	These files are superflouos. This commit was SVN r24331.	2011-02-01 21:31:35 +00:00
Jeff Squyres	c8badb79df	Don't instantiate variables in for loops; we don't assume C99 compilers. This commit was SVN r24330.	2011-02-01 19:23:14 +00:00
Jeff Squyres	ddcbfa6af0	Fix some fairly-important typos (!) This commit was SVN r24328.	2011-02-01 13:18:01 +00:00
Jeff Squyres	f015f885f6	Fix datatype variable names so that PGI builds stop failing in MTT. This commit was SVN r24327.	2011-01-31 19:12:33 +00:00
Josh Hursey	fa3f6485d8	Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around. This commit was SVN r24326.	2011-01-31 19:09:47 +00:00
Josh Hursey	5b58ff0663	Fix a C/R checkpoint->restart->checkpoint->restart case. The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint. This seems to be fixed now. Thanks to Alex Brick for the bug report. This commit was SVN r24325.	2011-01-28 21:25:14 +00:00
Eugene Loh	45b222ecec	Correct some subtle PTRHEAD_ typos (should be PTHREAD_) in config/ompi_config_pthreads.m4. Terry pointed them out. Mostly just aix/freebsd. This commit was SVN r24324.	2011-01-28 21:05:40 +00:00
Jeff Squyres	b3a22bbe82	Add note about mpirun's --debug switch multi-token fix. This commit was SVN r24323.	2011-01-28 13:38:23 +00:00
Jeff Squyres	ec3d18dc9f	As noted on the mailing list by Gabriele Fatigati (http://www.open-mpi.org/community/lists/users/2011/01/15427.php), the --tv (and friends) switches to mpirun would effectively munge the orterun command line together and then split it apart again before exec'ing the underlying debugger. We would therefore lose multi-token argv[x] value and split them into multiple tokens. For example: mpirun --tv -np 2 a.out "foo bar" would get launched with "foo" and "bar" as separate arguments; not one argument. This was due to the underlying code joining the argv into a single string and then re-splitting it. This commit removed the argv join; it now does the parsing and re-jigering of the argv by only looking at each individual argv item; multi-word tokens like "foo bar" will never be split into separate tokens. This commit was SVN r24322.	2011-01-28 13:01:06 +00:00
Nysal Jan	42015cf30a	Fix build failure on AIX This commit was SVN r24321.	2011-01-28 08:09:45 +00:00
Nysal Jan	857c32784e	Fix detection of fd_mask This commit was SVN r24320.	2011-01-28 06:20:32 +00:00
George Bosilca	d457338f66	Force mips2 asm acceptance before sc and ll. This commit was SVN r24319.	2011-01-27 22:42:26 +00:00
Nathan Hjelm	2605fc6a54	actually need pml = csum for these This commit was SVN r24318.	2011-01-27 20:44:13 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Jeff Squyres	5bc2ad2b44	Fix some deprecated notices to refer to the correct new function names This commit was SVN r24313.	2011-01-27 19:55:42 +00:00
Jeff Squyres	6c8de8fb76	Bump up to hwloc 1.1.1 This commit was SVN r24312.	2011-01-26 23:20:26 +00:00
Jeff Squyres	511f87665b	Fixes trac:2680: Add ARM support. This commit was SVN r24308. The following Trac tickets were found above: Ticket 2680 --> https://svn.open-mpi.org/trac/ompi/ticket/2680	2011-01-26 17:22:44 +00:00
Josh Hursey	81fd41f811	Return an informative error message if the user requests a migration of a job that is not capable of it. C/R Functionality cleanup This commit was SVN r24307.	2011-01-26 15:36:34 +00:00
Josh Hursey	8f45fcb429	More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. * Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation). * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer. * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process. * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario. This commit was SVN r24306.	2011-01-26 14:56:35 +00:00
Nathan Hjelm	8a3179cdcb	removed c99 test code This commit was SVN r24297.	2011-01-25 23:02:35 +00:00
Josh Hursey	66af515061	Fix C/R functionality with the new libtool. This fixes the case where the restarted process cannot be checkpointed or finalized. Short Version: -------------- Event engine needs to be flushed so it does not use old/stale file descriptors. Long Version: ------------- The problem was that the restarted process was waiting for the socket to the local daemon to finish establishing during the 'sync' operation. The core problem was that the daemon was sending a header of 36 bytes, but the restarted process only received 35 bytes of the message. So the restarted process became stuck waiting for the last byte to arrive. After many hours of digging, I figured out that the event engine was using the same file descriptor for its evsig_cb functionality (to signal itself when a signal arrives). So when the daemon wrote in to the new fd the event engine was stealing the first byte (shakes fist at event engine) before the recv() could be posted. The solution is to use the event_reinit() function on restart to re-establish the now-stale file descriptors in the event engine. This seems to have fixed the problem. A few other minor things: ------------------------- * Add a check to make sure the event engine is balanced in its init/finalize * Add the opal_event_base_close() to the BLCR restart exec function (still not 100% sure it is needed, but there it is). This commit was SVN r24296.	2011-01-25 22:43:47 +00:00
Josh Hursey	e4d13d338f	Fix a couple of compiler warnings This commit was SVN r24295.	2011-01-25 22:22:32 +00:00
Nysal Jan	72ba038309	Add workaround for a Libtool (<2.2.8) bug concerning IBM xlf compilers This commit was SVN r24294.	2011-01-25 09:53:34 +00:00
George Bosilca	09f645f9a9	There is no need for the byte variable. This commit was SVN r24293.	2011-01-24 22:41:04 +00:00
Jeff Squyres	30e164e246	Fix all the problems with "make distcheck" caused by the new ROMIO import so that we can finally get a trunk nightly tarball\! This commit was SVN r24292.	2011-01-24 21:10:14 +00:00
Nathan Hjelm	2ca55d54f7	use AC_PROG_CC_C99 to find flags to turn on c99 support. remove if mtt fails because of this. This commit was SVN r24291.	2011-01-24 15:54:52 +00:00
Jeff Squyres	afa654746c	Somehow this has been sitting, uncommitted, in a local checkout since last December. :-( Add new MCA param: maffinity_libnuma_policy. Thanks to David Singleton for the suggestion. Here's the help text about it: {{{ MCA maffinity: parameter "maffinity_libnuma_policy" (current value: <loose>, data source: default value) Binding policy that determines what happens if memory is unavailable on the local NUMA node. A value of "strict" means that the memory allocation will fail; a value of "loose" means that the memory allocation will spill over to another NUMA node. }}} This commit was SVN r24290.	2011-01-24 14:39:16 +00:00
Jeff Squyres	272fe89252	Update svn:ignore This commit was SVN r24289.	2011-01-24 14:15:24 +00:00
Jeff Squyres	1ea62f3bf6	Add svn:ignore This commit was SVN r24288.	2011-01-24 14:15:07 +00:00

1 2 3 4 5 ...

15477 Коммитов