openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	6213d23f0b	Merge pull request #5944 from rhc54/topic/psrvr Remove the stale orte-dvm code	2018-10-17 16:12:14 -07:00
Ralph Castain	1bd772e8eb	Remove the stale orte-dvm code Users should migrate to https://github.com/pmix/prrte Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-17 15:11:38 -07:00
Howard Pritchard	7730db9982	Merge pull request #5937 from hppritcha/topic/remove_crs_components remove some dead crs components	2018-10-17 16:06:36 -06:00
Jeff Squyres	c2608fb597	Merge pull request #5939 from jsquyres/pr/coverity-cid-1440332 opal/os_path: fix minor string overrun	2018-10-17 14:24:42 -07:00
Howard Pritchard	6564d3d217	remove some dead crs components Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-17 10:29:00 -06:00
Jeff Squyres	290d1c0534	opal/os_path: fix minor string overrun This is a follow on to 54ca3310ea35b7dc857afc59e00ffbb8f57e15ec to fix a minor bug identified by Coverity. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-17 08:57:30 -07:00
Howard Pritchard	a435bfe1cf	Merge pull request #5933 from hppritcha/topic/remove_bfo_pml remove the bfo pml	2018-10-17 09:39:58 -06:00
Nathan Hjelm	43547ade4c	Merge pull request #5663 from mkurnosov/coll-ireduce-rabenseifner coll/libnbc: add Rabenseifner's algorithm for MPI_Ireduce	2018-10-17 09:02:06 -06:00
Nathan Hjelm	979a199b4f	Merge pull request #5896 from mkurnosov/coll-iallgather-recursivedoubling coll/libnbc: add recursive doubling algorithm for MPI_Iallgather	2018-10-17 09:01:14 -06:00
Howard Pritchard	7d6774acf8	remove the bfo pml Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-17 06:50:11 -06:00
Brian Barrett	2acc4b7e7f	btl tcp: Add workaround for "dropped connection" issue Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-16 18:33:30 -07:00
Nathan Hjelm	37a3f32e47	Merge pull request #5913 from hjelmn/uct_btl_fixes btl/uct: fix deadlock in connection code	2018-10-16 19:11:54 -06:00
Nathan Hjelm	707d35deeb	btl/uct: fix deadlock in connection code This commit fixes a deadlock that can occur when using a TL that supports the connect to endpoint model. The deadlock was occurring while processing an incoming connection requests. This was done from an active-message callback. For some unknown reason (at this time) this callback was sometimes hanging. To avoid the issue the connection active-message is saved for later processing. At the same time I cleaned up the connection code to eliminate duplicate messages when possible. This commit also fixes some bugs in the active-message send path: - Correctly set all fragment fields in prepare_src. - Fix bug when using buffered-send. We were not reading the return code correctly (which is in bytes). This resulted in a message getting sent multiple times. - Don't try to progress sends from the btl_send function when in an active-message callback. It could lead to deep recursion and an eventual crash if we get a trace like send->progress->am_complete->ob1_callback->send->am_complete... Closes #5820 Closes #5821 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 18:28:47 -06:00
Nathan Hjelm	1c001319d4	Merge pull request #5922 from hjelmn/opal_free_list_really_old_race_we_should_really_fix_now opal/free_list: fix race condition	2018-10-16 15:27:05 -06:00
Nathan Hjelm	1ff3cfedb6	Merge pull request #5921 from devreal/ompi-rdma-preinit RDMA OSC: initialize segment memory before registering the segment	2018-10-16 15:10:02 -06:00
Brian Barrett	da1189d771	Merge pull request #5916 from bwbarrett/revert/6acebc4 Revert "Handle error cases in TCP BTL"	2018-10-16 13:54:18 -07:00
Joseph Schuchart	d9dcdfdfba	RDMA OSC: initialize segment memory before registering the segment Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2018-10-16 16:12:14 -04:00
Edgar Gabriel	069084e6ad	Merge pull request #5907 from edgargabriel/topic/testmpio-fixes Topic/testmpio fixes	2018-10-16 13:03:22 -07:00
Nathan Hjelm	5c770a7bec	opal/free_list: fix race condition There was a race condition in opal_free_list_get. Code throughout the Open MPI codebase was assuming that a NULL return from this function was due to an out-of-memory condition. In some cases this can lead to a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for example). Before this commit opal_free_list_get_mt looked like this: ```c static inline opal_free_list_item_t opal_free_list_get_mt (opal_free_list_t flist) { opal_free_list_item_t item = (opal_free_list_item_t) opal_lifo_pop_atomic (&flist->super); if (OPAL_UNLIKELY(NULL == item)) { opal_mutex_lock (&flist->fl_lock); opal_free_list_grow_st (flist, flist->fl_num_per_alloc); opal_mutex_unlock (&flist->fl_lock); item = (opal_free_list_item_t ) opal_lifo_pop_atomic (&flist->super); } return item; } ``` The problem is in a multithreaded environment is is* possible for the free list to be grown successfully but the thread calling opal_free_list_get_mt to be left without an item. The happens if between the calls to opal_lifo_push_atomic in opal_free_list_grow_st and the call to opal_lifo_pop_atomic other threads pop all the items added to the free list. This commit fixes the issue by ensuring the thread that successfully grew the free list always gets a free list item. Fixes #2921 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 13:17:09 -06:00
Edgar Gabriel	ba95588332	io/ompio: add verification for data representations. check for providing a data representation that is actually supported by ompio. Add also one check for a non-NULL pointer in mpi/c/file_set_view for the data representation. Also fixes parts of issue #5643 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-10-16 12:45:33 -05:00
Ralph Castain	d1881519f9	Merge pull request #5919 from rhc54/topic/od2 Ensure SIGCHLD is unblocked	2018-10-16 06:28:00 -07:00
Ralph Castain	647a760b7e	Ensure SIGCHLD is unblocked Thanks to @hjelmn for debugging it and providing the patch Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)	2018-10-15 21:03:17 -07:00
Nathan Hjelm	9ea5dfa799	class/opal_fifo: fix warning Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-15 19:18:31 -06:00
Jeff Squyres	296c91a10b	Merge pull request #5918 from jsquyres/pr/strcpy-no-more Cleanup some string operations	2018-10-15 09:17:21 -05:00
Jeff Squyres	54ca3310ea	ompi: cleanup various string operations Several fixes to string handling: 1. strncpy() -> opal_string_copy() (because opal_string_copy() guarantees to NULL-terminate, and strncpy() does not) 2. Simplify a few places, such as: * Since opal_string_copy() guarantees to NULL terminate, eliminate some memsets(), etc. * Use opal_asprintf() to eliminate multi-step string creation There's more work that could be done; e.g., this commit doesn't attempt to clean up any strcpy() usage. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:10:20 -07:00
Jeff Squyres	a85bad37df	orte: strncpy() -> opal_string_copy() Fairly straightforward conversion of strncpy() calls to opal_string_copy(). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:10:20 -07:00
Jeff Squyres	cef6cf0ac5	opal: update some string handling Make various string handling locations a little more robust. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:04:28 -07:00
Jeff Squyres	e6de42c379	opal/util/info.c: fix compiler warning Remove unused variable. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:03:28 -07:00
Yossi Itigin	a5b1c9a91d	Merge pull request #5898 from yosefe/topic/pml-ucx-init-err-code pml_ucx: fix return code from mca_pml_ucx_init() error flow	2018-10-14 11:34:00 +03:00
Brian Barrett	5162011428	Revert "Handle error cases in TCP BTL" This reverts commit 6acebc40a194c92ab38a28553c2c8b04eb391820. This patch is causing numerous "Socket closed" messages which are causing most of the failures on Cisco's MTT run. See https://github.com/open-mpi/ompi/issues/5849 for more information. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-12 15:01:54 -07:00
Ralph Castain	ee6cb4b999	Merge pull request #5912 from bwbarrett/master-build pmix: Fix filename typo	2018-10-12 11:30:14 -07:00
Brian Barrett	8feff49f6c	pmix: Fix filename typo Looks like a filename was missed when pmix sucked in the installdirs framework. Fixing the typo fixes "make ctags" and "make cscope". Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-12 17:05:01 +00:00
Gilles Gouaillardet	0a09b0419e	Merge pull request #5812 from ggouaillardet/topic/mpi_sizeof_misc_additions fortran: add CHARACTER and LOGICAL support to MPI_Sizeof()	2018-10-12 14:08:27 +09:00
Edgar Gabriel	849d0452a0	io/ompio: execute barrier before sync this ensures that all processes are done modifying a file before syncing. Fixes an error in the testmpio testsuite. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-10-11 17:39:05 -05:00
Edgar Gabriel	bf058ca6b0	common/ompio: check datatypes when setting file view return MPI_ERR_ARG if the size of the fileview is not a multiple of the size of the etype provided. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-10-11 14:43:32 -05:00
Edgar Gabriel	05d25383c2	common/ompio: return correct error code for improper access return MPI_ERR_ACCESS if the user tries to read from a file that was opened using MPI_MODE_WRONLY return MPI_ERR_READ_ONLY if the user tries to write a file that was opened using MPI_MODE_RDONLY Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-10-11 14:41:58 -05:00
Edgar Gabriel	c0d7b578be	io/ompio: fix seek position calculation for SEEK_CUR This commit fixes the calculation of the position where to seek to, in case SEEK_CUR is used. Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-10-11 14:09:03 -05:00
Nathan Hjelm	6ed68da870	btl/uct: use the correct tl interface attributes It is apparently possible for different instances of the same UCT transport to have different limits (max short put for example). To account for this we need to store the attributes per TL context not per TL. This commit fixes the issue. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-11 11:33:17 -06:00
Ralph Castain	4c5588f84f	Merge pull request #5900 from rhc54/topic/dh Fix -H operations for multi-app case	2018-10-11 10:21:38 -07:00
Ralph Castain	f5a6b7f1e9	Fix -H operations for multi-app case Correctly aggregate slots across -H entries from each app. Take into account any -H entry when computing nprocs when no value was given. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-11 09:30:01 -07:00
Jeff Squyres	7686b8696a	Merge pull request #5897 from jsquyres/pr/fix-c99-comments-in-mpih mpi.h.in: remove C99-style comments	2018-10-11 11:53:39 -04:00
Yossi Itigin	b71e85b8d5	pml_ucx: fix return code from mca_pml_ucx_init() error flow Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-10-11 18:48:54 +03:00
Jeff Squyres	f4b3ccabf7	mpi.h.in: remove C99-style comments While we require C99 to build Open MPI, we do not require C99 to build user MPI applications. As such, we shouldn't have C99-style comments (i.e., "//"-style) in mpi.h.in. Thanks to @AdamSimpson for reporting the issue. This commit simply converts a //-style comment to a /**/-style comment. No code or logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-11 10:58:06 -04:00
Mikhail Kurnosov	a7386c1e09	coll/libnbc: add recursive doubling algorithm for MPI_Iallgather Implements recursive doubling algorithm for MPI_Iallgather. The algorithm can be used only for power-of-two number of processes. Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>	2018-10-11 21:43:13 +07:00
Jeff Squyres	80348b9f80	Merge pull request #5891 from ftab/pr/readme-update README: Remove MXM from MTL list	2018-10-11 10:21:54 -04:00
Dennis Field	6e4cd3f229	README: Remove MXM from MTL list The MXM MTL has been deleted from Open MPI; no need to include it in the README anymore. Signed-off-by: Dennis Field <fury@xibase.com>	2018-10-10 20:15:07 -04:00
Yossi Itigin	5f767e1a54	Merge pull request #5890 from yosefe/topic/osc-ucx-flush-worker-instead-of-world-barrier osc_ucx: add worker flush before osc module free	2018-10-10 23:00:57 +03:00
Yossi Itigin	b8e1af6fcb	osc_ucx: add worker flush before osc module free Make sure all pending communications are done on all ranks before closing the window. This way it will be safe to close the endpoints when closing the component. Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-10-10 20:47:16 +03:00
Yossi Itigin	bcc48515e4	Revert "osc_ucx: fix hang/timeout in component finalize" This reverts commit 438d13b4ca1e7333b789ca3fb536fda17b0feb38. Signed-off-by: Yossi Itigin <yosefe@mellanox.com>	2018-10-10 20:47:13 +03:00
Yossi Itigin	27d8c8e83c	Merge pull request #5878 from yosefe/topic/pml-ucx-fix-datatype-leak pml_ucx: add ompi datatype attribute to release ucp_datatype	2018-10-10 20:18:39 +03:00

1 2 3 4 5 ...

29279 Коммитов