openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	049bfb6c3c	Merge pull request #5971 from devreal/ompi-rdma-preinit-fixbarrier Fix regression in RDMA shared memory registration	2018-10-25 09:45:36 -06:00
Joseph Schuchart	a193ae26bf	Fix regression introduced earlier by re-adding a barrier after shared memory has been registered Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2018-10-24 15:54:19 -04:00
Ralph Castain	c186004e5e	Merge pull request #5968 from ICLDisco/export/overspawn Correctly propagate the oversubscribe flag to the spawnees	2018-10-24 09:21:49 -07:00
Aurélien Bouteiller	2820aef551	Correctly propagate the oversubscribe flag to the spawnees Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>	2018-10-23 23:02:36 -04:00
Yossi Itigin	4fdf57a6ee	Merge pull request #5955 from amaslenn/mlnx-no-uct platform/mellanox: disable btl-uct by default	2018-10-23 08:21:02 +03:00
Nathan Hjelm	7b34c9b3fe	Merge pull request #5958 from hjelmn/btl_uct_sync btl/uct: update for UCT_CB_FLAG_SYNC removal	2018-10-22 23:17:27 -06:00
Andrey Maslennikov	074e9cc92c	platform/mellanox: disable btl-uct by default Signed-off-by: Andrey Maslennikov <andreyma@mellanox.com>	2018-10-22 12:23:40 +03:00
Yossi Itigin	4c442f2601	Merge pull request #5934 from hoopoepg/topic/suppressed-cov-warn-added-log-msg COMMON/UCX: suppressed coverity warnings	2018-10-22 11:00:47 +03:00
Nathan Hjelm	1b37328ba8	btl/uct: update for UCT_CB_FLAG_SYNC removal Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-21 18:57:42 -06:00
bosilca	690024917d	Merge pull request #5938 from bwbarrett/bugfix/tcp-multi-address Always use same IP address for module and modex	2018-10-21 09:19:43 -07:00
Sergey Oblomov	1099d5f023	COMMON/UCX: added error code to log output Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-10-21 11:37:25 +03:00
Nathan Hjelm	a66373454e	Merge pull request #5943 from bosilca/fix/libnbc_warnings Remove few warnings in libnbc identified by clang-1000.11.45.2	2018-10-20 21:24:30 -06:00
bosilca	c3abedbd2c	Merge pull request #5759 from bosilca/fix/monitoring Fix/monitoring	2018-10-19 07:18:41 -07:00
Nathan Hjelm	8db5aaa33a	Merge pull request #5952 from hjelmn/romio_warnings romio/romio321: silence some compiler warnings	2018-10-18 18:37:23 -06:00
Nathan Hjelm	dbae9c0958	romio/romio321: silence some compiler warnings Some compilers complain when comparing signed and unsigned. romio321 was doing just this. The check is meant to check whether a size (which is an ADIO_Offset-- a signed number) will work with memcpy which takes a size_t. To silence the warning I added a new type (ADIO_Size) which is an unsigned type and cast the ADIO_Offset to this new type. Fixes #5951 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-18 13:36:51 -06:00
George Bosilca	dc972f0b92	Fix the PML monitoring. The monitoring PML hides it's existence from the OMPI infrastructure by removing itself from the list of PML loaded components, remaining hidden until MPI_Finalize. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-18 00:29:23 -04:00
George Bosilca	668aa15dda	Early selection of the best PML. With this patch the best PML is selected earlier, before finalizing the others PML. This provides a simpler mechanism to intercept and highjack the PML (as done in the monitoring PML) Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-18 00:29:23 -04:00
Brian Barrett	457f058e73	btl tcp: Simplify modex address selection Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs #5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-18 02:23:21 +00:00
Ralph Castain	6213d23f0b	Merge pull request #5944 from rhc54/topic/psrvr Remove the stale orte-dvm code	2018-10-17 16:12:14 -07:00
Ralph Castain	1bd772e8eb	Remove the stale orte-dvm code Users should migrate to https://github.com/pmix/prrte Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-17 15:11:38 -07:00
Howard Pritchard	7730db9982	Merge pull request #5937 from hppritcha/topic/remove_crs_components remove some dead crs components	2018-10-17 16:06:36 -06:00
George Bosilca	66182a294d	Remove few warnings in libnbc identified by clang-1000.11.45.2 Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-17 18:04:39 -04:00
Jeff Squyres	c2608fb597	Merge pull request #5939 from jsquyres/pr/coverity-cid-1440332 opal/os_path: fix minor string overrun	2018-10-17 14:24:42 -07:00
Howard Pritchard	6564d3d217	remove some dead crs components Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-17 10:29:00 -06:00
Brian Barrett	4f19221af2	btl tcp: Simplify module address storage Today, a btl tcp module is associated with exactly one IP address (IPv4 or IPv6). There's no need to reserve space for both an IPv4 and IPv6 address in the module structure, since the module will only be associated with one or the other. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-17 16:21:17 +00:00
Jeff Squyres	290d1c0534	opal/os_path: fix minor string overrun This is a follow on to `54ca3310ea` to fix a minor bug identified by Coverity. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-17 08:57:30 -07:00
Howard Pritchard	a435bfe1cf	Merge pull request #5933 from hppritcha/topic/remove_bfo_pml remove the bfo pml	2018-10-17 09:39:58 -06:00
Nathan Hjelm	43547ade4c	Merge pull request #5663 from mkurnosov/coll-ireduce-rabenseifner coll/libnbc: add Rabenseifner's algorithm for MPI_Ireduce	2018-10-17 09:02:06 -06:00
Nathan Hjelm	979a199b4f	Merge pull request #5896 from mkurnosov/coll-iallgather-recursivedoubling coll/libnbc: add recursive doubling algorithm for MPI_Iallgather	2018-10-17 09:01:14 -06:00
Sergey Oblomov	df765595e3	COMMON/UCX: suppressed coverity warnings - suppressed coverity warnings - added log messages on failed calls Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2018-10-17 16:11:03 +03:00
Howard Pritchard	7d6774acf8	remove the bfo pml Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-17 06:50:11 -06:00
Brian Barrett	2acc4b7e7f	btl tcp: Add workaround for "dropped connection" issue Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-16 18:33:30 -07:00
Nathan Hjelm	37a3f32e47	Merge pull request #5913 from hjelmn/uct_btl_fixes btl/uct: fix deadlock in connection code	2018-10-16 19:11:54 -06:00
Nathan Hjelm	707d35deeb	btl/uct: fix deadlock in connection code This commit fixes a deadlock that can occur when using a TL that supports the connect to endpoint model. The deadlock was occurring while processing an incoming connection requests. This was done from an active-message callback. For some unknown reason (at this time) this callback was sometimes hanging. To avoid the issue the connection active-message is saved for later processing. At the same time I cleaned up the connection code to eliminate duplicate messages when possible. This commit also fixes some bugs in the active-message send path: - Correctly set all fragment fields in prepare_src. - Fix bug when using buffered-send. We were not reading the return code correctly (which is in bytes). This resulted in a message getting sent multiple times. - Don't try to progress sends from the btl_send function when in an active-message callback. It could lead to deep recursion and an eventual crash if we get a trace like send->progress->am_complete->ob1_callback->send->am_complete... Closes #5820 Closes #5821 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 18:28:47 -06:00
Nathan Hjelm	1c001319d4	Merge pull request #5922 from hjelmn/opal_free_list_really_old_race_we_should_really_fix_now opal/free_list: fix race condition	2018-10-16 15:27:05 -06:00
Nathan Hjelm	1ff3cfedb6	Merge pull request #5921 from devreal/ompi-rdma-preinit RDMA OSC: initialize segment memory before registering the segment	2018-10-16 15:10:02 -06:00
Brian Barrett	da1189d771	Merge pull request #5916 from bwbarrett/revert/6acebc4 Revert "Handle error cases in TCP BTL"	2018-10-16 13:54:18 -07:00
Joseph Schuchart	d9dcdfdfba	RDMA OSC: initialize segment memory before registering the segment Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2018-10-16 16:12:14 -04:00
Edgar Gabriel	069084e6ad	Merge pull request #5907 from edgargabriel/topic/testmpio-fixes Topic/testmpio fixes	2018-10-16 13:03:22 -07:00
Nathan Hjelm	5c770a7bec	opal/free_list: fix race condition There was a race condition in opal_free_list_get. Code throughout the Open MPI codebase was assuming that a NULL return from this function was due to an out-of-memory condition. In some cases this can lead to a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for example). Before this commit opal_free_list_get_mt looked like this: ```c static inline opal_free_list_item_t opal_free_list_get_mt (opal_free_list_t flist) { opal_free_list_item_t item = (opal_free_list_item_t) opal_lifo_pop_atomic (&flist->super); if (OPAL_UNLIKELY(NULL == item)) { opal_mutex_lock (&flist->fl_lock); opal_free_list_grow_st (flist, flist->fl_num_per_alloc); opal_mutex_unlock (&flist->fl_lock); item = (opal_free_list_item_t ) opal_lifo_pop_atomic (&flist->super); } return item; } ``` The problem is in a multithreaded environment is is* possible for the free list to be grown successfully but the thread calling opal_free_list_get_mt to be left without an item. The happens if between the calls to opal_lifo_push_atomic in opal_free_list_grow_st and the call to opal_lifo_pop_atomic other threads pop all the items added to the free list. This commit fixes the issue by ensuring the thread that successfully grew the free list always gets a free list item. Fixes #2921 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 13:17:09 -06:00
Edgar Gabriel	ba95588332	io/ompio: add verification for data representations. check for providing a data representation that is actually supported by ompio. Add also one check for a non-NULL pointer in mpi/c/file_set_view for the data representation. Also fixes parts of issue #5643 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2018-10-16 12:45:33 -05:00
Ralph Castain	d1881519f9	Merge pull request #5919 from rhc54/topic/od2 Ensure SIGCHLD is unblocked	2018-10-16 06:28:00 -07:00
Ralph Castain	647a760b7e	Ensure SIGCHLD is unblocked Thanks to @hjelmn for debugging it and providing the patch Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)	2018-10-15 21:03:17 -07:00
Nathan Hjelm	9ea5dfa799	class/opal_fifo: fix warning Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-15 19:18:31 -06:00
Jeff Squyres	296c91a10b	Merge pull request #5918 from jsquyres/pr/strcpy-no-more Cleanup some string operations	2018-10-15 09:17:21 -05:00
Jeff Squyres	54ca3310ea	ompi: cleanup various string operations Several fixes to string handling: 1. strncpy() -> opal_string_copy() (because opal_string_copy() guarantees to NULL-terminate, and strncpy() does not) 2. Simplify a few places, such as: * Since opal_string_copy() guarantees to NULL terminate, eliminate some memsets(), etc. * Use opal_asprintf() to eliminate multi-step string creation There's more work that could be done; e.g., this commit doesn't attempt to clean up any strcpy() usage. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:10:20 -07:00
Jeff Squyres	a85bad37df	orte: strncpy() -> opal_string_copy() Fairly straightforward conversion of strncpy() calls to opal_string_copy(). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:10:20 -07:00
Jeff Squyres	cef6cf0ac5	opal: update some string handling Make various string handling locations a little more robust. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:04:28 -07:00
Jeff Squyres	e6de42c379	opal/util/info.c: fix compiler warning Remove unused variable. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:03:28 -07:00
Yossi Itigin	a5b1c9a91d	Merge pull request #5898 from yosefe/topic/pml-ucx-init-err-code pml_ucx: fix return code from mca_pml_ucx_init() error flow	2018-10-14 11:34:00 +03:00

1 2 3 4 5 ...

29300 Коммитов