openmpi

Автор	SHA1	Сообщение	Дата
Gilles Gouaillardet	f446472f06	btl/uct: fix a typo in configure.m4 remove whitespace around '=' when setting btl_uct_LIBS Thanks Ake Sandgren for reporting this Refs. open-mpi/ompi#6173 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@b89deeb1bb)	2018-12-11 15:15:12 +09:00
Joshua Hursey	e1f75d5ff1	Add OPAL_VPID to unpacking * Needed to properly read PMIx job data like the following - `OPAL_PMIX_LOCALLDR` - `OPAL_PMIX_RANK` - `OPAL_PMIX_GLOBAL_RANK` - `OPAL_PMIX_APPLDR` - `OPAL_PMIX_APP_RANK` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com> (cherry picked from commit a557c4130c42a5a41aba5c08e606e7129d0bcb6d)	2018-11-21 11:48:58 -06:00
Howard Pritchard	9275b692b7	Merge pull request #6043 from hjelmn/v4.0.x_fix_this_damn_memory_barrier_bug_that_is_referenced_in_github_bug_6014_now_lets_get_this_release_out_the_door v4.0.x: opal/asm: work around possible gcc compiler bug	2018-11-08 08:01:33 -07:00
Nathan Hjelm	5efc76ef44	pmix3x: fix potential memory barrier bug with __atomic builtin atomics See open-mpi/ompi#6014 for more information. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-11-06 10:37:14 -07:00
Nathan Hjelm	e57c3fb3c9	opal/asm: work around possible gcc compiler bug It seems in some cases (gcc older than v6.0.0) the __atomic_thread_fence is a no-op with __ATOMIC_ACQUIRE. This appears to be the case with X86_64 so go ahead and use __ATOMIC_SEQ_CST for the x86_64 read memory barrier. This should not cause any performance issues as it is equivalent to the memory barrier in the hand-written atomics. References #6014 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 30119ee339eea086f43e3392352899187a4a73c7) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-11-06 10:28:13 -07:00
Howard Pritchard	bbfde1533b	btl/openib: fix a problem with ib query Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: #5810 Fixes: #5914 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 8126779a354b3e0c720d3e1790f7b936dd5b93b2)	2018-11-02 10:42:17 -06:00
Sergey Oblomov	0846c9d112	COMMON/UCX: added error code to log output Also fixes a PGI compilation error with --enable-debug. Signed-off-by: Geoff Paulsen <gpaulsen@users.noreply.github.com> Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit 1099d5f02327329e0c58d9403e3e0a7f1e1d1920)	2018-10-30 09:55:25 -05:00
Gilles Gouaillardet	7f36dce348	event/external: fix version requirement Only default to the external component if its version is greater or equal than the internal libevent (2.0.22) Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit b2050392051ed3d9d842f105326f7ea2223aafc4)	2018-10-29 15:20:48 -06:00
Gilles Gouaillardet	40a1f0c214	event/external: misc configury fixes - Always use the external component when configure'd with --with-libevent=external - Fix the external libevent library version detection by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros - Use the event2/event.h header (event.h is deprecated since libevent 2.0 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit 35e77a286c369c1be66ee7ef9ad5ec2faef47edb)	2018-10-29 15:20:32 -06:00
Nathan Hjelm	270ae73611	btl/uct: update for UCT_CB_FLAG_SYNC removal Signed-off-by: Nathan Hjelm <hjelmn@me.com> (cherry picked from commit 1b37328ba8e8cbd30a5dcdae8332dc879aea48d7) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-23 10:18:43 -06:00
Howard Pritchard	e8f28a5506	Merge pull request #5882 from hjelmn/btl_uct_updates_for_the_v4_branch btl/uct: bug fixes and general improvements	2018-10-22 13:12:34 -06:00
Howard Pritchard	210b4c60aa	remove some dead crs components Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 6564d3d217c3ebff24d0e1fd72929756dc498dfe)	2018-10-17 16:16:36 -06:00
Nathan Hjelm	e6f84e79de	btl/uct: fix deadlock in connection code This commit fixes a deadlock that can occur when using a TL that supports the connect to endpoint model. The deadlock was occurring while processing an incoming connection requests. This was done from an active-message callback. For some unknown reason (at this time) this callback was sometimes hanging. To avoid the issue the connection active-message is saved for later processing. At the same time I cleaned up the connection code to eliminate duplicate messages when possible. This commit also fixes some bugs in the active-message send path: - Correctly set all fragment fields in prepare_src. - Fix bug when using buffered-send. We were not reading the return code correctly (which is in bytes). This resulted in a message getting sent multiple times. - Don't try to progress sends from the btl_send function when in an active-message callback. It could lead to deep recursion and an eventual crash if we get a trace like send->progress->am_complete->ob1_callback->send->am_complete... Closes #5820 Closes #5821 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 707d35deeb62a93ea8a3806d07e07e3a96c51d19) Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-16 19:16:11 -06:00
Nathan Hjelm	eaa98af52c	opal/free_list: fix race condition There was a race condition in opal_free_list_get. Code throughout the Open MPI codebase was assuming that a NULL return from this function was due to an out-of-memory condition. In some cases this can lead to a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for example). Before this commit opal_free_list_get_mt looked like this: ```c static inline opal_free_list_item_t opal_free_list_get_mt (opal_free_list_t flist) { opal_free_list_item_t item = (opal_free_list_item_t) opal_lifo_pop_atomic (&flist->super); if (OPAL_UNLIKELY(NULL == item)) { opal_mutex_lock (&flist->fl_lock); opal_free_list_grow_st (flist, flist->fl_num_per_alloc); opal_mutex_unlock (&flist->fl_lock); item = (opal_free_list_item_t ) opal_lifo_pop_atomic (&flist->super); } return item; } ``` The problem is in a multithreaded environment is is* possible for the free list to be grown successfully but the thread calling opal_free_list_get_mt to be left without an item. The happens if between the calls to opal_lifo_push_atomic in opal_free_list_grow_st and the call to opal_lifo_pop_atomic other threads pop all the items added to the free list. This commit fixes the issue by ensuring the thread that successfully grew the free list always gets a free list item. Fixes #2921 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 5c770a7becc496f63b9f9a59151206236416f4f4) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 15:28:20 -06:00
Nathan Hjelm	0c4ba45af2	btl/uct: use the correct tl interface attributes It is apparently possible for different instances of the same UCT transport to have different limits (max short put for example). To account for this we need to store the attributes per TL context not per TL. This commit fixes the issue. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 6ed68da870c391d88575dc027a3de4826a77f57e) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-11 11:34:33 -06:00
Geoff Paulsen	c8ff7e3ef2	Merge pull request #5874 from rhc54/cmr40/config v4.0.0: Fix configury for internal PMIx	2018-10-10 10:47:26 -05:00
Nathan Hjelm	1153082a0f	btl/uct: bug fixes and general improvements This commit updates the uct btl to change the transports parameter into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the priority list. This will give better out of the box performance for multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5 lock inside libmlx5_rdmav2. This commit also fixes a number of leaks and a possible deadlock when using RDMA. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 39be6ec15c202d31423476f09e70199453d25adc) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-09 16:16:50 -06:00
Ralph Castain	376e2e4d98	Add missing file to tarball Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-09 06:46:09 -07:00
Ralph Castain	40270bd24b	Minor cleanups to the pmix/ext2x component Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-10-09 06:42:08 -07:00
Ralph Castain	12790e8ec6	Protect PMIx from bad configure entry Ignore with-hwloc=internal or external as those are meaningless to pmix (will upstream) Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit c498a7e77a377ddc3a7bcc26ea072627a33cb470)	2018-10-09 03:45:58 -07:00
Ralph Castain	3e2cc6f46a	Fail configure if pmix won't build If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit f379ba9c8e5ce17641937c351ab46e4b4a82446c)	2018-10-09 03:45:37 -07:00
Nathan Hjelm	fba5eda436	btl/vader: fix race condition in writing header Signed-off-by: Nathan Hjelm <hjelmn@me.com> (cherry picked from commit 8291f6722d890efd15333bf7b26f0d07952fa41e) Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-08 08:49:01 -06:00
Nathan Hjelm	fa768d748f	btl/vader: work around Oracle compiler bug This commit works around an Oracle C compiler bug in 5.15 (not sure when it was introduced). The bug is triggered when we chain assignments of atomic variables. Ex: _Atomic intptr x, y; intptr_t z = 0; x = y = z; Will produce a compiler error of the form: operand cannot have void type: op "=" assignment type mismatch: long "=" void To work around the issue we are removing the chain assignment and setting the head and tail on different lines. Fixes #5814 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit dfa8d3a81ac64e32d2bfb13a9afb20b83d747e03)	2018-10-03 11:36:17 -04:00
Nathan Hjelm	df6dd69db8	btl/vader: ensure the fast box tag is always read first On some platfoms reading a 64-bit value is non-atomic and it is possible that the two 32-bit values are read in the wrong order. To ensure the tag is always read first this commit reads the tag before reading the full 64-bit value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 66a7dc4c72cb25df67e7f872bee7a20b5fa9c763)	2018-10-03 11:36:17 -04:00
Geoff Paulsen	593d652077	Merge pull request #5823 from jsquyres/pr/v4.0.x/fix-tcp-btl-show-help-ip-address v4.0.x: btl/tcp: output the IP address correctly	2018-10-03 08:29:37 -05:00
George Bosilca	b63bee5da4	Small pedantic fixes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit a3a492b42cd7b114b435fafa7cd46222dc565dd1)	2018-10-02 14:37:31 -04:00
George Bosilca	d450e460d6	Provide the correct socklen to bind. Get Brian's patch from #5825 and his log message: Fix a failure in binding the initiating side of a connection on MacOS. MacOS doesn't like passing the size of the storage structure (sockaddr_storage) instead of the expected size of the structure (sockaddr_in or sockaddr_in6), which was causing bind() failures. This patch simply changes the structure size to the expected size. Add a more clear error message in debug mode. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 9164e26e2f323c43c9a671cb510bb4df03e45628)	2018-10-02 14:37:31 -04:00
Jeff Squyres	81f2f19398	btl/tcp: output the IP address correctly Per https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673, it looks like the IP address for a given interface is being stashed in two places: on the endpoint and on the module. 1. On the endpoint, it is storing the moral equivalent of a (struct sockaddr_in.sin_addr). 2. On the module, it is storing a full (struct sockaddr_storage). The call to opal_net_get_hostname() expects a full (struct sockaddr*) -- not just the stripped-down (struct sockaddr_in.sin_addr). Hence, when the original code was passing in the endpoint's (struct sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it like a (struct sockaddr), hilarity ensued (i.e., we got the wrong output). This commit eliminates the call to opal_net_get_hostname() and just calls inet_ntop() directly to convert the (struct sockaddr_in.sin_addr) to a string. NOTE: Per the github comment cited above, there can be a disparity between the IP address cached on the endpoint vs. the IP address cached on the module. This only happens with interfaces that have more than one IP address. This commit does not fix that issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 5dae086f7e4aee28fbb5a7282a2661286a5f68fe)	2018-10-02 10:49:51 -04:00
Sergey Oblomov	68d3baffd5	OPAL/COMMON/UCX: used __func__ macro instead of __FUNCTION__ - used __func__ macro instead of __FUNCTION__ to unify macro usage with other components Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit 9a51e257d162e724845024e3505880256194ebe2)	2018-09-27 12:04:07 +03:00
Jeff Squyres	8bdf4553d9	mca_base_var: fix output bug about settable vars Fix the test that determined whether we output "writeable" or "read-only" for MCA vars (it was checking the wrong flag). Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit 176da51aec0955a51f21157b33b21f60b6f28092)	2018-09-24 14:14:51 -07:00
Geoff Paulsen	3d4164e1e1	Merge pull request #5752 from gpaulsen/misc-warnings-fixes Miscellaneous compiler warning stomps.	2018-09-22 15:01:53 -05:00
Geoff Paulsen	556367af31	Merge pull request #5754 from gpaulsen/event_threading opal/progress: protect against multiple threads in event base	2018-09-22 15:00:32 -05:00
Mark Allen	5ac3fac6c2	snprintf() length fix for info The important part of this fix is a couple places 5 was hard-coded that needed to be strlen(OPAL_INFO_SAVE_PREFIX). But also this contains a fix for a gcc 7.3.0 compiler warning about snprintf(). There was an "if" statement making sure all the arguments had appropriate strlen(), but gcc still complained about the following snprintf() because the size of the struct element is iterator->ie_key[OPAL_MAX_INFO_KEY + 1]. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2018-09-21 14:47:11 -05:00
Nathan Hjelm	cd88e307fd	opal/progress: protect against multiple threads in event base libevent does not support multiple threads calling the event loop on the same event base. This causes external libevent's to print out re-entrant warning messages. This commit fixes the issue by protecting the call to the event loop with an atomic swap check. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-21 14:40:08 -05:00
Jeff Squyres	2e37f97a38	Miscellaneous compiler warning stomps. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> (cherry picked from commit fe0852bcb4d14a6aaf5a3e1021f60b5be32dd42d)	2018-09-21 14:35:51 -05:00
Ralph Castain	192f0f6fff	Remove the OFI/BTL component Remove this component pending re-architecture of the overall OFI components. We have had similar issues before when multiple components use the same library - typical issues are race conditions, initialize and finalize errors, etc. We are seeing similar problems here as we get broader exposure to different library version and environment combinations. The correct fix in the past has been to centralize the library interactions in a "common" component. We will pursue that here by moving some additional functions (e.g., endpoint creation) into the existing opal/mca/common/ofi component. We can't do that and thoroughly test it in time for the v4.0.0 release, so we'll simply remove this component from the release. Once we have things correctly fixed, we'll submit a PR to restore the component plus the related fixes to some future v4.x release. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-20 18:09:15 -07:00
Geoff Paulsen	4688da0631	Merge pull request #5736 from hoopoepg/topic/topic/common-del-procs-v4.0 MCA/COMMON/UCX: del_procs calls are unified to common module - v4.0	2018-09-20 18:12:25 -05:00
Geoff Paulsen	8dbfb9b032	Merge pull request #5745 from bwbarrett/v4.0.x-cuda-async openib: Disable CUDA async by default	2018-09-20 18:09:32 -05:00
Geoff Paulsen	4d9deb33ec	Merge pull request #5688 from rhc54/cmr40/pmix302 v4.0.x: Update to PMIx v3.0.2	2018-09-20 16:26:06 -05:00
Brian Barrett	a4fabb7e26	openib: Disable CUDA async by default Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References #3972 Signed-off-by: Brian Barrett <bbarrett@amazon.com> (cherry picked from commit 9344afd485d23c41a18ab57d6df8550826c24b9e) Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-09-20 14:20:54 -07:00
Howard Pritchard	27732bdf33	Merge pull request #5738 from hppritcha/topic/remove_scif_support_4.0.x SCIF: remove it	2018-09-19 12:43:26 -06:00
Howard Pritchard	730f98d8b0	Merge pull request #5665 from hjelmn/v4.0.x_cache_flush v4.0.x: patcher/base: improve instruction cache flush for aarch64	2018-09-19 12:12:57 -06:00
Howard Pritchard	01d4d52588	SCIF: remove it KNC is effectively dead. Remove corresponding SCIF support in Open MPI. cherry pick of PR #5737 + news update Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit b9ac3d8931c38661132544d59be085a50df01420)	2018-09-19 11:48:17 -06:00
Sergey Oblomov	3cace87749	MCA/COMMON/UCX: del_procs calls are unified to common module Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com> (cherry picked from commit 920cc2e0d9994dfd49062822c89cb502274eb464)	2018-09-19 10:47:27 +03:00
Ralph Castain	131ea01320	Update to PMIx v3.0.2 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2018-09-18 13:03:18 -07:00
Howard Pritchard	06c62c6b30	Merge pull request #5685 from selvintxavier/brcm_hcas_v4.0.x v4.0.x: Add support for different Broadcom HCAs	2018-09-17 16:29:45 -06:00
Selvin Xavier	9114a9ac95	v4.0.x: Add support for different Broadcom HCAs Adds device ids of different Broadcom adapters from BCM57XXX and BCM58XXX family of HCAs. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> (cherry-picked from a53a6f7650035d1644018fd09647c5e4db6b0694)	2018-09-16 22:54:07 -07:00
Nathan Hjelm	83668f4b47	btl/vader: ensure that the send tag is always written last To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes #5375, #5638 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 850fbff441756b2f9cde1007ead3e37ce22599c2) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-14 12:34:26 -06:00
Nathan Hjelm	1cc295e4c4	patcher/base: improve instruction cache flush for aarch64 This commit updates the patcher component to either use the __clear_cache intrinsic or the correct assembly to flush the instruction cache. Fixes #5631 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 1cdbceb0951c30a04b730b78f31d07543d8c3d2a) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-10 15:44:50 -06:00
Geoff Paulsen	17aab5ea5b	Merge pull request #5659 from ggouaillardet/topic/v4.0.x/misc_finalize_leaks Plug misc leaks on MPI_Finalize()	2018-09-10 14:06:31 -05:00

1 2 3 4 5 ...

5309 Коммитов