openmpi

Автор	SHA1	Сообщение	Дата
Gilles Gouaillardet	174e967dbc	Remove ORTE project Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build without reference to ORTE. Setup opal/pmix framework to be static. Remove support for all PMI-1 and PMI-2 libraries. Add support for "external" pmix component as well as internal v4 one. remove orte: misc fixes - UCX fixes - VPATH issue - oshmem fixes - remove useless definition - Add PRRTE submodule - Get autogen.pl to traverse PRRTE submodule - Remove stale orcm reference - Configure embedded PRRTE - Correctly pass the prefix to PRRTE - Correctly set the OMPI_WANT_PRRTE am_conditional - Move prrte configuration to the end of OMPI's configure.ac - Make mpirun a symlink to prun, when available - Fix makedist with --no-orte/--no-prrte option - Add a `--no-prrte` option which is the same as the legacy `--no-orte` option. - Remove embedded PMIx tarball. Replace it with new submodule pointing to OpenPMIx master repo's master branch - Some cleanup in PRRTE integration and add config summary entry - Correctly set the hostname - Fix locality - Fix singleton operations - Fix support for "tune" and "am" options Signed-off-by: Ralph Castain <rhc@pmix.org> Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2020-02-07 18:20:06 -08:00
Brice Goglin	329d4451a6	opal/hwloc: remove some unused variables when building with hwloc < 1.7 Refs #7362 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2020-02-04 22:56:46 +01:00
Brice Goglin	907ad854b4	hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 1.11 Build was broken by mistake in commit d40662edc41a5a4d09ae690b640cfdeeb24e15a1 Fixes #7362 Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2020-02-04 22:56:46 +01:00
Austen Lauria	824dbcbcf3	Protect use of _Static_assert(). Signed-off-by: Austen Lauria <awlauria@us.ibm.com>	2020-02-04 13:46:58 -05:00
Howard Pritchard	d2b68e6ecd	Merge pull request #7201 from bgoglin/master hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0	2020-02-03 11:23:42 -07:00
George Bosilca	72501f8f9c	Consistent return from all progress functions. This fix ensures that all progress functions return the number of completed events. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2020-01-28 20:16:53 +01:00
Joseph Schuchart	2c97187ee0	Harmonize return values of progress callbacks Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2020-01-28 20:15:03 +01:00
Austen Lauria	1275766037	Merge pull request #7207 from devreal/remove_shmem_seg_hdr Remove unused opal_shmem_seg_hdr_t to retain alignment	2020-01-28 13:57:55 -05:00
Aurelien Bouteiller	9f4365fef6	Merge pull request #6783 from abouteiller/export/macos-epipe Prevent EPIPE on OSX.	2020-01-28 11:18:46 -05:00
Brian Barrett	d768d82231	Merge pull request #7167 from wckzhang/reachable_netlinks Reachable documentation change	2020-01-27 15:39:14 -08:00
Brian Barrett	fc8c7a5869	Merge pull request #7134 from wckzhang/btl_tcp_interface_match btl tcp: Use reachability and graph solving for global interface matching	2020-01-27 15:38:49 -08:00
Austen Lauria	10f6a77640	Merge pull request #7315 from abouteiller/export/tcp_errors_v2 Handle error cases in TCP BTL (v2)	2020-01-27 17:03:07 -05:00
Aurelien Bouteiller	76021e35ee	Adding a description of the FIN message for future reference. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	93846fd0ee	Remove the pending event when socket is TCP_FAILED Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	6b3be224d4	Adding a FIN message to differentiate normal TCP closing from failures Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:32:34 -05:00
Aurelien Bouteiller	b7be64482a	Revert "Revert "Handle error cases in TCP BTL"" This reverts commit `5162011428`. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2020-01-27 13:31:06 -05:00
William Zhang	e958f3cf22	btl tcp: Use reachability and graph solving for global interface matching Previously we used a fairly simple algorithm in mca_btl_tcp_proc_insert() to pair local and remote modules. This was a point in time solution rather than a global optimization problem (where global means all modules between two peers). The selection logic would often fail due to pairing interfaces that are not routable for traffic. The complexity of the selection logic was Θ(n^n), which was expensive. Due to poor scalability, this logic was only used when the number of interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More details can be found in this ticket: https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates in the ticket do not match what I calculated from the function) As a fallback, when interfaces surpassed this threshold, a brute force O(n^2) double for loop was used to match interfaces. This commit solves two problems. First, the point-in-time solution is turned into a global optimization solution. Second, the reachability framework was used to create a more realistic reachability map. We switched from using IP/netmask to using the reachability framework, which supports route lookup. This will help many corner cases as well as utilize any future development of the reachability framework. The solution implemented in this commit has a complexity mainly derived from the bipartite assignment solver. If the local and remote peer both have the same number of interfaces (n), the complexity of matching will be O(n^5). With the decrease in complexity to O(n^5), I calculated and tested that initialization costs would be 5000 microseconds with 30 interfaces per node (Likely close to the maximum realistic number of interfaces we will encounter). For additional datapoints, data up to 300 (a very unrealistic number) of interfaces was simulated. Up until 150 interfaces, the matching costs will be less than 1 second, climbing to 10 seconds with 300 interfaces. Reflecting on these results, I removed the suboptimal O(n^2) fallback logic, as it no longer seems necessary. Data was gathered comparing the scaling of initialization costs with ranks. For low number of interfaces, the impact of initialization is negligible. At an interface count of 7-8, the new code has slightly faster initialization costs. At an interface count of 15, the new code has slower initialization costs. However, all initialization costs scale linearly with the number of ranks. In order to use the reachable function, we populate local and remote lists of interfaces. We then convert the interface matching problem into a graph problem. We create a bipartite graph with the local and remote interfaces as vertices and use negative reachability weights as costs. Using the bipartite assignment solver, we generate the matches for the graph. To ensure that both the local and remote process have the same output, we ensure we mirror their respective inputs for the graphs. Finally, we store the endpoint matches that we created earlier in a hash table. This is stored with the btl_index as the key and a struct mca_btl_tcp_addr_t* as the value. This is then retrieved during insertion time to set the endpoint address. Signed-off-by: William Zhang <wilzhang@amazon.com>	2020-01-21 18:24:08 +00:00
Jeff Squyres	ed753afbc0	hwloc2: advance hwloc git submodule Advance to hwloc-2.1.0rc2-33-g38433c0f, which includes a .gitignore update that we want here in Open MPI. Be warned; this is actually 33 commits beyond the hwloc v2.1.0 tag. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-01-21 09:42:39 -08:00
Nathan Hjelm	037b0bd9ee	Merge pull request #7304 from hjelmn/btl_vader_fix_max_address_on_aarch64 btl/vader: modify how the max attachment address is determined	2020-01-15 17:02:33 -07:00
Nathan Hjelm	728d51f9f3	btl/vader: modify how the max attachment address is determined This PR removes the constant defining the max attachment address and replaces it with the largest address that shows up in /proc/self/maps. This should address issues found on AARCH64 where the max address may differ based on the configuration. Since the calculated max address may differ between processes the max address is sent as part of the modex and stored in the endpoint data. Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2020-01-14 15:15:36 -08:00
Nathan Hjelm	61f96b3d6d	Merge pull request #7283 from hjelmn/fix_issues_in_both_vader_and_opal_interval_tree_t_that_were_causing_issue_6524 Fix issues in both vader and opal interval tree t that were causing issue 6524	2020-01-14 14:54:51 -07:00
Jeff Squyres	25931ea8bf	Merge pull request #7200 from cpshereda/master-opal_gethostname-change Fix unsafe use of gethostname()	2020-01-13 16:07:43 -05:00
Charles Shereda	cbc6feaab2	Created opal_gethostname() as safer gethostname substitute. The opal_gethostname() function provides a more robust mechanism to retrieve the hostname than gethostname(), which can return results that are not null-terminated, and which can vary in its behavior from system to system. opal_gethostname() just returns the value in opal_process_info.nodename; this is populated in opal_init_gethostname() inside opal_init.c. -Changed all gethostname calls in opal subtree to opal_gethostname -Changed all gethostname calls in orte subtree to opal_gethostname -Changed all gethostname calls in ompi subdir to opal_gethostname -Changed all gethostname calls in oshmem subdir to opal_gethostname -Changed opal_if.c in test subdir to use opal_gethostname -Changed opal_init.c to include opal_init_gethostname. This function returns an int and directly sets opal_process_info.nodename per jsquyres' modifications. Relates to open-mpi#6801 Signed-off-by: Charles Shereda <cpshereda@lanl.gov>	2020-01-13 08:52:17 -08:00
Jeff Squyres	21bc9042e1	mtl/ofi: check for FI_LOCAL_COMM+FI_REMOTE_COMM Make sure to get an RDM provider that can provide both local and remote communication. We need this check because some providers could be selected via RXD or RXM, but can't provide local communication, for example. Add OPAL_CHECK_OFI_VERSION_GE() m4 macro to check that the Libfabric we're building against is >= a target version. Use this check in two places: 1. MTL/OFI: Make sure it is >= v1.5, because the FI_LOCAL_COMM / FI_REMOTE_COMM constants were introduced in Libfabric API v1.5. 2. BTL/usnic: It already had similar configury to check for Libfabric >= v1.1, but the usnic component was checking for >= v1.3. So update the btl/usnic configury to use the new macro and check for >= v1.3. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2020-01-13 08:19:53 -08:00
Brian Barrett	f853971bc1	Merge pull request #6821 from jsquyres/pr/make-hwloc201-tarball-a-submodule hwloc v2.1.0: use a git submodule	2020-01-08 07:41:37 -08:00
Nathan Hjelm	f86f805be1	btl/vader: fix issues with xpmem registration invalidation This commit fixes an issue discovered in the XPMEM registration cache. It was possible for a registration to be invalidated by multiple threads leading to a double-free situation or re-use of an invalidated registration. This commit fixes the issue by setting the INVALID flag on a registation when it will be deleted. The flag is set while iterating over the tree to take advantage of the fact that a registration can not be removed from the VMA tree by a thread while another thread is traversing the VMA tree. References #6524 References #7030 Closes #6534 Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2020-01-07 22:50:52 -07:00
Eisuke Kawashima	d26d4e1d63	Fix typo and update URLs (https, redirection) [skip ci] Signed-off-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>	2020-01-07 03:52:25 +09:00
Jeff Squyres	a2a9a9516b	hwloc2: bump up to hwloc v2.1.0 Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-12-24 16:01:03 -08:00
Jeff Squyres	c292e759da	hwloc2: bump up to hwloc 2.0.4 Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-12-24 16:01:03 -08:00
Jeff Squyres	e5722acc37	hwloc201: replace with "hwloc2" component+git submodule Rename the component to be "hwloc2" (since it can now be any v2.x.y version of hwloc), and make the embedded copy of hwloc be a git submodule. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-12-24 16:01:03 -08:00
Jeff Squyres	18c3e1af5e	hwloc: clarify --with-hwloc behavior Clarify in README what --with-hwloc does in its different use cases. Also, ensure that the behavior when specifying `--with-hwloc` is the same as if that option is not specified at all. This is what we did in Open MPI <= v3.x; looks like we inadvertantly caused `--with-hwloc` to be synonymous with `--with-hwloc=external` in v4.0.0. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-12-19 08:38:57 -08:00
Todd Kordenbrock	1af6dbe277	Merge pull request #7066 from tkordenbrock/topic/master/portals4.fix.flowcontrol.bugs portals4: fix flow control bugs	2019-12-11 06:31:26 -06:00
Maxwell Coil	52a9cce6f3	memory/patcher: fix compiler warning syscall() returns a long, but we are invoking shmat(), which returns a void*. Signed-off-by: Maxwell Coil <mcoil@nd.edu>	2019-12-08 13:56:00 -05:00
Jeff Squyres	53ebea12aa	mpool/base: fix basic mpool_base() function The prior implementation was simply wrong. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-12-05 18:57:37 -05:00
William Zhang	a471f8749f	reachable: Update documentation on reachable function We have decided to show interfaces that are identical to itself as reachable. This is consistent with the previous netmask logic when determining reachability. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-12-02 18:04:40 +00:00
William Zhang	ce40436895	reachable/netlink: Show an interface as reachable to itself Due to the way netlinks detects reachability, it will not show an interface as reachable to itself, even if it can pass through a loopback interface. To maintain similar behavior with netmasks, we display an interface as reachable to itself. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-12-02 18:04:40 +00:00
Joseph Schuchart	ee80babe5c	Remove unused opal_shmem_seg_hdr_t to retain alignment Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2019-11-29 08:40:14 +01:00
Brice Goglin	ea80a20e10	hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0 Both opal_hwloc_base_get_relative_locality() and _get_locality_string() iterate over hwloc levels to build the proc locality information. Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0. We have to explicitly look a the special NUMA level to get that locality info. I am factorizing the core of the iterations inside dedicated "_by_depth" functions and calling them again for the NUMA level at the end of the loops. Thanks to Hatem Elshazly for reporting the NUMA communicator split failure at https://www.mail-archive.com/users@lists.open-mpi.org/msg33589.html It looks like only the opal_hwloc_base_get_locality_string() part is needed to fix that split, but there's no reason not to fix get_relative_locality() as well. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2019-11-27 12:41:33 +01:00
Geoffroy Vallee	de6f130b4a	Add the missing code to check a return code Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>	2019-11-24 11:04:10 -05:00
Austen Lauria	edcd6d8aeb	Merge pull request #7146 from bgoglin/master fix typos hlwoc->hwloc	2019-11-13 15:38:00 -05:00
Nathan Hjelm	09dd383f8b	Merge pull request #7108 from devreal/btl-ugni-deadlock uGNI: Fix potential deadlock when processing outstanding transfers	2019-11-11 10:56:56 -08:00
Howard Pritchard	9d345d9aa0	btl/uct: add UCT API version check to configury related to #7128 The UCX crew is no longer guaranteeing that the UCT API is going to be frozen, so this is kind of a whack-a-mole problem trying to keep the BTL UCT working with various changing UCT APIs. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2019-11-06 14:27:58 -07:00
Brice Goglin	5c6bd7ea4e	fix typos hlwoc->hwloc Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>	2019-11-06 10:42:36 +01:00
Nathan Hjelm	a3026c016a	btl/uct: fix compilation for UCX 1.7.0 Ref #7128 Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2019-11-05 12:53:26 -08:00
George Bosilca	476562752f	Correctly report TCP connect errors. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2019-10-31 18:33:15 -04:00
Joseph Schuchart	c09ca039b4	uGNI: Fix potential deadlock when processing outstanding transfers Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2019-10-26 12:21:17 +02:00
Austen Lauria	aa8be9c12d	Merge pull request #6284 from devreal/ompi-rdma-memalign Ensure proper alignment of memory provided by MPI	2019-10-25 12:27:58 -04:00
Nathan Hjelm	b1ef5a40fa	Merge pull request #7016 from hjelmn/fix_btl_uct_from_yet_another_unannounced_api_break_in_the_openucx_uct_layer btl/uct: add support for OpenUCX v1.8 API changes	2019-10-17 06:27:18 -07:00
Jeff Squyres	b6c4d5c118	Merge pull request #7060 from jsquyres/pr/usnic-mca-updates BTL usnic MCA updates	2019-10-15 10:48:10 -04:00
Stanislav Kirillov	0e0763e006	fix ipv6 btl connection bug Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>	2019-10-10 11:20:37 +00:00
Todd Kordenbrock	f7e74b6a3d	btl-portals4: fix a flow control configure bug This commit fixes a configure bug that caused flow control to be disabled regardless of the configure options used. Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>	2019-10-09 17:12:56 -05:00
Geoff Paulsen	4e1e6f8972	Merge pull request #6993 from awlauria/fix_warnings_master Fix miscellaneous compiler warnings.	2019-10-09 09:17:02 -05:00
Jeff Squyres	3080033a8c	btl/usnic: set retrans_timeout back down to 5ms Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-08 11:17:54 -07:00
Jeff Squyres	132e4cab3b	btl/usnic: set ack_iteration_delay default to 4 It was previously accidentally set to 0. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-08 11:17:30 -07:00
Jeff Squyres	fe7f772f21	btl/usnic: properly size freelist items Move the prefix area from the head to the body in relevant size computations. This fixes a problem in high traffic situations where usNIC may have sent from unregistered memory. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 14:40:56 -07:00
Jeff Squyres	27e3040dfe	btl/usnic: cap the number of resends per progress iteration New MCA param: btl_usnic_max_resends_per_iteration. This is the max number of resends we'll do in a single pass through usNIC component progress. This prevents progress from getting stuck in an endless loop of retransmissions (i.e., if more retransmissions are triggered during the sending of retransmissions). Specifically: we need to leave the resend loop to allow receives to happen (which may ACK messages we have sent previously, and therefore cause pending resends to be moot). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Jeff Squyres	3cc95d86b2	btl/usnic: increase default retrans_timeout Significantly increase the default retrans timeout. If the retrans timeout is too soon, we can end up in a retransmission storm where the logic will continually re-transmit the same frames during a single run through the usNIC progress function (because the timer for a single frame expires before we have run through re-transmitting all the frames pending re-transmission). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Jeff Squyres	968b1a51b5	btl/usnic: clarifications and fixes regarding ACKs New MCA parameter: btl_usnic_ack_iteration_delay. Set this to the number of times through the usNIC component progress function before sending a standalone ACK (vs. piggy-backing the ACK on any other send going to the target peer). Use "ticks" language to clarify that we're really counting the number of times through the usNIC component DATA_CHANNEL completion check (to check for incoming messages) -- it has no relation to wall clock time whatsoever. Also slightly change the channel-checking scheme in usNIC component progress: only check the PRIORITY channel once (vs. checking it once, not finding anything, and then falling through the progress_2() where we check PRIORITY again and then check the DATA channel). As before, if our "progress" libevent fires, increment the tick counter enough to guarantee that all endpoints that need an ACK will get triggered to send standalone ACKs the next time through progress, if necessary. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Jeff Squyres	ce2910a28a	btl/usnic: s/get_nsec/get_nticks/g Rename "get_nsec()" to "get_ticks()" to more accurately reflect that this function has no correlation to wall clock time at all. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Jeff Squyres	f3429d7a44	btl/usnic: pack a wire data struct Might as well save a few bytes when sending this struct across the network via the __opal_attribute_packed__ attribute. That being said, also re-order the elements in this struct so that there's no holes to begin with. Do this so that the compiler/runtime won't effect (slow) unaligned reads/writes because of the __opal_attribute_packed__ attribute. The "packed" attribute is really more about defensive programming (e.g., if we make a mistake and have a hole, "packed" will remove it for us). *** Do not bring this commit back to existing/already-released release branches: it will cause incompatibility, since it effectively changes the usNIC BTL wire protocol. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Austen Lauria	0d4004cc3c	Fix miscellaneous compiler warnings. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>	2019-10-01 16:27:25 -04:00
Joseph Schuchart	c385c927fb	Ensure proper alignment of memory provided by MPI Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>	2019-10-01 11:54:29 +02:00
Gilles Gouaillardet	1c4a3598d0	pmix/pmix4x: refresh to the latest open PMIx master refresh to openpmix/openpmix@ea3b29b1a4 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-10-01 14:27:22 +09:00
Nathan Hjelm	8473a66466	btl/uct: fix bug when using a transport without zero-copy This commit fixes a crash that can occur if a transport is usable but doesn't have zero-copy support. In this case do not attempt to use zero-copy and set the max send size off the bcopy limit. Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2019-09-27 17:26:37 -07:00
Nathan Hjelm	526775dfd7	btl/uct: add support for OpenUCX v1.8 API changes OpenUCX broke the UCT API again in v1.8. This commit updates btl/uct to fix compilation with current OpenUCX master (future v1.8). Further changes will likely be needed for the final release. Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2019-09-27 12:34:48 -07:00
Jeff Squyres	8038fac8f9	Merge pull request #6844 from adrianreber/check_for_user_ns Do not use CMA in user namespaces	2019-09-20 22:10:42 -04:00
Nathan Hjelm	ae91b11de2	btl/vader: when using single-copy emulation fragment large rdma This commit changes how the single-copy emulation in the vader btl operates. Before this change the BTL set its put and get limits based on the max send size. After this change the limits are unset and the put or get operation is fragmented internally. References #6568 Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2019-09-05 23:08:53 -07:00
Adrian Reber	fc68d8a90f	Do not use CMA in user namespaces Trying out to run processes via mpirun in Podman containers has shown that the CMA btl_vader_single_copy_mechanism does not work when user namespaces are involved. Creating containers with Podman requires at least user namespaces to be able to do unprivileged mounts in a container Even if running the container with user namespace user ID mappings which result in the same user ID on the inside and outside of all involved containers, the check in the kernel to allow ptrace (and thus process_vm_{read,write}v()), fails if the same IDs are not in the same user namespace. One workaround is to specify '--mca btl_vader_single_copy_mechanism none' and this commit adds code to automatically skip CMA if user namespaces are detected and fall back to MCA_BTL_VADER_EMUL. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-09-05 20:15:19 +02:00
Ralph Castain	9a2902c047	Force use of pmix/preg/native component Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-09-03 17:54:44 -07:00
Ralph Castain	2ebc1fa1b7	Update to current PMIx 4.0.0 (master) Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-09-03 14:46:55 -07:00
Howard Pritchard	71e1fad4a9	Merge pull request #6855 from hkuno/hkuno/mmap_loop Fix mmap infinite recurse in memory patcher	2019-08-26 14:28:53 -06:00
Howard Pritchard	5a3646fd1d	Merge pull request #6884 from guserav/reintroduce-common-ofi Reintroduce common ofi	2019-08-23 13:12:25 -06:00
Sergey Oblomov	182023febb	SPML/UCX: fixed hang in SHMEM_FINALIZE - used MPI _Barrier to synchronize processes Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-08-21 12:04:46 +03:00
guserav	56c3d9a238	common/ofi: Set HPE as owner of component Signed-off-by: guserav <erik.zeiske@web.de>	2019-08-20 10:13:02 -07:00
guserav	8a67a95c99	common/ofi: Fix open-mpi/ompi#2519 As discussed in open-mpi/ompi#2519 the common component does not depend on libfabric yet. This commit introduces this dependency by just calling fi_version(). Signed-off-by: guserav <erik.zeiske@hpe.com>	2019-08-12 16:17:59 -07:00
guserav	0e25c95eae	common/ofi: Fix check for OFI in build files The changes made in `f5e1a672cc` have been done after the common/ofi component was removed and thus the component doesn't reflect the changes made their. Namely `f5e1a672cc` changed: - How to call OPAL_CHECK_OFI (It sets opal_ofi_happy to yes now) - Dropped the common part in the build flags for ofi Signed-off-by: guserav <erik.zeiske@web.de>	2019-08-12 16:15:23 -07:00
guserav	4ad78aaa15	Revert "Remove opal/mca/common/ofi." This reverts commit `dd20174532`. Signed-off-by: guserav <erik.zeiske@web.de>	2019-08-09 10:33:21 -07:00
Yossi Itigin	ec9def1406	Merge pull request #6864 from hoopoepg/topic/ucx-ppn-hint UCX: added PPN hint for UCX context	2019-08-07 13:45:38 +03:00
Brian Barrett	827a2bcc3d	Merge pull request #6852 from wckzhang/opalifnamesize opal/util: Change opal/util/if.h macro IF_NAMESIZE to OPAL_IF_NAMESIZE	2019-08-05 16:16:05 -07:00
Sergey Oblomov	43186e494b	UCX: added PPN hint for UCX context - added PPN hint for UCX context init Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-08-05 18:07:06 +03:00
Harumi Kuno	fca8436064	Fix mmap infinite recurse in memory patcher This commit fixes issue #6853 by removing MacOS/Darwin-specific logic from intercept_mmap. It also opportunistically converts tabs to spaces. Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>	2019-08-01 10:18:25 -07:00
William Zhang	4ebb37a26c	opal/util: Change opal/util/if.h macro IF_NAMESIZE to OPAL_IF_NAMESIZE Due to IF_NAMESIZE being a reused and conditionally defined macro, issues could arise from macro mismatches. In particular, in cases where opal/util/if.h is included, but net/if.h is not, IF_NAMESIZE will be 32. If net/if.h is included on Linux systems, IF_NAMESIZE will be 16. This can cause a mismatch when using the same macro on a system. Thus different parts of the code can have differring ideas on the size of a structure containing a char name[IF_NAMESIZE]. To avoid this error case, we avoid reusing the IF_NAMESIZE macro and instead define our own as OPAL_IF_NAMESIZE. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-29 21:24:39 +00:00
Ralph Castain	c5c93e3391	Update to PMIx master Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-07-29 12:20:20 -07:00
Ralph Castain	d202e10c14	Provide locality for all procs on node Update PMIx to latest master to get supporting updates. For connect/accept (part of comm_spawn as well), lookup locality for all participating procs on the node and compute the relative locality so it can be used for MPI operations. Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-07-22 09:23:38 -07:00
Brian Barrett	41c2007af5	Merge pull request #6820 from wckzhang/cleanup btl tcp: Fix error path memory leak	2019-07-16 15:50:32 -07:00
Gilles Gouaillardet	06c6325bc8	Merge pull request #6822 from ggouaillardet/topic/pmix_refresh pmix/pmix4x: refresh to the latest PMIx master	2019-07-16 12:56:31 +09:00
Gilles Gouaillardet	4510711e95	pmix/pmix4x: refresh to the latest PMIx master refresh to pmix/pmix@03a8b5daab Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-07-16 11:31:20 +09:00
William Zhang	8c3b8a87c5	btl tcp: Fix error path memory leak After the OPAL_MODEX_RECV call, remote_addrs was not freed in the error path. Moved the free call into cleanup to ensure we always free this memory before leaving the function. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-15 22:35:04 +00:00
William Zhang	c0c3e1b540	reachable: Update documentation on reachable function Added information on the type of objects provided in the list as well as the required fields for them. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-11 21:36:12 +00:00
William Zhang	c9214cc53e	reachable: Change list name from _if to _ifs The parameter names were misleading due to implying a single interface instead of a list. This will provide more clarity in distinguishing the list of interfaces from each individual interface. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-11 21:33:24 +00:00
Jeff Squyres	650dd3e4cf	opal/if/linux_ipv6: unify gethostname() call behavior Unfortunately, https://github.com/open-mpi/ompi/pull/6797 was merged before all feedback was received (`39b799d936`). This PR is a minor addendum to that commit. This PR simply removes a meaningless `= {0}` operation. The use of gethostname() here -- and many other places in the code base -- is technically unsafe. See https://github.com/open-mpi/ompi/issues/6801 for a further description of the issue and a suggested fix. But the risk is quite low; real-world hostnames are usually much shorter than OPAL_MAXHOSTNAMELEN. Hence, this PR just removes the meaningless operation and leaves a real fix for gethostname() usage to a potential future PR. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-07-09 10:38:24 -07:00
Orivej Desh	39b799d936	Fix if_linux_ipv6 verbose output of interface addresses Previously the verbose output of if_linux_ipv6_open looked like this: found interface ab c: 0ab: a b: abc: 0 0: a 0🔡 0 0 scope 0 This changes the output to: found interface eth0 inet6 ab0c🆎a0b🔤0:a00:abcd:0 scope 0 Signed-off-by: Orivej Desh <orivej@gmx.fr> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-07-09 05:45:15 -07:00
George Bosilca	4620c351ea	Prevent EPIPE on OSX. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2019-06-28 15:32:59 -04:00
Gilles Gouaillardet	63aa156bb0	pmix/pmix4x: refresh to the latest PMIx master refresh to pmix/pmix@99971222ce Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-06-27 09:35:49 +09:00
Gilles Gouaillardet	5679a88867	pmix/pmix4x: refresh to the latest PMIx master refresh to pmix/pmix@f67efc835c Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-06-24 10:17:23 +09:00
Ralph Castain	d4070d5f58	Fix finalize of flux component Per patches from @SteVwonder and @garlick Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-06-18 21:14:04 -07:00
Gilles Gouaillardet	d9326ff2ca	pmix/pmix4x: refresh to the latest PMIx master refresh to pmix/pmix@186dca196c Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-06-10 15:17:43 +09:00
markalle	008ab98946	Merge pull request #6531 from markalle/patcher_additions shmat/shmdt additions for patcher	2019-05-30 12:16:05 -05:00
Nathan Hjelm	b78066720c	btl/uct: add support for UCX 1.6.x This commit updates the uct btl to support the v1.6.x release of UCX. This release breaks API. Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>	2019-05-21 04:31:57 -06:00
Yossi Itigin	84ae05c7bc	Merge pull request #6675 from hoopoepg/topic/ucx-common-init-patcher-on-hooks-used-only COMMON/UCX: init memhooks infra on external hooks only	2019-05-16 22:35:32 +03:00
Sergey Oblomov	ebc457baf5	COMMON/UCX: removed ucs stuff Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-05-16 20:56:30 +03:00
Sergey Oblomov	a0a9306066	COMMON/UCX: init memhooks infra on external hooks only - initialize memory hooks infrastructure only in case if external memory hooks are requested Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-05-16 20:13:16 +03:00
Nathan Hjelm	3e1dd36241	btl/uct: check for support before disabling UCX memory hooks Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2019-05-15 13:49:10 -06:00
Jeff Squyres	df5f7afb14	usnic: fix Coverity false positives Add some Coverity inline notation to tell Coverity that these functions never return. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-14 13:53:25 -07:00
Jeff Squyres	566e6f1ca3	btl/usnic: remove legacy code Remove compatibility code for multiple versions of BTL_IN_OPAL, BTL_VERSION, and RCACHE_VERSION. This stuff was really only necessary when we were actively swapping code between multiple release branches that had large variations in core OMPI infrastructure. These large variations have now been around for quite a while, so the need for this "compat" layer is significantly reduced. It hasn't been removed simply because a few of the "compat" names a slightly more friendly than the real names (e.g., the SEND/RECV/PUT names). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-11 05:19:36 -07:00
Jeff Squyres	8a2441603f	btl/usnic: remove all calls to abort() Inspired by https://github.com/open-mpi/ompi/pull/5205, finally remove all calls to abort() from the usnic BTL. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-11 05:17:29 -07:00
Nathan Hjelm	b82a08254f	btl/ugni: fix 32-bit compare-and-swap atomics This commit fixes an error in the 32-bit compare-and-swap atomic support for Aries networks. The code was incorrectly using the non-fetching version of cswap which was causing the routing to return OPAL_ERR_BAD_ARG. Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>	2019-05-10 09:59:54 -06:00
Yossi Itigin	5d2200a7d6	Merge pull request #6605 from brminich/topic/shmem_all2all_put SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h	2019-05-01 12:00:21 +03:00
Mikhail Brinskii	2ef5bd8b36	SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h The new routine transfers the data asynchronously from the source PE to all PEs in the OpenSHMEM job. The routine returns immediately. The source and target buffers are reusable only after the completion of the routine. After the data is transferred to the target buffers, the counter object is updated atomically. The counter object can be read either using atomic operations such as shmem_atomic_fetch or can use point-to-point synchronization routines such as shmem_wait_until and shmem_test. Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>	2019-04-26 14:47:58 +03:00
Gilles Gouaillardet	562809fca1	pmix/pmix4x: refresh to the latest PMIx master refrest pmi4x to pmix/pmix@bde4a8a54f Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-04-23 09:31:43 +09:00
Ralph Castain	f4aa783848	Remove all the linkages back to libpmix in pmix components This link-back seems to be breaking OMPI for some reason. I'm not sure we need it in PMIx anyway, but we'll investigate over there. Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-04-16 13:30:20 -07:00
Mark Allen	bdd92a7a64	-cpu-set as a constraint rather than as a binding The first category of issue I'm addressing is that recent code changes seem to only consider -cpu-set as a binding option. Eg a command like this % mpirun -np 2 --report-bindings --use-hwthread-cpus \ --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname which just round robins over the --cpu-set list. Example output which seems fine to me: > MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] It should also be possible though to pass a --cpu-set to most other map/bind options and have it be a constraint on that binding. Eg % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname The first command above errors that > Conflicting directives for mapping policy are causing the policy > to be redefined: > New policy: RANK_FILE > Prior policy: BYHWTHREAD The error check in orte_rmaps_rank_file_open() is likely too aggressive. The intent seems to be that any option like "--map-by whatever" will check to see if a rankfile is in use, and report that mapping via rmaps and using an explicit rankfile is a conflict. But the check has been expanded to not just check NULL != orte_rankfile but also errors out if (NULL != opal_hwloc_base_cpu_list && !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy)) which seems to be only recognizing -cpu-set as a binding option and ignoring -cpu-set as a constraint on other binding policies. For now I've changed the NULL != opal_hwloc_base_cpu_list to OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy) so it hopefully only errors out if -cpu-set is being used as a binding policy. Whether I did that right or not it's enough to get to the next stage of testing the example commands I have above. Another place similar logic is used is hwloc_base_frame.c where it has /* did the user provide a slot list? */ if (NULL != opal_hwloc_base_cpu_list) { OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET); } where it used to (long ago) only do that if !OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy) I think the new code is making it impossible to use --cpu-set as anything other than a binding policy. That brings us past the error detection and into the real functionality, some of which has been stripped out, probably in moving to hwloc-2: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname > MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] The rank_by() function in rmaps_base_ranking.c makes an array out of objects returned from opal_hwloc_base_get_obj_by_type(,,,i,) which uses df_search(). That function changed quite a bit from hwloc-1 to 2 but it used to include a check for available = opal_hwloc_base_get_available_cpus(topo, start) which is where the bitmask from --cpu-set goes. And it used to skip objs that had hwloc_bitmap_iszero(available). So I restored that behavior in ds_search() by adding a "constrained_cpuset" to replace start->cpuset that it was otherwise processing. With that change in place the first command works: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname > MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] The other command uses a different path though that still ignored the available mask: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname > MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] In bind_generic() the code used to call opal_hwloc_base_find_min_bound_target_under_obj() which used opal_hwloc_base_get_ncpus(), and that's where it would intersect objects with the available cpuset and skip over ones that were't available. To match the old behavior I added a few lines in bind_generic() to skip over objects that don't intersect the available mask. After that we get % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname > MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] I think the above changes are improvements, but I don't feel like they're comprehensive. I only traced through enough code to fix the two specific bugs I was dealing with. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2019-04-12 15:33:56 -04:00
Gilles Gouaillardet	9ce8d7b568	pmix/pmix4x: refresh to the latest PMIx refrest pmi4x to pmix/pmix@2531c0c3d1 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-04-09 14:03:00 +09:00
KAWASHIMA Takahiro	77286a41aa	opal/sys: Introduce OPAL_HAVE_SYS_TIMER_GET_FREQ macro ... to avoid using an architecture name macro in `opal/mca/timer/linux/timer_linux_component.c`. The function name `opal_sys_timer_freq` is also changed for consistency with `opal_sys_timer_get_cycles`. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2019-04-04 11:48:02 +09:00
Mark Allen	eb888118e8	shmat/shmdt additions for patcher This is mostly based off recent UCX additions to their patcher: https://github.com/openucx/ucx/pull/2703 They added triggers for * mmap when (flags & MAP_FIXED) && (addr != NULL) * shmat when (shmflg & SHM_REMAP) && (shmaddr != NULL) Beyond that I noticed they already had a trigger for * madvise when (advice == MADV_FREE) that we didn't so I added that. And the other main thing is we didn't really have shmat/shmdt active for some systems because we only had a path for syscall(SYS_shmdt, ) but we needed to also have a path for syscall(SYS_ipc, IPCOP_shmdt, ) and same for shmat. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2019-03-29 14:38:46 -04:00
bosilca	b54fdf5dd9	Merge pull request #6541 from bwbarrett/bugfix/enotconn btl/tcp: Skip printing error message in racy cleanup path	2019-03-28 22:42:52 -04:00
Brian Barrett	d5360711fa	btl/tcp: Skip printing error message in racy cleanup path Avoid printing an error message about ENOTCONN return codes from getpeername() when handling an incoming connection request. At this point in the receive state machine, the remote process has been verified to be a valid OMPI instance. In all-to-all startup at 4k rank scale, we're seeing this error message when the remote side drops the connection because it realizes it's the "loser" in the connection race. We were already doing all the right things, other than printing a scary error message. So skip the error message and call it good. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2019-03-28 23:12:35 +00:00
Gilles Gouaillardet	77060cad07	btl/vader: fix finalize sequence free the component mpool in mca_btl_vader_component_close() and after freeing soem objects that depend on it such as mca_btl_vader_component.vader_frags_user Thanks Christoph Niethammer for reporting this. Refs. open-mpi/ompi#6524 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-03-27 11:57:40 +09:00
Gilles Gouaillardet	e844f76725	pmix/pmix4x: refresh to the latest PMIx refrest pmi4x to pmix/pmix@20cc9c041e Fixes open-mpi/ompi#6513 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-03-25 13:33:18 +09:00
Joshua Ladd	9ab6ecba65	Merge pull request #6492 from janjust/oshmem-multiple-contexts-master Oshmem multiple contexts	2019-03-22 17:34:46 -04:00
Xin Zhao	9c3d00b144	ompi/oshmem/spml/ucx: use lockfree array to optimize spml_ucx_progress/delete oshmem_barrier in shmem_ctx_destroy ompi/oshmem/spml/ucx: optimize spml ucx progress Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>	2019-03-21 23:01:45 +02:00
Xin Zhao	e1c1ab0202	ompi/oshmem/spml/ucx: defer clean up shmem_ctx to shmem_finalize Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>	2019-03-21 23:01:37 +02:00
Josh Hursey	53cd31ed7e	Merge pull request #6504 from jjhursey/rm-hash-pmix4 Do not force 'hash' gds on direct modex in pmix4x	2019-03-19 20:35:12 -05:00
Ralph Castain	0f26d8c76b	Silence warnings Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-03-19 10:27:39 -07:00
Ralph Castain	c4be211741	Sync to latest PMIx master Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-03-19 10:27:12 -07:00
Joshua Hursey	1314cf2640	Do not force 'hash' gds on direct modex in pmix4x * Forcing the 'hash' gds component should not be necessary any more. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2019-03-19 11:53:26 -05:00
Joshua Hursey	c2581d0e33	Do not force 'hash' gds on direct modex * Forcing the 'hash' gds component should not be necessary any more. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2019-03-18 21:52:32 -05:00
Josh Hursey	ad8c842e7d	Merge pull request #6477 from markalle/report_bindings_strlen opal_hwloc_base_cset2str() off-by-1 in its strncat()	2019-03-14 12:42:50 -05:00
Sergey Oblomov	c319cf9ade	COMMON/UCX: rewording of hooks suggestion - also updated output macro Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-03-14 11:00:57 +02:00
Sergey Oblomov	d8e3562bae	PML/SPML/UCX: added evaluation of mmap events - there was a set of UCX related issues reported which caused by mmap API hooks conflicts. We added diagnostic of such problems to simplify bug-resolving pipeline Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>	2019-03-12 21:14:27 +02:00
Mark Allen	30d60994d2	opal_hwloc_base_cset2str() off-by-1 in its strncat() I think the strncat() calls here need to be of the form strncat(str, new_str_to_add, len - strlen(new_str_to_addstr) - 1); since in the OMPI calls len is being used as total number of bytes in str. strncat(dest,src,n) on the other hand is documented as writing up to n chars from the incoming string plus 1 for the null, for n+1 total bytes it can write. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2019-03-11 14:35:53 -04:00
Jeff Squyres	14563770a1	btl/usnic: amend Makefile.am fix from `b4097626ab` Use $(AM_CPPFLAGS) in $(usnic_btl_run_tests_CPPFLAGS) so that we don't have to replicate hard-coded values. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-03-05 09:30:21 -08:00
Gilles Gouaillardet	b4097626ab	btl/usnic: fix usnic_btl_run_tests CPPFLAGS do define the OMPI_LIBMPI_NAME macro via the CPPFLAGS. The issue occurs when Open MPI is configured with --enable-opal-btl-usnic-unit-tests Thanks George Marselis for reporting this issue Refs. open-mpi/ompi#6441 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-03-05 09:57:55 +09:00
Ralph Castain	8fd6107987	Merge pull request #6418 from rhc54/topic/slurm Update slurm pmi configury to account for pmix	2019-02-27 14:30:45 -08:00
Ben Menadue	17dcc7041a	Hold off running hwloc:external feature tests until after we decide if we're using the internal or external component. This fixes #6430 . Signed-off-by: Ben Menadue <ben.menadue@nci.org.au>	2019-02-25 16:58:11 +11:00
Ralph Castain	cd1b5641be	Update slurm pmi configury to account for pmix When Slurm is built against PMIx, some installations place a copy of the PMIx library that Slurm is linking against in the Slurm PMI location. Current configury ignores that location. The desired behavior is to look for a PMIx lib in that location when --with-pmi is given. If the user also specifies --with-pmix and gives a different location, then override anything previously found and look for it where the user directed. Signed-off-by: Ralph Castain <rhc@pmix.org>	2019-02-21 11:33:35 -08:00
Artem Polyakov	13a8e42108	Merge pull request #6163 from artpol84/osc/mt_submission Refactoring of osc/ucx component for MT	2019-02-20 09:41:27 -08:00
Jeff Squyres	170d5d119e	Merge pull request #6409 from dmitrygladkov/topic/btl/tcp btl/tcp: Fix copy-paste misprint	2019-02-20 12:12:18 -05:00
Dmitry Gladkov	9920da4992	btl/tcp: Fix copy-paste misprint Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>	2019-02-20 11:18:02 +02:00
Artem Polyakov	91d6115d99	opal/common/ucx: Adjust the threasholds for periodical flushes Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Artem Polyakov	3aadc2b5e1	opal/common/ucx: Fix periodical flush in the worker pool Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Artem Polyakov	84dfe1277c	opal/common/ucx: Rename wpool recv_worker to dflt_worker Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Artem Polyakov	8a990c2b64	opal/common/ucx: Add comments clarifying data structures Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Artem Polyakov	19e2ae2efb	opal/common/ucx: Switch to opal/tsd Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Artem Polyakov	7984d7d997	opal/common/ucx: Remove unused debugging macro Will be reintroduced later if needed and after adaptation to the OMPI infrastructure. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Artem Polyakov	43f16d8796	opal/common/ucx: Remove common_ucx_int.h Place the content of common_ucx_int.h back to the common_ucx.h and include common_ucx_wpool.h explicitly. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Xin Zhao	bb7d360621	opal/common/ucx: add refcnt in tlocal_ctx_tbl entry to keep track of usage Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Xin Zhao	101036651b	opal/common/ucx: Fix the bug in wpool's periodical flush Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Xin Zhao	bcb52ecade	opal/common/ucx: add winfo ptr into req Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2019-02-19 14:22:07 -08:00
Xin Zhao	33517428a1	opal/common/ucx: add periodical flush and counter to opal directory. Signed-off-by: Xin Zhao <xinz@mellanox.com>	2019-02-19 14:22:07 -08:00

1 2 3 4 5 ...

3916 Коммитов