openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	3cc95d86b2	btl/usnic: increase default retrans_timeout Significantly increase the default retrans timeout. If the retrans timeout is too soon, we can end up in a retransmission storm where the logic will continually re-transmit the same frames during a single run through the usNIC progress function (because the timer for a single frame expires before we have run through re-transmitting all the frames pending re-transmission). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Jeff Squyres	968b1a51b5	btl/usnic: clarifications and fixes regarding ACKs New MCA parameter: btl_usnic_ack_iteration_delay. Set this to the number of times through the usNIC component progress function before sending a standalone ACK (vs. piggy-backing the ACK on any other send going to the target peer). Use "ticks" language to clarify that we're really counting the number of times through the usNIC component DATA_CHANNEL completion check (to check for incoming messages) -- it has no relation to wall clock time whatsoever. Also slightly change the channel-checking scheme in usNIC component progress: only check the PRIORITY channel once (vs. checking it once, not finding anything, and then falling through the progress_2() where we check PRIORITY again and then check the DATA channel). As before, if our "progress" libevent fires, increment the tick counter enough to guarantee that all endpoints that need an ACK will get triggered to send standalone ACKs the next time through progress, if necessary. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Jeff Squyres	ce2910a28a	btl/usnic: s/get_nsec/get_nticks/g Rename "get_nsec()" to "get_ticks()" to more accurately reflect that this function has no correlation to wall clock time at all. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Jeff Squyres	f3429d7a44	btl/usnic: pack a wire data struct Might as well save a few bytes when sending this struct across the network via the __opal_attribute_packed__ attribute. That being said, also re-order the elements in this struct so that there's no holes to begin with. Do this so that the compiler/runtime won't effect (slow) unaligned reads/writes because of the __opal_attribute_packed__ attribute. The "packed" attribute is really more about defensive programming (e.g., if we make a mistake and have a hole, "packed" will remove it for us). *** Do not bring this commit back to existing/already-released release branches: it will cause incompatibility, since it effectively changes the usNIC BTL wire protocol. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-10-04 13:05:51 -07:00
Austen Lauria	0d4004cc3c	Fix miscellaneous compiler warnings. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>	2019-10-01 16:27:25 -04:00
Nathan Hjelm	8473a66466	btl/uct: fix bug when using a transport without zero-copy This commit fixes a crash that can occur if a transport is usable but doesn't have zero-copy support. In this case do not attempt to use zero-copy and set the max send size off the bcopy limit. Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2019-09-27 17:26:37 -07:00
Nathan Hjelm	526775dfd7	btl/uct: add support for OpenUCX v1.8 API changes OpenUCX broke the UCT API again in v1.8. This commit updates btl/uct to fix compilation with current OpenUCX master (future v1.8). Further changes will likely be needed for the final release. Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2019-09-27 12:34:48 -07:00
Jeff Squyres	8038fac8f9	Merge pull request #6844 from adrianreber/check_for_user_ns Do not use CMA in user namespaces	2019-09-20 22:10:42 -04:00
Nathan Hjelm	ae91b11de2	btl/vader: when using single-copy emulation fragment large rdma This commit changes how the single-copy emulation in the vader btl operates. Before this change the BTL set its put and get limits based on the max send size. After this change the limits are unset and the put or get operation is fragmented internally. References #6568 Signed-off-by: Nathan Hjelm <hjelmn@google.com>	2019-09-05 23:08:53 -07:00
Adrian Reber	fc68d8a90f	Do not use CMA in user namespaces Trying out to run processes via mpirun in Podman containers has shown that the CMA btl_vader_single_copy_mechanism does not work when user namespaces are involved. Creating containers with Podman requires at least user namespaces to be able to do unprivileged mounts in a container Even if running the container with user namespace user ID mappings which result in the same user ID on the inside and outside of all involved containers, the check in the kernel to allow ptrace (and thus process_vm_{read,write}v()), fails if the same IDs are not in the same user namespace. One workaround is to specify '--mca btl_vader_single_copy_mechanism none' and this commit adds code to automatically skip CMA if user namespaces are detected and fall back to MCA_BTL_VADER_EMUL. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-09-05 20:15:19 +02:00
William Zhang	4ebb37a26c	opal/util: Change opal/util/if.h macro IF_NAMESIZE to OPAL_IF_NAMESIZE Due to IF_NAMESIZE being a reused and conditionally defined macro, issues could arise from macro mismatches. In particular, in cases where opal/util/if.h is included, but net/if.h is not, IF_NAMESIZE will be 32. If net/if.h is included on Linux systems, IF_NAMESIZE will be 16. This can cause a mismatch when using the same macro on a system. Thus different parts of the code can have differring ideas on the size of a structure containing a char name[IF_NAMESIZE]. To avoid this error case, we avoid reusing the IF_NAMESIZE macro and instead define our own as OPAL_IF_NAMESIZE. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-29 21:24:39 +00:00
William Zhang	8c3b8a87c5	btl tcp: Fix error path memory leak After the OPAL_MODEX_RECV call, remote_addrs was not freed in the error path. Moved the free call into cleanup to ensure we always free this memory before leaving the function. Signed-off-by: William Zhang <wilzhang@amazon.com>	2019-07-15 22:35:04 +00:00
George Bosilca	4620c351ea	Prevent EPIPE on OSX. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2019-06-28 15:32:59 -04:00
Nathan Hjelm	b78066720c	btl/uct: add support for UCX 1.6.x This commit updates the uct btl to support the v1.6.x release of UCX. This release breaks API. Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>	2019-05-21 04:31:57 -06:00
Nathan Hjelm	3e1dd36241	btl/uct: check for support before disabling UCX memory hooks Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2019-05-15 13:49:10 -06:00
Jeff Squyres	df5f7afb14	usnic: fix Coverity false positives Add some Coverity inline notation to tell Coverity that these functions never return. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-14 13:53:25 -07:00
Jeff Squyres	566e6f1ca3	btl/usnic: remove legacy code Remove compatibility code for multiple versions of BTL_IN_OPAL, BTL_VERSION, and RCACHE_VERSION. This stuff was really only necessary when we were actively swapping code between multiple release branches that had large variations in core OMPI infrastructure. These large variations have now been around for quite a while, so the need for this "compat" layer is significantly reduced. It hasn't been removed simply because a few of the "compat" names a slightly more friendly than the real names (e.g., the SEND/RECV/PUT names). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-11 05:19:36 -07:00
Jeff Squyres	8a2441603f	btl/usnic: remove all calls to abort() Inspired by https://github.com/open-mpi/ompi/pull/5205, finally remove all calls to abort() from the usnic BTL. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-05-11 05:17:29 -07:00
Nathan Hjelm	b82a08254f	btl/ugni: fix 32-bit compare-and-swap atomics This commit fixes an error in the 32-bit compare-and-swap atomic support for Aries networks. The code was incorrectly using the non-fetching version of cswap which was causing the routing to return OPAL_ERR_BAD_ARG. Signed-off-by: Nathan Hjelm <hjelmn@cs.unm.edu>	2019-05-10 09:59:54 -06:00
bosilca	b54fdf5dd9	Merge pull request #6541 from bwbarrett/bugfix/enotconn btl/tcp: Skip printing error message in racy cleanup path	2019-03-28 22:42:52 -04:00
Brian Barrett	d5360711fa	btl/tcp: Skip printing error message in racy cleanup path Avoid printing an error message about ENOTCONN return codes from getpeername() when handling an incoming connection request. At this point in the receive state machine, the remote process has been verified to be a valid OMPI instance. In all-to-all startup at 4k rank scale, we're seeing this error message when the remote side drops the connection because it realizes it's the "loser" in the connection race. We were already doing all the right things, other than printing a scary error message. So skip the error message and call it good. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2019-03-28 23:12:35 +00:00
Gilles Gouaillardet	77060cad07	btl/vader: fix finalize sequence free the component mpool in mca_btl_vader_component_close() and after freeing soem objects that depend on it such as mca_btl_vader_component.vader_frags_user Thanks Christoph Niethammer for reporting this. Refs. open-mpi/ompi#6524 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-03-27 11:57:40 +09:00
Jeff Squyres	14563770a1	btl/usnic: amend Makefile.am fix from `b4097626ab` Use $(AM_CPPFLAGS) in $(usnic_btl_run_tests_CPPFLAGS) so that we don't have to replicate hard-coded values. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-03-05 09:30:21 -08:00
Gilles Gouaillardet	b4097626ab	btl/usnic: fix usnic_btl_run_tests CPPFLAGS do define the OMPI_LIBMPI_NAME macro via the CPPFLAGS. The issue occurs when Open MPI is configured with --enable-opal-btl-usnic-unit-tests Thanks George Marselis for reporting this issue Refs. open-mpi/ompi#6441 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2019-03-05 09:57:55 +09:00
Jeff Squyres	170d5d119e	Merge pull request #6409 from dmitrygladkov/topic/btl/tcp btl/tcp: Fix copy-paste misprint	2019-02-20 12:12:18 -05:00
Dmitry Gladkov	9920da4992	btl/tcp: Fix copy-paste misprint Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>	2019-02-20 11:18:02 +02:00
Jeff Squyres	f5e1a672cc	ofi: revamp OPAL_CHECK_OFI configury Update the OPAL_CHECK_OFI configury macro: - Make it safe to call the macro multiple times: - The checks only execute the first time it is invoked - Subsequent invocations, it just emits a friendly "checking..." message so that configure output is sensible/logical - With the goal of ultimately removing opal/mca/common/ofi, rename the output variables from OPAL_CHECK_OFI to be opal_ofi_{happy\|CPPFLAGS\|LDFLAGS\|LIBS}. - Update btl/ofi, btl/usnic, and mtl/ofi for these new conventions. - Also, don't use AC_REQUIRE to invoke OPAL_CHECK_OFI because that causes the macro to be invoked at a fairly random time, which makes configure stdout confusing / hard to grok. - Remove a little left-over kruft in OPAL_CHECK_OFI, too (which resulted in an indenting change, making the change to opal_check_ofi.m4 look larger than it really is). Thanks Alastair McKinstry for the report and initial fix. Thanks Rashika Kheria for the reminder. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-02-07 06:29:58 -08:00
Jeff Squyres	b556cabfe9	btl/ofi/Makefile.am: down with tabs! Replace all tabs with spaces. No code or logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-02-07 06:29:58 -08:00
Jeff Squyres	8de786f5a4	btl/openib: So long / farewell / it's time to say goodnight So long BTL openib! After many years of (mostly) faithful service, it is time to remove the openib BTL. It has been fully replaced by other components, such as the UCX PML and OFI MTL. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2019-02-07 05:34:19 -08:00
Aravind Gopalakrishnan	786e686d43	btl/ofi: Fix valgrind complaints on uninitialized pointer use It doesn't seem like the BTL was using uninitialized pointer. But simply setting the rcache pointer to NULL after destroying it makes the valgrind errors go away. Fixes Issue #6345 Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>	2019-02-01 14:03:23 -08:00
Nathan Hjelm	b51c8f888c	btl/vader: fix fragment sizes used by free lists This commit fixes a bug introduced in `f62d26ddbc`. That commit changed how vader allocates fragment memory from the shared memory segment. Unfortunately, the values used for the fragment sizes did not include space for the fragment header. This can cause an overrun of data from one fragment to the header of the next fragment. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2019-01-30 12:31:34 -07:00
Nathan Hjelm	f62d26ddbc	btl/vader: use basic mpool type to handle frag/fbox allocation This commit updates btl/vader to use an mpool for handling all shared memory allocations (frags, fboxes). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2019-01-14 15:54:16 -07:00
Nathan Hjelm	edaf08bf6d	btl/vader: minor correction to match ompi coding style Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2019-01-07 14:24:58 -07:00
Nathan Hjelm	30b8336cb4	btl/vader: don't try to set reachabilty in add_procs if not requested This commit fixes a bug where add_procs can incorrectly return an error when going through the dynamic add_procs path. This doesn't happen normally, only when pml/ob1 is not in use. References #6201 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2019-01-07 13:00:48 -07:00
Gilles Gouaillardet	b89deeb1bb	btl/uct: fix a typo in configure.m4 remove whitespace around '=' when setting btl_uct_LIBS Thanks Ake Sandgren for reporting this Refs. open-mpi/ompi#6173 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-12-11 14:23:53 +09:00
Gilles Gouaillardet	78aa6fdd1d	btl/uct: fix a warning Use the PRIsize_t macro to correctly print a size_t Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-12-07 16:16:35 +09:00
Nathan Hjelm	e07a64c52d	btl/uct: fix some issues when using UCX over ugni Though not a recommended configuration it is possible to use Open MPI over UCX over uGNI. This configuration had some issues related to the connection management and tl selection. This commit fixes those issues. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-12-05 16:30:54 -07:00
Gilles Gouaillardet	0a2ce58040	btl/openib: delay UCX warning to add_procs() If UCX is available, then pml/ucx will be used instead of pml/ob1 + btl/openib, so there is no need to warn about btl/openib not supporting Infiniband. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-11-29 16:20:55 +09:00
KAWASHIMA Takahiro	cacd6f389c	datatype: Remove `#if HAVE_[TYPE]` for C99 types Now Open MPI requires a C99 compiler. Checking availability of the following types is no more needed. - `long long` (`signed` and `unsigned`) - `long double` - `float _Complex` - `double _Complex` - `long double _Complex` Furthermore, the `#if HAVE_[TYPE]` style checking is not correct. Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`. `AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h` if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]` (instead of defining as `0`) if it is not available. So even if we need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`. I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac` since someone may use `HAVE_[TYPE]` macros somewhere. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2018-11-14 09:32:52 +09:00
Thananon Patinyasakdikul	d9bd54c628	btl/ofi: fixed compiler warning on OSX. This commit closes #6049 Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>	2018-11-06 15:37:25 -05:00
Howard Pritchard	8126779a35	btl/openib: fix a problem with ib query Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: #5810 Fixes: #5914 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-10-31 14:50:15 -06:00
Brian Barrett	a1e85b03aa	btl tcp: Fix compile error in IPv6 In `457f058` I broke the TCP BTL with --enable-ipv6. This patch fixes the compile error, so IPv6 works again. Fixed #5996 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-30 12:31:04 -07:00
Gilles Gouaillardet	b715dd2657	btl/uct: fix AC_CHECK_DECLS usage AC_CHECK_DECLS take a comma separated list of macros/symbols, so replace the whitespace separator with a comma. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2018-10-26 15:36:02 +09:00
Nathan Hjelm	1b37328ba8	btl/uct: update for UCT_CB_FLAG_SYNC removal Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-21 18:57:42 -06:00
Brian Barrett	457f058e73	btl tcp: Simplify modex address selection Simplify selection of the address to publish for a given BTL TCP module in the module exchange code. Rather than looping through all IP addresses associated with a node, looking for one that matches the kindex of a module, loop over the modules and use the address stored in the module structure. This also happens to be the address that the source will use to bind() in a connect() call, so this should eliminate any confusion (read: bugs) when an interface has multiple IPs associated with it. Refs #5818 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-18 02:23:21 +00:00
Brian Barrett	4f19221af2	btl tcp: Simplify module address storage Today, a btl tcp module is associated with exactly one IP address (IPv4 or IPv6). There's no need to reserve space for both an IPv4 and IPv6 address in the module structure, since the module will only be associated with one or the other. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-17 16:21:17 +00:00
Brian Barrett	2acc4b7e7f	btl tcp: Add workaround for "dropped connection" issue Work around a race condition in the TCP BTL's proc setup code. The Cisco MTT results have been failing on TCP tests due to a "dropped connection" message some percentage of the time. Some digging shows that the issue happens in a combination of multiple NICs and multiple threads. The race is detailed in https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032. This patch doesn't fix the race, but avoids it by forcing the MPI layer to complete all calls to add_procs across the entire job before any process leaves MPI_INIT. It also reduces the scalability of the TCP BTL by increasing start-up time, but better than hanging. The long term fix is to do all endpoint setup in the first call to add_procs for a given remote proc, removing the race. THis patch is a work around until that patch can be developed. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-16 18:33:30 -07:00
Nathan Hjelm	37a3f32e47	Merge pull request #5913 from hjelmn/uct_btl_fixes btl/uct: fix deadlock in connection code	2018-10-16 19:11:54 -06:00
Nathan Hjelm	707d35deeb	btl/uct: fix deadlock in connection code This commit fixes a deadlock that can occur when using a TL that supports the connect to endpoint model. The deadlock was occurring while processing an incoming connection requests. This was done from an active-message callback. For some unknown reason (at this time) this callback was sometimes hanging. To avoid the issue the connection active-message is saved for later processing. At the same time I cleaned up the connection code to eliminate duplicate messages when possible. This commit also fixes some bugs in the active-message send path: - Correctly set all fragment fields in prepare_src. - Fix bug when using buffered-send. We were not reading the return code correctly (which is in bytes). This resulted in a message getting sent multiple times. - Don't try to progress sends from the btl_send function when in an active-message callback. It could lead to deep recursion and an eventual crash if we get a trace like send->progress->am_complete->ob1_callback->send->am_complete... Closes #5820 Closes #5821 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-16 18:28:47 -06:00
Brian Barrett	da1189d771	Merge pull request #5916 from bwbarrett/revert/6acebc4 Revert "Handle error cases in TCP BTL"	2018-10-16 13:54:18 -07:00
Jeff Squyres	cef6cf0ac5	opal: update some string handling Make various string handling locations a little more robust. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-14 16:04:28 -07:00
Brian Barrett	5162011428	Revert "Handle error cases in TCP BTL" This reverts commit `6acebc40a1`. This patch is causing numerous "Socket closed" messages which are causing most of the failures on Cisco's MTT run. See https://github.com/open-mpi/ompi/issues/5849 for more information. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-12 15:01:54 -07:00
Nathan Hjelm	6ed68da870	btl/uct: use the correct tl interface attributes It is apparently possible for different instances of the same UCT transport to have different limits (max short put for example). To account for this we need to store the attributes per TL context not per TL. This commit fixes the issue. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-11 11:33:17 -06:00
Brian Barrett	902b57919c	btl tcp: Print number of endpoints on error While trying to debug #3035, it's not clear whether there is an issue with the modex data or printing the address list. Print the number of endpoints on the error, which will help determine which case is happening to Cisco. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-10 08:29:48 -07:00
Brian Barrett	bd07cc6707	btl tcp: Improve initialization debugging When creating TCP BTL modules, print more information about the module's ethernet association, including the first address associated with the device, as debug output. Fix a flipped output string for IPv4 and IPv6 addresses in the modex send code. Add the addresses being published in the modex to the debugging output in modex send, to help match failures in endpoint match. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-10 08:29:48 -07:00
Nathan Hjelm	39be6ec15c	btl/uct: bug fixes and general improvements This commit updates the uct btl to change the transports parameter into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the priority list. This will give better out of the box performance for multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5 lock inside libmlx5_rdmav2. This commit also fixes a number of leaks and a possible deadlock when using RDMA. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-09 15:15:45 -06:00
Brian Barrett	e9e4d2a4bc	Handle asprintf errors with opal_asprintf wrapper The Open MPI code base assumed that asprintf always behaved like the FreeBSD variant, where ptr is set to NULL on error. However, the C standard (and Linux) only guarantee that the return code will be -1 on error and leave ptr undefined. Rather than fix all the usage in the code, we use opal_asprintf() wrapper instead, which guarantees the BSD-like behavior of ptr always being set to NULL. In addition to being correct, this will fix many, many warnings in the Open MPI code base. Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-10-08 16:43:53 -07:00
Nathan Hjelm	8291f6722d	btl/vader: fix race condition in writing header Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-10-05 16:30:06 -06:00
Geoff Paulsen	56a55cdb4b	Merge pull request #5786 from jsquyres/pr/string-madness Replace strcpy() and strncpy() with (new) opal_string_copy()	2018-10-04 16:12:46 -05:00
Nathan Hjelm	dfa8d3a81a	btl/vader: work around Oracle compiler bug This commit works around an Oracle C compiler bug in 5.15 (not sure when it was introduced). The bug is triggered when we chain assignments of atomic variables. Ex: _Atomic intptr x, y; intptr_t z = 0; x = y = z; Will produce a compiler error of the form: operand cannot have void type: op "=" assignment type mismatch: long "=" void To work around the issue we are removing the chain assignment and setting the head and tail on different lines. Fixes #5814 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-03 08:55:51 -06:00
Nathan Hjelm	66a7dc4c72	btl/vader: ensure the fast box tag is always read first On some platfoms reading a 64-bit value is non-atomic and it is possible that the two 32-bit values are read in the wrong order. To ensure the tag is always read first this commit reads the tag before reading the full 64-bit value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-10-03 07:17:32 -06:00
George Bosilca	a3a492b42c	Small pedantic fixes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-02 12:08:18 -04:00
George Bosilca	9164e26e2f	Provide the correct socklen to bind. Get Brian's patch from #5825 and his log message: Fix a failure in binding the initiating side of a connection on MacOS. MacOS doesn't like passing the size of the storage structure (sockaddr_storage) instead of the expected size of the structure (sockaddr_in or sockaddr_in6), which was causing bind() failures. This patch simply changes the structure size to the expected size. Add a more clear error message in debug mode. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-10-02 12:06:40 -04:00
Jeff Squyres	5dae086f7e	btl/tcp: output the IP address correctly Per https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673, it looks like the IP address for a given interface is being stashed in two places: on the endpoint and on the module. 1. On the endpoint, it is storing the moral equivalent of a (struct sockaddr_in.sin_addr). 2. On the module, it is storing a full (struct sockaddr_storage). The call to opal_net_get_hostname() expects a full (struct sockaddr*) -- not just the stripped-down (struct sockaddr_in.sin_addr). Hence, when the original code was passing in the endpoint's (struct sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it like a (struct sockaddr), hilarity ensued (i.e., we got the wrong output). This commit eliminates the call to opal_net_get_hostname() and just calls inet_ntop() directly to convert the (struct sockaddr_in.sin_addr) to a string. NOTE: Per the github comment cited above, there can be a disparity between the IP address cached on the endpoint vs. the IP address cached on the module. This only happens with interfaces that have more than one IP address. This commit does not fix that issue. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-10-01 16:12:57 -07:00
Jeff Squyres	28e51ad4d4	opal: convert from strncpy() -> opal_string_copy() In many cases, this was a simple string replace. In a few places, it entailed: 1. Updating some comments and removing now-redundant foo[size-1]='\0' statements. 2. Updating passing (size-1) to (size) (because opal_string_copy() wants the entire destination buffer length). This commit actually fixes a bunch of potential (yet quite unlikely) bugs where we could have ended up with non-null-terminated strings. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-27 11:56:18 -07:00
Jeff Squyres	6bb356ab87	Squash a bunch of harmless compiler warnings. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-26 12:15:21 -07:00
Nathan Hjelm	c6230657a2	opal: fix warning Fixes #5713 Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2018-09-21 00:51:40 -06:00
Brian Barrett	9344afd485	openib: Disable CUDA async by default Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References #3972 Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2018-09-20 07:58:10 -07:00
Howard Pritchard	dccf780546	Merge pull request #5737 from hppritcha/topic/remove_scif_support SCIF: remove it	2018-09-19 11:38:27 -06:00
Howard Pritchard	b9ac3d8931	SCIF: remove it KNC is effectively dead. Remove corresponding SCIF support in Open MPI. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-09-19 10:39:52 -06:00
bosilca	b0e2975f86	Merge pull request #5701 from ICLDisco/export/errors_ib Adding error handling in OpenIB BTL	2018-09-19 09:45:58 -04:00
bosilca	464e1abbab	Merge pull request #5700 from ICLDisco/export/tcp_errors Handle error cases in TCP BTL	2018-09-19 09:44:38 -04:00
Michael Kuron	17b0f1fcc3	Deal with EOPNOTSUPP returned from getsockopt() This can be returned when running on QEMU user-mode emulation, which does not support getsockopt with SO_RCVTIMEO. Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>	2018-09-16 14:55:28 +02:00
Jeff Squyres	8f2620d3af	misc: compiler warning fixes A variety of small compiler warning fixes. The 2 PMIx fixes are already committed upstream. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-15 06:04:13 -07:00
Nathan Hjelm	6363889cd0	Merge pull request #5696 from hjelmn/vader_5375 btl/vader: ensure that the send tag is always written last	2018-09-14 12:33:35 -06:00
Nathan Hjelm	1071d72130	Merge pull request #5445 from hjelmn/asm_type Update opal to use C11 atomics if available	2018-09-14 12:32:56 -06:00
Nathan Hjelm	000f9eed4d	opal: add types for atomic variables This commit updates the entire codebase to use specific opal types for all atomic variables. This is a change from the prior atomic support which required the use of the volatile keyword. This is the first step towards implementing support for C11 atomics as that interface requires the use of types declared with the _Atomic keyword. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-14 10:48:55 -06:00
Nathan Hjelm	850fbff441	btl/vader: ensure that the send tag is always written last To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes #5375, #5638 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-14 10:35:35 -06:00
Selvin Xavier	a53a6f7650	Add support for different Broadcom HCAs Adds device ids of different Broadcom adapters from BCM57XXX and BCM58XXX family of HCAs. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>	2018-09-12 02:22:34 -07:00
Jeff Squyres	a1b879d176	Merge pull request #5651 from jsquyres/pr/dont-let-openib-yell-about-no-nics btl/openib: don't complain about no NICs	2018-09-07 15:00:24 -04:00
Howard Pritchard	dc02e54320	Merge pull request #5516 from thananon/ofi_send btl/ofi: Added 2 sided communication support.	2018-09-06 18:39:23 -06:00
Jeff Squyres	098ec55e37	btl/openib: don't complain about no NICs Since openib is on its long, slow way out the door, don't let it complain about not being able to find any NICs at run time. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-09-06 11:26:58 -07:00
Nathan Hjelm	36c206d2d6	btl/uct: add missing opal_mem_hooks_unregister_release call This commit fixes a bug when using the UCT btl with the UCX memory hooks disabled. We were misssing a call to opal_mem_hooks_unregister_release to remove the btl memory hook callback. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-09-05 13:01:45 -06:00
Jeff Squyres	fe0852bcb4	Miscellaneous compiler warning stomps. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-08-24 07:39:14 -07:00
Nathan Hjelm	c74cf666a9	btl/vader: clean up debuging and squash warning References #5512 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-22 10:57:32 -06:00
Aurélien Bouteiller	e46c907468	Adding error handling in OpenIB BTL bugfix: major: openib send credits returned correctly after a fault for pending frags to dead processes; also tweak the default IB retry timeouts tomake this happen faster Make it compile in non-debug builds Mark the IB endpoint as failed when invoking an error; this resolves UDCM connection deadlocks Changing the default IB retry timeouts is not a good idea. We'll need to find another way to speedup credit recovery in failure cases. Remove ULFM specific cases Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2018-08-16 16:53:17 -04:00
Aurelien Bouteiller	6acebc40a1	Handle error cases in TCP BTL When an error is returned by the socket operations, trigger the appropriate error path in the PML to give an opportunity for rerouting/error handling. Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>	2018-08-14 15:35:24 -04:00
Nathan Hjelm	dca3516765	btl/vader: move memory barrier to where it belongs The write memory barrier was intended to precede setting a fast-box header but instead follows it. This commit moves the memory barrier to the intended location. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-08-13 10:14:34 -06:00
George Bosilca	6d11a45f44	Remove few warnings identified by @rhc in #5514 . Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2018-08-03 16:21:06 -04:00
Thananon Patinyasakdikul	080115d440	btl/ofi: Added 2 side communication support. The 2 sided communication support is added for non-tagmatching provider to take advantage of this BTL and PML OB1. The current state is "functional" and not optimized for performance. Two sided support is disabled by default and can be turned on by mca parameter: "mca_btl_ofi_mode". Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-08-03 12:30:03 -07:00
Nathan Hjelm	47ed8e8830	btl/uct: fix compile warnings/errors Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-23 14:04:38 -06:00
Howard Pritchard	4447738098	Merge pull request #5414 from hppritcha/topic/iwarp_only_by_default btl/openib: only look for iwarp/roce by default	2018-07-17 20:08:00 -06:00
Howard Pritchard	6818272392	btl/openib: only look for iwarp/roce by default Due to decreasing support by vendors/other orgs for the OpenIB BTL, only look for iWarp/RoCE devices by default. Allow IB HCAs with ports configured for ethernet. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2018-07-17 19:11:37 -06:00
Thananon Patinyasakdikul	033c364ee0	btl/ofi: Added FI_CONTEXT as requirement. OFI BTL uses context for completion but never ask for it in fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT to eliminate any potential error. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>	2018-07-17 12:18:43 -07:00
Nathan Hjelm	9d3a79925b	btl/vader: fix bugs in rma emulation This commit fixes two bugs in the RMA/atomic emulation code: 1) Fix a fragment leak when using AMO emulation. 2) Always initialize the single-copy emulation code. This is required to use the AMO support. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-12 15:50:50 -06:00
Nathan Hjelm	bd5cd62df9	btl/ugni: fix up some warnings Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-03 16:30:44 -06:00
Nathan Hjelm	d8916a4672	btl/ugni: fix race condition in completing frags The descriptor flags field in a fragment were being ready after the fragment may have been freed. This commit reads the flags before calling the user callback. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-03 10:48:54 -06:00
Nathan Hjelm	87d41da62b	btl/vader: add support for atomics and emulated rdma This commit adds support for atomic operations as well as rdma for systems without rdma support. This support is implemented using an internal send tag. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2018-07-02 13:57:11 -06:00
Nathan T. Weeks	08f9ae97ee	btl/ugni: update BTL_VERBOSE argument list Signed-off-by: Nathan T. Weeks <weeks@iastate.edu>	2018-07-02 09:23:30 -06:00
Jeff Squyres	7b0dd03e92	tcp/btl: fix a cast The current cast is functional, but isn't really the way it should be done. This commit makes the cast the way it should be done. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2018-06-29 07:25:46 -07:00

1 2 3 4 5 ...

963 Коммитов