openmpi

Автор	SHA1	Сообщение	Дата
Igor Ivanov	8b05f308f9	opal/memory: Move Memory Allocation Hooks usage from openib These changes fix issue https://github.com/open-mpi/ompi/issues/1336 - improve abstractions: opal/memory/linux component should be single place that opeartes with Memory Allocation Hooks. - avoid collisions in case dynamic component open/close: it is safe because it is linked statically. - does not change original behaivour.	2016-02-11 14:46:35 +02:00
Jeff Squyres	8d0a592563	usnic: update a few verbose reachability messages	2016-02-06 03:28:48 -08:00
Jeff Squyres	87dbe6ce01	usnic: add high-verbose reachability messages	2016-02-06 03:28:47 -08:00
Jeff Squyres	dac2fe1589	usnic: ensure to use ntohl() for network-order values	2016-02-06 03:28:47 -08:00
Jeff Squyres	51240394a7	usnic: ensure to init module->av_eq_num	2016-02-06 03:28:47 -08:00
Jeff Squyres	89eea51075	usnic: fix calculation for number of blocks	2016-02-02 16:56:34 -08:00
Nathan Hjelm	cd11fc3081	btl/ugni: fix race condition that causes completions to be dropped The send code in the ugni btl has an optimization that enables it to return 1 (fragment gone) in some cases. This optimization involved removing the btl ownership and callback flags to ensure the fragment stuck around long enough for its completion flag to be checked. This works fine for the single-threaded case but not in the multi-threaded case. It is possible that a fragment will be completed by another thread while a thread is in mca_btl_ugni_send. This competition can lead to a leaked fragment, missed callback, or both. To fix the issue without removing the optimization a reference count has been added to the fragment. Callbacks and fragment release will not be made until the fragment reference count has reach 0. The count is incremented before sending the frag and decremented after the completion flag has been checked. The fix has been verified to work using a multi-threaded RMA benchmark with the osc/pt2pt component. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-02 12:14:31 -07:00
Nathan Hjelm	14704201e2	btl/ugni: fix race condition when adding endpoint to wait list This commit fixes a race condition that can cause an endpoint to be added to the wait list multiple times. To fix the issue an additional check has been added to ensure the endpoint is not on the wait list after the wait list lock is held. The wait list processing code has also been updated to keep the wait list lock until all wait listed endpoints have been handled. This reduces the chance that an endpoint that is being processed by the wait list code is not re-added to the list by a competing send. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-02 12:13:49 -07:00
Jeff Squyres	9f3ed00125	usnic: minor updates from code review Three minor updates from the code review of https://github.com/open-mpi/ompi-release/pull/933: * Remove an extra blank line a show_help message * We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so change the flag to REGINT_GE_ONE * Change "num_blocks" definition to be in terms of block_len (not eq_size)	2016-02-01 11:14:30 -08:00
Jeff Squyres	c2615a4732	usnic: change retrans timeout to 5ms A bunch of empirical testing has shown that increasing the retranmit timeout from 1ms to 5ms doesn't adversely affect performance, yet decreases the number of gratuitious retransmissions.	2016-01-30 10:49:14 -08:00
Jeff Squyres	797d5026c8	usnic: better av_eq_num default value handling	2016-01-30 10:46:14 -08:00
Jeff Squyres	db825abc00	usnic: don't overrun the fi_av_insert() EQ Add endpoints in a blocked manner so that we don't overrun the fi_av_insert() event queue. Also make the AV EQ length an MCA param, and report it in mca_btl_base_verbose >=5 output.	2016-01-30 08:33:48 -08:00
Jeff Squyres	d624e0d60f	usnic: fix wraparound sequence number issue Sequence numbers will wrap around; it is not sufficient to check for (seq-1) -- must use the SEQ_DIFF macro to properly handle the wraparound. This bug wasn't serious; it just meant we might retransmit one or two extra times when retransmits were triggerd and the sequence numbers wrapped around their sliding windows.	2016-01-30 08:32:13 -08:00
Jeff Squyres	4de4a263f5	usnic: ensure all messages are sent on the data channel Messages should go on the data channel, even if they're short. Only ACKs go on the priority channel.	2016-01-30 08:31:21 -08:00
Jeff Squyres	348ac507c2	usnic: explain why we still have OPAL_HAVE_HWLOC Put in a comment explaining why btl_usnic_compat.h still defines OPAL_HAVE_HWLOC, even though master/v2.x no longer does.	2016-01-16 04:11:05 -08:00
Jeff Squyres	0f5fcf9029	usnic: fix common symbol	2016-01-16 03:55:27 -08:00
Jeff Squyres	60ffe713b8	common syms: whitelist bison-generated common symbols Bison generates some common symbols that we can't do anything about, so whitelist them.	2016-01-16 03:53:14 -08:00
Artem Polyakov	84e4fb308b	Fix race condition in UDCM where service thread sees that `cm_message_event_active == 1` but main thread has already stopped processing messages and thus we will have the situation where one message was left unhandled leading to a hang.	2016-01-08 23:56:21 +06:00
Jeff Squyres	6d073a8da4	btl_sm: add a comment explaining why we rename(2) Per open-mpi/ompi#1230, add a comment explaining why we write to a temporary file and then rename(2) the file, just so that future code maintainers don't wonder why we do this seemingly-useless step.	2016-01-04 14:51:52 -05:00
Artem Polyakov	2abb2972ac	Fix Mellanox copyrights with respect to the following PRs: * https://github.com/open-mpi/ompi/pull/1184 * https://github.com/open-mpi/ompi/pull/1188 * https://github.com/open-mpi/ompi/pull/1197 * https://github.com/open-mpi/ompi/pull/1202 * https://github.com/open-mpi/ompi/pull/1210 * https://github.com/open-mpi/ompi/pull/1216 * https://github.com/open-mpi/ompi/pull/1236 * https://github.com/open-mpi/ompi/pull/1237 * https://github.com/open-mpi/ompi/pull/1248 * https://github.com/open-mpi/ompi/pull/1260 * https://github.com/open-mpi/ompi/pull/1264	2015-12-30 00:12:19 +06:00
Gilles Gouaillardet	fec973efda	configury: test portability replace test ... -o ... with test ... \|\| test ... and test ... -a ... with test ... && test ...	2015-12-28 13:58:45 +09:00
Nathan Hjelm	700a21022a	Merge pull request #1260 from artpol84/openib_proc_account_fix Openib proc accounting fix	2015-12-27 15:19:52 -07:00
Artem Polyakov	a20826e6b4	Fix vader resource leak. This nasty bug was nicely masked. It was causing `mca_btl_vader_component.vader_frags_user` overflow and as the result rear hangs of ompi-test-suite.	2015-12-28 00:41:45 +06:00
Gilles Gouaillardet	2d9aa38e6a	btl/openib: fix heterogeneous support	2015-12-25 16:31:35 +09:00
Artem Polyakov	3031affdb7	Fix openib process accounting if procs was dynamically added.	2015-12-24 17:56:35 +06:00
Artem Polyakov	400af6c52d	openib addproc improvements: 1. finer grained locks; 2. separate srq creation from cq adjustments.	2015-12-24 17:56:35 +06:00
Artem Polyakov	41c325f15a	Shift common code for calculating a port count and btl_rank in openib into the static function	2015-12-24 17:56:35 +06:00
Gilles Gouaillardet	5fa63f086a	btl/tcp: add missing #include <unistd.h> Thanks Marco Atzeri for contributing the original patch	2015-12-24 14:41:46 +09:00
Gilles Gouaillardet	15ed7ad9f5	btl/sm: add missing #include <unistd.h> Thanks Marco Atzeri for contributing the original patch	2015-12-24 14:41:41 +09:00
Gilles Gouaillardet	42313acd58	btl/usnic: add missing #include <alloca.h>	2015-12-24 14:33:58 +09:00
Nathan Hjelm	84d890b7e7	Merge pull request #1248 from artpol84/openib_proc_init_race Openib dynamic add proc race conditions	2015-12-22 21:48:05 -07:00
Artem Polyakov	08ad8357a8	Fix local process accounting in openib when dynamic add_proc is on.	2015-12-22 22:44:46 +06:00
Artem Polyakov	3c2f6d5560	Protect openib_btl->device data with explicit opal_mitex locks.	2015-12-22 18:33:26 +06:00
Gilles Gouaillardet	607d7c7545	btl/sm: rename file after file descriptor has been closed. Thanks George for spotting this.	2015-12-22 13:56:53 +09:00
Artem Polyakov	e06bffe213	Fix ib_proc locking	2015-12-21 18:52:31 +06:00
Artem Polyakov	3eb4756a17	Force locking regardles to the `opal_using_threads()` setting.	2015-12-21 18:52:31 +06:00
Artem Polyakov	11b72d9add	Make important fields of ib_proc volatile.	2015-12-21 18:52:31 +06:00
Artem Polyakov	86c0c3ec52	Provide additional information: whether ib_proc was newly created or it was already existing.	2015-12-21 18:52:31 +06:00
Artem Polyakov	9325bd3d69	Protect device initialization	2015-12-21 18:52:31 +06:00
Artem Polyakov	0f77bc7ea7	Perform endpoint initialization atomically.	2015-12-21 18:52:31 +06:00
Artem Polyakov	afaf9c9ea6	Shift ib_proc initialization to the separate function.	2015-12-21 18:52:31 +06:00
Artem Polyakov	3c9fd567b6	Fix openib race condition when direct modex is used. The problem was in mca_btl_openib_proc_create. This function may be called from several places simultaneously: * from the main thread when somebody wants to do `MPI_Send()` (for example) for the first time; * from udcm if the counterpart peer is trying to connect and `mca_btl_openib_get_ep()` is called. In this case one of the threads may add an uninitialized proc structure to the `mca_btl_openib_component.ib_procs` and the other will read it and treat as initialized. This commit turns ib_proc initialization into a single atomic operation.	2015-12-21 18:52:30 +06:00
Gilles Gouaillardet	db4f483653	btl/sm: fix race condition write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes open-mpi/ompi#1230	2015-12-21 16:37:51 +09:00
Nathan Hjelm	e77199fd4f	Merge pull request #1235 from ggouaillardet/topic/ibv_exp_fixes btl/openib: do not mix exp and non exp verbs	2015-12-17 08:36:09 -07:00
Gilles Gouaillardet	994a627f82	btl/openib: do not mix exp and non exp verbs	2015-12-17 16:45:43 +09:00
Artem Polyakov	0951a34e95	Fix openib memory registration limit calculation if cutoff = 0.	2015-12-17 13:45:19 +06:00
Jeff Squyres	2b9341a38a	usnic: fix embarrissing typo	2015-12-15 19:01:19 -08:00
Jeff Squyres	944d5061a6	usnic: sendto() can return EPERM if we send too fast If we send too fast, sendto() can run out of resources and return EPERM. So delay a little and try again.	2015-12-15 15:31:29 -08:00
Jeff Squyres	ab1bbca5b9	usnic: improve error message When sendto() fails, it would be helpful to see the errno value.	2015-12-15 15:04:25 -08:00
Jeff Squyres	c1a6beac8d	usnic: fix error message There were too many "%s" instances. Re-order the output so that we show file, line, and then the error message.	2015-12-15 14:48:38 -08:00

... 3 4 5 6 7 ...

715 Коммитов