openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	584b80147d	usnic: set MCA_BTL_FLAGS_SINGLE_ADD_PROCS The btl_recv.h:lookup_sender() function uses the hashed ORTE proc name to determine the sender of the packet. With add_procs_cutoff>0, the usnic BTL may not have knowledge of all the senders. Until the usNIC BTL can be adjusted to do something like the openib/ugni BTLs (i.e., use opal_proc_for_name() to lookup unknown sender proc names), set MCA_BTL_FLAGS_SINGLE_ADD_PROCS, which means that ob1 will only all add_procs() once -- with all the procs in it. Also in this commit, adapt the connectivity checker to not rely on knowing all the senders (which is a bit easier than adapting the main BTL send path): the receiving connectivity agent will simply echo back the same PING message (which contains the sender's IP address+UDP port) back to the sender without checking that it knows who the sender is. If the sender receives the echoed PING back on the expexted interface, it will find a match in the pending pings list. If the sender receives the echoed PING back an unexpected interface, a match will not be found, and the incoming PING message will be dropped. Fixes open-mpi/ompi#1440	2016-03-08 17:41:42 -08:00
Jeff Squyres	4975fdcd5c	usnic: allow connect(2) to fail temporarily When connecting the connectivity checker client to its agent fails with ECONNREFUSED, just delay a little and try again a few more times.	2016-03-08 15:35:34 -08:00
Nathan Hjelm	2ef2763f72	btl/openib: fix inconsistency in the default settings This commit fixes an inconsistency between btl_openib_receive_queues, btl_openib_max_send_size and btl_openib_eager_limit. Before this commit if the ini file specified a set of default receive queues that happen to not contain one large enough for the default max_send_size of eager_limit users would see an error like: WARNING: The largest queue pair buffer size specified in the btl_openib_receive_queues MCA parameter is smaller than the maximum send size (i.e., the btl_openib_max_send_size MCA parameter), meaning that no queue is large enough to receive the largest possible incoming message fragment. The OpenFabrics (openib) BTL will therefore be deactivated for this run. Local host: somehost Largest buffer size: 65536 Maximum send fragment size: 131072 This commit adds code that detects the source of the max_send_size and eager_limit values and sets either or both of them to the size supported by the largest queue pair if both 1) the value is larger than the largest queue pair size, and 2) the value was not set by the user or a MCA configuration file. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-03-08 16:17:33 -07:00
Nathan Hjelm	d2f5fca82a	configure: add a summary section at the end of configure output This commit adds two m4 macros: OPAL_SUMMARY_ADD, OPAL_SUMMARY_PRINT. OPAL_SUMMARY_ADD adds an item to a section in the summary. For example OPAL_SUMMARY_ADD([[Transports]],[[Foo]],...,[yes]) will add the following to the summary: Transports ----------------------- Foo: yes With this commit two sections are added: Transports, Resource Managers. The OPAL_SUMMARY_PRINT macro is called after AC_OUTPUT and prints out some information about the build (version, projects, etc) and then the summarys sections. It will additionally print a warning if internal debugging is enabled. Example output: Open MPI configuration: ----------------------- Version: 3.0.0 a1 Build Open Platform Abstration project: yes Build Open Runtime project: yes Build Open MPI project: yes Build Open SHMEM project: no MPI C++ bindings (deprecated): no MPI Fortran bindings: mpif.h, use mpi, use mpi_f08 Debug build: yes Transports ----------------------- Cray uGNI (Gemini/Aries): no Intel Omnipath (PSM2): no KNEM Shared Memory: no Linux CMA IPC: no Mellanox MXM: no Open UCX: no OpenFabrics libfabric: no OpenFabrics Verbs: no portals4: no QLogic Infinipath (PSM): no tcp: yes XPMEM Shared Memory: no Resource Managers ----------------------- Cray Alps: no Grid Engine: no LSF: no Slurm: yes Torque: yes INTERNAL DEBUGGING IS ENABLED. DO NOT USE THIS BUILD FOR PERFORMANCE MEASUREMENTS! Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2016-03-08 10:04:15 -07:00
Nathan Hjelm	2a0b3a5700	btl/vader: various threading fixes This commit fixes several threading bugs: - Add an additional lock to the btl_base_endpoint_t structure to lock the list of pending frags. This allows the progress function to attempt to send pending frags without needing to drop/reaquire the lock. This should provide a small improvement in performance and fixes a potential race between adding an removing items from the pending list. - Ensure fast boxes are only set up once by updating the send count using atomics when needed and do not set the fast box buffer pointer until the fast box is set up. Closes open-mpi/ompi#1408 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-03-02 10:50:59 -07:00
Gilles Gouaillardet	477991b5aa	btl/openib: fix abstraction violation and use opal_memory->memoryc_set_alignment	2016-02-24 09:50:13 +09:00
Nathan Hjelm	2031bb6f01	btl/openib: XRC save SRQ#s on the loopback endpoint This commit fixes a bug that can occur when communicating via XRC to peers on the same node. UDCM was not saving the SRQ numbers on the loopback endpoint (which shares its ib_addr info with all local peers) so any messages to local peers use an invalid SRQ number. Fixes open-mpi/ompi#1383 Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2016-02-18 20:59:11 -07:00
Nathan Hjelm	371df45bf8	btl/openib: fix locking bugs with XRC ib_addr lock This bug fixes two issue with the ib_addr lock: - The ib_addr lock must always be obtained regardless of opal_using_threads() as the CPC is run in a seperate thread. - The ib_addr lock is held in mca_btl_openib_endpoint_connected when calling back into the CPC start_connect on any pending connections. This will attempt to obtain the ib_addr lock again. Since this is not a performance-critical part of the code the lock has been changed to be recursive. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-18 15:55:34 -07:00
Nathan Hjelm	4dc73d7765	btl/openib: XRC fix bug that could cause an invalid SRQ# to be used This commit fixes a bug that occurs when attempting a get or put operation on an endpoint that is not already connected. In this case the remote_srqn may be set to an invalid value as the rem_srqs array on the endpoint is not populated. This commit moves the usage of the rem_srqs array to the internal put/get functions where it is guaranteed this array is populated. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-18 15:44:29 -07:00
Nathan Hjelm	4f4ea96940	btl/openib/udcm: fix local XRC connections This commit ensures ib_addr->remote_xrc_rcv_qp_num value is set when creating the loopback queue pair. This is needed when communicating with any other local peer. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-17 14:54:19 -07:00
Nathan Hjelm	bf8360388f	btl/openib/udcm: fix XRC support This commit fixes two bugs in XRC support - When dynamic add_procs support was added to master the remote process name was added to the non-XRC request structure. The same value was not added to the XRC xconnect structure. This error was not caught because the send/recv code was incorrectly using the wrong structure member. This commmit adds the member and ensure the xconnect code uses the correct structure. - XRC loopback QP support has been fixed. It was 1) not setting the correct fields on the endpoint structure, 2) calling udcm_xrc_recv_qp_connect, and 3) was not initializing the endpoint data. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-16 16:49:04 -07:00
Nathan Hjelm	201c280e6c	btl/openib: fix error in param check in mca_btl_openib_put mca_btl_openib_put incorrectly checks the qp inline max before allowing an inline put. This check will always fail for an endpoint that has not been connected. The commit changes the check to use the btl_put_local_registration_threshold instead. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-16 16:46:32 -07:00
Nathan Hjelm	123a39ac3c	btl/openib: fix regression in XRC support Commit open-mpi/ompi@400af6c52d introduced a regression in XRC support. The commit reversed the ordering of shared receive queue (SRQ) and completion queue (CQ) completion. CQ creation must always preceed SRQ creation when using XRC as the CQs are needed to create the SRQs. This commit fixes the ordering so that CQs are always created before SRQs. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-16 16:46:20 -07:00
igor-ivanov	d9eefefa74	Merge pull request #1351 from igor-ivanov/pr/issue-1336 opal/memory: Move Memory Allocation Hooks usage from openib	2016-02-15 14:07:36 +04:00
Ralph Castain	aa9e5a1a27	Add support for Singularity containers, including a .m4 file for checking if Singularity is available and an orte/schizo component for setting the proper support if a container was given as the executable Cleanup the configury so we properly check for Singularity under the various typical use-cases Bring the Singularity support online. We have to turn "off" the sm BTL as it segfaults from inside the container - root cause remains unclear. Also turned "off" the various OPAL shmem components in case they are involved and someone else tries to use them. Happily, the vader BTL works just fine!	2016-02-13 04:40:22 -08:00
Igor Ivanov	8b05f308f9	opal/memory: Move Memory Allocation Hooks usage from openib These changes fix issue https://github.com/open-mpi/ompi/issues/1336 - improve abstractions: opal/memory/linux component should be single place that opeartes with Memory Allocation Hooks. - avoid collisions in case dynamic component open/close: it is safe because it is linked statically. - does not change original behaivour.	2016-02-11 14:46:35 +02:00
Jeff Squyres	8d0a592563	usnic: update a few verbose reachability messages	2016-02-06 03:28:48 -08:00
Jeff Squyres	87dbe6ce01	usnic: add high-verbose reachability messages	2016-02-06 03:28:47 -08:00
Jeff Squyres	dac2fe1589	usnic: ensure to use ntohl() for network-order values	2016-02-06 03:28:47 -08:00
Jeff Squyres	51240394a7	usnic: ensure to init module->av_eq_num	2016-02-06 03:28:47 -08:00
Jeff Squyres	89eea51075	usnic: fix calculation for number of blocks	2016-02-02 16:56:34 -08:00
Nathan Hjelm	cd11fc3081	btl/ugni: fix race condition that causes completions to be dropped The send code in the ugni btl has an optimization that enables it to return 1 (fragment gone) in some cases. This optimization involved removing the btl ownership and callback flags to ensure the fragment stuck around long enough for its completion flag to be checked. This works fine for the single-threaded case but not in the multi-threaded case. It is possible that a fragment will be completed by another thread while a thread is in mca_btl_ugni_send. This competition can lead to a leaked fragment, missed callback, or both. To fix the issue without removing the optimization a reference count has been added to the fragment. Callbacks and fragment release will not be made until the fragment reference count has reach 0. The count is incremented before sending the frag and decremented after the completion flag has been checked. The fix has been verified to work using a multi-threaded RMA benchmark with the osc/pt2pt component. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-02 12:14:31 -07:00
Nathan Hjelm	14704201e2	btl/ugni: fix race condition when adding endpoint to wait list This commit fixes a race condition that can cause an endpoint to be added to the wait list multiple times. To fix the issue an additional check has been added to ensure the endpoint is not on the wait list after the wait list lock is held. The wait list processing code has also been updated to keep the wait list lock until all wait listed endpoints have been handled. This reduces the chance that an endpoint that is being processed by the wait list code is not re-added to the list by a competing send. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-02 12:13:49 -07:00
Jeff Squyres	9f3ed00125	usnic: minor updates from code review Three minor updates from the code review of https://github.com/open-mpi/ompi-release/pull/933: * Remove an extra blank line a show_help message * We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so change the flag to REGINT_GE_ONE * Change "num_blocks" definition to be in terms of block_len (not eq_size)	2016-02-01 11:14:30 -08:00
Jeff Squyres	c2615a4732	usnic: change retrans timeout to 5ms A bunch of empirical testing has shown that increasing the retranmit timeout from 1ms to 5ms doesn't adversely affect performance, yet decreases the number of gratuitious retransmissions.	2016-01-30 10:49:14 -08:00
Jeff Squyres	797d5026c8	usnic: better av_eq_num default value handling	2016-01-30 10:46:14 -08:00
Jeff Squyres	db825abc00	usnic: don't overrun the fi_av_insert() EQ Add endpoints in a blocked manner so that we don't overrun the fi_av_insert() event queue. Also make the AV EQ length an MCA param, and report it in mca_btl_base_verbose >=5 output.	2016-01-30 08:33:48 -08:00
Jeff Squyres	d624e0d60f	usnic: fix wraparound sequence number issue Sequence numbers will wrap around; it is not sufficient to check for (seq-1) -- must use the SEQ_DIFF macro to properly handle the wraparound. This bug wasn't serious; it just meant we might retransmit one or two extra times when retransmits were triggerd and the sequence numbers wrapped around their sliding windows.	2016-01-30 08:32:13 -08:00
Jeff Squyres	4de4a263f5	usnic: ensure all messages are sent on the data channel Messages should go on the data channel, even if they're short. Only ACKs go on the priority channel.	2016-01-30 08:31:21 -08:00
Jeff Squyres	348ac507c2	usnic: explain why we still have OPAL_HAVE_HWLOC Put in a comment explaining why btl_usnic_compat.h still defines OPAL_HAVE_HWLOC, even though master/v2.x no longer does.	2016-01-16 04:11:05 -08:00
Jeff Squyres	0f5fcf9029	usnic: fix common symbol	2016-01-16 03:55:27 -08:00
Jeff Squyres	60ffe713b8	common syms: whitelist bison-generated common symbols Bison generates some common symbols that we can't do anything about, so whitelist them.	2016-01-16 03:53:14 -08:00
Artem Polyakov	84e4fb308b	Fix race condition in UDCM where service thread sees that `cm_message_event_active == 1` but main thread has already stopped processing messages and thus we will have the situation where one message was left unhandled leading to a hang.	2016-01-08 23:56:21 +06:00
Jeff Squyres	6d073a8da4	btl_sm: add a comment explaining why we rename(2) Per open-mpi/ompi#1230, add a comment explaining why we write to a temporary file and then rename(2) the file, just so that future code maintainers don't wonder why we do this seemingly-useless step.	2016-01-04 14:51:52 -05:00
Artem Polyakov	2abb2972ac	Fix Mellanox copyrights with respect to the following PRs: * https://github.com/open-mpi/ompi/pull/1184 * https://github.com/open-mpi/ompi/pull/1188 * https://github.com/open-mpi/ompi/pull/1197 * https://github.com/open-mpi/ompi/pull/1202 * https://github.com/open-mpi/ompi/pull/1210 * https://github.com/open-mpi/ompi/pull/1216 * https://github.com/open-mpi/ompi/pull/1236 * https://github.com/open-mpi/ompi/pull/1237 * https://github.com/open-mpi/ompi/pull/1248 * https://github.com/open-mpi/ompi/pull/1260 * https://github.com/open-mpi/ompi/pull/1264	2015-12-30 00:12:19 +06:00
Gilles Gouaillardet	fec973efda	configury: test portability replace test ... -o ... with test ... \|\| test ... and test ... -a ... with test ... && test ...	2015-12-28 13:58:45 +09:00
Nathan Hjelm	700a21022a	Merge pull request #1260 from artpol84/openib_proc_account_fix Openib proc accounting fix	2015-12-27 15:19:52 -07:00
Artem Polyakov	a20826e6b4	Fix vader resource leak. This nasty bug was nicely masked. It was causing `mca_btl_vader_component.vader_frags_user` overflow and as the result rear hangs of ompi-test-suite.	2015-12-28 00:41:45 +06:00
Gilles Gouaillardet	2d9aa38e6a	btl/openib: fix heterogeneous support	2015-12-25 16:31:35 +09:00
Artem Polyakov	3031affdb7	Fix openib process accounting if procs was dynamically added.	2015-12-24 17:56:35 +06:00
Artem Polyakov	400af6c52d	openib addproc improvements: 1. finer grained locks; 2. separate srq creation from cq adjustments.	2015-12-24 17:56:35 +06:00
Artem Polyakov	41c325f15a	Shift common code for calculating a port count and btl_rank in openib into the static function	2015-12-24 17:56:35 +06:00
Gilles Gouaillardet	5fa63f086a	btl/tcp: add missing #include <unistd.h> Thanks Marco Atzeri for contributing the original patch	2015-12-24 14:41:46 +09:00
Gilles Gouaillardet	15ed7ad9f5	btl/sm: add missing #include <unistd.h> Thanks Marco Atzeri for contributing the original patch	2015-12-24 14:41:41 +09:00
Gilles Gouaillardet	42313acd58	btl/usnic: add missing #include <alloca.h>	2015-12-24 14:33:58 +09:00
Nathan Hjelm	84d890b7e7	Merge pull request #1248 from artpol84/openib_proc_init_race Openib dynamic add proc race conditions	2015-12-22 21:48:05 -07:00
Artem Polyakov	08ad8357a8	Fix local process accounting in openib when dynamic add_proc is on.	2015-12-22 22:44:46 +06:00
Artem Polyakov	3c2f6d5560	Protect openib_btl->device data with explicit opal_mitex locks.	2015-12-22 18:33:26 +06:00
Gilles Gouaillardet	607d7c7545	btl/sm: rename file after file descriptor has been closed. Thanks George for spotting this.	2015-12-22 13:56:53 +09:00
Artem Polyakov	e06bffe213	Fix ib_proc locking	2015-12-21 18:52:31 +06:00

... 2 3 4 5 6 ...

680 Коммитов