openmpi

Автор	SHA1	Сообщение	Дата
Thananon Patinyasakdikul	afe07cd5d5	Fixed common symbol in btl/usnic - This commit fixes the accidental common symbol btl_usnic_lock - It also moves the btl_usnic_lock declaration to btl_usnic.h	2016-06-20 10:05:44 -07:00
Thananon Patinyasakdikul	7bd18214a7	Fix btl/usnic deadlock when the connectivity check is turned off.	2016-06-15 07:42:55 -07:00
Thananon Patinyasakdikul	ee85204c12	Added MPI_THREAD_MULTIPLE support for btl/usnic.	2016-06-13 13:47:06 -07:00
Jeff Squyres	dc18c32437	usnic: fix resource check The math for checking the number of QPs and CQs per usNIC/VF was incorrect, allowing you to run MPI processes even when usNICs (i.e., VIC VFs) had fewer QPs and CQs than were necessary. This led to a confusing error later when fi_enable(3) failed (because we lazily create QPs). Fixing the math here ensure that we actually print a helpful error message telling the user specifically what is wrong. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-04-22 15:58:27 -07:00
Jeff Squyres	4a3c986a80	usnic: remove need for hwloc verbs helper Haven't needed this for a while, but it got left in by accident. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-03-28 09:10:12 -07:00
Jeff Squyres	05e2423756	usnic: specify the cache name Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-03-28 09:02:52 -07:00
Jeff Squyres	d44804f0c9	usnic: use version 1 of the API, not the current version	2016-03-16 16:03:51 -07:00
Jeff Squyres	e7ef711455	usnic: allow mpool_hints to be empty Follow on to open-mpi/ompi@eac0b11	2016-03-16 15:04:39 -07:00
Nathan Hjelm	eac0b110b8	btl/usnic: update for mpool/rcache rewrite Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-03-14 10:50:41 -06:00
Jeff Squyres	97714716ec	usnic: add some cagent verification checks Add primitive magic number and version checking in the connectivity checker protocol. These checks doesn't guarantee to we won't get false PINGs and ACKs, but they do significantly reduce the possibility of interpretating random incoming fragments as PINGs or ACKs.	2016-03-09 13:25:00 -08:00
Jeff Squyres	584b80147d	usnic: set MCA_BTL_FLAGS_SINGLE_ADD_PROCS The btl_recv.h:lookup_sender() function uses the hashed ORTE proc name to determine the sender of the packet. With add_procs_cutoff>0, the usnic BTL may not have knowledge of all the senders. Until the usNIC BTL can be adjusted to do something like the openib/ugni BTLs (i.e., use opal_proc_for_name() to lookup unknown sender proc names), set MCA_BTL_FLAGS_SINGLE_ADD_PROCS, which means that ob1 will only all add_procs() once -- with all the procs in it. Also in this commit, adapt the connectivity checker to not rely on knowing all the senders (which is a bit easier than adapting the main BTL send path): the receiving connectivity agent will simply echo back the same PING message (which contains the sender's IP address+UDP port) back to the sender without checking that it knows who the sender is. If the sender receives the echoed PING back on the expexted interface, it will find a match in the pending pings list. If the sender receives the echoed PING back an unexpected interface, a match will not be found, and the incoming PING message will be dropped. Fixes open-mpi/ompi#1440	2016-03-08 17:41:42 -08:00
Jeff Squyres	4975fdcd5c	usnic: allow connect(2) to fail temporarily When connecting the connectivity checker client to its agent fails with ECONNREFUSED, just delay a little and try again a few more times.	2016-03-08 15:35:34 -08:00
Jeff Squyres	8d0a592563	usnic: update a few verbose reachability messages	2016-02-06 03:28:48 -08:00
Jeff Squyres	87dbe6ce01	usnic: add high-verbose reachability messages	2016-02-06 03:28:47 -08:00
Jeff Squyres	dac2fe1589	usnic: ensure to use ntohl() for network-order values	2016-02-06 03:28:47 -08:00
Jeff Squyres	51240394a7	usnic: ensure to init module->av_eq_num	2016-02-06 03:28:47 -08:00
Jeff Squyres	89eea51075	usnic: fix calculation for number of blocks	2016-02-02 16:56:34 -08:00
Jeff Squyres	9f3ed00125	usnic: minor updates from code review Three minor updates from the code review of https://github.com/open-mpi/ompi-release/pull/933: * Remove an extra blank line a show_help message * We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so change the flag to REGINT_GE_ONE * Change "num_blocks" definition to be in terms of block_len (not eq_size)	2016-02-01 11:14:30 -08:00
Jeff Squyres	c2615a4732	usnic: change retrans timeout to 5ms A bunch of empirical testing has shown that increasing the retranmit timeout from 1ms to 5ms doesn't adversely affect performance, yet decreases the number of gratuitious retransmissions.	2016-01-30 10:49:14 -08:00
Jeff Squyres	797d5026c8	usnic: better av_eq_num default value handling	2016-01-30 10:46:14 -08:00
Jeff Squyres	db825abc00	usnic: don't overrun the fi_av_insert() EQ Add endpoints in a blocked manner so that we don't overrun the fi_av_insert() event queue. Also make the AV EQ length an MCA param, and report it in mca_btl_base_verbose >=5 output.	2016-01-30 08:33:48 -08:00
Jeff Squyres	d624e0d60f	usnic: fix wraparound sequence number issue Sequence numbers will wrap around; it is not sufficient to check for (seq-1) -- must use the SEQ_DIFF macro to properly handle the wraparound. This bug wasn't serious; it just meant we might retransmit one or two extra times when retransmits were triggerd and the sequence numbers wrapped around their sliding windows.	2016-01-30 08:32:13 -08:00
Jeff Squyres	4de4a263f5	usnic: ensure all messages are sent on the data channel Messages should go on the data channel, even if they're short. Only ACKs go on the priority channel.	2016-01-30 08:31:21 -08:00
Jeff Squyres	348ac507c2	usnic: explain why we still have OPAL_HAVE_HWLOC Put in a comment explaining why btl_usnic_compat.h still defines OPAL_HAVE_HWLOC, even though master/v2.x no longer does.	2016-01-16 04:11:05 -08:00
Jeff Squyres	0f5fcf9029	usnic: fix common symbol	2016-01-16 03:55:27 -08:00
Gilles Gouaillardet	42313acd58	btl/usnic: add missing #include <alloca.h>	2015-12-24 14:33:58 +09:00
Jeff Squyres	2b9341a38a	usnic: fix embarrissing typo	2015-12-15 19:01:19 -08:00
Jeff Squyres	944d5061a6	usnic: sendto() can return EPERM if we send too fast If we send too fast, sendto() can run out of resources and return EPERM. So delay a little and try again.	2015-12-15 15:31:29 -08:00
Jeff Squyres	ab1bbca5b9	usnic: improve error message When sendto() fails, it would be helpful to see the errno value.	2015-12-15 15:04:25 -08:00
Jeff Squyres	c1a6beac8d	usnic: fix error message There were too many "%s" instances. Re-order the output so that we show file, line, and then the error message.	2015-12-15 14:48:38 -08:00
Jeff Squyres	300cff2b89	usnic: fix/update the usnic stats 1. Fix: old v1.6-era code reset the stats-emitting event to fire twice for each time period. 1. Add the usNIC device name to the output for differentiating the output in multi-rail scenarios.	2015-11-06 12:05:34 -08:00
Jeff Squyres	7bf364ff04	usnic: create blocking libevent for progress thread emulation In the v1.10 opal_progress_thread emulation, ensure to create a blocking libevent timer far in the future. Without that, opal_event_loop() will return immediately (and therefore the progress thread spins hard, stealing CPU cycles).	2015-09-18 11:53:03 -07:00
Jeff Squyres	7cb6c2fcc2	usnic: put the HWLOC #if's back to preserve compat with v1.10 We try to keep the source code the same between master and v1.10. So put the #if's back for OPAL_HAVE_HWLOC (and just hard-code it to 1 on master) so that this code is also compilable in v1.10.	2015-09-18 11:53:03 -07:00
Jeff Squyres	f7d90abf42	usnic: update for new add_procs() downcall behavior	2015-09-10 08:55:55 -06:00
Jeff Squyres	f782a7640e	usnic: minor re-order of Makefile.am sources Put the hwloc.c file alphabetically in the list.	2015-09-05 05:02:00 -07:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Jeff Squyres	8558458bb9	usnic: adjust for new PMIX argument type	2015-08-31 14:55:58 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Jeff Squyres	14340770c4	usnic: remove some logically dead code This code really had no purpose; just assign FI_VERSION(1, 1). This fixes CID 1315274. Also clarify the commet about why we still retain libfabric v1.0.0 compatibility code, even though configure.m4 requires libfabric >= v1.1.0.	2015-08-12 05:21:18 -07:00
Jeff Squyres	236cf7ff62	usnic: add v1.8/v1.10 compat code Add compat code so that I can sync master against the v1.10 branch.	2015-08-10 16:27:38 -07:00
Jeff Squyres	7da1c4b875	usnic: avoid race condition in connectivity checker In short applications, it's possible that the agent (i.e., local rank 0) will finalize after non-local rank 0 procs detect the connectivity checker named socket, but before they complete a connect() on it. As such, their connect() gets ECONNREFUSED. This commit adds a simple counter in the agent that won't let it quit before it accept()'s from all local procs, or 10 seconds goes by (whichever occurs first). This is similar to the timeout for the clients: they'll exit if they don't see the expected named socket within 10 seconds.	2015-08-10 15:40:33 -07:00
Jeff Squyres	bad508687e	usnic: move cchecker to OPAL-wide progress thread There's no longer any need for the usnic BTL to have its own progress thread: it can use the opal_progress_thread() infrastructure. This commit removes the code to startup/shutdown the usnic-BTL-specific progress thread and instead, just adds its events to the OPAL-wide progress thread. This necessitated a small change in the finalization step. Previously, we would stop the progress thread and then tear down the events. We can no longer stop the progress thread, and if we start tearing down events, this will cause shutdown/hangups to be sent across sockets, potentially firing some of the still-remaining events while some (but not all) of the data structures have been torn down. Chaos ensues. Instead, queue up an event to tear down all the pending events. Since the progress thread will only fire one event at a time, having a teardown event means that it can tear down all the pending events "atomically" and not have to worry that one of those events will get fired in the middle of the teardown process.	2015-08-10 15:40:33 -07:00
Jeff Squyres	b5c37dbfe2	CSCuv67889: usnic: fix an error corner case Ensure that we have non-NULL on all levels of pointers, which will save us if there are exitable errors very early during component / module initialization.	2015-08-06 10:54:28 -07:00
Jeff Squyres	cbcd16b399	usnic: remove a stale shell variable name	2015-07-31 18:53:54 -07:00
Jeff Squyres	0ee8295e6e	usnic: ensure that we have libfabric >= v1.1	2015-07-31 18:53:54 -07:00
Jeff Squyres	2e7f794aae	usnic: convert to use fi_recvmsg / FI_MORE Minor optimization to post 16 receive buffers at a time (vs. 1).	2015-07-31 12:45:40 -07:00
Jeff Squyres	ec3a38384f	Merge pull request #688 from jsquyres/pr/usnic-libfabric-msg-prefix-fix usnic fixes for differences between libfabric v1.0.0 and v1.1.0	2015-07-21 10:18:36 -04:00
Ralph Castain	683efcb850	Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.	2015-07-11 10:08:19 -07:00
Ralph Castain	61fb067f14	Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base	2015-07-11 06:42:23 -07:00
Jeff Squyres	633da6641e	usnic: gracefully handle when we can't alloc an ACK The comment didn't match the debugging code (which was ugly, and apparently never happens, anyway). Just return and let the sender retransmit.	2015-07-10 14:19:33 -07:00

1 2 3 4

160 Коммитов