openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	9f3ed00125	usnic: minor updates from code review Three minor updates from the code review of https://github.com/open-mpi/ompi-release/pull/933: * Remove an extra blank line a show_help message * We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so change the flag to REGINT_GE_ONE * Change "num_blocks" definition to be in terms of block_len (not eq_size)	2016-02-01 11:14:30 -08:00
Jeff Squyres	c2615a4732	usnic: change retrans timeout to 5ms A bunch of empirical testing has shown that increasing the retranmit timeout from 1ms to 5ms doesn't adversely affect performance, yet decreases the number of gratuitious retransmissions.	2016-01-30 10:49:14 -08:00
Jeff Squyres	797d5026c8	usnic: better av_eq_num default value handling	2016-01-30 10:46:14 -08:00
Jeff Squyres	db825abc00	usnic: don't overrun the fi_av_insert() EQ Add endpoints in a blocked manner so that we don't overrun the fi_av_insert() event queue. Also make the AV EQ length an MCA param, and report it in mca_btl_base_verbose >=5 output.	2016-01-30 08:33:48 -08:00
Jeff Squyres	d624e0d60f	usnic: fix wraparound sequence number issue Sequence numbers will wrap around; it is not sufficient to check for (seq-1) -- must use the SEQ_DIFF macro to properly handle the wraparound. This bug wasn't serious; it just meant we might retransmit one or two extra times when retransmits were triggerd and the sequence numbers wrapped around their sliding windows.	2016-01-30 08:32:13 -08:00
Jeff Squyres	4de4a263f5	usnic: ensure all messages are sent on the data channel Messages should go on the data channel, even if they're short. Only ACKs go on the priority channel.	2016-01-30 08:31:21 -08:00
Jeff Squyres	348ac507c2	usnic: explain why we still have OPAL_HAVE_HWLOC Put in a comment explaining why btl_usnic_compat.h still defines OPAL_HAVE_HWLOC, even though master/v2.x no longer does.	2016-01-16 04:11:05 -08:00
Jeff Squyres	0f5fcf9029	usnic: fix common symbol	2016-01-16 03:55:27 -08:00
Gilles Gouaillardet	42313acd58	btl/usnic: add missing #include <alloca.h>	2015-12-24 14:33:58 +09:00
Jeff Squyres	2b9341a38a	usnic: fix embarrissing typo	2015-12-15 19:01:19 -08:00
Jeff Squyres	944d5061a6	usnic: sendto() can return EPERM if we send too fast If we send too fast, sendto() can run out of resources and return EPERM. So delay a little and try again.	2015-12-15 15:31:29 -08:00
Jeff Squyres	ab1bbca5b9	usnic: improve error message When sendto() fails, it would be helpful to see the errno value.	2015-12-15 15:04:25 -08:00
Jeff Squyres	c1a6beac8d	usnic: fix error message There were too many "%s" instances. Re-order the output so that we show file, line, and then the error message.	2015-12-15 14:48:38 -08:00
Jeff Squyres	300cff2b89	usnic: fix/update the usnic stats 1. Fix: old v1.6-era code reset the stats-emitting event to fire twice for each time period. 1. Add the usNIC device name to the output for differentiating the output in multi-rail scenarios.	2015-11-06 12:05:34 -08:00
Jeff Squyres	7bf364ff04	usnic: create blocking libevent for progress thread emulation In the v1.10 opal_progress_thread emulation, ensure to create a blocking libevent timer far in the future. Without that, opal_event_loop() will return immediately (and therefore the progress thread spins hard, stealing CPU cycles).	2015-09-18 11:53:03 -07:00
Jeff Squyres	7cb6c2fcc2	usnic: put the HWLOC #if's back to preserve compat with v1.10 We try to keep the source code the same between master and v1.10. So put the #if's back for OPAL_HAVE_HWLOC (and just hard-code it to 1 on master) so that this code is also compilable in v1.10.	2015-09-18 11:53:03 -07:00
Jeff Squyres	f7d90abf42	usnic: update for new add_procs() downcall behavior	2015-09-10 08:55:55 -06:00
Jeff Squyres	f782a7640e	usnic: minor re-order of Makefile.am sources Put the hwloc.c file alphabetically in the list.	2015-09-05 05:02:00 -07:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Jeff Squyres	8558458bb9	usnic: adjust for new PMIX argument type	2015-08-31 14:55:58 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Jeff Squyres	14340770c4	usnic: remove some logically dead code This code really had no purpose; just assign FI_VERSION(1, 1). This fixes CID 1315274. Also clarify the commet about why we still retain libfabric v1.0.0 compatibility code, even though configure.m4 requires libfabric >= v1.1.0.	2015-08-12 05:21:18 -07:00
Jeff Squyres	236cf7ff62	usnic: add v1.8/v1.10 compat code Add compat code so that I can sync master against the v1.10 branch.	2015-08-10 16:27:38 -07:00
Jeff Squyres	7da1c4b875	usnic: avoid race condition in connectivity checker In short applications, it's possible that the agent (i.e., local rank 0) will finalize after non-local rank 0 procs detect the connectivity checker named socket, but before they complete a connect() on it. As such, their connect() gets ECONNREFUSED. This commit adds a simple counter in the agent that won't let it quit before it accept()'s from all local procs, or 10 seconds goes by (whichever occurs first). This is similar to the timeout for the clients: they'll exit if they don't see the expected named socket within 10 seconds.	2015-08-10 15:40:33 -07:00
Jeff Squyres	bad508687e	usnic: move cchecker to OPAL-wide progress thread There's no longer any need for the usnic BTL to have its own progress thread: it can use the opal_progress_thread() infrastructure. This commit removes the code to startup/shutdown the usnic-BTL-specific progress thread and instead, just adds its events to the OPAL-wide progress thread. This necessitated a small change in the finalization step. Previously, we would stop the progress thread and then tear down the events. We can no longer stop the progress thread, and if we start tearing down events, this will cause shutdown/hangups to be sent across sockets, potentially firing some of the still-remaining events while some (but not all) of the data structures have been torn down. Chaos ensues. Instead, queue up an event to tear down all the pending events. Since the progress thread will only fire one event at a time, having a teardown event means that it can tear down all the pending events "atomically" and not have to worry that one of those events will get fired in the middle of the teardown process.	2015-08-10 15:40:33 -07:00
Jeff Squyres	b5c37dbfe2	CSCuv67889: usnic: fix an error corner case Ensure that we have non-NULL on all levels of pointers, which will save us if there are exitable errors very early during component / module initialization.	2015-08-06 10:54:28 -07:00
Jeff Squyres	cbcd16b399	usnic: remove a stale shell variable name	2015-07-31 18:53:54 -07:00
Jeff Squyres	0ee8295e6e	usnic: ensure that we have libfabric >= v1.1	2015-07-31 18:53:54 -07:00
Jeff Squyres	2e7f794aae	usnic: convert to use fi_recvmsg / FI_MORE Minor optimization to post 16 receive buffers at a time (vs. 1).	2015-07-31 12:45:40 -07:00
Jeff Squyres	ec3a38384f	Merge pull request #688 from jsquyres/pr/usnic-libfabric-msg-prefix-fix usnic fixes for differences between libfabric v1.0.0 and v1.1.0	2015-07-21 10:18:36 -04:00
Ralph Castain	683efcb850	Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.	2015-07-11 10:08:19 -07:00
Ralph Castain	61fb067f14	Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base	2015-07-11 06:42:23 -07:00
Jeff Squyres	633da6641e	usnic: gracefully handle when we can't alloc an ACK The comment didn't match the debugging code (which was ugly, and apparently never happens, anyway). Just return and let the sender retransmit.	2015-07-10 14:19:33 -07:00
Jeff Squyres	3327fa56b5	usnic: minor code cleanups	2015-07-10 10:10:43 -07:00
Jeff Squyres	f9c65a701e	usnic: "sin" assignment needs to be outside the #if The "sin" variable is used below; need to ensure that it is assigned for all builds (not just debug builds).	2015-07-10 06:51:03 -07:00
Jeff Squyres	cd87c8ad41	usnic: misc compiler warnings fixes	2015-07-10 06:51:03 -07:00
Jeff Squyres	ba429dc890	usnic: temporarily disable the BTL put method The usnic BTL put method is currently broken. Disable it until we can fix it properly.	2015-07-10 06:51:03 -07:00
Jeff Squyres	f265358fbe	usnic: handle FI_MSG_PREFIX differences libfabric v1.0.0->v1.1.0 In libfabric v1.0.0 (i.e., API v1.0), the usnic provider handled FI_MSG_PREFIX inconsistently between sends and receives. This has been fixed in libfabric v1.1.0 (i.e., API v1.1): FI_MSG_PREFIX is handled consistently for both sends and receives. Run-time detect which libfabric we are running with and adapt behavior appropriately.	2015-07-10 06:51:03 -07:00
Jeff Squyres	ddd0de6cfc	usnic: make more OS-bypass memory Valgrind-defined This helps reduce false positives when running MPI apps through Valgrind.	2015-07-10 06:51:03 -07:00
Jeff Squyres	9bc7a54e0c	usnic: correctly count CRC errors Handle the differences between libfabric v1.0.0 and v1.1.0 in the return value of fi_cq_readerr(). Also consolidate CRC and truncation errors into the same handling block, since truncation errors are typically another symptom of CRC errors. This ensures that buffers get reposted properly.	2015-07-10 06:51:03 -07:00
Jeff Squyres	fc686f5538	usnic: make configure complain if libfabric cannot be found Instead of silently determining that the usnic BTL can't be built, announce that usnic is checking for libfabric support, and then AC_MSG_RESULT the result of that check.	2015-07-10 06:45:33 -07:00
Jeff Squyres	a172bd161e	usnic: switch to use the new libfabric common library The usnic BTL configure.m4 no longer needs to OPAL_CHECK_LIBFABRIC; it just uses the results from opal/mca/common/libfabric's configure.m4. We also now don't need to link against libfabric -- they just link against the opal_common_libfabric library.	2015-06-25 13:33:15 -07:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Jeff Squyres	dfa36197ea	usnic/Makefile.am: ensure static builds include -lfabric	2015-06-17 08:15:29 -07:00
Jeff Squyres	44e7646de9	usnic/configure.m4: convert to use external libfabric Use the new OPAL_CHECK_LIBFABRIC macro.	2015-06-15 15:17:06 -07:00
Jeff Squyres	ec57aa1805	usnic: also update btl_usnic_compat.c for versioning	2015-05-26 18:56:29 -07:00
Jeff Squyres	9c19dd4e5b	usnic: make v1.10 and v2.0 fit in version checking scheme v1.10 is now in the same compatibility level as v1.7/v1.8 (there is/will be no v1.9 series). v2.0 now takes over for what used to be called v1.9.	2015-05-26 18:42:33 -07:00
Jeff Squyres	95a2b14543	usnic: avoid some compiler warnings Followup to open-mpi/ompi@65b66ab: if we're not debugging, then #if out an entire block so that the compiler doesn't warn about variables that are assigned and not used.	2015-05-26 14:27:26 -07:00
Dave Goodell	65b66ab4ae	usnic: use fi_getname in newer libfabric When using an external libfabric (or really any libfabric newer than libfabric commit 607e863), we must use fi_getname to determine the local port of our endpoint. Without this fix, OMPI will hang endlessly while retransmitting packets to port 0 on the remote host.	2015-05-21 08:51:03 -07:00
Jeff Squyres	df5043bc3f	usnic: adjust for libfabric API return value change	2015-04-27 10:18:49 -07:00

1 2 3 4

193 Коммитов