openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	cb7cc171f9	usnic: update README.txt notes Update notes about copying the usnic BTL between master and the v1.8 branch.	2015-02-03 15:54:36 -08:00
Jeff Squyres	edf7232e00	usnic: enable building with an external libfabric	2015-02-03 13:46:06 -08:00
Jeff Squyres	bfa54d5d7b	usnic: update to match new libfabric	2015-02-03 13:46:06 -08:00
Jeff Squyres	436223959d	usnic: update to match new libfabric APIs	2015-01-24 05:49:36 -08:00
Jeff Squyres	65a279019e	usnic: fix typo in memchecker usage	2015-01-16 09:42:19 -08:00
Jeff Squyres	d13c14ec82	CSCus22527: fix off-by-one error in checking the number of VFs Ensure to count this process when checking for how many VFs we need on the local server. (cherry picked from commit 386c01934e98cb8dcb48ff648ecdfb0c8677baa9)	2015-01-15 11:44:29 -08:00
Jeff Squyres	e4e5e7dbc0	usnic: ensure to clean up nicely in case of low resources If there are not enough resources (e.g., low VFs), we can end up calling finalize_one_channel() on the same channel multiple times. So ensure to NULL out fields that we have freed already so that we do not try to free them a second time. Fixes CSCus26648.	2015-01-13 14:37:31 -08:00
Jeff Squyres	d00cede718	usnic: fix if_include/exclude of CIDR-specified networks Fix the ordering so that we obtain the usnic netmask information before we do the filtering based on CIDR-specified networks. Also requires upstream Github libfabric commit 3976745. Fixes CSCus22495.	2015-01-13 12:04:51 -08:00
Jeff Squyres	a220b92cf8	usnic: fix function name in opal_output	2015-01-13 12:04:07 -08:00
Jeff Squyres	5ed688a074	usnic: enusre that we only get "usnic"-named providers Also, a minor update to a verbose message.	2015-01-12 13:21:22 -08:00
Jeff Squyres	881b1dcf19	usnic: document libfabric abstractions Handy tips to remember the libfabric abstractions and what they correspond to in usnic/VIC terms.	2015-01-09 15:21:51 -08:00
Gilles Gouaillardet	194d9f84d3	btl/usnic: move call to check_reg_mem_basics() avoid annoying memlock related messages when there is no usnic device.	2015-01-09 11:37:45 +09:00
Dave Goodell	49069bc661	usnic: fix fi_av_insert (ARP resolution) bugs We had several problems in the old code: 1. We were specifying an arbitrary timeout (100 ms) and then abandoning all remaining pending AV insert operations. We would then free the endpoint buffer that we gave to fi_av_insert(), usually causing libfabric's progress thread to write to a freed buffer. 2. We were claiming in a show_help message that the timeout was controllable via an MCA parameter. This commit removes that parameter, since there's no good method for us to specify a timeout like this to libfabric right now. 3. We also weren't waiting for the correct number of fi_av_insert() operations to complete. We were waiting for nprocs, which is accidentally fine for 2 procs on separate hosts, but not for most other proc counts. Reviewed-by: Jeff Squyres <jsquyres@cisco.com>	2015-01-07 08:25:17 -08:00
Jeff Squyres	c621d1e622	libfabric: don't LIBADD the common library in the static case Adding the libfabric common library in the --disable-dlopen case will result in duplicate symbols.	2014-12-18 11:04:08 -08:00
Jeff Squyres	d6f059f538	configury: add some descriptive output messages in configure Ensure that the ofi MTL and the usnic BTL have good descriptive output messages in configure.	2014-12-17 13:36:01 -08:00
Jeff Squyres	95da4a5a0e	usnic: no longer use opal_using_threads() Instead, use the flag that is passed in.	2014-12-16 08:49:01 -08:00
Jeff Squyres	cd0a54d76f	usnic: short term fix to enable builds on non-libfabric platforms This isn't quite the Right fix yet, because it doesn't address usnic for external libfabric builds. I'll fix that separately / later.	2014-12-09 09:19:26 -08:00
Jeff Squyres	6e24a1eb85	usnic: update for libfabric API change Use FI_ADDR_UNSPEC for posting a receive from an unspecified source.	2014-12-09 06:06:52 -08:00
Jeff Squyres	9547345b18	usnic: fix show_help message Rename a few symbols to use libfabric-friendly names. Fix a show_help message when fi_av_insert times out.	2014-12-08 11:39:07 -08:00
Jeff Squyres	8e49cc754f	usnic: update to latest libfabric API changes	2014-12-08 11:37:37 -08:00
Jeff Squyres	984982790a	usnic: convert from verbs to libfabric (yay!) This commit represents the conversion of the usnic BTL from verbs to libfabric. For the moment, libfabric is embedded in Open MPI (currently in the usnic BTL). This is because the libfabric API is still changing, and also has not yet been released. Ultimately, this embedded copy of libfabric will likely disappear and the usnic BTL will rely on an external installation of libfabric. New configure options: * --with-libfabric: will cause configure to fail if libfabric support cannot be built * --without-libfabric: will prevent libfabric support from being built * --with-libfabric=DIR: use an external libfabric installation * --with-libfabric-libdir=LIBDIR: when paired with --with-libfabric=DIR, use LIBDIR for the libfabric installation library dir The --with-libnl3[-libdir] arguments are now gone.	2014-12-08 11:37:37 -08:00
Nathan Hjelm	1b564f62bd	Revert "Merge pull request #275 from hjelmn/btlmod" This reverts commit ccaecf0fd6c862877e6a1e2643f95fa956c87769, reversing changes made to 6a19bf85dde5306f559f09952cf3919d97f52502.	2014-11-19 23:22:43 -07:00
Nathan Hjelm	ec33374339	btl: remove des_remote/des_remote_count from the mca_btl_base_descriptor_t structure This structure member was originally used to specify the remote segment for an RDMA operation. Since the new btl interface no longer uses desriptors for RDMA this member no longer has a purpose. In addition to removing these members the local segment information has been renamed to des_segments/des_segment_count.	2014-11-19 11:33:02 -07:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
Jeff Squyres	ec4268b59c	usnic: do not send zero-length modex message If there are no usnic BTL modules, then just avoid sending any modex message at all (other BTLs do this; it's safe to do). The change is smaller than it looks: I added a "if 0 ==..." check at the top to return immediately if there are no BTL modules. Then I removed some now-unnecessary conditionals and un-indented as appropriate. Fixes #248	2014-10-22 11:11:58 -07:00
Jeff Squyres	c22e1ae33b	configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros These two macros set the prefix for the OPAL and ORTE libraries, respectively. Specifically, the OPAL library will be named libPREFIXopen-pal.la and the ORTE library will be named libPREFIXopen-rte.la. These macros must be called, even if the prefix argument is empty. The intent is that Open MPI will call these macros with an empty prefix, but other projects (such as ORCM) will call these macros with a non-empty prefix. For example, ORCM libraries can be named liborcm-open-pal.la and liborcm-open-rte.la. This scheme is necessary to allow running Open MPI applications under systems that use their own versions of ORTE and OPAL. For example, when running MPI applications under ORTE, if the ORTE and OPAL libraries between OMPI and ORCM are not identical (which, because they are released at different times, are likely to be different), we need to ensure that the OMPI applications link against their ORTE and OPAL libraries, but the ORCM executables link against their ORTE and OPAL libraries.	2014-10-22 10:32:19 -07:00
Jeff Squyres	51027a6635	usnic: fix minor typo Change harmless-but-weird comma to semicolon. Found during code review.	2014-10-15 05:32:36 -07:00
Ralph Castain	fd6a044b7f	Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages. Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.	2014-10-03 16:02:57 -06:00
Jeff Squyres	733316372b	usnic: remove suggestion of enabling no-drop in the fabric Reviewed by Reese Faucette cmr=v1.8.3:reviewer=ompi-rm1.8 This commit was SVN r32628.	2014-08-28 23:56:56 +00:00
Jeff Squyres	b0dfb9f401	usnic: avoid a possible race condition Per #4874, code review revealed a possible race condition in the module struct and the connectivity agent. Move the setup of the connectivity agent listener until the module struct has been fully setup. This commit was SVN r32573.	2014-08-22 02:34:24 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Jeff Squyres	ac7c907f8d	usnic: ensure to have a safe destruction of an opal_list_item_t It turns out that we ''can'' get to the endpoint destructor with the endpoint still on the "endpoints needing ACKs" list. So if it's on the list, remove it first, and then DESTRUCT the opal_list_item_t. This prevents an assert() fail in debug builds. We'd like to let this soak over the weekend. cmr=v1.8.2:reviewer=dgoodell This commit was SVN r32546.	2014-08-15 21:52:36 +00:00
Jeff Squyres	1cdcb7290b	usnic: no need to check before calling this function This function is intentionally always safe to call -- no need for a double redundant check. This commit was SVN r32545.	2014-08-15 21:39:29 +00:00
Jeff Squyres	082ab15d19	usnic: increase the listen() backlog size Rarely -- but it happens -- the connectivity client gets ECONNREFUSED because the connectivity agent listen() backlog is too small. Rather than put in a loop on the client side, take the simple way out for now: increase the backlog size to an arbitrarily-large number. Reviewed by Dave Goodell. cmr=v1.8.2:reviewer=ompi-rm1.8 This commit was SVN r32543.	2014-08-15 19:12:18 +00:00
Jeff Squyres	9373d6420e	usnic: when a module is finalized, "unlisten" the connectivity checker Instead of waiting to destroy the connectivity agent during component shutdown, have the module shutdown send an "unlisten" command to the cagent that will tell it to stop listening on a given interface. This commit was SVN r32536.	2014-08-15 00:52:43 +00:00
Jeff Squyres	6b592d3016	usnic: convert some BTL_ERRORs to more descriptive show_help messages 1. After we receive N abnormally-short messages (meaning: corrupted), print a show_help message about it. N defaults to 25. N can be set to 0 disable the message via btl_usnic_max_short_packets. 1. If we receive a completion error for something other than a receive, display a show_help message. Reviewed by Dave Goodell. CMR'ing to v1.8.3, but it will require a custom patch because of the OMPI->OPAL BTL move. cmr=v1.8.3 This commit was SVN r32522.	2014-08-13 15:01:20 +00:00
Jeff Squyres	65767aff68	usnic: remove errant OMPI header file This commit was SVN r32469.	2014-08-08 20:34:50 +00:00
Jeff Squyres	323b9f346c	usnic: update connectivity checker help message Show an example of using the btl_usnic_connectivity_map option. Also, mention that another reason for the "total connectivity failure" may be due to asymmetric / unexpected routing. Reviewed by Dave Goodell. cmr=v1.8.2:reviewer=ompi-rm1.8 This commit was SVN r32465.	2014-08-08 17:18:29 +00:00
Jeff Squyres	6bf28a6940	usnic: update help messages These messages were committed in the v1.8 branch in r32341, but were never committed to the trunk (because we were waiting for the OPAL BTL move). This commit brings the trunk and v1.8 help messages in line with each other. This commit was SVN r32445. The following SVN revision numbers were found above: r32341 --> open-mpi/ompi@5e752b4aba	2014-08-07 20:50:29 +00:00
Jeff Squyres	70f5a10128	usnic: fix typo from r32438 This commit was SVN r32440. The following SVN revision numbers were found above: r32438 --> open-mpi/ompi@d2e31ac647	2014-08-06 19:29:46 +00:00
Jeff Squyres	d2e31ac647	usnic: Fix connectivity checker pointer mismatch Ensure that the connectivity checker agent only uses pointers from the client that is the same process as the agent. Not necessary for the v1.8 branch -- this is a trunk/v1.9-only problem. This commit was SVN r32438.	2014-08-05 23:07:01 +00:00
Jeff Squyres	34897cee9f	usnic: unify teardown between trunk and v1.8 branches Make the del_procs, module finalize, and endpoint destructors be the same between trunk and v1.8, with one exception: the very beginning of v1.8 module_finalize calls del_procs for each proc to simulate/pretend the trunk/v1.9 PML behavior of calling del_procs before module_finalize. This commit was SVN r32437.	2014-08-05 22:31:55 +00:00
Jeff Squyres	1a8d72119f	usnic: Fix configure.ac typo This commit was SVN r32436.	2014-08-05 22:31:07 +00:00
Dave Goodell	13b104bdef	usnic: fix endpoint destruction on the trunk Fixes an assertion failure in --enable-debug builds and SEGVs in normal builds. I'm not 100% sure I like this model, but it at least seems to be consistent. Some variation on this scheme will need to be adapted to the trunk, where usnic_del_procs() is called by the PML instead of internally in usnic_finalize(). A related bug (but with different mechanics) is #4832. This commit was SVN r32424.	2014-08-04 21:30:21 +00:00
Dave Goodell	490c484f8c	usnic: fix uninitialized param to accept(2) This commit was SVN r32423.	2014-08-04 21:30:08 +00:00
Dave Goodell	61a9b49d5b	usnic: fix usnic breakage in ORCM repo This commit was SVN r32416.	2014-08-04 19:34:55 +00:00
Jeff Squyres	ff4717b727	usnic: cagent now checks that incoming pings are expected Previously, the connectivity agent was pretty dumb: it took whatever pings it got and ACKed them. Then we added an agent check to ensured that the ping actually came from the source interface that it said it came from. Now we add another check such that when a ping is received on interface X that corresponds to usnic module Y, we ensure that the source interface of the ping is on the all_endpoints list for module Y (i.e., module Y expects to be able to talk to that peer interface). This detects cases where peers have come to different conclusions about which interfaces should be used to communicate (which is bad!). This usually reflects a network misconfiguration. Fixes CSCuq05389. This commit was SVN r32383.	2014-07-31 22:30:20 +00:00
Ralph Castain	db89071dc2	Cleanup the moved component's Makefile.am to use the opal instead of ompi directories This commit was SVN r32370.	2014-07-31 04:41:04 +00:00
Jeff Squyres	959bdace3c	usnic: check that connectivity pings came from where they said they came from Ensure that incoming "ping" messages came from the IP address that they think they came from. If they don't, drop them (because it is probably routing error), which will likely eventually cause the connectivity checker to timeout, and therefore cause the job to abort. This commit was SVN r32368.	2014-07-30 21:03:56 +00:00
Jeff Squyres	20349da03b	usnic: minor cleanup This commit was SVN r32367.	2014-07-30 20:56:49 +00:00

1 2 3 4

160 Коммитов