openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	3e7ef48c13	pml/ob1: do not cache leave_pinned This commit fixes a bug that disabled both the RDMA pipeline and RDMA protocols in ob1. ob1 was internally caching the values of opal_leave_pinned and opal_leave_pinned_pipeline at init time. This is no longer valid as opal_leave_pinned may be set by any call to a btl's add_procs. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-14 09:00:40 -06:00
Ralph Castain	330b11c8ab	Merge pull request #3156 from rhc54/topic/tm Update the TM module to support regex passing	2017-03-14 00:25:19 -07:00
Ralph Castain	b1a01d77ae	Update the TM module to support regex passing Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-13 21:50:40 -07:00
Nathan Hjelm	9410574253	Merge pull request #3149 from hjelmn/btl_ugni_2_0 Improve multi-threaded RMA performance of the ugni btl	2017-03-13 16:28:41 -06:00
Ralph Castain	e4a35f2dbf	Merge pull request #3152 from rhc54/topic/setup Update launchers to get correct regex	2017-03-13 14:23:43 -07:00
Nathan Hjelm	d5aaeb74b6	btl/ugni: return a descriptor from sendi Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-13 14:56:54 -06:00
Nathan Hjelm	a19e7023d1	btl/ugni: always check local SMSG CQ This commit removes the local operation count check from the local SMSG completion queue. This check was leading to hangs due to an undocumented feature of the ugni library. The local SMSG CQ is used to send credit return messages back to the sender. The ugni library never checks for the completion itself but relying on the SMSG user to periodically check the CQ. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-13 14:56:54 -06:00
Nathan Hjelm	d5cdeb81d0	btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-13 14:46:06 -06:00
Nathan Hjelm	12bf38a25c	btl/ugni: add MPI_T performance variables for ugni counters This commit exposes ugni statistics for use with MPI_T. There is no overhead to providing these counters. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-13 14:42:58 -06:00
Jeff Squyres	086748bb70	Merge pull request #3102 from omor1/master Add missing definition of MPI_T_PVAR_SESSION_NULL (resolve #2652)	2017-03-13 15:27:05 -04:00
Ralph Castain	bb574a41df	Update launchers to get correct regex Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-13 11:21:44 -07:00
Ralph Castain	41d7a5c7d9	Merge pull request #3148 from rhc54/topic/cov Silence Coverity warnings	2017-03-13 11:12:14 -07:00
Ralph Castain	105fb152e1	Silence Coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-13 08:38:51 -07:00
Ralph Castain	b9f5cab710	Add a minor debug statement Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-12 18:15:44 -07:00
Gilles Gouaillardet	23d44a5284	sensor/base: initialize orte_sensor_base global variable Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-03-13 09:39:43 +09:00
Ralph Castain	59bcad5f8e	Merge pull request #3146 from rhc54/topic/alps Update alps module to new APIs	2017-03-12 10:35:29 -07:00
Ralph Castain	6d6bc9bd07	Update alps module to new APIs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-12 09:43:07 -07:00
Ralph Castain	fb27bd1b4a	Merge pull request #3143 from rhc54/topic/odls Enable parallel fork/exec of local procs by providing the option of multiple odls progress threads	2017-03-12 07:29:11 -07:00
Ralph Castain	70591bf4dc	Enable parallel fork/exec of local procs by providing the option of multiple odls progress threads Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 20:48:04 -08:00
Ralph Castain	3afadbad89	Merge pull request #3142 from rhc54/topic/sensor Restore sensor framework	2017-03-11 19:53:45 -08:00
Ralph Castain	ab50665222	Restore sensor framework Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 17:46:32 -08:00
Ralph Castain	74125ecc7a	Merge pull request #3141 from rhc54/topic/sync Sync to latest PMIx master and PMIx reference server	2017-03-11 15:53:18 -08:00
Ralph Castain	c6bc3ccb76	Sync to latest PMIx master and PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 12:50:38 -08:00
Howard Pritchard	df8df0d2f3	Merge pull request #3137 from hppritcha/topic/swap_rmaps_compiler_warning rmaps/base: swat compiler warning	2017-03-09 15:06:01 -07:00
Howard Pritchard	f8183f71f7	rmaps/base: swat compiler warning gcc was complaining about variables possibly used uninitialized Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-03-09 14:30:06 -06:00
Yossi	1a95633e40	Merge pull request #2717 from alex-mikheev/topic/sshmem_ucx oshmem: sshmem: adds UCX allocator	2017-03-09 12:58:06 +02:00
Jeff Squyres	16ee880c4e	README: Remove coll/ml verbiage Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-03-08 15:58:54 -05:00
Jeff Squyres	17a34b489b	Merge pull request #3121 from jsquyres/pr/master/readme-updates-from-2.x master: README: sync with v2.x	2017-03-08 12:58:19 -05:00
Yossi	327d5a8ac4	Merge pull request #3125 from alex-mikheev/topic/pml_ucx_req_init_fix ompi: pml ucx: fix persistant request initialization	2017-03-08 19:08:12 +02:00
Ralph Castain	97287f6568	Merge pull request #2916 from rhc54/topic/sim Create an alternative mapping method	2017-03-08 07:08:51 -08:00
Jeff Squyres	dc12ae008b	Merge pull request #3122 from hjelmn/patcher_madvise memory/patcher: do not hook madvise	2017-03-08 09:46:45 -05:00
Alex Mikheev	c081239f88	ompi: pml ucx: fix persistant request init CR changes Signed-off-by: Alex Mikheev <alexm@mellanox.com>	2017-03-08 13:26:29 +02:00
Alex Mikheev	c113c37a7a	ompi: pml ucx: fix persistant request initialization Signed-off-by: Alex Mikheev <alexm@mellanox.com>	2017-03-08 10:59:41 +02:00
Ralph Castain	48fc339718	Create an alternative mapping method that pushes responsibility onto the backend daemons. By default, let mpirun only pack the app_context info and send that to the backend daemons where the mapping will be done. This significantly reduces the computational time on mpirun as it isn't running up/down the topology tree computing thousands of binding locations, and it reduces the launch message to a very small number of bytes. When running -novm, fall back to the old way of doing things where mpirun computes the entire map and binding, and then sends the full info to the backend daemon. Add a new cmd line option/mca param --fwd-mpirun-port that allows mpirun to dynamically select a port, but then passes that back to all the other daemons so they will use that port as a static port for their own wireup. In this mode, we no longer "phone home" directly to mpirun, but instead use the static port to wireup at daemon start. We then use the routing tree to rollup the initial launch report, and limit the number of open sockets on mpirun's node. Update ras simulator to track the new nidmap code Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found. Update gadget platform file Initialize the range count when starting a new range Fix the no-np case in managed allocation Ensure DVM node usage gets cleaned up after each job Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-07 20:43:12 -08:00
Nathan Hjelm	3caeda21dc	memory/patcher: do not hook madvise It is not possible to hook madvise at this time due to a deadlock when using glibc. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-07 16:26:53 -07:00
Jeff Squyres	3a6b297bd5	README: sync with v2.x The README on master had grown very, very stale. This commit copies the README from the tip of the v2.x branch (from https://github.com/open-mpi/ompi/pull/3119) and preserves a few minor differences between master and the v2.x branch. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> [skip ci] bot:notest	2017-03-07 18:08:26 -05:00
Nathan Hjelm	7240bee0e0	Merge pull request #3110 from hjelmn/osc_pt2pt osc/pt2pt: flush pending fragments on lock ack	2017-03-07 14:44:09 -07:00
Joshua Ladd	e2ba60b778	Merge pull request #3111 from jladd-mlnx/topic/cx5-device-param Adding latest ConnectX-5 adapter vendor part id to OpenIB device params.	2017-03-07 13:55:46 -05:00
Nathan Hjelm	15ea9c5524	Merge pull request #3013 from hjelmn/rcache_lifo rcache/base: do not free memory with the vma lock held	2017-03-07 09:11:04 -07:00
Jeff Squyres	c2adf359cf	Merge pull request #3083 from ggouaillardet/topic/hwloc_v15 hwloc: add support for hwloc v1.5	2017-03-07 10:01:24 -05:00
Joshua Ladd	b28647857f	Adding latest ConnectX-5 adapter vendor part id to OpenIB device params. Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>	2017-03-07 00:19:54 +02:00
Nathan Hjelm	0195d15401	osc/pt2pt: flush pending fragments on lock ack This commit addresses an issue that can occur in cases where a lot of fragments are outstanding. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-06 13:58:46 -07:00
Ralph Castain	79540fec08	Merge pull request #3104 from rhc54/topic/minor Fix some minor compatibility issues	2017-03-06 10:14:22 -08:00
Edgar Gabriel	607dc2c039	Merge pull request #3103 from edgargabriel/pr/sharedfp-name-collision-fix sharedfp/lockedfile and sm: fix the namecollision	2017-03-05 14:46:20 -06:00
Ralph Castain	aca7091114	Fix some minor compatibility issues by ensuring job-level data gets stored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes #3101 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-05 10:30:59 -08:00
Edgar Gabriel	2d462b3b80	sharedfp/lockedfile and sm: fix name collision this fixes the issue reported by Nicolas Joly on the mailing: the sharedfp/lockedfile component does not support right now a scenario where multiple jobs read from the same input file, due to a collision of the filenames utilized for the sharedfp handle. Although not part of the oroginal report, the same occurs for the sharedfp/sm component. Add therefore the jobid to be part of the lockedfilename/sm file name. use the OMPI_CAST_RTE_NAME macro to determine jobid Fixes: #3098 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>	2017-03-05 11:28:28 -06:00
Omri Mor	20ab37a297	Add missing MPI_T_PVAR_SESSION_NULL to mpi.h MPI_T_pvar_session_free() should reject null sessions and set *session to MPI_T_PVAR_SESSION_NULL Signed-off-by: Omri Mor <omri50@gmail.com>	2017-03-05 09:03:30 -06:00
Mike Dubman	2f8c759b73	Merge pull request #3100 from artpol84/fix_ucx_req/master ompi/pml/ucx: Fix uninitialized UCX request field.	2017-03-05 08:53:18 +01:00
Artem Polyakov	9448814c40	ompi/pml/ucx: Fix uninitialized UCX request field. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-03-05 03:06:30 +07:00
Edgar Gabriel	d1fed77781	Merge pull request #3094 from edgargabriel/pr/master-lustre-priority io/ompio: adjust the priority of the OMPIO component on lustre	2017-03-03 09:29:14 -06:00

1 2 3 4 5 ...

26790 Коммитов