openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	af336ac0e8	pmix/configure.m4: always use embedded mode Looks like embedded mode was mistakenly disabled when --with-devel-headers was specified. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-05-04 10:01:41 -07:00
Ralph Castain	a737d0f963	Merge pull request #3430 from bosilca/topic/tcp_hostname Use the OPAL function to get the hostname.	2017-05-03 06:42:02 -07:00
Brian Barrett	3b991498be	btl tcp: Don't set socket buffer size by default Set the default send and receive socket buffer size to 0, which means Open MPI will not try to set a buffer size during startup. The default behavior since near day one of the TCP BTL has been to set the send and receive socket buffer sizes to 128 KiB. A number that works great on 1 GbE, but not so great on 10 GbE fabrics of any real size. Modern TCP stacks, particularly on Linux, have gotten much smarter about buffer sizes and are much less efficient if a buffer size is set (even if set to something large). Signed-off-by: Brian Barrett <bbarrett@amazon.com>	2017-04-28 14:14:49 -07:00
George Bosilca	2d8943d920	Use the OPAL function to get the hostname.	2017-04-28 02:48:15 -04:00
Nathan Hjelm	387467c358	btl/ugni: remove erroneous mca_btl_ugni_frag_return call Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-04-27 09:14:51 -06:00
Ralph Castain	8b1f01dfe6	Set the default modex parameters back to full blocking modex while we continue to test and debug the slow modex - it seems to be having issues on the Cray Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-22 15:19:46 -07:00
Howard Pritchard	f2a27cc991	Merge pull request #3396 from hppritcha/topic/swat_compiler_warning btl/sm: swat a compiler warning	2017-04-22 14:31:21 -06:00
Ralph Castain	f2ed293ecd	Merge pull request #3398 from rhc54/topic/modex Implement a background fence that collects all data during modex operation	2017-04-21 15:15:49 -07:00
Ralph Castain	9fc3079ac2	Implement a background fence that collects all data during modex operation The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called. Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim. This PR changes the default settings of a few relevant params to make "background modex" the default behavior: * pmix_base_async_modex -> defaults to true * pmix_base_collect_data -> continues to default to true (no change) * async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything. The logic in MPI_Init is: * if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation * if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed * if async_modex is not set, then we block until the fence completes (regardless of collecting data or not) * if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation. * if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case. HTH Ralph Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-21 10:29:23 -07:00
Jeff Squyres	1d5e08f44a	usnic: more iov_limit fixes Follow on to `7bd2de9960`: move setting the iov_limit to 1 earlier in the startup sequence. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-04-21 09:14:28 -07:00
Howard Pritchard	782f1bb9af	btl/sm: swat a compiler warning gnu 6.3.1 complaining about uninitialized variable Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-04-21 10:02:56 -05:00
Jeff Squyres	e9e89e502b	Merge pull request #3245 from hjelmn/auto_bool mca/base: accept y and n for bool and auto bool enumerator	2017-04-21 10:41:10 -04:00
Howard Pritchard	462342d148	Merge pull request #3311 from hppritcha/topic/libfabric_moves_to_ofi common/libfabric: move libfabric to ofi	2017-04-21 07:50:38 -06:00
Jeff Squyres	7bd2de9960	usnic: ensure to set the iov_limit to 1 The usNIC BTL does not use more than 1 iov, so be sure to set it to 1 so that we don't allocate cq/rq/sq entries based on a default (i.e., >1) number of iovs per entry. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-04-20 13:28:15 -07:00
Howard Pritchard	841192645b	common/libfabric: move libfabric to ofi This PR renames the common library for OFI libfabric from libfabric to ofi. There are a number of reasons this is good to do: 1) its shorter and replaces 9 characters with three for function names for what may eventually be a fairly extensive interface 2) OFI is the term used for MTL and RML components that use the OFI libfabric interface 3) A planned OSC component will also use the OFI term. 4) Other HPC libraries that can use OFI libfabric tend to use the term "ofi" internally and also in their configure options relevant to OFI libfabric (i.e. MPICH/CH4, Intel MPI, Sandia SHMEM) There seem to be comments in places in the Open MPI source code that indicate that this common library will be going away. Far from it as we will want to be able to share things like AV objects between OMPI and possibly OSHMEM components that use the OFI libfabric interface. This PR also adds a synonym to the --with-libfabric(-libdir) configury options: --with-ofi and with-ofi-libdir. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-04-20 13:07:16 -06:00
Ralph Castain	c86f71376a	Increase fine grain of timing info Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-20 00:17:40 -07:00
Gilles Gouaillardet	b18745589f	Merge pull request #1665 from ggouaillardet/topic/OMPI_DATATYPE_INIT_UNAVAILABLE_BASIC_TYPE ompi/datatype: define OMPI_DATATYPE_INIT_UNAVAILABLE_BASIC_TYPE macro	2017-04-20 09:10:07 +09:00
Nathaniel Graham	34b4aeb17f	Merge pull request #3339 from nrgraham23/mpirun_help_improvements Additional mpirun --help changes	2017-04-19 14:05:07 -06:00
Nathaniel Graham	01312b2f90	Additional mpirun --help changes This commit recategorizes several mpirun arguments, and moves the information for mpirun --help arguments to the bottom of the general help message. I also added the OPAL_CMD_LINE_OTYPE field to two commands that were missed initially because they were not in the same area as the others. Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>	2017-04-19 11:43:45 -06:00
Gilles Gouaillardet	cc8a655fe6	configury: remove now obsolete reference to OPAL_PTRDIFF_TYPE since Open MPI now requires a C99, and ptrdiff_t type is part of C99, there is no more need for the abstract OPAL_PTRDIFF_TYPE type. Thanks George, Nathan and Paul for the help. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-04-19 13:42:45 +09:00
Gilles Gouaillardet	fa5cd0dbe5	use ptrdiff_t instead of OPAL_PTRDIFF_TYPE since Open MPI now requires a C99, and ptrdiff_t type is part of C99, there is no more need for the abstract OPAL_PTRDIFF_TYPE type. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-04-19 13:41:56 +09:00
bosilca	872cf44c28	Improve the opal_pointer_array & more (#3369 ) * Complete rewrite of opal_pointer_array Instead of a cache oblivious linear search use a bits array to speed up the management of the free space. As a result we slightly increase the memory used by the structure, but we get a significant boost in performance. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Do not register datatypes in the f2c translation table. The registration is now done up into the Fortran layer, by forcing a call to MPI_Type_c2f. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-04-18 21:41:26 -04:00
Jeff Squyres	a0543616ee	dl/dlopen: add libs to wrapper LIBS With this, libs (e.g., "-ldl") are not added to the wrapper LIBS flags. This may work on some platforms, but on at least RHEL 7.3, it does not (i.e., compiling MPI applications fails because it can't find dlopen). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-04-15 09:30:18 -07:00
Ralph Castain	ffbfd22d84	Fix event registration - need to increment the event index and record the number of codes in the event handler Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-13 17:35:10 -07:00
Joshua Hursey	742d452c62	opal_info: Add --show-failed CLI option * `ompi_info --show-failed` will include the failed components along with information about why they failed. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-04-12 16:06:40 -05:00
Joshua Hursey	3ad3d4e3e7	opal_info: Add ability to report load failures * Add a path for failed component load information to be reported up. * This allows ompi_info to display this information inline to make it easier for folks to see if the component is present but failed for some reason. Most likely a missing library, but could be a libnl conflict. * Add MCA parameter to enable this feature: - `mca_base_component_track_load_errors` takes a boolean - Default: `false` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-04-12 16:06:21 -05:00
Ralph Castain	9f73974fe1	Update to latest PMIx master, including disabling the pmi-1 and pmi-2 backward compatibility as these interfere with the s1,s2 components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-12 12:34:27 -07:00
Ralph Castain	dadc924cde	Cleanup warnings when timing is not enabled Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-11 17:29:27 -07:00
Artem Polyakov	4477b87e1d	Merge pull request #3303 from karasevb/timing2/master OMPI timings	2017-04-11 07:52:40 -07:00
Boris Karasev	d132eab4a5	ompi/timings: fixed the error of opal timings env import Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2017-04-11 12:08:48 +06:00
Ralph Castain	95ae0d1df3	Cleanup timing macros for portability across compilers. Rename the --enable-timing configure option to be --enable-pmix-timing so it doesn't pickup external timing requests. Remove a stale function reference in PMIx so it can compile with timing enabled. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-10 12:56:38 +06:00
Nathaniel Graham	5e44e40ca9	Merge pull request #3293 from nrgraham23/mpirun_help_parsable Add parsable option to help arguments	2017-04-07 11:35:49 -06:00
Boris Karasev	36a0e71f2d	ompi/timings: preparing to production state Adds: - enabling/disabling of timings throught environment variable `OMPI_TIMING_ENABLE` - output format: [file name]:[function name]:[description]: avg/min/max - dynamically extending array of results for case then inited size was exhausted - catch and collect errors - cleanup Note: For use feature need to configure with `--enable-timings` and set env `OMPI_TIMING_ENABLE = 1` Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2017-04-07 21:16:57 +06:00
Artem Polyakov	45898a9c65	opal/timing: add the draft of env-based timings This commit adds new timing feature that uses environment variables to expose timing information. This allows easy access to this data (if timing is enabled) from any other part of the application for the subsequent postprocessing. In particular this will be integrated with OMPI-level timing framework that whill use MPI_Reduce functionality to provide more compact and easy-to use information. This commit also adds the example of usage of this framework by annotating rte_init function. The result is not used anywhere for now. It will be postprocessed in subsequent commits. NOTE: that functionality is currently disabled untill it will be verified at runtime Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-04-07 21:16:22 +06:00
Artem Polyakov	88ed79ea25	opal/timing: remove old framework Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-04-07 21:16:22 +06:00
Nathaniel Graham	36d660e07a	Add parsable option to help arguments This commit adds a "parsable" option to the help arguments, which prints out a machine readable list of all the mpirun options. Fixes #3279 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>	2017-04-05 17:01:43 -06:00
Gilles Gouaillardet	10ea991d0a	hwloc: add CUDA include dir to CPPFLAGS so hwloc configury can find nvml.h when CUDA support is built Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-04-05 11:46:22 +09:00
Gilles Gouaillardet	8d7541f766	hwloc: disable nvml is CUDA support is not built in Open MPI Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-04-05 11:07:34 +09:00
Nathaniel Graham	7063f3021f	Merge pull request #3231 from nrgraham23/revamp_mpirun_help mpirun --help output revamp	2017-04-04 12:32:20 -06:00
Nathaniel Graham	19e5d15491	mpirun --help output revamp This commit modifies the output from the mpirun --help command. The options have been split into groups, to make the output smaller and more readable. The groups are: general, debug, output, input, mapping, ranking, binding, devel, compatibility, launch, dvm, and unsupported. There is also a special "full" command that can be used to get the old behaviour of printing out all of the options. Unsupported options may only be seen with this full output. This commit also adds a special case for the help argument. It makes it possible for the user to enter 0 or 1 arguments instead of having to always enter an argument. This defaults to printing out the "general" help options so the user can then see what help arguments there are. Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>	2017-04-04 10:59:32 -06:00
Ralph Castain	92c996487c	Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well. Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap Get the DVM running again Fix direct modex by eliminating race condition caused by releasing data while sending it Up the size limit before compressing Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 19:25:15 -07:00
Ralph Castain	2cc5fea8be	Update to PMIx v2.0alpha Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 10:02:29 -07:00
Gilles Gouaillardet	81062b7cd2	hwloc: update hwloc to 1.11.6 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-03-31 13:35:16 +09:00
Ralph Castain	7dd34d0c9a	Use the correct callback data - the callback function was expecting a bool, not a pmix_ptl_sr_t. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-28 17:21:47 -07:00
Nathan Hjelm	676cfe2a35	mca/base: accept y and n for bool and auto bool enumerator Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-28 09:20:14 -06:00
Ralph Castain	b398d721d5	Merge pull request #3236 from rhc54/topic/craycleanups Silence a flood of warnings when compiling with gcc on Cray	2017-03-24 13:33:46 -07:00
Ralph Castain	ecc8000136	Silence a flood of warnings when compiling with gcc on Cray Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 13:37:11 -06:00
Ralph Castain	470452cba0	Correctly check the sa_family and cast the data correctly before passing it to inet_nop, and don't be quite as fancy with the pointer arithmetic as the combination was causing us to segfault every time this debug message was called. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 11:42:57 -07:00
Ralph Castain	35f817911e	Fix coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 08:09:46 -07:00
Ralph Castain	c0bcd11bcf	Fix permissions - no CI required Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 08:05:52 -07:00
Ralph Castain	55e4fba5f5	If we lose connection to the server after initiating a send/recv in PMIx (e.g., in PMIx_Abort), then we need to "resolve" all pending recvs to avoid hanging. Fixes #3225 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 02:53:21 -07:00
Ralph Castain	d645557fa0	Update to include the PMIx 2.0 APIs for monitoring and job control. Include required integration, but leave the monitors off for now. Move the sensor framework out of ORTE as it is being absorbed into PMIx Fix typo and silence warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 17:47:08 -07:00
Ralph Castain	4b6d220a83	You cannot include both pmi.h and pmi2.h as they have conflicting defines in them. Thanks to Kilian Cavalotti for pointing it out Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-19 11:53:54 -07:00
Jeff Squyres	ce0e1cd32c	Merge pull request #3201 from hppritcha/jjhursey-topic/timer-gettimeofday Jjhursey topic/timer gettimeofday	2017-03-18 20:12:36 -04:00
Jeff Squyres	b8dfd49e97	hwloc: re-enable use of autogen.pl in a tarball Commit `fec519a793` broke the ability to run autogen.pl in a distribution tarball. This commit restores that ability by also distributing opal/mca/hwloc/autogen.options in the tarball. Skipping CI because CI does not test this functionality: [skip ci] bot:notest Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-03-17 11:41:17 -07:00
Jeff Squyres	5219054d29	Merge pull request #3185 from jsquyres/pr/master/compiler-warning-squashes Compiler warning squashes	2017-03-16 10:12:08 -04:00
Jeff Squyres	b51c4e2797	memory/patcher: fix a compiler warning Don't define the madvise intercept functions since we're not currently intercepting madvise. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-03-16 05:43:51 -07:00
Jeff Squyres	616f20c52c	timer/linux: rename component-specific functions Several component-specific functions were named with a prefix of "opal_timer_base", which was quite confusing. Rename them to have a prefix "opal_timer_linux" to make it clear that they are here in this component (and different than actual opal_timer_base symbols). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-03-15 21:03:13 -05:00
Jeff Squyres	290d4598df	timer/linux: remove global variable This variable is only used in one file, so make it static. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-03-15 21:03:06 -05:00
Howard Pritchard	db2e1298fb	OSx: remove built-in atomics support It was decided to remove support for os-x builtin atomics Fixes #2668 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-03-15 12:45:33 -06:00
Nathan Hjelm	6b210fa2c4	btl/ugni: do not return a frag from sendi if an endpoint is waitlisted This fixes a hang that can occur when running bandwidth tests. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-14 10:14:13 -06:00
Nathan Hjelm	2e42b0afbd	btl/ugni: move connection check into sync event This commit makes datagram checks time based and reduces their frequency when only the wildcard datagram is posted. This change improves latency on knl systems. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-14 10:10:05 -06:00
Nathan Hjelm	d5aaeb74b6	btl/ugni: return a descriptor from sendi Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-13 14:56:54 -06:00
Nathan Hjelm	a19e7023d1	btl/ugni: always check local SMSG CQ This commit removes the local operation count check from the local SMSG completion queue. This check was leading to hangs due to an undocumented feature of the ugni library. The local SMSG CQ is used to send credit return messages back to the sender. The ugni library never checks for the completion itself but relying on the SMSG user to periodically check the CQ. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-13 14:56:54 -06:00
Nathan Hjelm	d5cdeb81d0	btl/ugni: improve multi-threaded performance This commit updates the ugni btl to make use of multiple device contexts to improve the multi-threaded RMA performance. This commit contains the following: - Cleanup the endpoint structure by removing unnecessary field. The structure now also contains all the fields originally handled by the common/ugni endpoint. - Clean up the fragment allocation code to remove the need to initialize the my_list member of the fragment structure. This member is not initialized by the free list initializer function. - Remove the (now unused) common/ugni component. btl/ugni no longer need the component. common/ugni was originally split out of btl/ugni to support bcol/ugni. As that component exists there is no reason to keep this component. - Create wrappers for the ugni functionality required by btl/ugni. This was done to ease supporting multiple device contexts. The wrappers are thread safe and currently use a spin lock instead of a mutex. This produces better performance when using multiple threads spread over multiple cores. In the future this lock may be replaced by another serialization mechanism. The wrappers are located in a new file: btl_ugni_device.h. - Remove unnecessary device locking from serial parts of the ugni btl. This includes the first add-procs and module finalize. - Clean up fragment wait list code by moving enqueue into common function. - Expose the communication domain flags as an MCA variable. The defaults have been updated to reflect the recommended setting for knl and haswell. - Avoid allocating fragments for communication with already overloaded peers. - Allocate RDMA endpoints dyncamically. This is needed to support spreading RMA operations accross multiple contexts. - Add support for spreading RMA communication over multiple ugni device contexts. This should greatly improve the threading performance when communicating with multiple peers. By default the number of virtual devices depends on 1) whether opal_using_threads() is set, 2) how many local processes are in the job, and 3) how many bits are available in the pid. The last is used to ensure that each CDM is created with a unique id. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-13 14:46:06 -06:00
Nathan Hjelm	12bf38a25c	btl/ugni: add MPI_T performance variables for ugni counters This commit exposes ugni statistics for use with MPI_T. There is no overhead to providing these counters. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-13 14:42:58 -06:00
Ralph Castain	c6bc3ccb76	Sync to latest PMIx master and PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 12:50:38 -08:00
Nathan Hjelm	3caeda21dc	memory/patcher: do not hook madvise It is not possible to hook madvise at this time due to a deadlock when using glibc. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-03-07 16:26:53 -07:00
Joshua Ladd	e2ba60b778	Merge pull request #3111 from jladd-mlnx/topic/cx5-device-param Adding latest ConnectX-5 adapter vendor part id to OpenIB device params.	2017-03-07 13:55:46 -05:00
Nathan Hjelm	15ea9c5524	Merge pull request #3013 from hjelmn/rcache_lifo rcache/base: do not free memory with the vma lock held	2017-03-07 09:11:04 -07:00
Jeff Squyres	c2adf359cf	Merge pull request #3083 from ggouaillardet/topic/hwloc_v15 hwloc: add support for hwloc v1.5	2017-03-07 10:01:24 -05:00
Joshua Ladd	b28647857f	Adding latest ConnectX-5 adapter vendor part id to OpenIB device params. Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>	2017-03-07 00:19:54 +02:00
Ralph Castain	aca7091114	Fix some minor compatibility issues by ensuring job-level data gets stored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes #3101 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-05 10:30:59 -08:00
Ralph Castain	1de72ff023	Silence an unnecessary error log Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-02 17:18:34 -08:00
Gilles Gouaillardet	7e01be60d9	hwloc: add support for hwloc v1.5 hwloc v1.5 does not support HWLOC_OBJ_OSDEV_COPROC nor hwloc_topology_dup(), so for this version : - do not search for coprocessors - do not try hwloc_topology_dup(), note this is not used anywhere in the code base Thanks Jeff for helping with the wording Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-03-03 09:39:24 +09:00
Ralph Castain	83199979ba	Remove the stale opal/sec framework Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-02 15:41:56 -08:00
Jeff Squyres	5b484c91f4	btl/tcp: use show_help to print the dropped-TCP warning Make the message more friendly / more detailed, and de-duplicate it (just in case it happens a lot). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-03-01 16:31:29 -08:00
George Bosilca	b0f8d2c460	Never free the statically allocated buffer. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-03-01 13:21:03 -05:00
George Bosilca	ec4a235e6a	Allow a TCP proc release during the create. This is mostly for error cases, where we need to release the newly created proc. Currently the code deadlocks because the endpoint lock is help at the release and the lock is not recursive. Aslo added some code to print the IP addresses that don't match during the TCP connection step. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-03-01 13:17:54 -05:00
Jeff Squyres	d5266aba90	Merge pull request #2955 from jsquyres/pr/hwloc-external-fixes Fix --with-hwloc=external	2017-02-28 14:57:07 -05:00
Josh Hursey	0006f0d7c5	Merge pull request #2773 from jjhursey/topic/hook-fwk Add a 'hook' framework	2017-02-28 12:29:50 -06:00
Jeff Squyres	fec519a793	hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h Per a prior commit, the presence of "hwloc.h" can cause ambiguity when using --with-hwloc=external (i.e., whether to include opal/mca/hwloc/hwloc.h or whether to include the system-installed hwloc.h). This commit: 1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h. 2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc. 3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the rest of the code base. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-28 07:48:42 -08:00
Joshua Hursey	c10bbfded6	ompi/hook: Add the hook/license framework * Include a 'demo' component that shows some of the features. * Currently has hooks for: - MPI_Initialized - top, bottom - MPI_Init_thread - top, bottom - MPI_Finalized - top, bottom - MPI_Init - top (pre-opal_init), top (post-opal_init), error, bottom - MPI_Finalize - top, bottom * Other places in ompi can 'register' to hook into any one of these places by passing back a component structure filled with function pointers. * Add a `MCA_BASE_COMPONENT_FLAG_REQUIRED` flag to the MCA structure that is checked by the `hook` framework. If a required, static component has been excluded then the `hook` framework will fail to initialize. - See note in `opal/mca/mca.h` as to why this is checked in the `hook` framework and not in `opal/mca/base/mca_base_component_find.c` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-02-27 12:05:53 -05:00
Gilles Gouaillardet	af0b5cffb4	asm: rename the AMD64 into X86_64 in this context, AMD64 really means amd64 or em64t, so let's rename this into X86_64 in order to avoid any confusion Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-27 15:10:50 +09:00
Jeff Squyres	d7dd4d769e	openmpi-mca-params.conf: Fix comment Make sure to specify "--level 9" to ompi_info to see all MCA params. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-24 07:09:06 -08:00
Clement Foyer	f371cc0a43	Fix minor typo Return value in comment about opal_list_item_compare_fn_t typedef when a < b is indicated to be 11 instead of -1. Signed-off-by: Clement Foyer <clement.foyer@inria.fr>	2017-02-23 16:10:32 +01:00
Ralph Castain	e86a0dbf39	Update to PMIx master to include dlopen fixes and addition of libltdl support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-22 11:54:33 -08:00
Nathan Hjelm	60ad9d1817	rcache/base: do not free memory with the vma lock held This commit makes the vma tree garbage collection list a lifo. This way we can avoid having to hold any lock when releasing vmas. In theory this should finally fix the hold-and-wait deadlock detailed in #1654. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-02-21 21:04:46 -07:00
Ralph Castain	8cffdcf127	Ensure that the pmix headers and lib get installed when --with-devel-headers is given so that PMIx applications can be built and executed against the "embedded" PMIx version Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-21 13:46:46 -08:00
Gilles Gouaillardet	4184c01be5	Merge pull request #2393 from bosilca/topic/no_predefined_ddt_refcount Don't refcount the predefined datatypes.	2017-02-21 09:38:11 +09:00
Gilles Gouaillardet	bb2481a84b	pmix2x: synchronize to the latest PMIx master pmix/master@f57d9b2953 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-20 10:45:17 +09:00
Ralph Castain	f49118eaab	Fix some pmix configuration code Remove stale file reference that caused a check to always fail. Update psm2 function check to new libs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-16 10:54:47 -08:00
Howard Pritchard	b272f87926	Merge pull request #2968 from hjelmn/pmix_cray pmix/cray: performance improvements and cleanup	2017-02-16 11:41:59 -07:00
Ralph Castain	201f8571ca	Ensure we retain the peer object until we are done with it, then detect that the socket has closed due to a lost connection and cleanly release the message event Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 18:30:55 -08:00
Ralph Castain	223495325d	Fix binding policy bug and support pe=1 modifier Allow someone to specify the "pe=N" modifier to a mapping policy when N=1. This equates to just "bind-to core", but helps people who use a script to set the PE policy. Fix a bug where setting the binding policy left a lingering "if-supported" flag that shouldn't be there. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 14:55:17 -08:00
Ralph Castain	9cd7349d7c	Instead of completely free'ing the event base, pause the PMIx progress thread before tearing down the infrastructure, and then release the event base at the end of the procedure. This allows any infrastructure objects holding events to delete them prior to free'ing the event base. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 05:02:43 -08:00
Ralph Castain	f7fe2f7189	Merge pull request #2977 from rhc54/topic/spawn Fix comm_spawn by registering nspace info only when needed	2017-02-15 04:31:54 -08:00
Ralph Castain	68b53e2179	Fix comm_spawn by registering nspace info only when needed - either when we have local procs, or when job-level info is required by connecting jobs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 19:47:56 -08:00
Ralph Castain	404fe327be	Merge pull request #2973 from rhc54/topic/cleanups Update to newest PMIx master (includes configuration cleanups). Silence trivial Coverity warning in hwloc base.	2017-02-14 17:38:18 -08:00
Ralph Castain	0c8609ca16	Update to newest PMIx master (includes configuration cleanups). Silence trivial Coverity warning in hwloc base. Cleanup a race condition segfault during finalize by ensuring the PMIx progress thread is stopped prior to starting to tear down the messaging components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 15:14:00 -08:00

1 2 3 4 5 ...

4755 Коммитов