openmpi

Автор	SHA1	Сообщение	Дата
Gilles Gouaillardet	aeee48357a	btl/sm: correctly handle nodes with zero NUMA hwloc object the hwloc topology might not contain a NUMA object with hwloc < v2 if the node is not NUMA, so force the NUMA object count to one in order to correctly allocate mca_btl_sm_component.sm_mpools. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-12 11:45:29 +09:00
George Bosilca	c2cd717f82	Don't refcount the predefined datatypes. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-01-11 16:48:59 -05:00
Nysal Jan K.A	97801028ba	asm/ppc: Fix a regression in powerpc atomics Add a missing constraint to the input operand list. This fixes a regression caused by d4be138a7b. Thanks to Orion Poplawski for reporting the issue. Refs #2610 Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>	2017-01-11 11:00:11 -05:00
Ralph Castain	31a8476223	Merge pull request #2702 from rhc54/topic/cov Silence Coverity CID 1398541	2017-01-10 17:50:23 -08:00
Nathan Hjelm	79cabc92fd	rcache/base: do not release vma stuctures in vma_tree_delete This commit fixes a deadlock that can occur when the libc version holds a lock when calling munmap. In this case we could end up calling free() from vma_tree_delete which would in turn try to obtain the lock in libc. To avoid the issue put any deleted vma's in a new list on the vma module and release them on the next call to vma_tree_insert. This should be safe as this function is not called from the memory hooks. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-01-10 16:58:07 -07:00
Ralph Castain	e568b211e4	Silence Coverity CID 1398541 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-10 15:30:50 -08:00
Jeff Squyres	b980e334dc	usnic: add completion stats This should probably not go to the v2.x branch, since it changes the output format of the usnic stats. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:54 -08:00
Jeff Squyres	706f53bb01	usnic: ensure that stats string is always truncated Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:54 -08:00
Jeff Squyres	1fdd0fe228	usnic: add missing params to show_help() call Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:54 -08:00
Jeff Squyres	7048adec04	usnic: add some assert()s Add some run-time assert checks for debug builds. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:32 -08:00
Jeff Squyres	2d28ccb5fd	usnic: add verbose output of queue lengths Show the actual RX/TX and CQ length returned by libfabric in verbose output. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:32 -08:00
Jeff Squyres	bd5b8ed754	usnic: ensure that queues are long enough Double check the queue lengths that we get back from libfabric to ensure that they are at least as long as we need. They should never be shorter than we need, but let's just check to be sure. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:32 -08:00
Jeff Squyres	53dc75a89c	usnic: ensure to reset flags on returned frags Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:31 -08:00
Jeff Squyres	c4d7876ca0	usnic: check send credits on data channel for data frags Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:31 -08:00
Jeff Squyres	879d25e5df	usnic: ensure to check send credits for ACKs Don't just blindly send ACKs; ensure that we have send credits before doing so. If we don't have any send credits, just don't send the ACK (it'll come again soon enough; it's not a tragedy if we don't send it now). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:06:31 -08:00
Jeff Squyres	7787dad4db	usnic: ensure CQs are long enough The libfabric usnic provider may give you back TX/RX queues that are longer than you asked for. So just use the TX/RQ/CQ lengths that we asked for, regardless of what length comes back. Additionally, keep the length of the priority channel CQ separate from the length of the data CQ. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:03:53 -08:00
Jeff Squyres	b02d8c48f5	usnic: make the releasing safer Since the usnic BTL is single-threaded in this area, there really is no danger, but don't use one of the pointers hanging off the frag after we return it to the freelist. Instead, save the endpoint pointer before returning the frag. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:03:53 -08:00
Jeff Squyres	e25b860627	usnic: clarify types The types are technically typedef equivalent, but it's less confusing to use the types that agree with the name of the constructor. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 12:03:53 -08:00
Jeff Squyres	40fe575132	usnic: trivial updates (no code/logic changes) - Add more explanatory comments - Trivial whitespace / style updates - Rename opal_btl_usnic_force_retrans() -> opal_btl_usnic_fast_retrans() Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-10 10:40:02 -08:00
Nathan Hjelm	3593fad4d2	Merge pull request #2679 from hjelmn/cpuid_fix master: amd64: save/restore all 64 bits of rbx around cpuid	2017-01-10 09:12:28 -07:00
Gilles Gouaillardet	6d59b476de	Merge pull request #2686 from ggouaillardet/topic/pmix2x_ptl_base_sendrecv pmix2x: ptl/base: send header and message data together via writev()	2017-01-10 16:26:10 +09:00
Gilles Gouaillardet	44c1ff60f1	Merge pull request #2672 from ggouaillardet/topic/misc_memory_leaks Plug misc memory leaks	2017-01-10 13:16:04 +09:00
Gilles Gouaillardet	a01960bee5	pmix2x: ptl/base: send header and message data together via writev() on Linux, sending the header and then the message data does severely impact performances of ptl/tcp : on the receiver, reading the data can often result in an PMIX_ERR_RESOURCE_BUSY or PMIX_ERR_WOULD_BLOCK, which ends up degrading performances) this commit send both header and message data at the same time via writev() and makes ptl/tcp virtually as efficient as ptl/usock. Short writev generally occur when the kernel buffer is full, so there is no point for retrying in this case. fwiw, no such degradation was observed on OSX. Refs open-mpi/ompi#2657 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-10 13:07:39 +09:00
Nathan Hjelm	d6bd69dc93	mca/base: account for NULL string_value in verbose set The MCA variable code calls the string from value function with a NULL string to verify values. The verbosity enumerator was not correctly checking for a non-NULL value before trying to set the string. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-01-09 11:52:31 -07:00
Ralph Castain	67fce2861b	Merge pull request #2685 from rhc54/topic/cov Resolve Coverity issues	2017-01-07 13:11:40 -08:00
Ralph Castain	e25e69dc2f	Resolve Coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-07 10:45:52 -08:00
George Bosilca	cfeeecd381	Remove the tcp_local field from the TCP component. Instead use the OPAL process name to get the name of the local process. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-01-07 13:24:18 -05:00
Ralph Castain	822e2680ba	Cleanup some configure stuff for static builds - still can't get wrapper extra libs to be recognized Signed-off-by: Ralph Castain <rhc@open-mpi.org> pmix2x: minor configure updates Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-07 08:37:36 -08:00
Nathan Hjelm	5b70ae3ec0	amd64: save/restore all 64 bits of rbx around cpuid This commit fixes a bug in the timer check. When -fPIC is used we need to save/restore ebx. The code copied from patcher was meant for 32-bit systems and did not work correctly on 64-bit systems. This commit updates the save/restore to use rbx instead of ebx. Fixes #2678 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-01-06 18:54:20 -07:00
Gilles Gouaillardet	189d7b9480	opal/dss: revamp opal_value_unload() to keep valgrind happy reorder tests to avoid valgrind complaining about uninitialized variables Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 17:10:39 +09:00
Ralph Castain	444f5fa35d	Raise the priority of the usock component so it gets preferentially picked Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-05 22:53:04 -08:00
Gilles Gouaillardet	c2ddb1e2fc	mca/base: plug a memory leak register mca_base_var_enum_value_flag_t so they can be free'd upon finalize Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:36 +09:00
Gilles Gouaillardet	6d5cb9fe0d	event: plug a leak when closing the event framework Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:35 +09:00
Gilles Gouaillardet	b3a2bdda7b	opal/threads: manually invoke thread-specific key destructors on the main thread. there is no such thing as pthread_join(main_thread), so key destructors are never invoked on the main thread, which causes valgrind report some memory leaks. Manually store and then invoke the key destructors and make valgrind happy. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:35 +09:00
Gilles Gouaillardet	6ef281e163	pmix/base: fix misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:35 +09:00
Gilles Gouaillardet	a59dfd7b14	sec/munge: plug a memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:35 +09:00
Gilles Gouaillardet	c612499bc1	opal: mca/base: fix a memory leak in the mca_base_var_enum_flag_t destructor Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:59 +09:00
Gilles Gouaillardet	7e5da7382e	btl/tcp: plug leaks when closing component remove tcp_local from the tcp_procs table, and release it Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:59 +09:00
Gilles Gouaillardet	507623d6b1	mpool/hugepage: plug a memory leak on finalize Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:58 +09:00
Gilles Gouaillardet	51021028d6	mpool/base: plug a memory leak on finalize Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:58 +09:00
Ralph Castain	6509f60929	Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node. Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given. Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example: $ mpirun -npernode 2 ./mpi_memprobe Sampling memory usage after MPI_Init Data for node rhc001 Daemon: 12.483398 Client: 6.514648 Data for node rhc002 Daemon: 11.865234 Client: 4.643555 Sampling memory usage after MPI_Barrier Data for node rhc001 Daemon: 12.520508 Client: 6.576660 Data for node rhc002 Daemon: 11.879883 Client: 4.703125 Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-05 10:32:17 -08:00
Ralph Castain	91d714fe93	Add flags to direct PMIx to only use one listener, but without directing which one (tcp or usock) to use. This allows the user to set PMIX_MCA_ptl in their environment to select the transport method. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-04 09:16:44 -08:00
Ralph Castain	f355fb926d	Continue cleanup of notifications. Resolve a race condition that can result in attempt to send a message on a closed socket Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-04 09:16:33 -08:00
Ralph Castain	9eab9a1ed3	Remove stale global variables Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers. Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation). Begin first cut at memory profiler Some minor cleanups of memprobe Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-02 14:04:24 -08:00
Ralph Castain	e8aea2ebfc	Minor cleanups Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-30 16:19:42 -08:00
Ralph Castain	08c76a42bb	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org> Plug a minor memory leak. Tell the PMIx server not to create a dstore memory region for the daemon job as there is nobody to share it with. Signed-off-by: Ralph Castain <rhc@open-mpi.org> Protect users of hwloc membind functions Signed-off-by: Ralph Castain <rhc@open-mpi.org> Update PMIx to include NULL string protection Signed-off-by: Ralph Castain <rhc@open-mpi.org> Update to PMIx master to include key overwrite protection Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-30 12:44:47 -08:00
Ralph Castain	fe68f23099	Only instantiate the HWLOC topology in an MPI process if it actually will be used. There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced: * shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string instead of the topology itself, if available, thus avoiding instantiating the topology * openib BTL. This uses the distance matrix. At present, I haven't developed a method for replacing that reference. Thus, this component will instantiate the topology * usnic BTL. Uses the distance matrix. * treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate the topology * ess base functions. If a process is direct launched and not bound at launch, this code attempts to bind it. Thus, procs in this scenario will instantiate the topology Note that instantiating the topology on complex chips such as KNL can consume megabytes of memory. Fix pernode binding policy Properly handle the unbound case Correct pointer usage Do not free static error messages! Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-29 10:33:29 -08:00
Ralph Castain	3a2d6a5ab6	Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-28 09:14:26 -08:00
Ralph Castain	d3aa3777f3	Per @jsquyres: avoid mangling user-provided CFLAGS by using the new PMIX_FLAGS_UNIQ autoconf script in place of PMIX_UNIQ Refs #2636 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-27 09:00:59 -08:00
Gilles Gouaillardet	22db1d36b6	pmix2x: silence misc warnings Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-26 13:35:17 +09:00
Nysal Jan K A	19e3be31e5	Merge pull request #2421 from nysal/master mpit: Fix MPI_T_pvar_get_index	2016-12-22 15:33:51 +05:30
Gilles Gouaillardet	54c84196a6	btl/vader: plug a memory leak as reported by Coverity with CID 1362691 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-22 16:04:36 +09:00
Nysal Jan K.A	25ba507ada	mpit: Fix MPI_T_pvar_get_index MPI_T_pvar_get_index was returning an incorrect index. The index was never set correctly while registering the performance variables. Additionally fix a missing case in the mca_base_var_type_t to MPI datatype conversion. This type is currently used for control variables registered by mxm, fca and hcoll components. Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>	2016-12-22 12:30:21 +05:30
Jeff Squyres	3571c3c5bb	hwloc external: minor fixes to `9649c44` - Fix capitolization typos - Make comment more correct / flow better - Use AM_CPPFLAGS, not DEFAULT_INCLUDES - Remove extra "hwloc/" from external hwloc.h specification Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-12-21 09:06:24 -08:00
Gilles Gouaillardet	9649c44fa0	hwloc: correctly handle --with-hwloc=external - simply #include "hwloc.h" to use the external hwloc header - do use the external hwloc header instead of opal/mca/hwloc/hwloc.h Thanks Orion Poplawski for the report Fixes open-mpi/ompi#2616 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-21 11:58:10 +09:00
Ralph Castain	ea133206ec	Sync the internal OMPI component to PMIx master Update external PMIx v2.x component Add missing Makefile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 19:14:16 -08:00
Ralph Castain	256b5adac5	Transfer across final fixes from debugger attach work Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 00:34:27 -08:00
Ralph Castain	c6f6f40529	Transfer debugger support changes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-17 18:14:46 -08:00
rhc54	54c4925f3f	Merge pull request #2598 from rhc54/topic/debugger Transfer back changes from debugger attach work	2016-12-17 13:09:38 -08:00
Nathan Hjelm	16a2f09cd5	Merge pull request #2596 from hjelmn/x86_rtdtsc opal/timer: add code to check if rtdtsc is core invariant	2016-12-17 11:14:49 -07:00
Ralph Castain	269753f5c1	Transfer back changes from debugger attach work Silence warning Remove debug Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-17 10:00:52 -08:00
Ralph Castain	215d6290e0	Add a flux component for LLNL Fine tuning of flux component Fix a few minor issues with the initial cut: * Job id could be obtained from the PMI kvsname like SLURM, but simpler to getenv (FLUX_JOB_ID) * Flux pmi-1 doesn't define PMI_BOOL, PMI_TRUE, PMI_FALSE * Flux pmi-1 maps the deprecated PMI_Get_kvs_domain_id() to PMI_KVS_Get_my_name() internally, so just call that instead. * Drop residual slurm references. Add wrappers for PMI functions so that if HAVE_FLUX_PMI_LIBRARY is not defined, the component can dlopen libpmi.so at location specified by the FLUX_PMI_LIBRARY_PATH env variable, which adds flexibility. If HAVE_FLUX_PMI_LIBRARY is defined, link with libpmi.so at build time in the usual way. Update configury for flux component Update m4 so the configure options work as follows: --with-flux-pmi Build Flux PMI support (default: yes) --with-flux-pmi-library Link Flux PMI support with PMI library at build time. Otherwise the library is opened at runtime at location specified by FLUX_PMI_LIBRARY_PATH environment variable. Use this option to enable Flux support when building statically or without dlopen support (default: no) If the latter option is provided, the library/header is located at build time using the pkg-config module 'flux-pmi'. Otherwise there is no library/header dependency. Handle the case where ompi is configured with --disable-dlopen or --enable-statkc. In those cases, don't build the component unless --with-flux-pmi-library is provided. It is fatal if the user explicitly requests --with-flux-pmi but it cannot be built (e.g. due to --disable-dlopen). Add a schizo/flux component Update schizo/flux component Eliminate slurm-specific usage cases. Since the module is only loaded if FLUX_JOB_ID is set, there are only two cases to handle: 1) App was launched indirectly through mpirun. This is not yet supported with Flux, but hook remains in case this mode is supported in the future. 2) App was launched directly by Flux, with Flux providing CPU binding, if any. Fix up white space in pmix/flux component Drop non-blocking fence from pmix:flux component The flux PMI-1 library is not thread safe, therefore register a regular blocking fence callback instead of the thread-shifting fencenb(). pmix/flux component avoids extra PMI_KVS_Gets Keys stored into the base cache under the wildcard rank are not intended to be part of the global key namespace. These keys therefore should not trigger a PMI_KVS_Get() if they are not found in the cache. Minor pmix/flux component cleanup pmix/flux: drop code for fetching unused pmix_id pmix/flux: err_exit must return error Problem: in flux_init(), although 'ret' (variable holding err_exit return code) is initialized to OPAL_ERROR, the variable is reused as a temporary result code, so if there are some successes followed by a failure that doesn't set 'ret', flux_init() could return success with PMI not initialized. Ensure that a "goto err_exit" returns OPAL_ERROR if 'ret' is not set to some other error code. pmix/flux: don't mix OPAL_ and PMI_ return codes Problem: flux_init() can return both PMI_ and OPAL_ return codes. Although OPAL_SUCCESS and PMI_SUCCESS are both defined as 0, other codes are not compatible. Ensure that flux_init() consistently uses 'rc' for PMI_ return codes and 'ret' for OPAL_ return codes. pmix/flux: factor out repeated code for cache put Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-16 18:26:38 -08:00
Nathan Hjelm	a718743a5c	opal/timer: add code to check if rtdtsc is core invariant Newer x86 processors have a core invariant tsc. On these systems it is safe to use the rtdtsc instruction as a monotonic timer. This commit adds a new function to the opal timer code to check if the timer backend is monotonic. On x86 it checks the appropriate bit and on other architectures it parrots back the OPAL_TIMER_MONOTONIC value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-12-16 15:11:50 -07:00
rhc54	00b87ea829	Merge pull request #2584 from rhc54/topic/warnings Reduce the flood of warnings due to uninitialized variables, mismatch…	2016-12-15 10:09:01 -08:00
Ralph Castain	585540bcee	Reduce the flood of warnings due to uninitialized variables, mismatched types, and unused things to a more bearable trickle Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 16:33:50 -08:00
Gilles Gouaillardet	a019095b84	pmix2x/class: correctly handle concurrent class initialization (back-ported from upstream commit pmix/master@ceedbd67fd) Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-15 09:07:24 +09:00
Ralph Castain	884fb7fcf2	Update the PMIx2 support to include the latest shared memory optimizations Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn Update to track master Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 15:00:10 -08:00
Ralph Castain	1961a1c22a	Use the server tmpdir instead of the system tmpdir for tool contact files Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 08:42:09 -08:00
Ralph Castain	fbed2d794a	Update to latest PMIx master + PTL branch Update the usock component to disable it Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 20:47:44 -08:00
Ralph Castain	d51821cbc7	Update pmix check headers to support Open BSD Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 19:37:06 -08:00
Ralph Castain	f91f8ce494	Protect against NULL param Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 21:19:32 -08:00
Ralph Castain	f633d5c1b1	Initialize var Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 21:08:45 -08:00
rhc54	003f7d308f	Merge pull request #2504 from hppritcha/topic/fix_pmix_base_help_file pmix: Fix pmix base help file.	2016-12-03 11:57:07 -08:00
Ralph Castain	a43fae74a5	Avoid hanging in show_help if PMIx was unable to initialize Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 10:12:58 -08:00
Ralph Castain	a4e3f615e3	Fix executable modes of pmix config scripts Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 20:50:59 -08:00
Ralph Castain	342dbfcf4e	Update Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 20:33:03 -08:00
Ralph Castain	8e64382edf	Update to correct tarball Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 19:16:07 -08:00
Ralph Castain	1a0bccb536	Now that PMIx has settled on its release strategy and numbering, update the OPAL pmix framework to track Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 15:44:43 -08:00
Howard Pritchard	3049848731	Fix pmix base help file. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-12-02 15:03:22 -06:00
Ralph Castain	6041467df0	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-01 14:47:44 -08:00
Gilles Gouaillardet	3a76a78bff	btl/openib: plug a memory leak in btl_openib_register_mca_params() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-01 14:24:30 +09:00
Gilles Gouaillardet	c9aeccb84e	opal/if: open the if framework once in opal_init_util the if framework is no more open in opal_if*, which plugs several memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-01 14:24:30 +09:00
Gilles Gouaillardet	45732fd764	hwloc/base: fix a memory leak in buffer_cleanup() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-01 14:24:29 +09:00
Gilles Gouaillardet	2739346a18	opal: invoke mca_base_close() in opal_finalize_util() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-01 14:24:29 +09:00
Howard Pritchard	7ce3ca25ef	Merge pull request #2458 from hppritcha/topic/pmix_cray_no_dlopen pmix/cray: abort job if using aprun for general case	2016-11-25 14:03:20 -07:00
Howard Pritchard	eee9f7ae3a	pmix/cray: abort job if using aprun for general case It turns that there is an incompatibility between the Cray PMI library and the default configuration for building Open MPI (master). To work around this, we now disable use of aprun for direct launch of Open MPI jobs except under specific conditions. The problem is that there are now (on master) packages getting initialized that do not work properly across a fork operation. As part of a constructor in the Cray PMI library, a fork operation is done to simplify use of shared memory between the processes in a job on the same node. This ends up thoroughly messing up the Open MPI initialization process in the case that dlopen support is enabled. The initialization process gets about half-way through when the PMIX framework is opened and components are loaded, which triggers the Cray PMI constructor and hence the fork operation. There are two workarounds for this: 1) configure Open MPI for Cray XE/XC systems using aprun with the --disable-dlopen option 2) set the PMI_NO_FORK environment variable in the shell in which the aprun command is run. Without taking these measures, a Open MPI job will just hang at job startup in the first attempt to "thread-shift" the PMIx fence_nb operation. Additional hangs occur at shutdown if this problem is worked around, again due to the insertion of a fork operation halfway through the Open MPI initialization procedure. This commit detects if the conditions that bring out the hang situation are present, and if so, prints out a message and aborts the job launch. Note on systems using slurm, the PMI_NO_FORK environment variable is set as part of the srun job launch, hence this issue is avoided on those systems. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-25 06:28:19 -07:00
Gilles Gouaillardet	8fd1c3f0df	opal/util: handle a race condition in opal_os_dirpath_destroy An file might have been destroyed by an other task between readdir() and stat(), so simply ignore stat() failure. That typically occurs when one task is removing the job_session_dir and an other task is still removing its proc_session_dir. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-24 10:45:48 +09:00
Gilles Gouaillardet	1a279c4ee9	btl/self: fix fragment segment length in mca_btl_self_prepare_src() opal_convertor_pack() might pack less bytes than requested, so always set frag->segments[0].seg_len. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-24 10:44:56 +09:00
Ralph Castain	1e2019ce2a	Revert "Update to sync with OMPI master and cleanup to build" This reverts commit `cb55c88a8b`.	2016-11-22 15:03:20 -08:00
Ralph Castain	cb55c88a8b	Update to sync with OMPI master and cleanup to build Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-22 14:24:54 -08:00
Howard Pritchard	2bb4cffffd	Merge pull request #2447 from hppritcha/topic/compiler_warning_swats btl/ugni:vader swat some compiler warnings	2016-11-22 05:28:20 -07:00
Howard Pritchard	09f47fcf8e	btl/ugni:vader swat some compiler warnings Swat some compiler warnings. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-21 14:58:34 -06:00
Howard Pritchard	2cbc0e8472	pmix/cray: fix disable-dlopen problem PR open-mpi/ompi#2432 introduced a regression where configure and build with --disable-dlopn caused build failure owing to unresolved alps lli symbols in the libopal-pal shared library. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-21 13:45:10 -06:00
Howard Pritchard	0bbb319246	Merge pull request #2444 from hppritcha/topic/cray_pmix_ws_cleanup pmix/cray: whitespace cleanup	2016-11-21 06:03:56 -07:00
Howard Pritchard	08dce4f161	pmix/cray: whitespace cleanup Get rid of tabs. This is anti-ompi style. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-18 19:30:40 -07:00
Howard Pritchard	44f4663d0d	Merge pull request #2432 from hppritcha/topic/fix_info_etc_w_alps pmix/cray: set some envars for MPI_INFO_ENV object	2016-11-18 14:23:53 -07:00
Howard Pritchard	de3de131af	pmix/cray: set some envars for MPI_INFO_ENV object Enhance the cray pmix component to set some OMPI internal env. variables used to set some key/value pairs on the MPI_INFO_ENV object. This allows more of the ompi-tests ibm unit tests to pass when using aprun/srun direct launch and Cray PMI. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-16 17:37:52 -06:00
Joshua Ladd	4907085c6f	Add the ConnectX-5 device ID to openib BTL. Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>	2016-11-16 21:42:37 +02:00
Gilles Gouaillardet	8ef538adeb	Merge pull request #2398 from bosilca/topic/tcp_endpoints_mutex Protect the tcp_endpoints list from concurrent accesses.	2016-11-14 22:13:29 -07:00
Howard Pritchard	703b464c03	pmix: fix a typo in a help file Fixes #2391 Thanks to @njoly for reporting Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-12 11:49:15 -07:00
George Bosilca	d0dddef53d	Protect the tcp_endpoints list from concurrent accesses. Thanks Gilles for your help. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2016-11-11 00:06:03 -05:00
Gilles Gouaillardet	a49422fe84	btl/tcp: get rid of the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD macro since pthreads are now mandatory, the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD is always true and hence can be safely removed Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-08 14:00:05 +09:00
Gilles Gouaillardet	11dc86f26b	cleanup: always #include <pthread.h> pthreads are now mandatory, so there is no more need to Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-08 13:07:45 +09:00
Aboorva Devarajan	fb8e074583	powerpc: Add support for powerpcle in timer/pstat. Signed-off-by: Aboorva Devarajan <abodevar@in.ibm.com>	2016-11-07 02:35:44 -05:00
Gilles Gouaillardet	7a2894f1e0	event/libevent2022: cleanup dependencies to the embedded libevent lib configury force event/libevent2022 to be built as a static module, so simplify Makefile.am Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-07 14:57:52 +09:00
Gilles Gouaillardet	6f7ed1f552	event/libevent2022: add missing dependencies to the embedded libevent lib force the libevent2022 component rebuild if the embedded libevent is updated Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-04 11:13:44 +09:00
Jeff Squyres	a4ffa590c8	Merge pull request #2308 from hjelmn/vader_mem btl/vader: reduce memory footprint when using xpmem	2016-11-02 10:28:26 -04:00
Steve Wise	7050969d47	openib btl: remove BTL_OPENIB_FAILOVER_ENABLED code Remove BTL_OPENIB_FAILOVER_ENABLED code in the openib btl source. Remove the failover-specific files from the openib btl. Update the openib/Makefile.am accordingly. Remove the -enable-openib-failover config logic. Signed-off-by: Steve Wise <swise@opengridcomputing.com>	2016-11-01 14:45:36 -07:00
Ralph Castain	64873487b4	Remove the max_connections parameter from the radix component as it is confusing. Modify PMIx client init so that it simply returns the nspace/rank if called by a server - this allows the server to retrieve its assigned ID. Register the server's nspace so client-side operations can succeed Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-01 12:17:11 -07:00
Jeff Squyres	149b660666	btl/usnic: fix compiler warning Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-10-28 07:36:20 -07:00
rhc54	698dac108b	Merge pull request #2309 from rhc54/topic/pmix Update to latest PMIx master - mostly updates example codes, but incl…	2016-10-27 14:21:46 -07:00
Ralph Castain	f4a55118e6	Update to latest PMIx master - mostly updates example codes, but includes one critical cleanup during finalize. NOTE: set dstore shared memory storage option to "on" by default Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-27 11:16:03 -07:00
Nathan Hjelm	9d92075e60	btl/self: rewrite to decrease memory usage (#2307 ) This commit rewrites much of the btl/self component to fix a long standing memory usage bug. Before this commit the prepare_src path would always allocate a max send fragment (256kB). This caused the rank to allocate 32 * 256k useless buffers from one send. This commit makes the following changes: - Add the MCA_BTL_FLAGS_GET flag by default. No reason not to set it. - Reduce the eager limit, max send size, buffers per allocation, and maximum buffer count per fragment size. These changes should have no noticible affect on performance but should greatly reduce the memory usage of the component. - Implement the sendi function. This should reduce self send latency somewhat. - Rewrite prepare_src to never allocate a eager or max send fragment for contiguous data. - add_procs needs to return something in the peer array for the proc self not just set the reachability bit. Now stores (void *) 1. - Various cleanups. Removed and unused file. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-10-27 12:34:54 -04:00
Nathan Hjelm	a652a193ea	btl/vader: reduce memory footprint when using xpmem The vader btl kept a per-peer registration cache to keep track of attachments. This is not really a problem with small numbers of local ranks but can be a problem with large SMP machines. To reduce the footprint there is now one registration cache for all xpmem attachments. This will probably increase the lookup time for large transfers but is a worthwhile trade-off. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-10-27 10:09:43 -06:00
Ralph Castain	f298f294e1	Update PMIx to latest master tarball. Ensure we set the HNP name for orted's so that PMIx_Lookup can find the server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-26 15:48:56 -07:00
Gilles Gouaillardet	8cc3f288c9	opal: fix opal_class_finalize() usage the class system can be initialized/finalized as many times as we like, so there is no more need to have opal_class_finalize() invoked in a destructor Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-26 15:15:54 +09:00
Gilles Gouaillardet	f2a80dc09f	configury: check libnl version and abort in case of conflict libnl and libnl-3 are known to conflict with each other, so detect and abort if these two libs are both used directly (e.g. Open MPI uses libnl-3) or indirectly (e.g. libibverbs.so might depend on libnl)	2016-10-25 09:23:59 +09:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
rhc54	900ae15d49	Merge pull request #2221 from bharatpotnuri/master btl/openib: remove unwanted ompi header inclusion in opal code.	2016-10-21 14:05:55 -05:00
Ralph Castain	9131eca9c6	Update to latest PMIx master	2016-10-20 21:13:40 -07:00
Ralph Castain	be3197fe27	Ensure that the libevent headers are installed for external libevent when --with-devel-headers is given. Correct the path for opal_config.h in the external hwloc header	2016-10-20 20:57:50 -07:00
Ralph Castain	2f966bf3bf	Cleanup external PMIx v3 component for copy/paste errors - component and module require unique names	2016-10-20 09:11:46 -07:00
Ralph Castain	8113a8d1b0	Now that we are hiding symbols in the internal PMIx component, we cannot reuse that component for integration to the external PMIx master as the symbols don't match. So create a new "ext3x" component and copy the PMIx v3 integration over there. Also, remove a couple of build-product files from the pmix3x component.	2016-10-18 13:15:32 -07:00
Ralph Castain	50c9f3de55	Ensure the PMIx progress thread is stopped prior to tearing anything down. Thanks to Gilles for spotting this error!	2016-10-18 00:27:52 -07:00
Gilles Gouaillardet	4e19cd51b1	hwloc/external: add a missing include file	2016-10-14 09:27:33 +09:00
Ralph Castain	6f65d0a173	Repair event notification support. Cleanup the long-suffering "epoll: warning" coming out of libevent whenever a process abnormally terminated. Add changes to test program Sync to PMIx master	2016-10-13 16:27:39 -07:00
Ralph Castain	6417f217e1	Turn PMIx dstore off by default as MTT was effectively broken	2016-10-13 08:14:51 -07:00
Potnuri Bharat Teja	29f1aa836f	btl/openib: remove unwanted ompi header inclusion in opal code. OMPI header cannot be included in OPAL source code, hence removed it. Fixes: (`740b636db`) btl/openib: Disqualify rdmacm CPC if MPI_THREAD_MULTIPLE. Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>	2016-10-13 16:21:36 +05:30
Nathan Hjelm	5b40fd267f	Merge pull request #2204 from hjelmn/arm64 asm/arm64: ensure instruction ordering on timer	2016-10-12 11:22:28 -06:00
Nathan Hjelm	9a50ce6364	asm/arm64: ensure instruction ordering on timer Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-10-12 09:25:21 -06:00
Ralph Castain	8f05beb1ec	Sync pmix/master@cb53105	2016-10-11 20:54:59 -07:00
rhc54	ad156e3e91	Merge pull request #2207 from rhc54/topic/pmixupdate Update PMIx support to latest PMIx master	2016-10-11 18:57:11 -05:00
Jeff Squyres	bcbf0bc4f9	usnic: s/OMPI/OPAL/ Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-10-11 16:43:35 -07:00
Ralph Castain	6ce4b6d098	Eliminate -Wall from being hardcoded	2016-10-11 12:50:31 -07:00
Ralph Castain	1859b03416	Enable PMIx shared memory support by default	2016-10-11 12:18:01 -07:00
Ralph Castain	1d7d7c201b	Update PMIx support to latest PMIx master	2016-10-11 10:17:23 -07:00
Ralph Castain	5b1484a836	Implement the backend support for process-generated event notification	2016-10-08 09:24:28 -07:00
Gilles Gouaillardet	0d24fad307	opal: always run opal_class_finalize in the opal_cleanup destructor if MPI_Init[_thread]/MPI_Finalize and MPI_T_init_thread/MPI_T_finalize are balanced, opal_initialized is zero, and hence opal_cleanup destructor never invokes opal_class_finalize. if MPI_Init[_thread] nor MPI_T_init_thread have been called, classes is NULL, so opal_class_finalize does nothing	2016-10-08 16:58:20 +09:00
Gilles Gouaillardet	b55dd2442a	libevent2022: rename _event_strlcpy	2016-10-08 16:58:20 +09:00
Gilles Gouaillardet	c92e9a5406	use the new OPAL_HASH_TABLE_FOREACH convenience macro	2016-10-08 16:58:20 +09:00
Gilles Gouaillardet	23a8f764bd	opal: add the OPAL_HASH_TABLE_FOREACH macro this is a convenience macro similar to the OPAL_LIST_FOREACH macro, that can be used to iterate on all the key/value pairs of an opal_hash_table_t	2016-10-08 16:58:20 +09:00
Gilles Gouaillardet	014f917462	opal: fix comment in OPAL_LIST_FOREACH macro. no code change.	2016-10-08 16:58:19 +09:00
Gilles Gouaillardet	f1f1fb15eb	pmix3x: configury: output major, minor and release version after checking them and hence fix the configure output (back-ported from upstream commit pmix/master@7b7cdda2de)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	f3af799608	pmix3x: misc fixes to get pmix build on Solaris - replace MAXHOSTNAMELEN with hardcoded 1024. unlike Linux, Solaris #define MAXHOSTNAMELEN in <netdb.h>, so use a hard coded value to keep the test simpl - stdout cannot be assigned on Solaris, so use freopen instead (back-ported from upstream commit pmix/master@a63f6e53f4)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	5cbfddb8f1	pmix3x: fix misc memory leaks (back-ported from upstream commit pmix/master@1eff526929)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	b4e4e4a5f1	pmix3x: enhance pmix_nspace_t destructor PMIX_RELEASE all elements stored in the internal and modex hash tables (back-ported from upstream commit pmix/master@b90674fc52)	2016-10-08 13:01:27 +09:00
Gilles Gouaillardet	f1dc033767	pmix3x: add the PMIX_HASH_TABLE_FOREACH macro this is a convenience macro similar to the PMIX_LIST_FOREACH macro, that can be used to iterate on all the key/value pairs of a pmix_hash_table_t (back-ported from upstream commit pmix/master@349971c68c)	2016-10-08 13:01:27 +09:00
Jeff Squyres	67684be7c9	usnic: fix one last stray fabric_attr->name --> linux_device_name Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-10-04 18:17:38 -07:00
Jeff Squyres	8b77359cac	usnic: remove some legacy libfabric 1.0/1.1 code We only support running with libfabric v1.3 or greater. So it's safe to remove the legacy/adaptive cq_readerr() behavior. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-10-03 11:59:41 -07:00
Jeff Squyres	345c07a252	usnic: require libfabric >= v1.3 at run time There are critical usnic libfabric AV insert bugs before v1.3, so don't allow any version prior to v1.3 at run time (still allow compiling with earlier versions, though, since the ABI guarantees allow us to compile with an earlier libfabric and run with a later libfabric). Switch to using fi_version() to check the version (instead of calling fi_getinfo()) as a potentially lighter-weight / simpler solution. This allows us to only call fi_getinfo() once. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-10-03 11:59:41 -07:00

1 2 3 4 5 ...

4632 Коммитов