1
1
Граф коммитов

3507 Коммитов

Автор SHA1 Сообщение Дата
Sergey Oblomov
1223b05811 MCA/COMMON/UCX: fixed build scripts
- updated evaluation of UCX lib - used call from UCX v1.3
- updated makefile compilation flags

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-27 11:10:25 +03:00
Thananon Patinyasakdikul
be76896f7c btl/ofi: progress now happens after a threshold.
This commit changed the way btl/ofi call progress. Before, we force
progression with every rdma/atomic call. This gives performance boost in
some case and slow down on others. Now we only force progression after
some number of rdma calls which result in better performance overall.

Also added new MCA parameter 'mca_btl_ofi_progress_threshold' to set
the threshold number. The new default is 64.

Also:
Added FI_DELIVERY_COMPLETE to tx_rtx flags to ensure that the completion
is generated after the message has been received on the remote side.

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-06-26 10:39:45 -07:00
Nathan Hjelm
b0ac6276a6 btl/ugni: improve multi-threaded RDMA performance
This commit improves the injection rate and latency for RDMA
operations. This is done by the following improvements:

 - If C11's _Thread_local keyword is available then always use the
   same virtual device index for the same thread when using RDMA. If
   the keyword is not available then attempt to use any device that
   isn't already in use. The binding support is enabled by default but
   can be disabled via the btl_ugni_bind_devices MCA variable.

 - When posting FMA and RDMA operations always attempt to reap
   completions after posting the operation. This allows us to
   better balance the work of reaping completions across all
   application threads.

 - Limit the total number of outstanding BTE transactions. This
   fixes a performance bug when using many threads.

 - Split out RDMA and local SMSG completion queue sizes. The RDMA
   queue size is better tuned for performance with RMA-MT.

 - Split out put and get FMA limits. The old btl_ugni_fma_limit MCA
   variable is deprecated. The new variable names are:
   btl_ugni_fma_put_limit and btl_ugni_fma_get_limit.

 - Change how post descriptors are handled. They are no longer
   allocated seperately from the RDMA endpoints.

 - Some cleanup to move error code out of the critical path.

 - Disable the FMA sharing flag on the CDM when we detect that there
   should be enough FMA descriptors for the number of virtual devices
   we plan will create. If the user sets this flag we will not unset
   it. This change should improve the small-message RMA performance by
   ~ 10%.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-26 11:31:35 -06:00
Ralph Castain
0ddbc75ce5
Merge pull request #4930 from kizill/fix-ipv6
fixed ipv6 OOB connection problems (fix issue #1585)
2018-06-26 09:13:53 -07:00
Nathan Hjelm
abb87f9137
Merge pull request #5338 from ggouaillardet/topic/uct
btl/uct: misc fixes
2018-06-26 08:56:40 -06:00
Yossi Itigin
ee873f4f79
Merge pull request #5322 from hoopoepg/topic/mca-ucx-common
MCA/UCX: added common module
2018-06-26 13:54:12 +03:00
Gilles Gouaillardet
b40b835a70 btl/uct: remove debug code
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-06-26 16:03:16 +09:00
Gilles Gouaillardet
552d0809aa btl/uct: add missing include file
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-06-26 14:53:02 +09:00
Nathan Hjelm
6c089518e7 btl/uct: make uct endpoints array a flexible array member
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-25 18:14:58 -06:00
Nathan Hjelm
c5c5b42307 btl: add a new btl for the UCT layer in OpenUCX
This commit adds a new btl for one-sided and two-sided. This btl
uses the uct layer in OpenUCX. This btl makes use of multiple uct
contexts and per-thread device pinning to provide good performance
when using threads and osc/rdma. This btl has been tested extensively
with osc/rdma and passes all MTT tests on aries and IB hardware.

For now this new component disables itself but can be enabled by
setting the btl_ucx_transports MCA variable with a comma-delimited
list of supported memory domains/transport layers. For example:
--mca btl_uct_memory_domains ib/mlx5_0. The specific transports used
can be selected using --mca btl_uct_transports. The default is to use
any available transport.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-25 18:14:58 -06:00
Sergey Oblomov
bf7fd480e9 MCA/COMMON/UCX: added non-blocking implementations of atomics
- added implementation of swap/cswap/fadd operations
- blocking add64 is replaced by non-blocking routine

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-25 12:25:31 +03:00
Sergey Oblomov
63e7ba6843 MCA/COMMON/UCX: added parameter for UCX/opal progress
- added parameter to set UCX/opal progresses
- minor refactoring of request wait routines

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-25 11:00:12 +03:00
Jeff Squyres
3767ce27c0 btl/tcp: trivial whitespace clean
No code/logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-23 08:04:12 -07:00
Jeff Squyres
9034717876 btl/tcp: use a hash map for kernel IP interface indexes
The giant size of the TCP proc struct is causing a problem in some
environments (because it is allocated on the stack), and it was too
big, anyway.

Instead, use a hash map.  That way, it starts small and can grow if it
needs to.  It also makes no assumptions about the values of the kernel
interface indexes.

Fixes #5292.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-23 08:03:30 -07:00
Jeff Squyres
e3d6c5ce3a pmix3/pmix_server.c: minor compiler warning stomp
Submitted upstream https://github.com/pmix/pmix/pull/776.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-23 06:35:09 -07:00
Sergey Oblomov
d57ae62dee MCA/UCX: added common module
- implemented non-blocking routines for flush operations

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-06-22 16:41:09 +03:00
Jeff Squyres
4603852740 orterun: use consistent CLI option name for --bind-to
Since the new binding option is tied to the --cpu-list orterun CLI
option, make the --bind-to option reflect the same name (vs. the
--cpu-set CLI option, which is entirely different).  For example:

    mpirun --bind-to cpu-list:ordered ...

Note that "--bind-to cpulist:ordered" is accepted as a synonym,
because people will be lazy.

Also add some minor updates to the orterun.1in man page for
clarification.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-21 08:22:00 -07:00
Ralph Castain
f17d47087a Define a new binding method and qualifier
Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-20 21:26:09 -07:00
Ralph Castain
5ac2ce6346 Cover all the PMIx data types
Cover all data types for OPAL-to-PMIx conversion, generating error logs when we hit something we don't support

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-20 09:06:19 -07:00
Boris Karasev
39c9cb12bb pmix/ext2x: fixed detection PMIx v2.0 by pmix component
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-06-20 13:23:51 +03:00
Ralph Castain
08707c9762 Sync to updated PMIx v3.0.0rc
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-19 21:25:43 -07:00
Noah Evans
a64abadf97 Fix mca_base_var_files separator
In the opal list parsing behavior paths should be separated by ':' while files are separated by ','. In the opal and pmix code (the pmix fix is in a separate commit) there was a mistake in the parsing such that files were being separated by ':' when they should be separated by ','s. This commit attempts to address this mismatch.

Signed-off-by: Noah Evans <noah.evans@gmail.com>
2018-06-19 12:59:07 -06:00
Ralph Castain
f0a0d606a0 Correct accounting for tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1be080f7b92bad39745f42628a8cb6afefad2d2a)
2018-06-18 13:24:25 -07:00
Thananon Patinyasakdikul
13f58f3191
Merge pull request #5274 from thananon/ofi_sep
btl/ofi: add scalable endpoint support.
2018-06-18 08:41:06 -07:00
Jeff Squyres
266d5b2110
Merge pull request #5277 from jsquyres/pr/cygwin-patch
external libevent: fix for Cygwin
2018-06-18 11:08:30 -04:00
Ralph Castain
7981818b84 Update PMIx atomics
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-17 10:03:49 -07:00
Ralph Castain
fa18ba395d Sync to latest PMIx v3.0rc
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-17 02:41:46 -07:00
Ralph Castain
ac7bb15505 Fix other typo in help message
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-16 16:30:52 -07:00
Ralph Castain
8cfce583c0 Correct typo to properly check for PMIx v4
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-16 16:29:05 -07:00
Jeff Squyres
07c8ec6a3c external libevent: fix for Cygwin
Fix from Marco Atzeri for building on Cygwin.

Signed-off-by: Marco Atzeri <marco.atzeri@gmail.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-16 09:08:58 -07:00
Ralph Castain
e2e6da4379
Merge pull request #5258 from rhc54/topic/pmix4
Sync to PMIx v3.0rc and add ext4x
2018-06-15 09:41:08 -07:00
Thananon Patinyasakdikul
dae3c9447c btl/ofi: add scalable endpoint support.
This commit add support for scalable endpoint to enhance multithreaded
application performance. The BTL will detect the support from ofi
provider and will fallback to normal usage of scalable endpoint is not
supported.

NEW MCA parameters:
- mca_btl_ofi_disable_sep: force the btl to not use scalable endpoint.
- mca_btl_ofi_num_contexts_per_module: number of communication context
  to create (should be the same as number of thread).

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-06-14 15:44:29 -07:00
Howard Pritchard
7dcab6e4a4
Merge pull request #5269 from hppritcha/topic/squash_gcc7.3.0_warnings
topo/treematch - quash compiler warning
2018-06-13 21:13:04 -05:00
Howard Pritchard
64de269cc3 topo/treematch - quash compiler warning
quash a compiler warning showing up with gcc 7.3

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-06-13 16:34:17 -05:00
Thananon Patinyasakdikul
390d72addd
Merge pull request #4885 from davideberius/spc_pr
Initial Software-based Performance Counters PR
2018-06-12 14:04:49 -07:00
Jeff Squyres
591b225527
Merge pull request #5245 from PeterGottesman/hwloc-fix
Ensure required hwloc directories are in dist tarballs
2018-06-12 13:08:35 -04:00
Peter Gottesman
afe363b73b Ensure required hwloc directories are in dist tarballs
Fixes MTT failure when running autogen on a tarball

Signed-off-by: Peter Gottesman <pgottesm@cisco.com>
2018-06-12 08:33:50 -07:00
Howard Pritchard
b43f94895f
Merge pull request #5261 from thananon/ofi_new_mr
btl/ofi: change required mr mode bits.
2018-06-11 20:54:09 -06:00
David Eberius
d377a6b6f4 Added Software-based Performance Counters driver code along with several counters.
This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf).  More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.

All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined.  The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper.  Added a --with-spc configure option to enable SPCs in the Open MPI build.  This defines SOFTWARE_EVENTS_ENABLE.  Added an MCA parameter, mpi_spc_enable, for turning on specific counters.  Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize.  Added an SPC test and example.

Signed-off-by: David Eberius <deberius@vols.utk.edu>
2018-06-11 22:48:16 -04:00
Thananon Patinyasakdikul
b6e07bfac6 btl/ofi: change required mr mode bits.
FI_MR_UNSPEC is not supposed to be used beyond ofi version 1.5. This
commit replaces FI_MR_UNSPEC with the new FI_MR_BASIC mode bits
(FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR).

The btl functionality remains the same.

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-06-11 09:25:57 -07:00
Ralph Castain
48f27655a6 Sync to PMIx v3.0rc and add ext4x
Sync to the draft rc for PMIx v3.0. Add an external component for PMIx master, which is at v4.0

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-11 05:54:23 -07:00
Nicolas Morey-Chaisemartin
038ceb0b61 btl: openib: Add support for Broadcom Cumulus RoCE HCA
Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>
2018-06-07 15:16:58 +02:00
Nicolas Morey-Chaisemartin
d3416345ab btl: openib: add support for QLogic RoCE HCA
Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>
2018-06-07 15:16:41 +02:00
Arm
ce4d769aee
Merge pull request #5235 from thananon/ofi_einr_fix
btl/ofi: handles FI_EINTR properly.
2018-06-06 11:30:48 -07:00
Ralph Castain
840fb42f93 PMIx rte component does support dynamics
Minor cleanups

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-05 21:55:19 -07:00
Thananon Patinyasakdikul
f34a73af70 btl/ofi: handles FI_EINTR properly.
OFI sockets provider will sometime return FI_EINTR. Apparently this flag
does not exist in OFI version 1.5. This commit added an ifdef check.

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-06-05 18:40:00 -07:00
Nathan Hjelm
8192eb9aa6 patch/overwrite: add support for aarch64
This commit adds patcher support of aarch64. The current
implementation uses mov, movk, and br to perform the jump. This uses 5
instructions.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-04 16:04:26 -06:00
Arm
555029b323
Merge pull request #5216 from thananon/btl_ofi
btl/ofi: the component now compiles with libfabric < 1.5.
2018-06-04 13:47:00 -07:00
Thananon Patinyasakdikul
c18cf7315f btl/ofi: configury: will not build if OFI ver < 1.5.
This commit fixes problem where users have libfabric version < 1.5
installed and build failed because the new MR modebits does not exist.

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-06-04 10:35:50 -07:00
Ralph Castain
014bb3c8de Fix external hwloc builds
Remove spurious comma in header file definition. Remove unused variables

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-03 11:24:21 -07:00
Ralph Castain
55ac526a67 Enable the PMIx ompi/rte component
Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using:

* autogen --no-orte

* with external libevent, external hwloc, and external PMIx master

* configuring PMIx master with the same libevent and hwloc

* execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun

Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)
2018-06-03 07:25:12 -07:00
Ralph Castain
4ceaed3a76 Merge https://github.com/bgoglin/ompi into topic/rte2 2018-06-03 07:24:12 -07:00
Howard Pritchard
623e36de8a
Merge pull request #5205 from thananon/btl_ofi
new btl/ofi: RDMA only btl using libfabric.
2018-06-02 13:00:50 -06:00
Thananon Patinyasakdikul
6efb8069ec new btl/ofi: RDMA only btl using libfabric.
This commit added new transport layer to be used with osc rdma module.
This BTL provides put, get, atomic and fetch atomic operations. It can
be used with multiple hardware vendors as long as they have their
provider under Libfabric and have the right capabilities.

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-06-01 15:22:04 -07:00
Jeff Squyres
fb0473acb5 pmix3x: compiler warning stomp
This fix was already included in pmix upstream (https://github.com/pmix/pmix/commit/fb7af8af2).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-05-30 10:14:37 -07:00
Sylvain Jeaugey
4eb75623ef cuda: add option to remove warning about missing libcuda.
Signed-off-by: Sylvain Jeaugey <sjeaugey@nvidia.com>
2018-05-24 14:56:46 -07:00
Brice Goglin
847f2e9933 opal/hwloc: remove now unused available field from opal_hwloc_obj_data_t
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
b260600450 opal/hwloc: simplify df_search() and make it work with hwloc 2.x NUMA nodes
Don't do a recursive search (hence no need for *idx anymore).
Find the level depth, to hide cache-issues first.
Then iterate over that level to find the objects we want.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
a06fc74664 opal/hwloc: remove an obsolete comment about offlines CPUs etc
Only online/available objects are enabled in OMPI now.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
369a7ea279 opal/hwloc: remove df_search_cores and fix things for hwloc 2.x NUMA nodes
Just iterate over cores inside the given object cpuset.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
0cd0c12111 opal/hwloc: remove min_bound() functions
df_search_min_bound() would need to be fixed for hwloc 2.0,
but it's only used in opal_hwloc_base_find_min_bound_target_under_obj()
which isn't used anymore. So just remove all of them.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
d12ef324c9 hwloc 2.0 doesn't have hwloc/myriexpress.h anymore
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
33ea2f0de4 fix OPAL_HWLOC_WANT_SHMEM management in opal/mca/hwloc/external/external.h
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
bd08a6ead9 hwloc: fix hwloc/shmem.h in the external case
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Jeff Squyres
af4299ebc5 hwloc: updates for hwloc 2.0.x API
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-05-24 11:53:07 +02:00
Brice Goglin
77cc3fcda5 hwloc: update to hwloc 2.0.1
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:52:59 +02:00
bosilca
2ab628b92e
Merge pull request #5074 from bosilca/topic/remove_warnings
Remove warnings identified by clang.
2018-05-15 11:15:23 -04:00
Howard Pritchard
db45d61dfa
Merge pull request #5147 from hppritcha/topic/plug_debug_hole_in_verbs
btl/openib: add conditional around an assert
2018-05-05 08:12:53 -06:00
Howard Pritchard
30eed9f035 btl/openib: addition conditional around an assert
A user trying to build Open MPI with explicit use
of CFLAGS on the make command line hit problems.

This fixes one of the problems.

https://www.mail-archive.com/users@lists.open-mpi.org//msg32241.html

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2018-05-04 14:17:07 -06:00
Ralph Castain
4ff61450a4 Ensure pmix_cleanup finalizes the class system
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-05-04 06:22:36 -07:00
Gilles Gouaillardet
edb8fe8e4b pmix/ext1x: fix index handling when populating an info array
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-26 11:06:43 +09:00
Ralph Castain
f424aa367e Fix external PMIx v1.2.5 support
As @hjelmn and I discussed, this is a little hacky. However, it is the only solution that can be done solely from the OMPI side.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-25 13:42:36 -07:00
Howard Pritchard
0751578cbf
Merge pull request #4949 from hppritcha/topic/memkind_update
mpool/memkind: refactor to use the current API
2018-04-25 07:24:28 -06:00
Howard Pritchard
824197f886 mpool/memkind: refactor to use the current API
The mpool/memkind component was using a deprecated "partitions" API.
This commit refactors the memkind component to make use of the
supported public API.

The public API uses 3 parameters to specify a mpool "kind":

- a memkind type (which for now is just default or HBM)
- a memkind policy
- a memkind_bits (partly to specify pagesize)

The MCA parameters were changed to reflect these memkind
parameters.

Add a make check test for sanity checking of the memkind component.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-04-24 22:11:21 -06:00
Nathan Hjelm
69456c8962
Merge pull request #5042 from Stonesjtu/patch-1
Fix typo
2018-04-16 12:13:35 -06:00
George Bosilca
6ff11267fb
Remove warnings identified by clang.
Plus minor spacing and indentation issues.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-04-14 17:14:12 -04:00
Todd Kordenbrock
55c6918316
Merge pull request #5053 from tkordenbrock/topic/master/btl-portals4.del_proc.fix
master: btl-portals4: don't free module resources when proc count goes to zero
2018-04-12 12:12:34 -05:00
Gilles Gouaillardet
37e7bca867 pmix/ext1x: fix misc build time errors
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-12 14:58:55 +09:00
Kaiyu Shi
b7c5e65d4f Fix typo
Signed-off-by: Kaiyu Shi <skyisno.1@gmail.com>
2018-04-11 10:29:43 +08:00
Todd Kordenbrock
b569633ddf btl-portals4: don't free module resources when proc count goes to zero
This commit fixes a segfault in btl-portals4 add_procs().  The segfault
occurs if add_procs() is called after a del_procs() call that reduces
the proc count to zero which would cause PT and NI resources to be
freed.  This commit resolves the segfault by using a common
initiailization boolean and only freeing module resources in
finalize().

Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
2018-04-10 14:20:22 -05:00
Jeff Squyres
45922c4e81 pmix/base: set PMIx to follow OPAL's mca_component_show_load_errors
Have Open MPI's PMIx component to set PMIx's "show_load_errors" to do
the same thing that Open MPI's "show_load_errors" does.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-10 10:24:35 -07:00
Jeff Squyres
f200b866df btl/tcp: roll back parts of 40afd525f8
Some of the show_help() messages that were added in 40afd525f8 were
really normal / expected behavior (e.g., if 2 peers connect in the TCP
BTL more-or-less simultaneously, one of them will drop the connection
-- no need to show_help() about this; it's expected behavior).  Roll
back these messages to be opal_output_verbose() kinds of messages.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-07 12:28:10 -07:00
Jeff Squyres
a2fc1ace09
Merge pull request #4992 from jsquyres/pr/pmix-version-info-mca-vars
pmix: add "pmix*_library_version" info MCA var
2018-04-04 17:29:06 -04:00
Ralph Castain
cd52ccdb68 Move past the '.' when getting jobstepid
The strtoul function returns the pointer to the first non-digit character, which is a '.' in this case. Calling strtoul at that point will always yield a zero - you have to move past it to get the remaining number

Thanks to Greg Lee for the detailed analysis of the problem.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-04 11:22:38 -07:00
Jeff Squyres
9f472d8a7b pmix: add "pmix*_library_version" info MCA var
Simple MCA vars for ext1, ext2, and pmix3 components to reflect what
the underlying PMIx library version is.  For example:

```
$ ompi_info --param pmix pmix3x --parsable --level 9 | grep
library_version
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:value:PMIx library version 3.0.0 (embedded in Open MPI)
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:source:default
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:status:writeable
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:level:4
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:help:Version of the underlying PMIx library
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:deprecated:no
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:type:string
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:disabled:false
```

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-29 14:21:07 -07:00
Jeff Squyres
8c419294a8 btl/tcp: fix CID 710596
sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses),
but the memory that we are copying from (my_ss->sin_addr) is only 4
bytes long.  Don't copy beyond the end of that source buffer.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:22 -07:00
Jeff Squyres
3003be14f3 btl/sm: fix CID 1415105
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:21 -07:00
Jeff Squyres
a17f4afdc7 btl/tcp: fix CID 1416634
Fix resource leak in the TCP BTL.  Also add a little defensive programming.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:21 -07:00
Ralph Castain
e443adc7a1 Reset OMPI master to PMIx master
Track PMIx master instead of the reference server - fixes problem of external PMIx master builds.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-25 08:36:46 -07:00
Jeff Squyres
c3adcb05eb Miscellaneous compiler warnings fixes
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-23 11:45:30 -07:00
Jeff Squyres
023a4a82d3
Merge pull request #4942 from jsquyres/pr/tcp-btl-help-message-updates
TCP help message updates
2018-03-22 08:53:04 -05:00
Jeff Squyres
a15d8233c9
Merge pull request #3434 from dsharma283/pr-3431
ompi/opal: add support for HDR link speeds
2018-03-21 21:57:20 -05:00
Jeff Squyres
40afd525f8 btl/tcp: make error messages more specific
Convert some verbose messages to opal_show_help() messages.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-21 19:34:03 -07:00
Devesh Sharma
90e9b22196 ompi/opal: add support for HDR link speeds
This patch enables to use adapters with HDR speeds.
issue id 3431

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
2018-03-21 19:15:41 -07:00
Howard Pritchard
ade280eb7c
Merge pull request #3292 from markalle/pr/ibv_reg_mr__fork
IB fork
2018-03-21 09:39:08 -06:00
Boris Karasev
85ce76fa36
Merge pull request #4926 from karasevb/pmix_dstore_enable
pmix: dstore returned for direct modex
2018-03-21 14:24:51 +07:00
Stanislav Kirillov
c2bfca19ba
fix ipv6 missing copypaste
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2018-03-21 02:25:04 +03:00
Stanislav Kirillov
86061fbf8d
fixed ipv6 OOB connection problems
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2018-03-20 16:07:53 +00:00
Boris Karasev
dca3dd2ea4 pmix: dstore returned for direct modex
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-03-20 04:56:48 +02:00
Jeff Squyres
cc4bb433bc mpool/memkind: fix typo in partition page sizes
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-19 16:33:08 -05:00