1
1
Граф коммитов

27690 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
ed508010b4 Remove stale tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-18 07:30:47 -07:00
Ralph Castain
79f82f2c6d Merge pull request #4217 from rhc54/topic/dvm
Complete the fix of the ORTE DVM.
2017-09-16 14:53:24 -07:00
Ralph Castain
3c914a7a97 Complete the fix of the ORTE DVM. We will now use "prun" instead of "orterun -hnp foo" to execute jobs. This provides the feature of automatic discovery of the orte-dvm so you don't need to manually enter URI's or contact file locations. All IO is forwarded to prun.
Still in the "needs to be done" category:

* mapping/ranking/binding options aren't correctly supported

* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-16 13:13:07 -07:00
Brian Barrett
bffcc3bca0 util: move graph solver from usnic to util
Cisco wrote a bipartite graph solver to properly solve
interface pair selection for usNIC.  Using the reachable
framework, the TCP BTL (and possibly the runtime network
code) can use the graph solver to make more optimal pair
selection.  Jeff was happy to have the code more broadly
used, but didn't have time to do the move, hence this
commit.

There are a couple of minor changes to the code compared
to the usNIC version.  Obviously, the functions have
been renamed to match naming convention for their new
home.  Since it's easier to write unit tests for
util/ code, the unit tests have been made first class
tests run at "make check" time.  This last bit required
moving some of the definitions into a new header,
bipartite_graph_internal.h, so that they could be
included in both the library code and the test code.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-15 15:08:47 -07:00
Ralph Castain
f69466d633 Merge pull request #4213 from rhc54/topic/dvm2
Backport changes from PMIx reference server
2017-09-14 13:17:53 -07:00
Ralph Castain
7c7d8a69a0 Backport changes from PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-14 11:48:56 -07:00
Nathan Hjelm
0851122cce btl/openib/udcm: add support for connection across subnets
This commit adds the code necessary to support forming connections across
subnets. The primary changes are to 1) add the gid to the modex, and 2)
use the gid to create the address handle.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2017-09-14 06:42:06 -10:00
Ralph Castain
8d336ddcc0 Merge pull request #4209 from rhc54/topic/foobar
Only build prun if building --with-devel-headers
2017-09-13 13:07:29 -07:00
Ralph Castain
3f8908871b Since the DVM is now tied to prun, don't build the DVM either unless prun can be built
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-13 11:55:10 -07:00
Ralph Castain
589cc03d8e Only build prun if building --with-devel-headers
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-13 11:38:11 -07:00
Ralph Castain
0a3d8af4c2 Merge pull request #4202 from anandhis/master
Choosing provider when user requests generic transport "fabric"
2017-09-13 11:21:24 -07:00
Ralph Castain
27f15b67d7 Merge pull request #4210 from rhc54/topic/pup
Update to track PMIx master
2017-09-13 11:15:22 -07:00
Ralph Castain
691237801b Update to track PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-13 10:21:44 -07:00
Ralph Castain
df4bd83fcb Merge pull request #4206 from rhc54/topic/prun
Add a new launcher "prun" for starting applications against the ORTE DVM.
2017-09-13 06:55:30 -07:00
Ralph Castain
bbd83fd4c0 Add a new launcher "prun" for starting applications against the ORTE DVM.
Unlike "orterun", "prun" is a PMIx-only program that discovers the DVM connection instead of requiring that we explicitly provide it. Only build "prun" if PMIx v2.x is available.

This gets the DVM working again, but still is showing problems for multiple executions. I'll detail those in a separate issue. Thus, the DVM should still be considered "broken".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-12 21:40:41 -07:00
anandhi
4d7de8882f Checking for generic transport "fabric" in mca parameter rml_ofi_transports
to choose the first available non-socket provider.
	modified:   orte/mca/rml/ofi/rml_ofi_component.c
	modified:   orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi Jayakumar <anandhi.s.jayakumar@intel.com>
2017-09-12 15:39:55 -07:00
Ralph Castain
d41069795f Merge pull request #4200 from rhc54/topic/cov
Silence coverity warnings
2017-09-12 10:29:32 -07:00
Ralph Castain
88eac797fb Silence coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-12 09:14:36 -07:00
Brian Barrett
637ebf60f9 atomics: Remove requirement of 64 bit atomics
Remove two of the three  instances of components requiring
64 bit atomics, even on 32 bit systems.  The SM OSC component
also uses 64 bit atomics, but is a more complicated fix that
will follow this one.  Currently, no one is testing on
platforms that don't provide 64 bit atomics (even in 32 bit
mode), but with the removal of the non-inline assembly for
IA32, the older compilers on Absoft's test systems now
result in no practical way to call cmpxchg8 in 32 bit mode.
At that point, these failures started popping up.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-11 19:50:10 -07:00
Ralph Castain
6775b2a9c6 Merge pull request #4198 from rhc54/topic/dvmrepair
Repair the ORTE DVM
2017-09-11 18:40:06 -07:00
Ralph Castain
3477079804 Repair the ORTE DVM
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-11 17:38:21 -07:00
Nathan Hjelm
7cdda24206 osc/sm: do not require 64-bit atomic math
This commit fixes a compile issue on 32-bit systems that do not
support 64-bit atomic math. The active target path was using 64-bit
atomics exclusively to support PSCW. This commit updates the code to
use either 32 or 64-bit atomic math depending on what is available.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-09-11 14:10:38 -10:00
Brian Barrett
29a53b0269 git: Ignore OSHMEM C++ wrapper artifacts
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-08 08:54:08 -07:00
Josh Hursey
392129063b Merge pull request #4191 from jjhursey/fix/global_rank
orte/pmix: Always seed environment with global rank
2017-09-08 09:39:50 -05:00
Joshua Hursey
420ca65f4f orte/pmix: Always seed environment with global rank
* Even if we are only launching one app context, we might call spawn
   later and the remote groups might want their global rank information.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-09-08 08:53:49 -05:00
Brian Barrett
5602d3b9c2 atomics: Remove cmpset_64 on IA32
The recent changes to remove non-inline atomics have caused
a cascade of issues with cmpset_64 on IA32.  cmpxchg8 requires
the use of a bunch of registers (2 for every operand, 3 operands),
and one of them is ebx, which is used by the compiler to do
shared library things.  Some compilers don't deal well with
ebx being clobbered (I'm looking at you, gcc 4.1).  Rather than
continue trying to fight, remove cmpset_64 from the supported
atomic operations on IA32.  Other 32 bit platforms (MIPS32,
SPARC32, ARM, etc.) already don't support a 64 bit compare-and-
swap, so while this might slightly reduce performance, it will
at least be correct.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-07 12:19:34 -07:00
Ralph Castain
afe7f6983b Merge pull request #4184 from rhc54/topic/pmix
Update to track PMIx master
2017-09-06 15:19:01 -07:00
Brian Barrett
ff3ff28a00 NEWS: Remove duplicate "master" items
Both the C++ and Vampir notes appear in release branch notes
already, so remove from the "not on release branch" section.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2017-09-06 13:31:30 -07:00
Nathan Hjelm
4bba8774f4 monitoring: fix MPI_T regression
The monitoring code causes MPI_T based tools to segfault when
monitoring is disabled. This happens because the performance
variables remain registered after the common/monitoring
component is dlclosed due to a missing variable registration
flag. This commit adds the necessary flag to all the registered
performance variables.

The issue on github is #4162. Close when applied to master.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-09-06 14:24:35 -06:00
Ralph Castain
cbc114e923 Update to track PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-06 13:15:24 -07:00
Jeff Squyres
41c7230bc4 Merge pull request #4179 from jsquyres/pr/opal-path-nfs-razzem-frazzem
opal_path_nfs: ensure arrays are always long enough
2017-09-06 11:16:44 -04:00
Jeff Squyres
dee8cfbfd0 opal_path_nfs: ensure arrays are always long enough
This test used to have fixed-sized arrays for the mounts that it was
checking.  However, we periodically run across machines with more
mounts than can fit into those fixed-size arrays.  Rather than
periodically increasing the size of those arrays (after re-discovering
that the error is due to fixed-size arrays), just count how many
entries there are and make arrays that are big enough.

Additionally, add a check to ensure that we don't go over the max size
of the array when reading/filling them.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-09-06 07:01:45 -07:00
bosilca
dc538e9675 Merge pull request #1177 from bosilca/topic/large_msg
Topic/large msg
2017-09-05 13:30:19 -04:00
Mike Dubman
62739c6513 Merge pull request #4165 from alinask/topic/spml-ucx-estimated-num-eps
SPML_UCX: use ompi_proc_world_size() to set the estimated_num_eps value
2017-09-05 18:36:42 +03:00
Alina Sklarevich
007b1803ec SPML_UCX: use ompi_proc_world_size() to set the estimated_num_eps value
before this fix, mca_spml_ucx_component_open was using
oshmem_num_procs() to set the value of params.estimated_num_eps for UCX.
The oshmem_num_procs() function uses oshmem_group_all which will be
initialized after the call to mca_spml_ucx_component_open and therefore,
cannot be used there.

Signed-off-by: Alina Sklarevich <alinas@mellanox.com>
2017-09-04 14:46:00 +03:00
Gilles Gouaillardet
3b8b8c52c5 Merge pull request #1432 from ggouaillardet/topic/memchecker
Fix misc memchecker issues
2017-09-04 13:14:40 +09:00
Gilles Gouaillardet
ecb6b81a05 mpi: correctly handle MPI_IN_PLACE by memchecker in neighborhood collectives
MPI_IN_PLACE is not a valid send buffer for neighborhood collectives, so do not
invoke memchecker in this case.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-09-04 11:21:32 +09:00
Gilles Gouaillardet
66c9485e77 MPI_Isend: memchecker do not mark send buffer as unaccessible after pml isend invokation
Today's MPI standard mandates the send buffer remains accessible during the send operation.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-09-04 11:21:32 +09:00
Gilles Gouaillardet
af8242a121 pml/ob1: have memchecker make recv buffer defined again when mca_pml_ob1_recv completes
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-09-04 11:18:05 +09:00
Gilles Gouaillardet
6ee9366243 MPI_Wait: correctly handle MPI_STATUS_IGNORE in MEMCHECKER
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-09-04 11:18:05 +09:00
Ralph Castain
c1ce233eaf Merge pull request #4143 from aravindksg/psm2_cuda
Add support for GPU buffers for PSM2 MTL
2017-09-01 21:09:55 -07:00
Ralph Castain
7b22207599 Merge pull request #4163 from rhc54/topic/pmix21
Roll to track PMIx master
2017-09-01 17:36:20 -07:00
Aravind Gopalakrishnan
2e83cf15ce Add support for GPU buffers for PSM2 MTL
PSM2 enables support for GPU buffers and CUDA managed memory and it can
directly recognize GPU buffers, handle copies between HFIs and GPUs.
Therefore, it is not required for OMPI to handle GPU buffers for pt2pt cases.
In this patch, we allow the PSM2 MTL to specify when
it does not require CUDA convertor support. This allows us to skip CUDA
convertor init phases and lets PSM2 handle the memory transfers.

This translates to improvements in latency.
The patch enables blocking collectives and workloads with GPU contiguous,
GPU non-contiguous memory.

Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
2017-09-01 16:59:03 -07:00
George Bosilca
d10522a01c
Set a hard limit on the TCP max fragment size.
Some OSes have hardcoded limits to prevent overflowing over an int32_t.
We can either detect this at configure (which might be a nicer but
incomplete solution), or always force the pipelined protocol over TCP.
As it only covers data larger than 1GB, no performance penalty is to be
expected.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-09-01 18:52:48 -04:00
George Bosilca
866899e836
Always abide to the RDMA pipeline limit.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-09-01 18:52:48 -04:00
George Bosilca
050bd3b6d7
Make the pipeline depth an int instead of a size_t. While
they are supposed to be unsigned, casting them to a signed
value for all atomic operations is as errorprone as handling
them as signed entities.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-09-01 18:52:48 -04:00
George Bosilca
c340da2586
A first cut at the large data problem with TCP. As long
as the writev and readv support a sum larger than a uint32_t
this version will work. For the other OSes a different patch
is required. This patch is a slight modification of the one
proposed by @ggouaillardet.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-09-01 18:52:48 -04:00
George Bosilca
4db3730a25
Be consistent for atomic operations and add an entity
of the same type.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-09-01 18:52:48 -04:00
Ralph Castain
2c723f4338 Roll to track PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-01 12:30:34 -07:00
Nathan Hjelm
79fc9d54dc Revert "* Some recent versions of GCC try very hard to make it impossible to"
This reverts commit b5ea5e0994

This commit reverts a change that is hopefully not necessary. If this
is the case this will fix #4146.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-09-01 08:47:29 -06:00