1
1
Граф коммитов

3387 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
db45d61dfa
Merge pull request #5147 from hppritcha/topic/plug_debug_hole_in_verbs
btl/openib: add conditional around an assert
2018-05-05 08:12:53 -06:00
Howard Pritchard
30eed9f035 btl/openib: addition conditional around an assert
A user trying to build Open MPI with explicit use
of CFLAGS on the make command line hit problems.

This fixes one of the problems.

https://www.mail-archive.com/users@lists.open-mpi.org//msg32241.html

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2018-05-04 14:17:07 -06:00
Ralph Castain
4ff61450a4 Ensure pmix_cleanup finalizes the class system
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-05-04 06:22:36 -07:00
Gilles Gouaillardet
edb8fe8e4b pmix/ext1x: fix index handling when populating an info array
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-26 11:06:43 +09:00
Ralph Castain
f424aa367e Fix external PMIx v1.2.5 support
As @hjelmn and I discussed, this is a little hacky. However, it is the only solution that can be done solely from the OMPI side.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-25 13:42:36 -07:00
Howard Pritchard
0751578cbf
Merge pull request #4949 from hppritcha/topic/memkind_update
mpool/memkind: refactor to use the current API
2018-04-25 07:24:28 -06:00
Howard Pritchard
824197f886 mpool/memkind: refactor to use the current API
The mpool/memkind component was using a deprecated "partitions" API.
This commit refactors the memkind component to make use of the
supported public API.

The public API uses 3 parameters to specify a mpool "kind":

- a memkind type (which for now is just default or HBM)
- a memkind policy
- a memkind_bits (partly to specify pagesize)

The MCA parameters were changed to reflect these memkind
parameters.

Add a make check test for sanity checking of the memkind component.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-04-24 22:11:21 -06:00
Nathan Hjelm
69456c8962
Merge pull request #5042 from Stonesjtu/patch-1
Fix typo
2018-04-16 12:13:35 -06:00
Todd Kordenbrock
55c6918316
Merge pull request #5053 from tkordenbrock/topic/master/btl-portals4.del_proc.fix
master: btl-portals4: don't free module resources when proc count goes to zero
2018-04-12 12:12:34 -05:00
Gilles Gouaillardet
37e7bca867 pmix/ext1x: fix misc build time errors
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-12 14:58:55 +09:00
Kaiyu Shi
b7c5e65d4f Fix typo
Signed-off-by: Kaiyu Shi <skyisno.1@gmail.com>
2018-04-11 10:29:43 +08:00
Todd Kordenbrock
b569633ddf btl-portals4: don't free module resources when proc count goes to zero
This commit fixes a segfault in btl-portals4 add_procs().  The segfault
occurs if add_procs() is called after a del_procs() call that reduces
the proc count to zero which would cause PT and NI resources to be
freed.  This commit resolves the segfault by using a common
initiailization boolean and only freeing module resources in
finalize().

Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
2018-04-10 14:20:22 -05:00
Jeff Squyres
45922c4e81 pmix/base: set PMIx to follow OPAL's mca_component_show_load_errors
Have Open MPI's PMIx component to set PMIx's "show_load_errors" to do
the same thing that Open MPI's "show_load_errors" does.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-10 10:24:35 -07:00
Jeff Squyres
f200b866df btl/tcp: roll back parts of 40afd525f8
Some of the show_help() messages that were added in 40afd525f8 were
really normal / expected behavior (e.g., if 2 peers connect in the TCP
BTL more-or-less simultaneously, one of them will drop the connection
-- no need to show_help() about this; it's expected behavior).  Roll
back these messages to be opal_output_verbose() kinds of messages.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-07 12:28:10 -07:00
Jeff Squyres
a2fc1ace09
Merge pull request #4992 from jsquyres/pr/pmix-version-info-mca-vars
pmix: add "pmix*_library_version" info MCA var
2018-04-04 17:29:06 -04:00
Ralph Castain
cd52ccdb68 Move past the '.' when getting jobstepid
The strtoul function returns the pointer to the first non-digit character, which is a '.' in this case. Calling strtoul at that point will always yield a zero - you have to move past it to get the remaining number

Thanks to Greg Lee for the detailed analysis of the problem.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-04 11:22:38 -07:00
Jeff Squyres
9f472d8a7b pmix: add "pmix*_library_version" info MCA var
Simple MCA vars for ext1, ext2, and pmix3 components to reflect what
the underlying PMIx library version is.  For example:

```
$ ompi_info --param pmix pmix3x --parsable --level 9 | grep
library_version
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:value:PMIx library version 3.0.0 (embedded in Open MPI)
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:source:default
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:status:writeable
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:level:4
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:help:Version of the underlying PMIx library
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:deprecated:no
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:type:string
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:disabled:false
```

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-29 14:21:07 -07:00
Jeff Squyres
8c419294a8 btl/tcp: fix CID 710596
sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses),
but the memory that we are copying from (my_ss->sin_addr) is only 4
bytes long.  Don't copy beyond the end of that source buffer.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:22 -07:00
Jeff Squyres
3003be14f3 btl/sm: fix CID 1415105
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:21 -07:00
Jeff Squyres
a17f4afdc7 btl/tcp: fix CID 1416634
Fix resource leak in the TCP BTL.  Also add a little defensive programming.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:21 -07:00
Ralph Castain
e443adc7a1 Reset OMPI master to PMIx master
Track PMIx master instead of the reference server - fixes problem of external PMIx master builds.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-25 08:36:46 -07:00
Jeff Squyres
c3adcb05eb Miscellaneous compiler warnings fixes
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-23 11:45:30 -07:00
Jeff Squyres
023a4a82d3
Merge pull request #4942 from jsquyres/pr/tcp-btl-help-message-updates
TCP help message updates
2018-03-22 08:53:04 -05:00
Jeff Squyres
a15d8233c9
Merge pull request #3434 from dsharma283/pr-3431
ompi/opal: add support for HDR link speeds
2018-03-21 21:57:20 -05:00
Jeff Squyres
40afd525f8 btl/tcp: make error messages more specific
Convert some verbose messages to opal_show_help() messages.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-21 19:34:03 -07:00
Devesh Sharma
90e9b22196 ompi/opal: add support for HDR link speeds
This patch enables to use adapters with HDR speeds.
issue id 3431

Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
2018-03-21 19:15:41 -07:00
Howard Pritchard
ade280eb7c
Merge pull request #3292 from markalle/pr/ibv_reg_mr__fork
IB fork
2018-03-21 09:39:08 -06:00
Boris Karasev
85ce76fa36
Merge pull request #4926 from karasevb/pmix_dstore_enable
pmix: dstore returned for direct modex
2018-03-21 14:24:51 +07:00
Boris Karasev
dca3dd2ea4 pmix: dstore returned for direct modex
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-03-20 04:56:48 +02:00
Jeff Squyres
cc4bb433bc mpool/memkind: fix typo in partition page sizes
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-19 16:33:08 -05:00
Boris Karasev
36a0c6a794 pmix: fixed the direct modex request
This commit fixes the case when local client asks for the key from the
process on the remote node. The local server don't have commit count for
remote ranks, it is maintained by another PMIx server, so commit count
should be ignored for remote requests.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-03-19 11:51:03 +02:00
Jordan Cherry
2f0e8153a5
Merge pull request #4247 from jocherry/btlTcpLinksBugFix
tcp btl: Fix multiple-link connection establishment.
2018-03-07 08:40:37 -08:00
Ralph Castain
7241043809 Modify the internal logic for resolve nodes/peers
The current code path for PMIx_Resolve_peers and PMIx_Resolve_nodes executes a threadshift in the preg components themselves. This is done to ensure thread safety when called from the user level. However, it causes thread-stall when someone attempts to call the regex functions from _inside_ the PMIx code base should the call occur from within an event.

Accordingly, move the threadshift to the client-level functions and make the preg components just execute their algorithms. Create a new pnet/test component to verify that the prge code can be safely accessed - set that component to be selected only when the user directly specifies it. The new component will be used to validate various logical extensions during development, and can then be discarded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 456ac7f7af3d9ba09888e3c899eb001daaa24aef)
2018-03-02 02:00:31 -08:00
Ralph Castain
17c40f4cea Implement support for proctable queries
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Ralph Castain
0434b615b5 Update ORTE to support PMIx v3
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":

* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch

* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.

* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.

* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Gilles Gouaillardet
9fedf2836e btl/vader: handle unexpected short read/write in process_vm_{read,write}v
Important note :

According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written.  This return
value may be less than the total number of requested bytes, if a
partial read/write occurred.  (Partial transfers apply at the
granularity of iovec elements.  These system calls won't perform a
partial transfer that splits a single iovec element.)"

So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.

Thanks Heiko Bauke for the bug report.

Refs. open-mpi/ompi#4829

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-03-02 13:20:46 +09:00
Nathan Hjelm
5ed2fc2d48 mca/base: add support for additional variable types
This commit adds long, int32_t, uint32_t, int64_t, and uint64_t as
possible MCA variable types.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-03-01 20:42:27 -07:00
Jordan Cherry
d7e7e3acb7 tcp btl: Fix multiple-link connection establishment.
Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
    Three issues were resulting in hangs during large message transfer:
      * The 2nd..btl_tcp_link connections were dropped during establishment because the per-process
        address check was binary, rather than a count
      * The accept handler would not skip a btl module that was already in use, resulting in all
        connections for a given address being vectored to a single btl
      * Multiple addresses in the same subnet caused connections to be
        stalled, as the receiver would always use the same (first) address
        found.  Binding the outgoing connection solves this issue
     *  Lastly fix race condition created by connections being started at the exact same time
        by accpeting connections not in the closed state, allowing endpoint_accept to resolve
        dispute

    Signed-off-by: Jordan Cherry <cherryj@amazon.com>
2018-02-27 16:36:44 +00:00
Nathan Hjelm
5380d7cce5 mpool/hugepage: add missing header
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-02-26 13:35:56 -07:00
Nathan Hjelm
38d9b10db8 rcache/base: update VMA tree to use opal_interval_tree_t
This commit replaces the current VMA tree implementation with one that
uses the new opal_interval_tree_t class. Since the VMA tree lock is no
longer used this commit also updates rcache/grdma and btl/vader to
take better care when searching for existing registrations.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-02-26 13:35:56 -07:00
Ralph Castain
60e6440603 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-02-19 09:20:13 -08:00
Jeff Squyres
b452991ad8 btl/usnic: missed a preprocessor check in d36648b
Missed updating one instance of `==` to `>=` in d36648b.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-02-19 07:03:05 -08:00
Jeff Squyres
d36648b547 btl/usnic: update BTL_VERSION handling
Follow-on to 8097d09858: now that BTL_VERSION is defined in btl.h, be
a little smarter about whether we define it or not.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-02-16 13:20:36 -08:00
Ralph Castain
8097d09858 Silence usnic warnings - BTL version has changed
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-02-16 10:00:18 -08:00
Nathan Hjelm
9aa21f4467
Merge pull request #4796 from hjelmn/btl_v3.1
opal/btl: add support for flushing RDMA/atomic operations
2018-02-15 12:44:18 -07:00
Nathan Hjelm
072a6a4850 opal/btl: add support for flushing RDMA/atomic operations
This commit adds a new optional function to the BTL module:
btl_flush. This function takes an optional BTL endpoint. When called
this function completes all outstanding RDMA and atomic operations
started prior to the call to btl_flush.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-02-13 12:49:51 -07:00
Ralph Castain
1a7dfd7d54 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-02-07 12:16:51 -08:00
Ralph Castain
9fe8153d38 Sync to IOF branch and continue fix of request for job info from unknown nspace
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 02400d30d79ce3c7e7e28f9a08f7062a5b6f4c51)
2018-02-03 19:56:35 -08:00
Howard Pritchard
1adf0873f7
Merge pull request #4766 from hppritcha/topic/squash_grdma_comp_warning
rcache/grdma: squash a compiler warning
2018-02-02 18:58:32 -07:00
Gilles Gouaillardet
43700faba1 pmix/ext3x: remove autogenerated ext3x.h header file
This header file was meant to be autogenerated, and for
some reasons, was never removed from the repository.
Update .gitignore as well

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-01-31 23:45:42 +09:00