1
1

6935 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
7ddffc627d
Merge pull request #4776 from rhc54/topic/rte
Correct abstraction break and update ignores
2018-01-31 04:34:05 -08:00
Ralph Castain
8e8a9aecc5 Correct abstraction break - direct reference to ORTE
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-30 21:19:14 -08:00
Gilles Gouaillardet
34b45cc879 osc/sm: fix the osc_free callback
If component selection fails, then module->bases might be unallocated
when ompi_osc_sm_free() in invoked, so test it before trying to free()
module->bases[0].

Thanks Martin Binder for the report.

Refs open-mpi/ompi#4770

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-01-31 11:23:21 +09:00
Sergey Oblomov
7a5811d0a8 request/state: update state for canceled request
- fixed issue in set state for canceled request

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-01-29 18:26:20 +02:00
Edgar Gabriel
bcf26d419f fs/ufs and fs/lustre: remove erroneous return statement
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.

Fixes #4745

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-01-24 14:07:21 -06:00
Xin Zhao
72ff2b1135 OMPI/OSC/UCX: adding atomic lock for fetch_and_op and compare_and_swap
Signed-off-by: Xin Zhao <xinz@mellanox.com>
2018-01-18 00:36:22 +02:00
Yossi Itigin
f2851fd502
Merge pull request #4724 from alex-mikheev/topic/ucx_as_default
ompi/oshmem: ucx is selected over yalla/ikrit by default
2018-01-17 17:41:49 +02:00
Alex Mikheev
640e945b9c ompi: pml/ucx: blocking send using ucp_tag_send_nbr
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
2018-01-17 15:54:18 +02:00
Alex Mikheev
ae326546f4
ompi/oshmem: ucx is selected over yalla/ikrit by default
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
2018-01-17 15:08:04 +02:00
Matias Cabral
8049c06a96
Merge pull request #4580 from matcabral/fix_comments_pr_425_osc_rmda
osc/rmda: fix missing opal_argv_free in mtls search.
2018-01-12 09:20:11 -08:00
Matias A Cabral
009ba475e1 osc/rmda: fix missing opal_argv_free in mtls search.
Use asprintf in description message to avoid missing default values
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
2018-01-11 14:29:16 -08:00
bosilca
ef38ca5663
Merge pull request #4644 from bosilca/topic/treematch
Fix treematch topology assert
2018-01-02 21:21:54 -05:00
Alex Mikheev
e7bf0617cf
ompi: pml ucx: improve recv latency
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
2017-12-26 16:24:16 +02:00
Aravind Gopalakrishnan
fb68726baf MTL OFI: Allow retries in MTL progress for interrupted syscalls
This fixes a regression in sockets provider which could return -EINTR value
from fi_cq_read() due to a syscall being interrupted. The error value is
currently interpreted as fatal condition. Relax the rule so that we can retry
fi_cq_read() operation.

Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
2017-12-20 14:58:49 -08:00
George Bosilca
38455845db
Fix asserts.
In both cases we were comparing with the wrong size, it should be either
the number of local processes or the number of nodes, and not the size
of the communicator.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-12-20 11:51:35 -05:00
George Bosilca
808f865e9d
Force all output to use OMPI infrastructure.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-12-20 11:50:51 -05:00
Matias Cabral
2c86b8723d
Merge pull request #4510 from matcabral/mtl_psm2_shadow_vars
New flag for MCA parameters that allows a behaving with a default value of "unset".
2017-12-04 12:25:37 -08:00
Howard Pritchard
b160cf6339
Merge pull request #4533 from hppritcha/topic/ofi_mtl_mprobe_fixes
mtl/ofi: fix problem with mprobe/mrecv
2017-12-04 09:11:47 -07:00
Howard Pritchard
2233e44848
Merge pull request #4534 from hppritcha/topic/fix_a_segv_in_request
pml/cm: check for request comp. before completing bsend
2017-12-04 09:09:41 -07:00
Gilles Gouaillardet
2f5b1e9fe0
Merge pull request #4551 from ggouaillardet/topic/communicator_mutex_c_lock
Make usage of ompi_communicator_t, ompi_file_t and ompi_win_t mutex consistent
2017-12-04 09:20:52 +09:00
Edgar Gabriel
1f151be6d2 io/ompio: introduce a new function to retrieve mca parameter values
ompio has the unique problem, that mca parameters set in the io/ompio component
have to be accessible from other frameworks as well. This is mostly done to avoid
a replication in the parameter names and to reduce the number of mca parameters that
and end-user has to worry about.

This commit introduces a generic function to retrieve ompio mca parameters, the function pointer
is stored on the file handle. It replaces two functions that used the same concept already for
one parameter each.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-12-01 10:00:23 -06:00
Gilles Gouaillardet
5f1a967351 ompi/file: rename ompi_file_t's f_mutex into f_lock
in order to use a consistent name between ompi_file_t,
ompi_win_t and ompi_communicator_t

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-12-01 16:06:22 +09:00
Nathan Hjelm
7893248c5a opal/asm: add fetch-and-op atomics
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:23 -07:00
Nathan Hjelm
1282e98a01 opal/asm: rename existing arithmetic atomic functions
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:22 -07:00
Nathan Hjelm
9d0b3fe9f4 opal/asm: remove opal_atomic_bool_cmpset functions
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:22 -07:00
Nathan Hjelm
45db3637af osc/rdma: bug fixes
This commit fixes the following bugs:

 - Allow a btl to be used for communication if it can communicate with
   all non-self peers and it supports global atomic visibility. In
   this case CPU atomics can be used for self and the btl for any
   other peer.

 - It was possible to get into a state where different threads of an
   MPI process could issue conflicting accumulate operations to a
   remote peer. To eliminate this race we now update the peer flags
   atomically.

 - Queue up and re-issue put operations that failed during a BTL
   callback. This can occur during an accumulate operation. This was
   an unhandled error case.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-29 12:43:58 -07:00
Nathan Hjelm
67e26b6e5a
Merge pull request #4482 from matcabral/osc_rdma_skip_mtls
osc/rdma: mca parameter to list MTLs that lower osc rdma priority
2017-11-29 09:34:33 -07:00
Nathan Hjelm
647b40f3f2
Merge pull request #4442 from bosilca/topic/ob1_pvar
Topic/ob1 pvar
2017-11-29 09:31:07 -07:00
Ralph Castain
7ad6886a30 Add a new OMPI rte component to support direct-launch using PMIx.
Cleanup several places where abstraction violations crept into OMPI layer (direct reference of ORTE). Add some missing includes that were exposed by this change.

Note that this compiles, but I haven't tested it for execution yet. Handing it over to Noah Evans for completion

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-28 12:05:01 -08:00
Howard Pritchard
3285325884 pml/cm: check for request comp. before completing bsend.
Turns out there are edge cases where an MTL's isend
method may end up marking a send request complete prior
to returning to the CM code.  The would end up causing
problems in the bsend path since the ompi_request_complete
would end up getting invoked a second time on this request.
This ended up causing segfaults, etc. in ompi_request_complete .

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-11-27 12:27:43 -07:00
Ralph Castain
3906aaf41a Silence warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-25 11:50:18 -08:00
Howard Pritchard
cd48eccbae mtl/ofi: fix problem with mprobe/mrecv
At least with some providers (sockets and GNI), the mprobe/mrecv
ofi mtl methods were incorrect.  For these two providers at least
one must supply the original tag and mask bits used with the
prior FI_PEEK | FI_CLAIM request that had been used to probe for
the message.

These providers take a strict interpretation of the following sentence
from the libfabric fi_tagged man page:

```
 Claimed messages can only be retrieved using a subsequent, paired receive  operation  with  the  FI_CLAIM  flag  set.
```

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-11-24 08:11:18 -07:00
Edgar Gabriel
75ab006ec0 io/ompio: add a new option to disable amode overwriting
ompio has historically changed the WRONLY flag provided by the applicaiton
to RDWR to allow for the data sieving optimization within the two-phase I/O
fcoll component. This change did not have a performance impact
on regular UNIX file systems, but seems to hurt performance on NFS (and maybe Lustre?)

So provide an option that allows to keep the WRONLY option, and raise an error
if tha fcoll/two-phase would actually like to use the data sieving.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-11-17 13:13:38 -06:00
Matias A Cabral
1fad59465f New flag for MCA parameters that allows a behaving with a default value
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.

Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
2017-11-16 16:28:50 -08:00
Matias Cabral
d1869a725a
Merge pull request #4467 from matcabral/master
mtl/ofi: Set data and control progress options default values to FI_PROGRESS_UNSPEC
2017-11-13 07:35:39 -08:00
Jeff Squyres
a8686a6813 mtl ofi: squelch compiler warnings
gcc 5.2 complains:

```
mtl_ofi_component.c: In function ‘ompi_mtl_ofi_finalize’:
mtl_ofi_component.c:613:5: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
     if (ret = fi_close((fid_t)ompi_mtl_ofi.fabric)) {
     ^
```

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-11 05:07:11 -08:00
Jeff Squyres
5a6ddf42d6 mtl ofi: it is not an error to return no data from fi_getinfo()
Before this commit, the presence of usNIC devices -- which will
(currently) return no data when fi_getinfo() is queried for tagged
matching providers -- would cause an error message to be displayed.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-11 05:07:11 -08:00
Jeff Squyres
f910f554f7 mtl ofi: show the positive value of the error
The value of ret is negative (e.g., -61), but it is displayed in the
help message as `%zd`, which renders as unsigned (i.e., a giant
positive value).  So make sure to negate the negative value before
rendering it (e.g., so we display "61", not "4294967235").

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-11 05:07:11 -08:00
Jeff Squyres
e8c13ef286 mtl ofi: fix trivial comment whitespace
No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-11 05:07:10 -08:00
Jeff Squyres
bed1930df8 mtl ofi: fix formatting of help message
No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-11 05:07:05 -08:00
bosilca
3f43e2c8df
Merge pull request #4419 from thananon/pr_newob1
pml/ob1: match callback will now queue wrong sequence frag and return.
2017-11-10 20:10:53 -05:00
Matias A Cabral
80c8858c5a osc/rdma: add an mca parameter to list MTLs for which osc pt2pt should have
higher priority than rdma and default to psm2.

Context: the Intel Omni-path driver (hfi1) has verbs support, so the openib
btl is available to use. However, at a bad performance. Without this
change osc rdma using btl openib is the default choice when running on Intel
Omni-path, with a lower performance than osc pt2pt over mtl psm2.

Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
2017-11-09 11:54:11 -08:00
George Bosilca
409638bdf4 Keep the out-of-sequence fragment ordered.
Rework the logic to handle the out-of-sequence fragments on the receiver
side. A large number of OOS messages are still arriving even in single
threaded scenarios.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-11-08 14:27:13 -05:00
Matias Cabral
b76bb42ac1 mtl/ofi: Set data and control progress options default
values to FI_PROGRESS_UNSPEC so each provider will use its default.

Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
2017-11-08 08:24:33 -08:00
Thomas Naughton
86bb6f8bac
Merge pull request #4444 from naughtont3/tjn-fix-plm-monitoring-configury
configury: single quote to avoid trouble with BSD
2017-11-08 10:56:37 -05:00
Edgar Gabriel
0c3aa44ba3
Merge pull request #4454 from edgargabriel/topic/large-file-bug-fix
io/ompio: fix a bug in handling large write/read operations
2017-11-07 09:46:47 -06:00
Gilles Gouaillardet
18d3897479 fs/ufs: add some more error codes on file_open
set proper error codes in mca_fs_ufs_file_open by mapping the errno value to
the MPI error code.

Refs. open-mpi/ompi#4443

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-11-07 11:09:32 +09:00
Edgar Gabriel
c9bb049d00 io/ompio: fix a bug in handling large write/read operations
This is a bug fix based on a problem reported on the mailing list.
For very large read/write operations, ompio breaks the operation
down into multiple cycles. The problem was that
one of the variables required to maintain its values
across the different cycles did not do that, and because
of that the calculations of the memory offsets was wrong.

Fixes issue #4453

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-11-06 11:48:13 -06:00
Edgar Gabriel
91881c4fdc fs/lustre: update component to match fs/ufs
the fs/lustre component has missed out on a number of updates to the fs/ufs component.
This commit tries to import all the changes performed on the fs/ufs component
w.r.t to the file_open operation, including updates on how the amode is set,
error is propegated and setting the fs_block_size value (which is required for
locking purposes).

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-11-03 16:11:46 -05:00
Edgar Gabriel
1885d99ac7 fs/ufs: set proper error codes on file_open
set proper error codes in mca_fs_ufs_file_open by mapping the errno value to
the MPI error code. Fixes an issue reported on the mailing by Wei-keng Liao

Fixes Issue #4443

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-11-03 16:06:31 -05:00