1
1

6498 Коммитов

Автор SHA1 Сообщение Дата
Edgar Gabriel
da640f98df fcoll/two_phase: data sieving has to occur at offset 0 as well
data sieving has to occur for any offset provided that is larger
or equal zero for this implementation to work correctly.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-03-10 11:23:09 -06:00
Edgar Gabriel
c83b47c266 io/romio314: mark datatypes of size 0 as contiguous
this commit fixes an issue observed with romio314 and the hdf5 1.10.x testsuite.
The ADIOI_Datatype_iscontig() routine in romio314/src/io_romio314_module.c
will now return for a datatype of size 0 that it is contiguous, even if the extent
of the datatype is non-zero. This avoids a segmentation fault observed in the
ADIOI_Flatten routine, and fixes this particular with the hdf5 1.10.x testsuite in
OpenMPI with romio314.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-03-08 09:10:09 -06:00
bosilca
9944d63de1
Merge pull request #4852 from thananon/pr/ob1_oos_fix
pml/ob1: fixed out of sequence bug.
2018-02-28 13:02:03 -05:00
Thananon Patinyasakdikul
09cba8b30b pml/ob1: fixed out of sequence bug.
This commit fixes #4795

- Fixed typo that sometimes causes deadlock in change of protocol.
- Redesigned out of sequence ordering and address the overflow case of
  sequence number from uint16_t.

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
2018-02-27 13:49:40 -05:00
Valentin Petrov
bf4e694a96 coll/hcoll: Fix return codes
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2018-02-22 17:48:29 +02:00
Matias Cabral
0a822f8f99
Merge pull request #4821 from nrspruit/OFI_mtl_multi_event_progress
MTL OFI: Added support for reading multiple CQ events in ofi progress
2018-02-20 14:59:47 -08:00
Jeff Squyres
9ef0f3d83a ompi/monitoring: add .sh versionig to common monitoring lib
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-02-20 07:07:23 -08:00
Spruit, Neil R
e7bff501cd MTL OFI: Added support for reading multiple CQ events in ofi progress
-Updated ompi_mtl_ofi_progress to use an array to read CQ events up to a
threshold that can be set by the Open MPI User.

-Users can adjust the number of events that can be handled in the
ompi_mtl_ofi_progress by setting "--mca mtl_ofi_progress_event_cnt #".

-The default value for the the number of CQ events that can be read in a
single call to ofi progress is 100 which is an average
based off workload usecase anaylsis showing 70-128 as the range of
multiple events returned during ofi progress.

Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
2018-02-15 09:41:14 -05:00
Nathan Hjelm
0e83568466 coll/libnbc: do not take lock in progress if there are no requests
This commit fixes a flaw in the progress function for libnbc. The
function was unconditionally taking a lock even if there are no
requests to process. This lock was showing up in vtune traces of
multi-threaded benchmarks.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-02-13 09:51:01 -07:00
Edgar Gabriel
a3a734b6d2 io/ompio: correctly reset the request
after performing the final OBJ_RELEASE on the request,
reset the user level variable to MPI_REQUEST_NULL.
Otherwise the c_2_f translation step in the fortran
interface fails.

Fixes issue #4807

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-02-13 09:18:25 -06:00
Jeff Squyres
e7f91f8068
Merge pull request #4527 from clementFoyer/osc-no-includes
Remove inter-dependencies between OSC modules.
2018-02-09 15:49:56 -05:00
Nathan Hjelm
da9f833f4a pml/ob1: ignore the eager limit of RDMA-only btls
This commit fixes a flaw in the eager limit check in pml/ob1. The
check was incorrectly checking if RDMA-only BTLs (BTLs without the
send flag) has a valid eager limit. This commit fixes the check by
adding an additional check for the send flag on the BTL module.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-02-07 12:42:44 -07:00
Clement Foyer
f5b4fc05f8 Remove inter-dependencies between OSC modules.
The osc monitoring component needed to include other OSC components
header in order to be able tu access communicator through the
component specific ompi_osc_*_module_t structures. This commit remove
the dependency, and resolve the issue #4523.

Extend the common monitoring API.

  * Now it's possible to translate from local rank to world rank from
    both the communicator and the group.
  * Remove useless hashtable as we directly use the w_group contained
    in window structure.

Add automatic generation at config time.

The templates are expanded at configure time. It creates a new header
file that generates all the variables/functions needed. Adding this
during the autogen automagicaly generates for each of the available
modules the proper functions.

Only keep a generated argv-style array.

Following Jeff's advice, the configure.m4 file generate a simple array
of module variables to be iterated over to find the proper module.

Signed-off-by: Clement Foyer <clement.foyer@inria.fr>
2018-02-07 11:52:00 +00:00
Ralph Castain
7ddffc627d
Merge pull request #4776 from rhc54/topic/rte
Correct abstraction break and update ignores
2018-01-31 04:34:05 -08:00
Ralph Castain
8e8a9aecc5 Correct abstraction break - direct reference to ORTE
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-30 21:19:14 -08:00
Gilles Gouaillardet
34b45cc879 osc/sm: fix the osc_free callback
If component selection fails, then module->bases might be unallocated
when ompi_osc_sm_free() in invoked, so test it before trying to free()
module->bases[0].

Thanks Martin Binder for the report.

Refs open-mpi/ompi#4770

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-01-31 11:23:21 +09:00
Sergey Oblomov
7a5811d0a8 request/state: update state for canceled request
- fixed issue in set state for canceled request

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-01-29 18:26:20 +02:00
Edgar Gabriel
bcf26d419f fs/ufs and fs/lustre: remove erroneous return statement
an erroneous return statement has creeped in commit 1885d99
which leads to some processes not resetting stripe_size
and stripe_count correctly. This can lead in 3.0.x to different
fcoll modules being selected. The impact is not that dramatic on
master and 3.1.x, but could lead to problems as well.

Fixes #4745

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2018-01-24 14:07:21 -06:00
Xin Zhao
72ff2b1135 OMPI/OSC/UCX: adding atomic lock for fetch_and_op and compare_and_swap
Signed-off-by: Xin Zhao <xinz@mellanox.com>
2018-01-18 00:36:22 +02:00
Yossi Itigin
f2851fd502
Merge pull request #4724 from alex-mikheev/topic/ucx_as_default
ompi/oshmem: ucx is selected over yalla/ikrit by default
2018-01-17 17:41:49 +02:00
Alex Mikheev
640e945b9c ompi: pml/ucx: blocking send using ucp_tag_send_nbr
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
2018-01-17 15:54:18 +02:00
Alex Mikheev
ae326546f4
ompi/oshmem: ucx is selected over yalla/ikrit by default
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
2018-01-17 15:08:04 +02:00
Matias Cabral
8049c06a96
Merge pull request #4580 from matcabral/fix_comments_pr_425_osc_rmda
osc/rmda: fix missing opal_argv_free in mtls search.
2018-01-12 09:20:11 -08:00
Matias A Cabral
009ba475e1 osc/rmda: fix missing opal_argv_free in mtls search.
Use asprintf in description message to avoid missing default values
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
2018-01-11 14:29:16 -08:00
bosilca
ef38ca5663
Merge pull request #4644 from bosilca/topic/treematch
Fix treematch topology assert
2018-01-02 21:21:54 -05:00
Alex Mikheev
e7bf0617cf
ompi: pml ucx: improve recv latency
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
2017-12-26 16:24:16 +02:00
Aravind Gopalakrishnan
fb68726baf MTL OFI: Allow retries in MTL progress for interrupted syscalls
This fixes a regression in sockets provider which could return -EINTR value
from fi_cq_read() due to a syscall being interrupted. The error value is
currently interpreted as fatal condition. Relax the rule so that we can retry
fi_cq_read() operation.

Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
2017-12-20 14:58:49 -08:00
George Bosilca
38455845db
Fix asserts.
In both cases we were comparing with the wrong size, it should be either
the number of local processes or the number of nodes, and not the size
of the communicator.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-12-20 11:51:35 -05:00
George Bosilca
808f865e9d
Force all output to use OMPI infrastructure.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-12-20 11:50:51 -05:00
Matias Cabral
2c86b8723d
Merge pull request #4510 from matcabral/mtl_psm2_shadow_vars
New flag for MCA parameters that allows a behaving with a default value of "unset".
2017-12-04 12:25:37 -08:00
Howard Pritchard
b160cf6339
Merge pull request #4533 from hppritcha/topic/ofi_mtl_mprobe_fixes
mtl/ofi: fix problem with mprobe/mrecv
2017-12-04 09:11:47 -07:00
Howard Pritchard
2233e44848
Merge pull request #4534 from hppritcha/topic/fix_a_segv_in_request
pml/cm: check for request comp. before completing bsend
2017-12-04 09:09:41 -07:00
Gilles Gouaillardet
2f5b1e9fe0
Merge pull request #4551 from ggouaillardet/topic/communicator_mutex_c_lock
Make usage of ompi_communicator_t, ompi_file_t and ompi_win_t mutex consistent
2017-12-04 09:20:52 +09:00
Edgar Gabriel
1f151be6d2 io/ompio: introduce a new function to retrieve mca parameter values
ompio has the unique problem, that mca parameters set in the io/ompio component
have to be accessible from other frameworks as well. This is mostly done to avoid
a replication in the parameter names and to reduce the number of mca parameters that
and end-user has to worry about.

This commit introduces a generic function to retrieve ompio mca parameters, the function pointer
is stored on the file handle. It replaces two functions that used the same concept already for
one parameter each.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-12-01 10:00:23 -06:00
Gilles Gouaillardet
5f1a967351 ompi/file: rename ompi_file_t's f_mutex into f_lock
in order to use a consistent name between ompi_file_t,
ompi_win_t and ompi_communicator_t

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-12-01 16:06:22 +09:00
Nathan Hjelm
7893248c5a opal/asm: add fetch-and-op atomics
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:23 -07:00
Nathan Hjelm
1282e98a01 opal/asm: rename existing arithmetic atomic functions
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:22 -07:00
Nathan Hjelm
9d0b3fe9f4 opal/asm: remove opal_atomic_bool_cmpset functions
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:22 -07:00
Nathan Hjelm
45db3637af osc/rdma: bug fixes
This commit fixes the following bugs:

 - Allow a btl to be used for communication if it can communicate with
   all non-self peers and it supports global atomic visibility. In
   this case CPU atomics can be used for self and the btl for any
   other peer.

 - It was possible to get into a state where different threads of an
   MPI process could issue conflicting accumulate operations to a
   remote peer. To eliminate this race we now update the peer flags
   atomically.

 - Queue up and re-issue put operations that failed during a BTL
   callback. This can occur during an accumulate operation. This was
   an unhandled error case.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-29 12:43:58 -07:00
Nathan Hjelm
67e26b6e5a
Merge pull request #4482 from matcabral/osc_rdma_skip_mtls
osc/rdma: mca parameter to list MTLs that lower osc rdma priority
2017-11-29 09:34:33 -07:00
Nathan Hjelm
647b40f3f2
Merge pull request #4442 from bosilca/topic/ob1_pvar
Topic/ob1 pvar
2017-11-29 09:31:07 -07:00
Ralph Castain
7ad6886a30 Add a new OMPI rte component to support direct-launch using PMIx.
Cleanup several places where abstraction violations crept into OMPI layer (direct reference of ORTE). Add some missing includes that were exposed by this change.

Note that this compiles, but I haven't tested it for execution yet. Handing it over to Noah Evans for completion

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-28 12:05:01 -08:00
Howard Pritchard
3285325884 pml/cm: check for request comp. before completing bsend.
Turns out there are edge cases where an MTL's isend
method may end up marking a send request complete prior
to returning to the CM code.  The would end up causing
problems in the bsend path since the ompi_request_complete
would end up getting invoked a second time on this request.
This ended up causing segfaults, etc. in ompi_request_complete .

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-11-27 12:27:43 -07:00
Ralph Castain
3906aaf41a Silence warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-25 11:50:18 -08:00
Howard Pritchard
cd48eccbae mtl/ofi: fix problem with mprobe/mrecv
At least with some providers (sockets and GNI), the mprobe/mrecv
ofi mtl methods were incorrect.  For these two providers at least
one must supply the original tag and mask bits used with the
prior FI_PEEK | FI_CLAIM request that had been used to probe for
the message.

These providers take a strict interpretation of the following sentence
from the libfabric fi_tagged man page:

```
 Claimed messages can only be retrieved using a subsequent, paired receive  operation  with  the  FI_CLAIM  flag  set.
```

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-11-24 08:11:18 -07:00
Edgar Gabriel
75ab006ec0 io/ompio: add a new option to disable amode overwriting
ompio has historically changed the WRONLY flag provided by the applicaiton
to RDWR to allow for the data sieving optimization within the two-phase I/O
fcoll component. This change did not have a performance impact
on regular UNIX file systems, but seems to hurt performance on NFS (and maybe Lustre?)

So provide an option that allows to keep the WRONLY option, and raise an error
if tha fcoll/two-phase would actually like to use the data sieving.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-11-17 13:13:38 -06:00
Matias A Cabral
1fad59465f New flag for MCA parameters that allows a behaving with a default value
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.

Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
2017-11-16 16:28:50 -08:00
Matias Cabral
d1869a725a
Merge pull request #4467 from matcabral/master
mtl/ofi: Set data and control progress options default values to FI_PROGRESS_UNSPEC
2017-11-13 07:35:39 -08:00
Jeff Squyres
a8686a6813 mtl ofi: squelch compiler warnings
gcc 5.2 complains:

```
mtl_ofi_component.c: In function ‘ompi_mtl_ofi_finalize’:
mtl_ofi_component.c:613:5: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
     if (ret = fi_close((fid_t)ompi_mtl_ofi.fabric)) {
     ^
```

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-11 05:07:11 -08:00
Jeff Squyres
5a6ddf42d6 mtl ofi: it is not an error to return no data from fi_getinfo()
Before this commit, the presence of usNIC devices -- which will
(currently) return no data when fi_getinfo() is queried for tagged
matching providers -- would cause an error message to be displayed.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-11 05:07:11 -08:00