1
1
Граф коммитов

25191 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
e968ddfe64 start bug fixes (#1729)
* mpi/start: fix bugs in cm and ob1 start functions

There were several problems with the implementation of start in Open
MPI:

 - There are no checks whatsoever on the state of the request(s)
   provided to MPI_Start/MPI_Start_all. It is erroneous to provide an
   active request to either of these calls. Since we are already
   looping over the provided requests there is little overhead in
   verifying that the request can be started.

 - Both ob1 and cm were always throwing away the request on the
   initial call to start and start_all with a particular
   request. Subsequent calls would see that the request was
   pml_complete and reuse it. This introduced a leak as the initial
   request was never freed. Since the only pml request that can
   be mpi complete but not pml complete is a buffered send the
   code to reallocate the request has been moved. To detect that
   a request is indeed mpi complete but not pml complete isend_init
   in both cm and ob1 now marks the new request as pml complete.

 - If a new request was needed the callbacks on the original request
   were not copied over to the new request. This can cause osc/pt2pt
   to hang as the incoming message callback is never called.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

* osc/pt2pt: add request for gc after starting a new request

Starting a new receive may cause a recursive call into the pt2pt
frag receive function. If this happens and the prior request is
on the garbage collection list it could cause problems. This commit
moves the gc insert until after the new request has been posted.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-02 20:22:40 -04:00
George Bosilca
e1c6b0e4a7 Some compilers are more than picky. 2016-06-03 09:04:34 +09:00
Matias Cabral
a15b648e65 Merge pull request #1747 from matcabral/master
Adding owner file for PSM2 MTL.
2016-06-02 16:32:26 -07:00
Matias A Cabral
29ab28f4f6 Adding owner.txt file for PSM2 MTL. 2016-06-02 16:26:16 -07:00
Nathan Hjelm
d9fc855955 Merge pull request #1743 from hjelmn/gcc_atomics_fix
atomic/gcc: add check for 128-bit CAS being lock-free
2016-06-02 16:55:31 -06:00
Nathan Hjelm
d86e41ea13 atomic/gcc: add check for 128-bit CAS being lock-free
Compiler implementations are free to include support for atomics that
use locks. Unfortunately lock-free and lock atomics do not mix. Older
versions of llvm on OS X use locks to provide
__atomic_compare_exchange on 128-bit values but are lock-free on
64-bit values. This screws up our lifo implementation which mixes
64-bit and 128-bit atomics on the same values to improve
performance. This commit adds a configure-time check if 128-bit
atomics are lock free. If they are not then the 128-bit __atomic CAS
is disabled and we check for the __sync version as a fallback.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-02 15:59:05 -06:00
Nathan Hjelm
5aab4b2d51 Merge pull request #1662 from ggouaillardet/topic/amd64_atomic
amd64/atomic: silence warnings
2016-06-02 14:10:20 -06:00
George Bosilca
d577e12dd0 Fix comment. 2016-06-03 00:57:31 +09:00
George Bosilca
87b1d17e7e Remove warnings.
clang 7.0 with the picky option on is extremely verbose, and complains
about almost everything. Trying to make him happy, at least regarding
the datatype engine.
2016-06-03 00:56:24 +09:00
George Bosilca
fc5d458249 Consistency in handling OPAL_ENABLE_FT_CR.
I am not sure if we should continue to maintain the request support
for FT_CR, but I tried here to simplify the code while maintaining
the same meaning.
2016-06-03 00:54:24 +09:00
rhc54
483b9c370a Merge pull request #1741 from rhc54/topic/pmix114
Update to 1.1.4rc3
2016-06-02 06:57:37 -07:00
Nathan Hjelm
fc26d9c69f Merge pull request #1734 from hjelmn/progress_threading
opal/progress: make progress function registration mt safe
2016-06-02 06:35:59 -06:00
Nathan Hjelm
b001184e63 request: fix warnings (#1742)
Fix warnings introduced by request rework.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-02 04:53:16 -04:00
Ralph Castain
ecea1e3bb5 Update to 1.1.4rc3 2016-06-01 20:56:07 -07:00
George Bosilca
bfcf145613 Refactor the request test and wait functions. 2016-06-02 11:58:25 +09:00
Nathan Hjelm
2fad3b9bc6 opal/progress: make progress function registration mt safe
This commit fixes a bug in opal progress registration that can cause
crashes when a progress function is registered while another thread is
in opal_progress(). Before this commit realloc is used to allocate
more space for progress functions but it is possible for a thread in
opal_progress() to try to read from the array that is freed by realloc
before the array is re-assigned when realloc returns. To prevent this
race use malloc + memcpy to fill the new array and atomically swap out
the old and new array pointers.

Per suggestion we now allocate a default of 8 slots for callbacks and
double the current number when we run out of space.

This commit also fixes leaking the callbacks_lp array.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-01 20:57:19 -06:00
George Bosilca
d9fb59bea5 Update the synchronization primitive
Add comments and make sure we correctly return the status of the
synchronization primitive, especially if it was completed with error.
2016-06-02 11:53:56 +09:00
George Bosilca
2e1b1d34c6 Safety first ! 2016-06-02 11:52:43 +09:00
George Bosilca
50cec456fb ompi_request_complete with signal
Rewrite the ompi_request_complete function to take in account the
with_signal argument. Change the comment to explain the expected
behavior.
Alter all the ompi_request_complete uses to make sure the status of the
request is set before calling ompi_request_complete.

bot🏷️enhancement
2016-06-02 11:49:12 +09:00
George Bosilca
223d75595d Give a boost to MPI_Barrier.
Based on current implementation it is faster to use a blocking
send than the non-blocking version. Switch the exchange function
used in the barrier to use the blocking version combined with
the non-blocking version of the receive.
2016-06-02 11:45:25 +09:00
rhc54
3b68c1f8db Merge pull request #1740 from rhc54/topic/async
Add an experimental ability to skip the RTE barriers at the end of MPI_Init and the beginning of MPI_Finalize
2016-06-01 18:31:35 -07:00
Nathan Hjelm
f33bbfd381 atomic: add support for __atomic builtins (#1735)
* atomic: add support for __atomic builtins

This commit adds support for the gcc __atomic builtins. The __sync
builtins are deprecated and have been replaced by these atomics. In
addition, the new atomics support atomic exchange which was not
supported by __sync.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

* atomic: add support for transactional memory

This commit adds support for using transactional memory when using
opal atomic locks. This feature is enabled if the __HLE__ feature is
available and the gcc builtin atomics are in use.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-01 21:23:47 -04:00
Ralph Castain
2c086e56be Add an experimental ability to skip the RTE barriers at the end of MPI_Init and the beginning of MPI_Finalize 2016-06-01 17:01:15 -07:00
rhc54
b85a5e62ab Merge pull request #1739 from rhc54/topic/pmix
Split the pmix external component into one for the 1.1.4 release, and…
2016-06-01 16:24:44 -07:00
Nathan Hjelm
d844442683 Merge pull request #1738 from hjelmn/ob1_req_fix
pml/ob1: fix race on pml completion of send requests
2016-06-01 15:21:52 -06:00
Ralph Castain
12ecf972af Split the pmix external component into one for the 1.1.4 release, and another for the upcoming 2.0 release. Clean up the configury so the components look for a series-specific function instead of running a program.
NOTE: the changes for the 2.0 series are not yet in the PMIx master.
2016-06-01 14:15:24 -07:00
Jeff Squyres
873cebb4c0 Merge pull request #1727 from jsquyres/pr/mpirun-timeout-and-friends
mpirun.1in: add descriptions of new options
2016-06-01 17:11:44 -04:00
Nathan Hjelm
ceb2912838 Merge pull request #1736 from hjelmn/ugni_fixes
ugni BTL fixes
2016-06-01 14:59:55 -06:00
Nathan Hjelm
086ffc1838 pml/ob1: fix race on pml completion of send requests
The request code was setting the request as pml_complete before
calling MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE. This was causing
MCA_PML_OB1_SEND_REQUEST_RETURN to be called twice in some cases. The
code now mirrors the recvreq code and only sets the request as pml
complete if the request has not already been freed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-01 13:36:06 -06:00
Jeff Squyres
2c3d522147 Merge pull request #1737 from jsquyres/pr/fix-hwloc-valgrind-check
fix hwloc valgrind check
2016-06-01 11:14:02 -04:00
Jeff Squyres
d175fd692d README.ompi: track patches added to hwloc
Track post-v1.11.3-release patches applied to the hwloc copy embedded
in Open MPI.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-01 07:17:05 -07:00
Jeff Squyres
3867bd3640 hwloc.m4: only check for valgrind in non-embedded mode
This fixes https://github.com/open-mpi/ompi/issues/1732: i.e., the
case where the outer project has its own check for
<valgrind/valgrind.h>, but also supplements CPPFLAGS (to find
Valgrind's header files) before doing that check.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Ideally, we would tell OMPI to disable autoconf's caching of our
valgrind check result so that its check gets the right result after
adding CPPFLAGS. Not sure if we can do that.

For now, just disable our Valgrind code in embedded mode.
This will keep the x86 backend enabled under Valgrind but
it will auto-disable itself when finding identical APIC ids anyway
(because CPUID returns same outputs for all PUs).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

Fixes open-mpi/ompi#1732

(cherry picked from commit open-mpi/hwloc@8b44fb1c81)
2016-06-01 06:58:53 -07:00
Gilles Gouaillardet
57978a75d0 Merge pull request #1717 from ggouaillardet/topic/lex_cleanup
configury: clean the flex generated .c files
2016-06-01 13:06:21 +09:00
Nathan Hjelm
5d4bcce042 Merge pull request #1700 from shamisp/topic/cma_config
CMA: Fixing logic for CMA system call detection
2016-05-31 20:33:48 -06:00
Nathan Hjelm
340152a635 Merge pull request #1720 from shamisp/topic/vader/max_addr
VADER: Adjusting VADER_MAX_ADDRESS for non x86 platforms.
2016-05-31 20:33:28 -06:00
Gilles Gouaillardet
5f565dfec3 configury: clean the flex generated .c files 2016-06-01 11:13:31 +09:00
Jeff Squyres
cf27ec36b3 mpirun.zsh: add options to zsh shell completion
Add the following to zsh shell completion:

* --get-stack-traces
* --report-state-upon-timeout
* --timeout

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-31 16:33:46 -07:00
Jeff Squyres
e9ce11c6a7 help-orterun.txt: minor word smything
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-31 16:33:46 -07:00
Jeff Squyres
347497cc7e mpirun.1in: add descriptions of new options
Add descriptions for the new --report-state-on-timeout and
--get-stack-traces options.

Also add --timeout, and cross-reference MPIEXEC_TIMEOUT with it.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-31 16:33:46 -07:00
Nathan Hjelm
bf10d79914 btl/ugni: remove erroneous unlock
The endpoint lock was being released twice in mca_btl_ugni_get_ep.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-31 16:52:53 -06:00
Nathan Hjelm
cc96097873 btl/ugni: fix bug when attempting unaligned get on aries
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-31 16:52:09 -06:00
Jeff Squyres
17202e5177 Merge pull request #1733 from jsquyres/pr/hwloc1113-fix
hwloc1113: add missing file to Makefile.am
2016-05-31 13:59:08 -04:00
Jeff Squyres
5cfee95ea4 hwloc1113: add missing file to Makefile.am
Lack of this file causes a failure when you run autogen.pl on a
distribution tarball.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-31 09:57:50 -07:00
rhc54
93ff4ce36d Merge pull request #1731 from rhc54/topic/timeout
Provide ETIMEDOUT as the mpirun exit code if the timeout limit was hit
2016-05-31 08:41:21 -07:00
Ralph Castain
0cd0ccb7fd Provide ETIMEDOUT as the mpirun exit code if the timeout limit was hit 2016-05-31 07:45:31 -07:00
Gilles Gouaillardet
1bbc5fadee ompi/win: silence an other warning 2016-05-31 13:18:39 +09:00
Gilles Gouaillardet
c41321b9e5 ompi/win: silence warning 2016-05-31 13:03:20 +09:00
rhc54
0965cb3d41 Merge pull request #1730 from rhc54/topic/pmixext
Patch from Gilles - modify detection of PMIx version for external libraries
2016-05-30 18:50:12 -07:00
Ralph Castain
7b115a9e0b Patch from Gilles - modify detection of PMIx version for external libraries 2016-05-30 14:30:10 -07:00
George Bosilca
d2abff583e Fix race condition during BTL TCP tear-down.
bot🏷️bug
bot:assign:@hjelmn
2016-05-30 10:47:14 -05:00