1
1
Граф коммитов

30136 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
26705efad0 opal_check_alps: fix configure output
There was a path where OPAL_CHECK_ALPS would exit its testing but
still leave `opal_check_cray_alps_happy` blank.  Fix that by setting
it to "no".

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-18 11:30:00 -07:00
Edgar Gabriel
dce203ffc6
Merge pull request #7057 from edgargabriel/topic/romio321-status-set-elements-fix
MPIR_Status_set_bytes: fix for large counts
2019-10-18 08:16:36 -05:00
Nathan Hjelm
b1ef5a40fa
Merge pull request #7016 from hjelmn/fix_btl_uct_from_yet_another_unannounced_api_break_in_the_openucx_uct_layer
btl/uct: add support for OpenUCX v1.8 API changes
2019-10-17 06:27:18 -07:00
Jeff Squyres
b6c4d5c118
Merge pull request #7060 from jsquyres/pr/usnic-mca-updates
BTL usnic MCA updates
2019-10-15 10:48:10 -04:00
Jeff Squyres
e1e6d8b85e
Merge pull request #7076 from ftab/pr/my-superlative-fix
README: Remove info for plugins that aren't used anymore
2019-10-10 14:52:36 -04:00
Jeff Squyres
65fd12feff
Merge pull request #7081 from msbrowning/pr/fixed-README
Removed text block from line 883 of README.
2019-10-10 14:52:23 -04:00
Mark Browning
77b3ff9d38 Remove the stale cr MPI extension
Also removed text block from line 883 of README.

Signed-off-by: Mark Browning <marksbrowning3@gmail.com>
2019-10-10 13:24:30 -04:00
Jeff Squyres
f7ee4463b3
Merge pull request #7079 from CalebProvost/hacktoberfest
Edit README
2019-10-10 13:18:54 -04:00
Jeff Squyres
896ce76b64
Merge pull request #7082 from kizill/master
Fix ipv6 improper address copy bug
2019-10-10 12:01:44 -04:00
Jeff Squyres
8f3583d3bd
Merge pull request #7073 from Joe-Downs/pr/fix-README
README: edit "dist_graph topologies" to "communicator topologies"
2019-10-10 11:55:43 -04:00
Jeff Squyres
d736253079
Merge pull request #7074 from classicsman/pr/fix-README
Deleted paragraph
2019-10-10 11:50:38 -04:00
Jeff Squyres
836a0766ae
Merge pull request #7072 from bfitzgit23/pr/fix-README
README-fixed-bfitzgit23
2019-10-10 11:39:06 -04:00
CalebProvost
634054fb37 README: minor grammar fixes
Signed-off-by: CalebProvost <DHX664@gmail.com>
2019-10-10 11:23:55 -04:00
Rick Gleitz
0c923c5428 README: deleted stale paragraph about fca component
Signed-off-by: Rick Gleitz <rgleitz@jefflibrary.org>
2019-10-10 11:07:53 -04:00
Jeff Squyres
f77c3327d8
Merge pull request #7070 from Cfoster01/pr/fix-README
updated readme to remove double space on line 297
2019-10-10 11:06:52 -04:00
Jeff Squyres
69eca3c599
Merge pull request #7069 from summonholmes/master
Fix a typo: slopen -> dlopen
2019-10-10 11:06:12 -04:00
Jeff Squyres
f10e582f71
Merge pull request #7078 from santa65/pr/readme-fix
fix address
2019-10-10 10:32:27 -04:00
bfitzgit23
38da109217 README: Removed stale sentance about --enable-mpi-thread-multiple
Signed-off-by: bfitzgit23 <dfitz@me.com>
2019-10-10 10:22:09 -04:00
shanekimble
72b6292b69 Fix a typo: slopen -> dlopen
Signed-off-by: shanekimble <skimble@edjanalytics.com>
2019-10-10 10:04:09 -04:00
santa magar
3bbf870fde README: fix Knem URL
Signed-off-by: santa magar <santa65thapa@yahoo.com>
2019-10-10 09:34:33 -04:00
Stanislav Kirillov
0e0763e006
fix ipv6 btl connection bug
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2019-10-10 11:20:37 +00:00
Dennis Field
e72b93bf60 README: Remove info for plugins that aren't used anymore
Signed-off-by: Dennis Field <fury@xibase.com>
2019-10-09 20:38:50 -04:00
Joe Downs
dd6a4f3950 README: edit to "communicator topologies"
Signed-off-by: Joe Downs <joedowns502@gmail.com>
2019-10-09 20:22:26 -04:00
Foster
252e98c474 updated readme to remove double space on line 297
Signed-off-by: Foster <CCF6703@yum.com>
2019-10-09 20:17:46 -04:00
Geoff Paulsen
4e1e6f8972
Merge pull request #6993 from awlauria/fix_warnings_master
Fix miscellaneous compiler warnings.
2019-10-09 09:17:02 -05:00
Gilles Gouaillardet
096da7b3b5
Merge pull request #7061 from ggouaillardet/topic/ucx_zero_size_ddt
pml/ucx: correctly handle zero size datatypes
2019-10-09 17:28:18 +09:00
Gilles Gouaillardet
33361aa124 pml/ucx: correctly handle zero size datatypes
zero-size derived datatypes are now flagged as OPAL_DATATYPE_FLAG_CONTIGUOUS
so update mca_pml_ucx_init_datatype() to correctly handle them.
Since 'size' is a 'size_t', the assertion can simply be removed.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-09 16:54:00 +09:00
Gilles Gouaillardet
8906f8cdc6
Merge pull request #7062 from ggouaillardet/topic/travis_distcheck
travis: fix make distcheck
2019-10-09 13:30:14 +09:00
Gilles Gouaillardet
d37f35244f travis: fix make distcheck
bad side effect occurs when CPPFLAGS is set in the environment,
so set it (and LDFLAGS too) on the configure command line.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-09 12:57:54 +09:00
Jeff Squyres
3080033a8c btl/usnic: set retrans_timeout back down to 5ms
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-08 11:17:54 -07:00
Jeff Squyres
132e4cab3b btl/usnic: set ack_iteration_delay default to 4
It was previously accidentally set to 0.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-08 11:17:30 -07:00
Edgar Gabriel
8a3abbf803 MPIR_Status_set_bytes: fix for large count sizes
Change the ncounts argument to MPI_Count and use
MPI_Status_set_elements_x for enabling read/write operations beyond
the 2GB limit.

Thanks to  Richard Warren from the HDF5 group for reporting the issue
and providing the suggested fix for romio.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-10-08 10:47:02 -05:00
Edgar Gabriel
ea7e1ea859
Merge pull request #7046 from edgargabriel/pr/hdf5-2gb-bug
comomn_ompio_file_read/write: fix 2GB limiting issue
2019-10-05 10:39:09 -05:00
Edgar Gabriel
a130f569df comomn_ompio_file_read/write: fix 2GB limiting issue
individual read/write operations exceeding 2GB fail in ompio
due to improper conversions from size_t to int in two different
locations. This commit fixes an issue reported by Richard Warren
from the HDF5 group.

Fixes Issue #397

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-10-05 09:50:02 -05:00
Jeff Squyres
a49ae7f034
Merge pull request #7038 from jsquyres/pr/usnic-fixes-and-optimizations
btl/usnic fixes and optimizations
2019-10-04 19:38:50 -04:00
Jeff Squyres
fe7f772f21 btl/usnic: properly size freelist items
Move the prefix area from the head to the body in relevant size
computations.  This fixes a problem in high traffic situations where
usNIC may have sent from unregistered memory.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 14:40:56 -07:00
Jeff Squyres
27e3040dfe btl/usnic: cap the number of resends per progress iteration
New MCA param: btl_usnic_max_resends_per_iteration.  This is the max
number of resends we'll do in a single pass through usNIC component
progress.  This prevents progress from getting stuck in an endless
loop of retransmissions (i.e., if more retransmissions are triggered
during the sending of retransmissions).  Specifically: we need to
leave the resend loop to allow receives to happen (which may ACK
messages we have sent previously, and therefore cause pending resends
to be moot).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Jeff Squyres
3cc95d86b2 btl/usnic: increase default retrans_timeout
Significantly increase the default retrans timeout.  If the
retrans timeout is too soon, we can end up in a retransmission storm
where the logic will continually re-transmit the same frames during a
single run through the usNIC progress function (because the timer for
a single frame expires before we have run through re-transmitting all
the frames pending re-transmission).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Jeff Squyres
968b1a51b5 btl/usnic: clarifications and fixes regarding ACKs
New MCA parameter: btl_usnic_ack_iteration_delay.  Set this to the
number of times through the usNIC component progress function before
sending a standalone ACK (vs. piggy-backing the ACK on any other send
going to the target peer).

Use "ticks" language to clarify that we're really counting the number
of times through the usNIC component DATA_CHANNEL completion check (to
check for incoming messages) -- it has no relation to wall clock time
whatsoever.

Also slightly change the channel-checking scheme in usNIC component
progress: only check the PRIORITY channel once (vs. checking it once,
not finding anything, and then falling through the progress_2() where we
check PRIORITY again and then check the DATA channel).

As before, if our "progress" libevent fires, increment the tick
counter enough to guarantee that all endpoints that need an ACK will
get triggered to send standalone ACKs the next time through progress,
if necessary.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Jeff Squyres
ce2910a28a btl/usnic: s/get_nsec/get_nticks/g
Rename "get_nsec()" to "get_ticks()" to more accurately reflect that
this function has no correlation to wall clock time at all.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Jeff Squyres
f3429d7a44 btl/usnic: pack a wire data struct
Might as well save a few bytes when sending this struct across the
network via the __opal_attribute_packed__ attribute.

That being said, also re-order the elements in this struct so that
there's no holes to begin with.  Do this so that the compiler/runtime
won't effect (slow) unaligned reads/writes because of the
__opal_attribute_packed__ attribute.

The "packed" attribute is really more about defensive programming
(e.g., if we make a mistake and have a hole, "packed" will remove it
for us).

*** Do not bring this commit back to existing/already-released release
branches: it will cause incompatibility, since it effectively changes
the usNIC BTL wire protocol.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Josh Hursey
b774b47428
Merge pull request #7032 from jjhursey/fix-sigkill-wait
Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion
2019-10-02 14:48:27 -05:00
Joshua Hursey
0e8a97c598 Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion.
* The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
   30 seconds before sending SIGTERM then another 30 seconds before sending
   SIGKILL to remaining processes. This usually happens on an abnormal
   termination. Sometimes the user wants to delay the cleanup to give the
   system time to write out corefile or run other diagnostics.
 * The problem is that child processes may be completing while ORTE is
   in this loop. The SIGCHLD will interrupt the `sleep` system call.
   Without the loop the sleep could effectively be ignored in this case.
   - Sleep returns the amount of time remaining to sleep. If it was
     interrupted by a signal then it is a positive number less than or
     equal to the parameter passed to it. If it slept the whole time
     then it returns 0.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-10-02 14:41:34 -04:00
Austen Lauria
0d4004cc3c Fix miscellaneous compiler warnings.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2019-10-01 16:27:25 -04:00
Jeff Squyres
7ddfa6950b
Merge pull request #7028 from jsquyres/pr/fix-ofi-mtl
mtl/ofi: replace OMPI_UNLIKELY with OPAL version
2019-10-01 13:40:21 -04:00
Howard Pritchard
d6d73b7724 mtl/ofi: replace OMPI_UNLIKELY with OPAL version
one off patch for v4.0.x.  for some reason commit on master
didn't have this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 5f3dbdb5c8)

Note that this commit is actually a cherry-pick from the v4.0.x
branch.  This is the opposite direction than what we nornmally do: we
usually commit to master first and then cherry-pick to the release
branches (vs. the other way around).

As is probably evident from the original commit message above, through
a comedy of errors, this commit was actually applied to the v4.0.x
branch first and then cherry-picked back to master (i.e., the problem
*did* exist in the original master commit
3aca4af548, but it was not recongized at
the time).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-01 09:52:27 -07:00
Gilles Gouaillardet
280856928a
Merge pull request #7026 from ggouaillardet/topic/openpmix_refresh
pmix/pmix4x: refresh to the latest open PMIx master
2019-10-01 18:15:53 +09:00
Gilles Gouaillardet
1c4a3598d0 pmix/pmix4x: refresh to the latest open PMIx master
refresh to openpmix/openpmix@ea3b29b1a4

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-01 14:27:22 +09:00
Ralph Castain
f4371f7f94
Merge pull request #7022 from rhc54/topic/oob2
Remove stale references to orte_oob_base.ev_base
2019-09-29 19:53:12 -07:00
Ralph Castain
7444b32494
Remove stale references to orte_oob_base.ev_base
The oob is restricted to the main event base

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-29 18:52:03 -07:00