1
1
Граф коммитов

30156 Коммитов

Автор SHA1 Сообщение Дата
Mark Allen
6855ebb84b Adding -mca comm_method to print table of communication methods
This is closely related to Platform-MPI's old -prot feature.

The long-format of the tables it prints could look like this:
>   Host 0 [myhost001] ranks 0 - 1
>   Host 1 [myhost002] ranks 2 - 3
>   Host 2 [myhost003] ranks 4
>   Host 3 [myhost004] ranks 5
>   Host 4 [myhost005] ranks 6
>   Host 5 [myhost006] ranks 7
>   Host 6 [myhost007] ranks 8
>   Host 7 [myhost008] ranks 9
>   Host 8 [myhost009] ranks 10
>
>    host | 0    1    2    3    4    5    6    7    8
>   ======|==============================================
>       0 : sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       1 : tcp  sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       2 : tcp  tcp  self tcp  tcp  tcp  tcp  tcp  tcp
>       3 : tcp  tcp  tcp  self tcp  tcp  tcp  tcp  tcp
>       4 : tcp  tcp  tcp  tcp  self tcp  tcp  tcp  tcp
>       5 : tcp  tcp  tcp  tcp  tcp  self tcp  tcp  tcp
>       6 : tcp  tcp  tcp  tcp  tcp  tcp  self tcp  tcp
>       7 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  self tcp
>       8 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp  self
>
>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

In this example hosts 0 and 1 had multiple ranks so "sm" was more
meaningful than "self" to identify how the ranks on the host are
talking to each other. While host 2..8 were one rank per host so
"self" was more meaningful as their btl.

Above a certain number of hosts (12 by default) the above table gets too big
so we shrink to a more abbreviated looking table that has the same data:
>    host | 0 1 2 3 4       8
>   ======|====================
>       0 : A C C C C C C C C
>       1 : C A C C C C C C C
>       2 : C C B C C C C C C
>       3 : C C C B C C C C C
>       4 : C C C C B C C C C
>       5 : C C C C C B C C C
>       6 : C C C C C C B C C
>       7 : C C C C C C C B C
>       8 : C C C C C C C C B
>   key: A == sm
>   key: B == self
>   key: C == tcp

Then above 36 hosts we stop printing the 2d table entirely and just print the
summary:
>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

The options to control it are
    -mca comm_method 1   :   print the above table at the end of MPI_Init
    -mca comm_method 2   :   print the above table at the beginning of MPI_Finalize
    -mca comm_method_max <n> :  number of hosts <n> for which to print a full size 2d
    -mca comm_method_brief 1 :  only print summary output, no 2d table
    -mca comm_method_fakefile <filename> :  for debugging only

* printing at init vs finalize:

The most important difference between these two is that when printing the table
during MPI_Init(), we send extra messages to make sure all hosts are connected to
each other. So the table ends up working against the idea of on-demand connections
(although it's only forcing the n^2 connections in the number of hosts, not the
total ranks).  If printing at MPI_Finalize() we don't create any connections that
aren't already connected, so the table is more likely to have "n/a" entries if
some hosts never connected to each other.

* how many hosts <n> for which to print a full size 2d table

The option -mca comm_method_max <n> can be used to specify a number of hosts <n>
(default 12) that controls at what host-count the unabbreviated / abbreviated
2d tables get printed:
    1 - n      : full size 2d table
    n+1 - 3n   : shortened 2d table
    3n+1 - inf : summary only, no 2d table

* brief

The option -mca comm_method_brief 1 can be used to skip the printing of the 2d
table and only show the short summary

* fakefile

This is a debugging option that allows easeir testing of all the printout
routines by letting all the detected communication methods between the hosts
be overridden by fake data from a file.

The source of the information used in the table is the .mca_component_name

In the case of BTLs, the module always had a .btl_component linking back to the
component. The vars mca_pml_base_selected_component and ompi_mtl_base_selected_component
offer similar functionality for pml/mtl.

So with the ability to identify the component, we can then access
the component name with code like this
    mca_pml_base_selected_component.pmlm_version.mca_component_name
See the three lookup_{pml,mtl,btl}_name() functions in hook_comm_method_fns.c,
and their use in comm_method() to parse the strings and produce an integer
to represent the connection type being used.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2019-10-31 16:23:57 -04:00
Gilles Gouaillardet
631a43581f
Merge pull request #7117 from ggouaillardet/topic/f08_bind_c_constants_revamp_misc_fixes
fortran/use-mpi-f08: misc fixes
2019-10-30 10:42:33 +09:00
Edgar Gabriel
007b773cd7
Merge pull request #7122 from edgargabriel/pr/simple-aggr-mode-fix
common/ompio: fix calculation in simple-grouping option
2019-10-29 13:38:24 -05:00
Josh Hursey
312c55edaa
Merge pull request #7092 from sam6258/smiller_rsh_chdir
plm/rsh: Add chdir option to change directory before orted exec
2019-10-29 13:34:25 -05:00
Edgar Gabriel
ad5d0df4e9 common/ompio: fix calculation in simple-grouping option
This is based on a bug reported on the mailing list using a netcdf testcase.
The problem occurs if processes are using a custom file view, but on some
of them it appears as if the default file view is being used. Because of that,
the simple-grouping option lead to different number of aggregators used on different
processes, and ultimately to a deadlock. This patch fixes the problem by not using
the file_view size anymore for the calculation in the simple-grouping option,
but the contiguous chunk size (which is identical on all processes).

Fixes issue #7109

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-10-29 12:30:41 -05:00
Gilles Gouaillardet
fda4d040da fortran/use-mpi-f08: misc fixes
- fix typos from open-mpi/ompi@b10a60a5a9
 - remove remaining references to OMPI_PROTECTED from open-mpi/ompi@df6d763a53

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-29 15:00:51 +09:00
Jeff Squyres
8343a289f2
Merge pull request #7112 from wbailey2/pr/fix-HACKING
Changed the final URL to https://github.com/westes/flex
2019-10-28 09:25:21 -04:00
Jeff Squyres
e59e6f714c
Merge pull request #7105 from ggouaillardet/topic/f08_bind_c_constants_revamp
fortran/use-mpi-f08: revamp mpi_f08 constants
2019-10-28 09:14:23 -04:00
William Bailey
caf1d9292c Changed the final URL to https://github.com/westes/flex
Signed-off-by: William Bailey <wbailey2@nd.edu>
2019-10-27 22:33:50 -04:00
Gilles Gouaillardet
51e23f8cb6 fortran/use-mpi-f08: remove bind(C) constants.
Remove unused bind(C) constants in ompi/mpi/fortran/use-mpi-f08/constants.{c,h}
(and break ABI compatibility).

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-28 10:28:17 +09:00
Gilles Gouaillardet
df6d763a53 configury: remove references to unused OMPI_PROTECTED
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-28 10:28:17 +09:00
Gilles Gouaillardet
b10a60a5a9 fortran/use-mpi-f08: revamp constant declarations
In order to work around an issue with flang based compilers,
avoid declaring bind(C) constants and use plain Fortran parameter
instead.

For example,
type(MPI_Comm), bind(C, name="ompi_f08_mpi_comm_world") OMPI_PROTECTED :: MPI_COMM_WORLD
is changed to
type(MPI_Comm), parameter :: MPI_COMM_WORLD = MPI_Comm(OMPI_MPI_COMM_WORLD)

Note that in order to preserve ABI compatibility, ompi/mpi/fortran/use-mpi-f08/constants.{c,h}
have been kept even if its symbols are no more referenced by Open MPI.

Refs. open-mpi/ompi#7091

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-28 10:01:17 +09:00
Austen Lauria
aa8be9c12d
Merge pull request #6284 from devreal/ompi-rdma-memalign
Ensure proper alignment of memory provided by MPI
2019-10-25 12:27:58 -04:00
Austen Lauria
ecd990a67c
Merge pull request #6933 from devreal/osc-ucx-excl-lock
UCX osc: properly release exclusive lock to avoid lockup
2019-10-25 09:16:51 -04:00
Austen Lauria
96f55b0b32
Merge pull request #7096 from jsquyres/pr/fix-alps-configure-output
opal_check_alps: fix configure output
2019-10-25 09:02:25 -04:00
Jeff Squyres
d8f17aea69
Merge pull request #7097 from mcoil1/pr/README-fix2
README: Use "--" notation for CLI options
2019-10-18 16:32:29 -04:00
Maxwell Coil
7e07346524 README: Use "--" notation for CLI options
Signed-off-by: Maxwell Coil <mcoil@nd.edu>
2019-10-18 15:44:23 -04:00
Jeff Squyres
26705efad0 opal_check_alps: fix configure output
There was a path where OPAL_CHECK_ALPS would exit its testing but
still leave `opal_check_cray_alps_happy` blank.  Fix that by setting
it to "no".

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-18 11:30:00 -07:00
Edgar Gabriel
dce203ffc6
Merge pull request #7057 from edgargabriel/topic/romio321-status-set-elements-fix
MPIR_Status_set_bytes: fix for large counts
2019-10-18 08:16:36 -05:00
Nathan Hjelm
b1ef5a40fa
Merge pull request #7016 from hjelmn/fix_btl_uct_from_yet_another_unannounced_api_break_in_the_openucx_uct_layer
btl/uct: add support for OpenUCX v1.8 API changes
2019-10-17 06:27:18 -07:00
Scott Miller
c1b8599528 plm/rsh: Add chdir option to change directory before orted exec
Signed-off-by: Scott Miller <scott.miller1@ibm.com>
2019-10-15 17:19:30 -04:00
Jeff Squyres
b6c4d5c118
Merge pull request #7060 from jsquyres/pr/usnic-mca-updates
BTL usnic MCA updates
2019-10-15 10:48:10 -04:00
Jeff Squyres
e1e6d8b85e
Merge pull request #7076 from ftab/pr/my-superlative-fix
README: Remove info for plugins that aren't used anymore
2019-10-10 14:52:36 -04:00
Jeff Squyres
65fd12feff
Merge pull request #7081 from msbrowning/pr/fixed-README
Removed text block from line 883 of README.
2019-10-10 14:52:23 -04:00
Mark Browning
77b3ff9d38 Remove the stale cr MPI extension
Also removed text block from line 883 of README.

Signed-off-by: Mark Browning <marksbrowning3@gmail.com>
2019-10-10 13:24:30 -04:00
Jeff Squyres
f7ee4463b3
Merge pull request #7079 from CalebProvost/hacktoberfest
Edit README
2019-10-10 13:18:54 -04:00
Jeff Squyres
896ce76b64
Merge pull request #7082 from kizill/master
Fix ipv6 improper address copy bug
2019-10-10 12:01:44 -04:00
Jeff Squyres
8f3583d3bd
Merge pull request #7073 from Joe-Downs/pr/fix-README
README: edit "dist_graph topologies" to "communicator topologies"
2019-10-10 11:55:43 -04:00
Jeff Squyres
d736253079
Merge pull request #7074 from classicsman/pr/fix-README
Deleted paragraph
2019-10-10 11:50:38 -04:00
Jeff Squyres
836a0766ae
Merge pull request #7072 from bfitzgit23/pr/fix-README
README-fixed-bfitzgit23
2019-10-10 11:39:06 -04:00
CalebProvost
634054fb37 README: minor grammar fixes
Signed-off-by: CalebProvost <DHX664@gmail.com>
2019-10-10 11:23:55 -04:00
Rick Gleitz
0c923c5428 README: deleted stale paragraph about fca component
Signed-off-by: Rick Gleitz <rgleitz@jefflibrary.org>
2019-10-10 11:07:53 -04:00
Jeff Squyres
f77c3327d8
Merge pull request #7070 from Cfoster01/pr/fix-README
updated readme to remove double space on line 297
2019-10-10 11:06:52 -04:00
Jeff Squyres
69eca3c599
Merge pull request #7069 from summonholmes/master
Fix a typo: slopen -> dlopen
2019-10-10 11:06:12 -04:00
Jeff Squyres
f10e582f71
Merge pull request #7078 from santa65/pr/readme-fix
fix address
2019-10-10 10:32:27 -04:00
bfitzgit23
38da109217 README: Removed stale sentance about --enable-mpi-thread-multiple
Signed-off-by: bfitzgit23 <dfitz@me.com>
2019-10-10 10:22:09 -04:00
shanekimble
72b6292b69 Fix a typo: slopen -> dlopen
Signed-off-by: shanekimble <skimble@edjanalytics.com>
2019-10-10 10:04:09 -04:00
santa magar
3bbf870fde README: fix Knem URL
Signed-off-by: santa magar <santa65thapa@yahoo.com>
2019-10-10 09:34:33 -04:00
Stanislav Kirillov
0e0763e006
fix ipv6 btl connection bug
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2019-10-10 11:20:37 +00:00
Dennis Field
e72b93bf60 README: Remove info for plugins that aren't used anymore
Signed-off-by: Dennis Field <fury@xibase.com>
2019-10-09 20:38:50 -04:00
Joe Downs
dd6a4f3950 README: edit to "communicator topologies"
Signed-off-by: Joe Downs <joedowns502@gmail.com>
2019-10-09 20:22:26 -04:00
Foster
252e98c474 updated readme to remove double space on line 297
Signed-off-by: Foster <CCF6703@yum.com>
2019-10-09 20:17:46 -04:00
Geoff Paulsen
4e1e6f8972
Merge pull request #6993 from awlauria/fix_warnings_master
Fix miscellaneous compiler warnings.
2019-10-09 09:17:02 -05:00
Gilles Gouaillardet
096da7b3b5
Merge pull request #7061 from ggouaillardet/topic/ucx_zero_size_ddt
pml/ucx: correctly handle zero size datatypes
2019-10-09 17:28:18 +09:00
Gilles Gouaillardet
33361aa124 pml/ucx: correctly handle zero size datatypes
zero-size derived datatypes are now flagged as OPAL_DATATYPE_FLAG_CONTIGUOUS
so update mca_pml_ucx_init_datatype() to correctly handle them.
Since 'size' is a 'size_t', the assertion can simply be removed.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-09 16:54:00 +09:00
Gilles Gouaillardet
8906f8cdc6
Merge pull request #7062 from ggouaillardet/topic/travis_distcheck
travis: fix make distcheck
2019-10-09 13:30:14 +09:00
Gilles Gouaillardet
d37f35244f travis: fix make distcheck
bad side effect occurs when CPPFLAGS is set in the environment,
so set it (and LDFLAGS too) on the configure command line.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-09 12:57:54 +09:00
Jeff Squyres
3080033a8c btl/usnic: set retrans_timeout back down to 5ms
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-08 11:17:54 -07:00
Jeff Squyres
132e4cab3b btl/usnic: set ack_iteration_delay default to 4
It was previously accidentally set to 0.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-08 11:17:30 -07:00
Edgar Gabriel
8a3abbf803 MPIR_Status_set_bytes: fix for large count sizes
Change the ncounts argument to MPI_Count and use
MPI_Status_set_elements_x for enabling read/write operations beyond
the 2GB limit.

Thanks to  Richard Warren from the HDF5 group for reporting the issue
and providing the suggested fix for romio.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-10-08 10:47:02 -05:00