1
1

30312 Коммитов

Автор SHA1 Сообщение Дата
Josh Hursey
b50d568004
Merge pull request #7191 from gvallee/returncode_check
Add the missing code to check a return code
2019-11-26 09:02:23 -06:00
Josh Hursey
f828990bcf
Merge pull request #7192 from gvallee/contiaing_typo
Fix typo in comment: contiaing -> containing
2019-11-25 14:40:08 -06:00
Edgar Gabriel
eaf0828643
Merge pull request #7104 from edgargabriel/topic/gpfs
Topic/gpfs
2019-11-25 12:32:11 -06:00
raafatfeki
7b2d83c898 mca/fs: Remove unused functions and prototypes & reduce recurent code through all components
I removed the implementation and/or prototypes of all unused functions defined for all components.
To reduce recurrent code, I created functions under base for the management of error codes and setting of file permission and amode.
Then, I replaced these recurrent code by those function for all components.

Signed-off-by: raafatfeki <fekiraafat@gmail.com>

add a missing header file

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-11-25 09:01:38 -06:00
raafatfeki
5988297bd8 fs/gpfs: Assign file creation to one process and management of error codes translation.
When O_CREAT and O_EXCL are set, open() shall fail if the file exists. Therefore, I assigned the file creation to the root process only.
I also translated the errno codes to their corresponding MPI error codes.

Signed-off-by: raafatfeki <fekiraafat@gmail.com>
2019-11-25 09:01:38 -06:00
raafatfeki
a2b1a03e7b fs/gpfs: Update component to compile with latest gpfs versions
I updated the gpfs component to match the latest structures and functions definitions and remove compiler wranings.
Disabled mpi info setting for the gpfs option "Data Shipping" since it is no longer supported in the latest gpfs versions.

Signed-off-by: raafatfeki <fekiraafat@gmail.com>
2019-11-25 09:01:38 -06:00
Edgar Gabriel
715a6e8c27 fs/gpfs: update component and module file
to match what we are doing meanwhile in the other fs compomenents.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-11-25 09:01:38 -06:00
Edgar Gabriel
8d31ba3a09 fs/gpfs: update configure logic
update the configure logic of the gpfs component
based on what we learned from user feedback over the last
two years for the other components

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-11-25 09:01:38 -06:00
XuanWang1982
cd76e53479 Changed parameter name from siox to gpfs
Adding get_info function for gpfs module
annotate printf information

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-11-25 09:01:38 -06:00
Christoph Niethammer
d3c11396fd First code review round of the GPFS module.
Delete check for amode which should go to a higler layer, e.g. ompi_file_open.
Only perform Info value check if key is found.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-11-25 09:01:38 -06:00
XuanWang1982
b1dc58eeb2 First version for GPFS module. To be tested
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-11-25 09:01:38 -06:00
Jeff Squyres
25aa2b9a8c
Merge pull request #7190 from gvallee/insertted_is_inserted
Fix a typo in comments: insertted -> inserted
2019-11-24 12:17:05 -05:00
Geoffroy Vallee
127573cf44
Fix typo in comment: contiaing -> containing
Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>
2019-11-24 11:06:41 -05:00
Geoffroy Vallee
de6f130b4a
Add the missing code to check a return code
Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>
2019-11-24 11:04:10 -05:00
Geoffroy Vallee
98de17c6da
Fix a type in comments: insertted -> inserted
Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>
2019-11-24 00:36:06 -05:00
Jeff Squyres
7bcd5fc13d SQUASHME Minor code de-duplication
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-11-23 08:18:02 -08:00
Tomislav Janjusic
ebb985dea9 oshmem:ucx, fix race condition and add context recycling
1) Race condition: Do not add private contexts to active list.
Private contexts are only visible to the user.
2) Recycled contexts: Destroyed contexts are put on an idle list until
finalize, continuous context creation will lead to oom condition.
Instead, check if context from idle list meets new context requirements
and reuse it.

Co-authored with: Artem Y. Polyakov <artemp@mellanox.com>,
                  Manjunath Gorentla Venkata <manjunath@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2019-11-23 04:39:05 +02:00
Howard Pritchard
ebf003581b
Merge pull request #7183 from hppritcha/topic/quiet_lustre
lustre: squash some compiler warnings
2019-11-22 19:19:08 -07:00
Howard Pritchard
75749d9a8b
Merge pull request #7185 from hppritcha/topic/pacify_clang_compiler
clang: pacify clang wrt ompi use of c11 atomics
2019-11-22 19:18:39 -07:00
Jeff Squyres
0aa8ee3bd7
Merge pull request #7075 from bartonski/pr/my_awsum_fix
Remove mxm
2019-11-22 16:23:08 -07:00
Howard Pritchard
441bad9a75 cray ftn: modify fortran module loc checker
to support the Cray Fortran compiler.  Cray Fortran compiler does not
contain all symbol info in the module file, have to link with the *.o
created as part of module file compilation.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-11-22 16:13:52 -06:00
Barton Chittenden
47816ef83b Remove mxm, yalla and ikrit
Signed-off-by: Barton Chittenden <bartonski@gmail.com>
2019-11-22 13:40:16 -08:00
Michael Lass
67490118ad dims_create: fix calculation of factors for odd squares
Until now sqrt(n) was missed as a factor for odd square numbers n. This
lead to suboptimal results of MPI_Dims_create for input numbers like 9,
25, 49, ... Fix the results by including sqrt(n) in the search for
factors.

Refs: #7186

Signed-off-by: Michael Lass <bevan@bi-co.net>
2019-11-22 20:12:58 +01:00
Edgar Gabriel
451cbdcc7a
Merge pull request #7126 from edgargabriel/pr/two-phase-aggr-calc-32bits-bug
fcoll/two_phase: fix error in calculating aggregators in 32bit mode
2019-11-22 09:49:13 -07:00
Edgar Gabriel
ea1355beae fcoll/two_phase: fix error in calculating aggregators in 32bit mode
In fcoll_two_phase_supprot_fns.c: calculation of the aggregator index
failed for large offsets on 32bit machine, due to improper handling of
64bit offsets.

Fixes Issue #7110

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2019-11-22 09:45:19 -06:00
Howard Pritchard
7c1aafb6e1 clang: pacify clang wrt ompi use of c11 atomics
The clang compiler (or at least the one used by the Cray CCE 9 and newer)
doesn't like what we're doing passing non _Atomic pointers to C11 atomics.
Fix the ones that keep vader from compiling using Cray CCE 9 and 10.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-11-21 15:14:45 -06:00
Howard Pritchard
e66a7cef11 lustre: squash some compiler warnings
Compiling OMPI on cray systems using latest Cray compilers (clang based)
yielded some compiler warnings from ompio/lustre.  Squash these warnings.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-11-21 14:14:40 -06:00
Geoff Paulsen
119b1c312e
Merge pull request #7170 from jjhursey/jsm-schizo
Add detection for JSM direct launch
2019-11-18 08:53:28 -06:00
Joshua Hursey
4f1de51371
Add detection for JSM direct launch
* Adds the `schizo/jsm` component that detects if the process was
   direct launched with IBM's Job Step Manager (JSM). JSM is a PMIx
   enhanced runtime environment so flag it as such.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-11-17 08:56:13 -06:00
Maxwell Coil
9b73f6ac83 Fix misleading error message with missing #! interpreter
This change fixes the misleading error message. I added a conditional to
determine whether the error is due to a missing file or a bad interpreter.
If it is the latter, a new, more precise error message will be displayed.

Fixes #4528

Signed-off-by: Maxwell Coil <mcoil@nd.edu>
2019-11-14 18:06:11 -05:00
Geoff Paulsen
9e17f028b5
Merge pull request #5507 from markalle/prot_table
Adding -mca comm_method to print table of communication methods
2019-11-14 14:24:06 -06:00
Ralph Castain
db52da40c3
Merge pull request #7162 from rhc54/topic/iof
Update IOF redirection options
2019-11-13 14:37:41 -07:00
Austen Lauria
edcd6d8aeb
Merge pull request #7146 from bgoglin/master
fix typos hlwoc->hwloc
2019-11-13 15:38:00 -05:00
Ralph Castain
b0a487a3c7
Update IOF redirection options
Provide both "--output-directory" and "--output-filename" options but do
not allow both to be given at the same time. Output-directory allows
specification of a directory, with output redirected into files of form
"<directory>/<jobid>/rank.<vpid>/stdout[err]". This option also supports two
directives: nojobid (removes the jobid directory layer) and nocopy (do
not copy the output to the terminal).

Output-filename is the "old" behavior that names the output files as
"<filename>.rank" with both stdout and stderr redirected into it. This
option only supports one directive: nocopy (do not copy the output to
the terminal).

Fix both the --help and man documentation.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-11-13 12:13:06 -08:00
Nathan Hjelm
09dd383f8b
Merge pull request #7108 from devreal/btl-ugni-deadlock
uGNI: Fix potential deadlock when processing outstanding transfers
2019-11-11 10:56:56 -08:00
George Bosilca
3de636dc6f Swap the 2 fields to maintain the size of the struct.
Thanks @devreal for catching this.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-11-07 11:25:03 -05:00
George Bosilca
59fb02618e Prevent overflow when dealing with datatype count.
This patch fixes #7147 by preventing overflow when multiplying
the count and the blocklen. The count reflects MPI count and is
therefore bound to the size of an int (it is an uint32_t) while the
blocklen can be merged together to represent the largest contiguous
memory layout and it is therefore promoted to a size_t.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-11-07 11:25:03 -05:00
Yossi Itigin
40e2fbb093
Merge pull request #7114 from brminich/topic/mlx_scat_tuning
COLL/TUNED: Add linear scatter using isend for mlnx platform
2019-11-07 13:38:21 +02:00
Mikhail Brinskii
f2cbd4806e COLL/TUNED: Add linear scatter using isend for mlnx platform
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
2019-11-07 11:04:39 +02:00
Howard Pritchard
6fc5a4e033
Merge pull request #7142 from hppritcha/topic/swat_issue_7128
btl/uct: restrict to using UCT 1.7 or older API for now
2019-11-06 16:07:54 -07:00
Howard Pritchard
9d345d9aa0 btl/uct: add UCT API version check to configury
related to #7128

The UCX crew is no longer guaranteeing that the UCT API is going to be frozen,
so this is kind of a whack-a-mole problem trying to keep the BTL UCT working
with various changing UCT APIs.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-11-06 14:27:58 -07:00
Brice Goglin
5c6bd7ea4e fix typos hlwoc->hwloc
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2019-11-06 10:42:36 +01:00
Austen Lauria
1c33d5aecd
Merge pull request #7141 from devreal/shmem_memheap_alloc_band
Shmem: use bitwise and instead of logical and to check for allocator capabilities
2019-11-05 18:21:23 -05:00
Nathan Hjelm
30171cbd27
Merge pull request #7144 from hjelmn/btl_uct_fix_compilation_issue_for_ucx_1_7_because_the_api_break_got_into_this_release
btl/uct: fix compilation for UCX 1.7.0
2019-11-05 14:54:30 -08:00
Nathan Hjelm
a3026c016a btl/uct: fix compilation for UCX 1.7.0
Ref #7128

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2019-11-05 12:53:26 -08:00
Joseph Schuchart
9f2c6a42c3 Shmem: use bitwise and instead of logical and to check for allocator capabilities
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-11-05 15:44:59 +01:00
bosilca
6e3ff5e9b4
Merge pull request #7131 from bosilca/fix/tcp_errors
Correctly report TCP connect errors.
2019-10-31 22:12:27 -04:00
George Bosilca
476562752f
Correctly report TCP connect errors.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-10-31 18:33:15 -04:00
Mark Allen
6855ebb84b Adding -mca comm_method to print table of communication methods
This is closely related to Platform-MPI's old -prot feature.

The long-format of the tables it prints could look like this:
>   Host 0 [myhost001] ranks 0 - 1
>   Host 1 [myhost002] ranks 2 - 3
>   Host 2 [myhost003] ranks 4
>   Host 3 [myhost004] ranks 5
>   Host 4 [myhost005] ranks 6
>   Host 5 [myhost006] ranks 7
>   Host 6 [myhost007] ranks 8
>   Host 7 [myhost008] ranks 9
>   Host 8 [myhost009] ranks 10
>
>    host | 0    1    2    3    4    5    6    7    8
>   ======|==============================================
>       0 : sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       1 : tcp  sm   tcp  tcp  tcp  tcp  tcp  tcp  tcp
>       2 : tcp  tcp  self tcp  tcp  tcp  tcp  tcp  tcp
>       3 : tcp  tcp  tcp  self tcp  tcp  tcp  tcp  tcp
>       4 : tcp  tcp  tcp  tcp  self tcp  tcp  tcp  tcp
>       5 : tcp  tcp  tcp  tcp  tcp  self tcp  tcp  tcp
>       6 : tcp  tcp  tcp  tcp  tcp  tcp  self tcp  tcp
>       7 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  self tcp
>       8 : tcp  tcp  tcp  tcp  tcp  tcp  tcp  tcp  self
>
>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

In this example hosts 0 and 1 had multiple ranks so "sm" was more
meaningful than "self" to identify how the ranks on the host are
talking to each other. While host 2..8 were one rank per host so
"self" was more meaningful as their btl.

Above a certain number of hosts (12 by default) the above table gets too big
so we shrink to a more abbreviated looking table that has the same data:
>    host | 0 1 2 3 4       8
>   ======|====================
>       0 : A C C C C C C C C
>       1 : C A C C C C C C C
>       2 : C C B C C C C C C
>       3 : C C C B C C C C C
>       4 : C C C C B C C C C
>       5 : C C C C C B C C C
>       6 : C C C C C C B C C
>       7 : C C C C C C C B C
>       8 : C C C C C C C C B
>   key: A == sm
>   key: B == self
>   key: C == tcp

Then above 36 hosts we stop printing the 2d table entirely and just print the
summary:
>   Connection summary:
>     on-host:  all connections are sm or self
>     off-host: all connections are tcp

The options to control it are
    -mca comm_method 1   :   print the above table at the end of MPI_Init
    -mca comm_method 2   :   print the above table at the beginning of MPI_Finalize
    -mca comm_method_max <n> :  number of hosts <n> for which to print a full size 2d
    -mca comm_method_brief 1 :  only print summary output, no 2d table
    -mca comm_method_fakefile <filename> :  for debugging only

* printing at init vs finalize:

The most important difference between these two is that when printing the table
during MPI_Init(), we send extra messages to make sure all hosts are connected to
each other. So the table ends up working against the idea of on-demand connections
(although it's only forcing the n^2 connections in the number of hosts, not the
total ranks).  If printing at MPI_Finalize() we don't create any connections that
aren't already connected, so the table is more likely to have "n/a" entries if
some hosts never connected to each other.

* how many hosts <n> for which to print a full size 2d table

The option -mca comm_method_max <n> can be used to specify a number of hosts <n>
(default 12) that controls at what host-count the unabbreviated / abbreviated
2d tables get printed:
    1 - n      : full size 2d table
    n+1 - 3n   : shortened 2d table
    3n+1 - inf : summary only, no 2d table

* brief

The option -mca comm_method_brief 1 can be used to skip the printing of the 2d
table and only show the short summary

* fakefile

This is a debugging option that allows easeir testing of all the printout
routines by letting all the detected communication methods between the hosts
be overridden by fake data from a file.

The source of the information used in the table is the .mca_component_name

In the case of BTLs, the module always had a .btl_component linking back to the
component. The vars mca_pml_base_selected_component and ompi_mtl_base_selected_component
offer similar functionality for pml/mtl.

So with the ability to identify the component, we can then access
the component name with code like this
    mca_pml_base_selected_component.pmlm_version.mca_component_name
See the three lookup_{pml,mtl,btl}_name() functions in hook_comm_method_fns.c,
and their use in comm_method() to parse the strings and produce an integer
to represent the connection type being used.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2019-10-31 16:23:57 -04:00
Joseph Schuchart
5471d59af4 Fix OPAL_ALIGN_MIN to work on 32-bit systems
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-10-30 10:33:10 +01:00