1
1

30358 Коммитов

Автор SHA1 Сообщение Дата
Austen Lauria
3c0fa537c2
Merge pull request #7113 from devreal/osc_rdma_cas
RDMA osc: perform CAS in shared memory if possible
2020-01-28 10:59:17 -05:00
Brian Barrett
d768d82231
Merge pull request #7167 from wckzhang/reachable_netlinks
Reachable documentation change
2020-01-27 15:39:14 -08:00
Brian Barrett
fc8c7a5869
Merge pull request #7134 from wckzhang/btl_tcp_interface_match
btl tcp: Use reachability and graph solving for global interface matching
2020-01-27 15:38:49 -08:00
Austen Lauria
10f6a77640
Merge pull request #7315 from abouteiller/export/tcp_errors_v2
Handle error cases in TCP BTL (v2)
2020-01-27 17:03:07 -05:00
Aurelien Bouteiller
76021e35ee
Adding a description of the FIN message for future reference.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:32:34 -05:00
Aurelien Bouteiller
395fcc4253
Disable inband PML error reporting during MPI Finalize as it interferes with the Finalize process.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:32:34 -05:00
Aurelien Bouteiller
93846fd0ee
Remove the pending event when socket is TCP_FAILED
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:32:34 -05:00
Aurelien Bouteiller
6b3be224d4
Adding a FIN message to differentiate normal TCP closing from failures
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:32:34 -05:00
Aurelien Bouteiller
b7be64482a
Revert "Revert "Handle error cases in TCP BTL""
This reverts commit 51620114281720c9e375b95cadc45272b3135837.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:31:06 -05:00
Austen Lauria
969eb0286c
Merge pull request #6970 from ggouaillardet/topic/cuda_no_orte
coll/cuda: remove unnecessary references to ORTE
2020-01-27 13:12:32 -05:00
Austen Lauria
4f6978466d
Merge pull request #7284 from bosilca/fix/monitoring_registration
Minor cleanup in the monitoring PML.
2020-01-27 13:01:30 -05:00
Josh Hursey
01a671349b
Merge pull request #7337 from jjhursey/no-ssh-core
plm/rsh: Fix segv on missing agent.
2020-01-27 10:32:41 -06:00
Joshua Hursey
62d0058738 plm/rsh: Fix segv on missing agent.
* Additionally, fixes the `NULL` option to `OMPI_MCA_plm_rsh_agent`
   would would also lead to a segv. Now it operates as intended by
   disqualifying the `rsh` component and falling back onto the `isolated`
   component.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-01-24 14:02:53 -05:00
Howard Pritchard
e9a54e8e0e
Merge pull request #7268 from hppritcha/topic/fix_6539_master
make mpifort obey disable-wrapper-runpath
2020-01-23 16:37:37 -07:00
Artem Polyakov
548ed56bef
Merge pull request #7332 from hoopoepg/topic/fixed-coverity-issues
SPML/UCX: fixed coverity issues
2020-01-23 11:54:00 -08:00
Jeff Squyres
f16d2903aa
Merge pull request #7328 from DDNStorage/ime_fs_missing_header
fs/ime: fix compilation errors due to missing header inclusion
2020-01-23 14:13:59 -05:00
Sergey Oblomov
8543860689 SPML/UCX: fixed coverity issues
- fixed sizeof(char***) by variable datatype
- fixed resorce leak in proc_add

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2020-01-23 14:02:08 +02:00
Artem Polyakov
e04575e8b6
Merge pull request #7274 from janjust/master-oshmem-perf-multi-worker
oshmem/ucx: multi-threaded spml ucx performance improvements
2020-01-22 14:24:54 -08:00
Tomislav Janjusic
3d6bf9fd8e oshmem/ucx: improves spml ucx performance for multi-threaded
applications.

Improves multi-threaded performance by adding the option to create
multiple ucx workers in threaded applications.

Co-authored with:
Artem Y. Polyakov <artemp@mellanox.com>,
Manjunath Gorentla Venkata <manjunath@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-01-22 21:41:09 +02:00
Artem Polyakov
611a2bdf68
Merge pull request #7273 from janjust/master-oshmem-perf-progress
oshmem/ucx: Fix progress in iput/iget: periodically poke progress to prevent hardware stalls when using DCT transport.
2020-01-22 11:39:57 -08:00
Sylvain Didelot
01ae0c22b8 fs/ime: fix compilation errors due to missing header inclusion
OpenMPI doesn't compile anymore with IME because the header
file "ompi/mca/fs/base/base.h" needs to be include in every
file where mca_fs_base_get_mpi_err() is used.

Signed-off-by: Sylvain Didelot <sdidelot@ddn.com>
2020-01-22 17:56:55 +01:00
Tomislav Janjusic
1b58e3d073 oshmem/ucx: Improves performance for non-blocking put/get operations.
Improves the performance when excess non-blocking operations are posted
by periodically calling progress on ucx workers.

Co-authored with:
Artem Y. Polyakov <artemp@mellanox.com>,
Manjunath Gorentla Venkata <manjunath@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-01-22 00:59:26 +02:00
Jeff Squyres
d14d8ad30d
Merge pull request #7326 from jsquyres/pr/make-opal-output-safe-before-opal-init
opal: ensure opal_gethostname() always returns a value
2020-01-21 16:14:14 -05:00
Jeff Squyres
f84a692af0
Merge pull request #7324 from jsquyres/pr/advance-hwloc-submodule
hwloc2: advance hwloc git submodule
2020-01-21 14:03:08 -05:00
William Zhang
e958f3cf22 btl tcp: Use reachability and graph solving for global interface matching
Previously we used a fairly simple algorithm in
mca_btl_tcp_proc_insert() to pair local and remote modules. This was a
point in time solution rather than a global optimization problem (where
global means all modules between two peers). The selection logic would
often fail due to pairing interfaces that are not routable for traffic.
The complexity of the selection logic was Θ(n^n), which was expensive.
Due to poor scalability, this logic was only used when the number of
interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More
details can be found in this ticket:
https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates
in the ticket do not match what I calculated from the function)
As a fallback, when interfaces surpassed this threshold, a brute force
O(n^2) double for loop was used to match interfaces.

This commit solves two problems. First, the point-in-time solution is
turned into a global optimization solution. Second, the reachability
framework was used to create a more realistic reachability map. We
switched from using IP/netmask to using the reachability framework,
which supports route lookup. This will help many corner cases as well as
utilize any future development of the reachability framework.

The solution implemented in this commit has a complexity mainly derived
from the bipartite assignment solver. If the local and remote peer both
have the same number of interfaces (n), the complexity of matching will
be O(n^5).

With the decrease in complexity to O(n^5), I calculated and tested
that initialization costs would be 5000 microseconds with 30 interfaces
per node (Likely close to the maximum realistic number of interfaces we
will encounter). For additional datapoints, data up to 300 (a very
unrealistic number) of interfaces was simulated. Up until 150
interfaces, the matching costs will be less than 1 second, climbing to
10 seconds with 300 interfaces. Reflecting on these results, I removed
the suboptimal O(n^2) fallback logic, as it no longer seems necessary.

Data was gathered comparing the scaling of initialization costs with
ranks. For low number of interfaces, the impact of initialization is
negligible. At an interface count of 7-8, the new code has slightly
faster initialization costs. At an interface count of 15, the new code
has slower initialization costs. However, all initialization costs
scale linearly with the number of ranks.

In order to use the reachable function, we populate local and remote
lists of interfaces. We then convert the interface matching problem
into a graph problem. We create a bipartite graph with the local and
remote interfaces as vertices and use negative reachability weights as
costs. Using the bipartite assignment solver, we generate the matches
for the graph. To ensure that both the local and remote process have
the same output, we ensure we mirror their respective inputs for the
graphs. Finally, we store the endpoint matches that we created earlier
in a hash table. This is stored with the btl_index as the key and a
struct mca_btl_tcp_addr_t* as the value. This is then retrieved during
insertion time to set the endpoint address.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-01-21 18:24:08 +00:00
Jeff Squyres
8c819e2a85 opal: ensure opal_gethostname() always returns a value
We initially thought it was a safe bet that opal_gethostname() would
never be called before opal_init().  However, it turns out that there
are some cases -- e.g., developer debugging -- where it is useful to
call opal_output() (which calls opal_gethostname()) before
opal_init().

Hence, we need to guarantee that opal_gethostname() always returns a
valid value.  If opal_gethostname() finds NULL in
opal_process_info.nodename, simply call the internal function to
initialize opal_process_info.nodename.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-01-21 09:56:52 -08:00
Jeff Squyres
ed753afbc0 hwloc2: advance hwloc git submodule
Advance to hwloc-2.1.0rc2-33-g38433c0f, which includes a .gitignore
update that we want here in Open MPI.

Be warned; this is actually 33 commits beyond the hwloc v2.1.0 tag.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-01-21 09:42:39 -08:00
Ralph Castain
1d0b87e170
Merge pull request #7317 from rhc54/topic/agen
Check the return from "chdir" to avoid infinite loop
2020-01-18 15:04:01 -07:00
Ralph Castain
7de5f76b18
Check the return from "chdir" to avoid infinite loop
When autogen attempts to change to a new directory while processing a
subdirectory, it can get into an infinite loop if that directory
doesn't exist as it will remain in the top-level directory, see itself
there (as "autogen.pl"), and re-execute itself. Check the return code on
"chdir" and error out if it fails.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-01-18 11:35:08 -08:00
Austen Lauria
df9745a251
Merge pull request #7299 from awlauria/fix_warnings
Fix some compiler warnings.
2020-01-17 11:33:41 -05:00
Jeff Squyres
69bd54c83b
Merge pull request #7310 from itemko/artemry/azure_ci_updates_for_review
Fixed several Mellanox Open MPI CI issues.
2020-01-16 09:25:43 -05:00
Artem Ryabov
fd7e94022b Fixed several Mellanox Open MPI CI issues.
The following issues have been fixed:
- Corrected the link to CI status badge in README after some renaming.
- Updated agent capabilities.
- Switched to jenkins_scripts from master branch.
- Corrected support e-mail.

Signed-off-by: Artem Ryabov <artemry@mellanox.com>
2020-01-16 13:43:28 +03:00
Nathan Hjelm
037b0bd9ee
Merge pull request #7304 from hjelmn/btl_vader_fix_max_address_on_aarch64
btl/vader: modify how the max attachment address is determined
2020-01-15 17:02:33 -07:00
Nathan Hjelm
728d51f9f3 btl/vader: modify how the max attachment address is determined
This PR removes the constant defining the max attachment address and
replaces it with the largest address that shows up in /proc/self/maps.
This should address issues found on AARCH64 where the max address
may differ based on the configuration.

Since the calculated max address may differ between processes the
max address is sent as part of the modex and stored in the endpoint
data.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2020-01-14 15:15:36 -08:00
Nathan Hjelm
61f96b3d6d
Merge pull request #7283 from hjelmn/fix_issues_in_both_vader_and_opal_interval_tree_t_that_were_causing_issue_6524
Fix issues in both vader and opal interval tree t that were causing issue 6524
2020-01-14 14:54:51 -07:00
Jeff Squyres
b3fe523150
Merge pull request #7124 from devreal/fix-opal-align-min
Fix OPAL_ALIGN_MIN to work on 32-bit systems
2020-01-14 15:18:55 -05:00
Jeff Squyres
02f02e7c1e
Merge pull request #7300 from itemko/artemry/mellanox-ci-with-azure-pipelines-review
Reworked Mellanox Open MPI CI with Azure Pipelines
2020-01-14 14:11:34 -05:00
Artem Ryabov
98bfe87ee5 Reworked Mellanox OpenMPI CI with Azure Pipelines.
Signed-off-by: Artem Ryabov <artemry@mellanox.com>
2020-01-14 20:20:51 +03:00
Jeff Squyres
25931ea8bf
Merge pull request #7200 from cpshereda/master-opal_gethostname-change
Fix unsafe use of gethostname()
2020-01-13 16:07:43 -05:00
Jeff Squyres
887400c878
Merge pull request #7174 from jsquyres/pr/ofi-mtl-fi-version-bump
mtl/ofi: increase the FI_VERSION requested to 1.5 and make sure to check for OFI_LOCAL_COMM
2020-01-13 13:22:22 -05:00
Charles Shereda
cbc6feaab2 Created opal_gethostname() as safer gethostname substitute.
The opal_gethostname() function provides a more robust mechanism
to retrieve the hostname than gethostname(), which can return
results that are not null-terminated, and which can vary in its
behavior from system to system.

opal_gethostname() just returns the value in opal_process_info.nodename;
this is populated in opal_init_gethostname() inside opal_init.c.

-Changed all gethostname calls in opal subtree to opal_gethostname
-Changed all gethostname calls in orte subtree to opal_gethostname
-Changed all gethostname calls in ompi subdir to opal_gethostname
-Changed all gethostname calls in oshmem subdir to opal_gethostname
-Changed opal_if.c in test subdir to use opal_gethostname
-Changed opal_init.c to include opal_init_gethostname. This function
 returns an int and directly sets opal_process_info.nodename per
 jsquyres' modifications.

Relates to open-mpi#6801

Signed-off-by: Charles Shereda <cpshereda@lanl.gov>
2020-01-13 08:52:17 -08:00
Robert Wespetal
49128a7adb mtl/ofi: Add workaround for EFA local/remote capabilities bug
Some versions of Libfabric contain a bug in EFA where FI_REMOTE_COMM and
FI_LOCAL_COMM are not advertised. In order to workaround this, we need to call
fi_getinfo() without those capability bits to see if EFA is available first.

Also move around some of the provider include/exclude list logic so we can skip
this workaround if applicable.

Signed-off-by: Robert Wespetal <wesper@amazon.com>
2020-01-13 08:26:01 -08:00
Jeff Squyres
21bc9042e1 mtl/ofi: check for FI_LOCAL_COMM+FI_REMOTE_COMM
Make sure to get an RDM provider that can provide both local and
remote communication.  We need this check because some providers could
be selected via RXD or RXM, but can't provide local communication, for
example.

Add OPAL_CHECK_OFI_VERSION_GE() m4 macro to check that the Libfabric
we're building against is >= a target version.  Use this check in two
places:

1. MTL/OFI: Make sure it is >= v1.5, because the FI_LOCAL_COMM /
   FI_REMOTE_COMM constants were introduced in Libfabric API v1.5.
2. BTL/usnic: It already had similar configury to check for Libfabric
   >= v1.1, but the usnic component was checking for >= v1.3.  So
   update the btl/usnic configury to use the new macro and check for
   >= v1.3.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-01-13 08:19:53 -08:00
George Bosilca
05093f9cb1
Minor cleanup in the monitoring PML.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2020-01-13 09:24:00 -05:00
Austen Lauria
b65ec27307 Fix some compiler warnings.
Silence unused variables, incompatible pointer types,
un-initialized variables, and signed/unsigned comparisons.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-01-10 13:10:53 -05:00
Jeff Squyres
dd2d7d2866
Merge pull request #7189 from michaellass/fix-dims_create
dims_create: fix calculation of factors for odd squares
2020-01-10 09:47:09 -05:00
Jeff Squyres
c471f1cb0b
Merge pull request #7286 from jsquyres/pr/add-git-submodule-status-to-github-issue-template
Github issue template: updates
2020-01-08 12:53:57 -05:00
Jeff Squyres
a26a6c8e42 Github issue template: updates
1. Add more recent release version numbers in the examples
2. Add request for output from `git submodule status` when
   building/installing from a git clone

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2020-01-08 11:47:35 -05:00
Brian Barrett
f853971bc1
Merge pull request #6821 from jsquyres/pr/make-hwloc201-tarball-a-submodule
hwloc v2.1.0: use a git submodule
2020-01-08 07:41:37 -08:00
bosilca
2d659323b9
Merge pull request #7233 from bosilca/fix/unused_fortran_protected
Remove unused variable.
2020-01-08 08:04:06 -05:00