1
1
Граф коммитов

160 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
a210f8046f
Cleanup ompi/dpm operations
Do some code cleanup in the connect/accept code. Ensure that the OMPI
layer has access to the PMIx identifier for the process. Add macros for
converting PMIx names to/from strings. Cleanup a few of the simple test
programs. Add a little more info to a btl/tcp error message.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-08 08:37:25 -07:00
Howard Pritchard
f136a20cae
Merge pull request #6578 from hppritcha/topic/thread_framework2
Implement a MCA framework for threads
2020-03-27 15:55:48 -06:00
Noah Evans
ee3517427e Add threads framework
Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobot:

https://github.com/pmodels/argobots

https://github.com/Qthreads/qthreads

The default threading model is pthreads.  Alternate thread models are
specificed at configure time using the --with-threads=X option.

The framework is static.  The theading model to use is selected at
Open MPI configure/build time.

mca/threads: implement Argobots threading layer

config: fix thread configury

- Add double quotations
- Change Argobot to Argobots
config: implement Argobots check

If the poll time is too long, MPI hangs.

This quick fix just sets it to 0, but it is not good for the
Pthreads version. Need to find a good way to abstract it.

Note that even 1 (= 1 millisecond) causes disastrous performance
degradation.

rework threads MCA framework configury

It now works more like the ompi/mca/rte configury,
modulo some edge items that are special for threading package
linking, etc.

qthreads module
some argobots cleanup

Signed-off-by: Noah Evans <noah.evans@gmail.com>
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-03-27 10:15:45 -06:00
Ralph Castain
33ab928e1b ompi_proc_t size reduction: part 1
We currently save the hostname of a proc when we create the ompi_proc_t for it. This was originally done because the only method we had for discovering the host of a proc was to include that info in the modex, and we had to therefore store it somewhere proc-local. Obviously, this ccarried a memory penalty for storing all those strings, and so we added a "cutoff" parameter so that we wouldn't collect hostnames above a certain number of procs.

Unfortunately, this still results in an 8-byte/proc memory cost as we have a char* pointer in the opal_proc_t that is contained in the ompi_proc_t so that we can store the hostname of the other procs if we fall below the cutoff. At scale, this can consume a fair amount of memory.

With the switch to relying on PMIx, there is no longer a need to cache the proc hostnames. Using the "optional" feature of PMIx_Get, we restrict the retrieval to be purely proc-local - i.e., we retrieve the info either via shared memory or from within the proc-internal hash storage (depending upon the active PMIx components). Thus, the retrieval of a hostname is purely a local operation involving no communication.

All RM's are required to provide a complete hostname map of all procs at startup. Thus, we have full access to all hostnames without including them in a modex or having to cache them on each proc. This allows us to remove the char* pointer from the opal_proc_t, saving us 8-bytes/proc.

Unfortunately, PMIx_Get does not currently support the return of a static pointer to memory. Thus, even though PMIx has the hostname in its memory, it can only return a malloc'd version of it. I have therefore ensured that the return from opal_get_proc_hostname is consistently malloc'd and free'd wherever used. This shouldn't be a burden as the hostname is only used in one of two circumstances:

(a) in an error message
(b) in a verbose output for debugging purposes

Thus, there should be no performance penalty associated with the malloc/free requirement. PMIx will eventually be returning static pointers, and so we can eventually simplify this method and return a "const char*" - but as noted, this really isn't an issue even today.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 12:49:44 -07:00
Gilles Gouaillardet
174e967dbc
Remove ORTE project
Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.

remove orte: misc fixes

 - UCX fixes
 - VPATH issue
 - oshmem fixes
 - remove useless definition
 - Add PRRTE submodule
 - Get autogen.pl to traverse PRRTE submodule
 - Remove stale orcm reference
 - Configure embedded PRRTE
 - Correctly pass the prefix to PRRTE
 - Correctly set the OMPI_WANT_PRRTE am_conditional
 - Move prrte configuration to the end of OMPI's configure.ac
 - Make mpirun a symlink to prun, when available
 - Fix makedist with --no-orte/--no-prrte option
 - Add a `--no-prrte` option which is the same as the legacy
   `--no-orte` option.
 - Remove embedded PMIx tarball. Replace it with new submodule
   pointing to OpenPMIx master repo's master branch
 - Some cleanup in PRRTE integration and add config summary entry
 - Correctly set the hostname
 - Fix locality
 - Fix singleton operations
 - Fix support for "tune" and "am" options

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-02-07 18:20:06 -08:00
Austen Lauria
824dbcbcf3 Protect use of _Static_assert().
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-02-04 13:46:58 -05:00
Aurelien Bouteiller
9f4365fef6
Merge pull request #6783 from abouteiller/export/macos-epipe
Prevent EPIPE on OSX.
2020-01-28 11:18:46 -05:00
Brian Barrett
fc8c7a5869
Merge pull request #7134 from wckzhang/btl_tcp_interface_match
btl tcp: Use reachability and graph solving for global interface matching
2020-01-27 15:38:49 -08:00
Aurelien Bouteiller
76021e35ee
Adding a description of the FIN message for future reference.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:32:34 -05:00
Aurelien Bouteiller
93846fd0ee
Remove the pending event when socket is TCP_FAILED
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:32:34 -05:00
Aurelien Bouteiller
6b3be224d4
Adding a FIN message to differentiate normal TCP closing from failures
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:32:34 -05:00
Aurelien Bouteiller
b7be64482a
Revert "Revert "Handle error cases in TCP BTL""
This reverts commit 5162011428.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-01-27 13:31:06 -05:00
William Zhang
e958f3cf22 btl tcp: Use reachability and graph solving for global interface matching
Previously we used a fairly simple algorithm in
mca_btl_tcp_proc_insert() to pair local and remote modules. This was a
point in time solution rather than a global optimization problem (where
global means all modules between two peers). The selection logic would
often fail due to pairing interfaces that are not routable for traffic.
The complexity of the selection logic was Θ(n^n), which was expensive.
Due to poor scalability, this logic was only used when the number of
interfaces was less than MAX_PERMUTATION_INTERFACES (default 8). More
details can be found in this ticket:
https://svn.open-mpi.org/trac/ompi/ticket/2031 (The complexity estimates
in the ticket do not match what I calculated from the function)
As a fallback, when interfaces surpassed this threshold, a brute force
O(n^2) double for loop was used to match interfaces.

This commit solves two problems. First, the point-in-time solution is
turned into a global optimization solution. Second, the reachability
framework was used to create a more realistic reachability map. We
switched from using IP/netmask to using the reachability framework,
which supports route lookup. This will help many corner cases as well as
utilize any future development of the reachability framework.

The solution implemented in this commit has a complexity mainly derived
from the bipartite assignment solver. If the local and remote peer both
have the same number of interfaces (n), the complexity of matching will
be O(n^5).

With the decrease in complexity to O(n^5), I calculated and tested
that initialization costs would be 5000 microseconds with 30 interfaces
per node (Likely close to the maximum realistic number of interfaces we
will encounter). For additional datapoints, data up to 300 (a very
unrealistic number) of interfaces was simulated. Up until 150
interfaces, the matching costs will be less than 1 second, climbing to
10 seconds with 300 interfaces. Reflecting on these results, I removed
the suboptimal O(n^2) fallback logic, as it no longer seems necessary.

Data was gathered comparing the scaling of initialization costs with
ranks. For low number of interfaces, the impact of initialization is
negligible. At an interface count of 7-8, the new code has slightly
faster initialization costs. At an interface count of 15, the new code
has slower initialization costs. However, all initialization costs
scale linearly with the number of ranks.

In order to use the reachable function, we populate local and remote
lists of interfaces. We then convert the interface matching problem
into a graph problem. We create a bipartite graph with the local and
remote interfaces as vertices and use negative reachability weights as
costs. Using the bipartite assignment solver, we generate the matches
for the graph. To ensure that both the local and remote process have
the same output, we ensure we mirror their respective inputs for the
graphs. Finally, we store the endpoint matches that we created earlier
in a hash table. This is stored with the btl_index as the key and a
struct mca_btl_tcp_addr_t* as the value. This is then retrieved during
insertion time to set the endpoint address.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2020-01-21 18:24:08 +00:00
George Bosilca
476562752f
Correctly report TCP connect errors.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-10-31 18:33:15 -04:00
Stanislav Kirillov
0e0763e006
fix ipv6 btl connection bug
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2019-10-10 11:20:37 +00:00
Austen Lauria
0d4004cc3c Fix miscellaneous compiler warnings.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2019-10-01 16:27:25 -04:00
William Zhang
4ebb37a26c opal/util: Change opal/util/if.h macro IF_NAMESIZE to OPAL_IF_NAMESIZE
Due to IF_NAMESIZE being a reused and conditionally defined macro,
issues could arise from macro mismatches. In particular, in cases where
opal/util/if.h is included, but net/if.h is not, IF_NAMESIZE will be 32.
If net/if.h is included on Linux systems, IF_NAMESIZE will be 16. This
can cause a mismatch when using the same macro on a system. Thus
different parts of the code can have differring ideas on the size of a
structure containing a char name[IF_NAMESIZE]. To avoid this error case,
we avoid reusing the IF_NAMESIZE macro and instead define our own as
OPAL_IF_NAMESIZE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2019-07-29 21:24:39 +00:00
William Zhang
8c3b8a87c5 btl tcp: Fix error path memory leak
After the OPAL_MODEX_RECV call, remote_addrs was not freed in the error
path. Moved the free call into cleanup to ensure we always free this
memory before leaving the function.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2019-07-15 22:35:04 +00:00
George Bosilca
4620c351ea
Prevent EPIPE on OSX.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-06-28 15:32:59 -04:00
bosilca
b54fdf5dd9
Merge pull request #6541 from bwbarrett/bugfix/enotconn
btl/tcp: Skip printing error message in racy cleanup path
2019-03-28 22:42:52 -04:00
Brian Barrett
d5360711fa btl/tcp: Skip printing error message in racy cleanup path
Avoid printing an error message about ENOTCONN return codes from
getpeername() when handling an incoming connection request.  At
this point in the receive state machine, the remote process has
been verified to be a valid OMPI instance.  In all-to-all startup
at 4k rank scale, we're seeing this error message when the remote
side drops the connection because it realizes it's the "loser"
in the connection race.  We were already doing all the right things,
other than printing a scary error message.  So skip the error
message and call it good.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2019-03-28 23:12:35 +00:00
Dmitry Gladkov
9920da4992 btl/tcp: Fix copy-paste misprint
Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>
2019-02-20 11:18:02 +02:00
Brian Barrett
a1e85b03aa btl tcp: Fix compile error in IPv6
In 457f058 I broke the TCP BTL with --enable-ipv6.  This patch
fixes the compile error, so IPv6 works again.

Fixed #5996

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-30 12:31:04 -07:00
Brian Barrett
457f058e73 btl tcp: Simplify modex address selection
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code.  Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure.  This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.

Refs #5818

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-18 02:23:21 +00:00
Brian Barrett
4f19221af2 btl tcp: Simplify module address storage
Today, a btl tcp module is associated with exactly one IP
address (IPv4 or IPv6).  There's no need to reserve space
for both an IPv4 and IPv6 address in the module structure,
since the module will only be associated with one or the
other.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-17 16:21:17 +00:00
Brian Barrett
2acc4b7e7f btl tcp: Add workaround for "dropped connection" issue
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads.  The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.

This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT.  It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.

The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race.  THis patch is a work around until that patch can
be developed.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-16 18:33:30 -07:00
Brian Barrett
da1189d771
Merge pull request #5916 from bwbarrett/revert/6acebc4
Revert "Handle error cases in TCP BTL"
2018-10-16 13:54:18 -07:00
Jeff Squyres
cef6cf0ac5 opal: update some string handling
Make various string handling locations a little more robust.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:04:28 -07:00
Brian Barrett
5162011428 Revert "Handle error cases in TCP BTL"
This reverts commit 6acebc40a1.

This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run.  See
https://github.com/open-mpi/ompi/issues/5849 for more information.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 15:01:54 -07:00
Brian Barrett
902b57919c btl tcp: Print number of endpoints on error
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Brian Barrett
bd07cc6707 btl tcp: Improve initialization debugging
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.

Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.

Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
George Bosilca
a3a492b42c
Small pedantic fixes.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:08:18 -04:00
George Bosilca
9164e26e2f
Provide the correct socklen to bind.
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.

Add a more clear error message in debug mode.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:06:40 -04:00
Jeff Squyres
5dae086f7e btl/tcp: output the IP address correctly
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.

1. On the endpoint, it is storing the moral equivalent of a
   (struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).

The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr).  Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).

This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.

NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module.  This only happens with interfaces that have
more than one IP address.  This commit does not fix that issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 16:12:57 -07:00
bosilca
464e1abbab
Merge pull request #5700 from ICLDisco/export/tcp_errors
Handle error cases in TCP BTL
2018-09-19 09:44:38 -04:00
Michael Kuron
17b0f1fcc3 Deal with EOPNOTSUPP returned from getsockopt()
This can be returned when running on QEMU user-mode emulation,
which does not support getsockopt with SO_RCVTIMEO.

Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>
2018-09-16 14:55:28 +02:00
Jeff Squyres
fe0852bcb4 Miscellaneous compiler warning stomps.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-24 07:39:14 -07:00
Aurelien Bouteiller
6acebc40a1
Handle error cases in TCP BTL
When an error is returned by the socket operations, trigger the
appropriate error path in the PML to give an opportunity for
rerouting/error handling.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-08-14 15:35:24 -04:00
Jeff Squyres
7b0dd03e92 tcp/btl: fix a cast
The current cast is *functional*, but isn't really the way it should
be done.  This commit makes the cast the way it should be done.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-29 07:25:46 -07:00
Jeff Squyres
57bc657e7f btl/tcp: fix hash map usage
Fix two facepalms:

1. The "uint32" in the hash map functions refer to the *key* size, not
   the *value* size.  The values are always 64 bits.
2. Pass the straight value to the "set" functions -- not the pointer
   to the value.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-28 15:29:41 -07:00
Ralph Castain
0ddbc75ce5
Merge pull request #4930 from kizill/fix-ipv6
fixed ipv6 OOB connection problems (fix issue #1585)
2018-06-26 09:13:53 -07:00
Jeff Squyres
3767ce27c0 btl/tcp: trivial whitespace clean
No code/logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-23 08:04:12 -07:00
Jeff Squyres
9034717876 btl/tcp: use a hash map for kernel IP interface indexes
The giant size of the TCP proc struct is causing a problem in some
environments (because it is allocated on the stack), and it was too
big, anyway.

Instead, use a hash map.  That way, it starts small and can grow if it
needs to.  It also makes no assumptions about the values of the kernel
interface indexes.

Fixes #5292.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-23 08:03:30 -07:00
George Bosilca
6ff11267fb
Remove warnings identified by clang.
Plus minor spacing and indentation issues.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-04-14 17:14:12 -04:00
Jeff Squyres
f200b866df btl/tcp: roll back parts of 40afd525f8
Some of the show_help() messages that were added in 40afd525f8 were
really normal / expected behavior (e.g., if 2 peers connect in the TCP
BTL more-or-less simultaneously, one of them will drop the connection
-- no need to show_help() about this; it's expected behavior).  Roll
back these messages to be opal_output_verbose() kinds of messages.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-07 12:28:10 -07:00
Jeff Squyres
8c419294a8 btl/tcp: fix CID 710596
sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses),
but the memory that we are copying from (my_ss->sin_addr) is only 4
bytes long.  Don't copy beyond the end of that source buffer.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:22 -07:00
Jeff Squyres
a17f4afdc7 btl/tcp: fix CID 1416634
Fix resource leak in the TCP BTL.  Also add a little defensive programming.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-26 14:21:21 -07:00
Jeff Squyres
40afd525f8 btl/tcp: make error messages more specific
Convert some verbose messages to opal_show_help() messages.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-21 19:34:03 -07:00
Stanislav Kirillov
c2bfca19ba
fix ipv6 missing copypaste
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2018-03-21 02:25:04 +03:00