1
1

462 Коммитов

Автор SHA1 Сообщение Дата
Orivej Desh
78b7e342bd Fix oob_tcp tcp_component_close segfault with active listeners
oob_tcp in non-HNP mode shares libevent event_base with oob_base [1].
orte_oob_base_close calls:
(1) oob_tcp component_shutdown, then
(2) opal_progress_thread_finalize, then
(3) oob_tcp tcp_component_close [2].
opal_progress_thread_finalize calls tracker_destructor [3] that frees the
event_base [4]. If any oob_tcp event listeners are active at this time, oob_tcp
will crash trying to delete them at [5] [6].

This change moves oob_tcp event listener cleanup from component_close to
component_shutdown so that it happens before the event_base is freed.

[1] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L160
[2] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/base/oob_base_frame.c#L95
[3] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L232
[4] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L65
[5] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_component.c#L192
[6] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L955

Signed-off-by: Orivej Desh <orivej@gmx.fr>
2019-07-04 20:45:47 +00:00
Ralph Castain
e56ee1e06a Remove the remaining cruft from dual oob transport
* When we moved to allowing dual rml/oob transports, we added a bunch of
stuff that is no longer needed. Remove it so as to simplify the
messaging system.

* Fix the routed/radix component so it correctly returns the parent's
vpid

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:12:31 -08:00
Ralph Castain
125d236173 Move from the use of regex to compression
We've been fighting the battle of trying to create a regex generator and
parser that can handle arbitrary hostname schemes - without long-term
success. The worst of it is that there is no way of checking to see if
the computed regex is correct short of parsing it and doing a
character-by-character comparison with the original string. Ugh...there
has to be a better solution.

One option is to investigate using 3rd-party regex libraries as
those are coming from communities whose sole focus is resolving that
problem. However, someone would need to spend the time to investigate
it, and we'd have to find a license-friendly implementation.

Another option is to quit beating our heads against the wall and just
compress the information. It won't be as much of a reduction, but we
also won't keep hitting scenarios where things break. In this case, it
seems that "perfection" is definitely the enemy of "good enough".

This PR implements the compression option while retaining the
possibility of people adding regex-generating components. The
compression code used in ORTE is consolidated into the opal/compress
framework. That framework currently held bzip and gzip components for
use in compressing checkpoint files - since we no longer support C/R, I
have .opal_ignore'd those components.

However, I have left the original framework APIs alone in case someone
ever decides to redo C/R. The APIs of interest here are added to the
framework - specifically, the "compress_block" and "decompress_block"
functions. I then moved the ORTE zlib compression code into a new
component in this framework.

Unfortunately, the framework currently is a single-select one - i.e.,
only one active component at a time. Since I .opal_ignore'd the other
two and made the priority of zlib high, this isn't a problem. However,
if someone wants to re-enable bzip/gzip or add another component, they
might need to transition opal/compress to a multi-select framework.

Included changes:

* Consolidate the compression code into the opal/compress framework

* Move the ORTE zlib compression code into a new opal/compress/zlib
  component

* Ignore the bzip and gzip components in opal/compress framework

* Add a "compress_base_limit" MCA param to set the threshold above which
  we compress data - defaults to 4096 bytes

* Delete stale brucks and rcd components from orte/grpcomm framework

* Delete the orte/regx framework

* Update the launch system to use opal/compress instead of string regex

* Provide a default module if no zlib is available

* Fix some misc multi-node issues

* Properly generate the nidmap in response to a "connection warmup"
  message so the remote daemon knows the children it needs to launch.

* Remove stale references to orte_node_regex

* opal_byte_object_t's are not OPAL objects - properly release allocated
  memory.

* Set the topology

* Currently only handling homogeneous case

* Update the compress framework files to conform

* Consolidate open/close into one "frame" file. Ensure we open/close the
  framework

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:14 -08:00
Jeff Squyres
a85bad37df orte: strncpy() -> opal_string_copy()
Fairly straightforward conversion of strncpy() calls to
opal_string_copy().

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:10:20 -07:00
Ralph H Castain
fc81d0d519 Replace asprintf with opal_asprintf
Silence the flood of warnings from ORTE

Signed-off-by: Ralph H Castain <rhc@open-mpi.org>
2018-10-06 19:32:37 +00:00
Ralph Castain
0ddbc75ce5
Merge pull request #4930 from kizill/fix-ipv6
fixed ipv6 OOB connection problems (fix issue #1585)
2018-06-26 09:13:53 -07:00
Jeff Squyres
0f8077ace6 oob/tcp: add show_help message about version mismatch
Be more explicit about version mismatch between ORTE processes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-21 20:18:28 -07:00
Stanislav Kirillov
86061fbf8d
fixed ipv6 OOB connection problems
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2018-03-20 16:07:53 +00:00
Joshua Hursey
e1d079544b mca: Dynamic components link against project lib
* Resolves #3705
 * Components should link against the project level library to better
   support `dlopen` with `RTLD_LOCAL`.
 * Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
   with the appropriate project level library:
```
MCA components in ompi/
       $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
       $(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
       $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
       $(top_builddir)/oshmem/liboshmem.la"
```

Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-08-24 11:56:16 -04:00
Ralph Castain
b225366012 Bring the ofi/rml component online by completing the wireup protocol for the daemons. Cleanup the current confusion over how connection info gets created and
passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors.

Remove the no-longer-required get_contact_info and set_contact_info from the RML layer.

Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi
ll be beneficial at large scales. Leave it "off" by default.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-20 21:01:57 -07:00
Gilles Gouaillardet
626e94b689 oob/tcp: make mca_oob_tcp_msg_type_t an uint8_t
so no conversion is required when heterogeneous mode is enabled

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-12 10:27:45 +09:00
Ralph Castain
8979bfe71e Silence Coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-06 06:07:28 -07:00
Ralph Castain
85f8eb4c6b Stop all progress threads prior to releasing the peer objects to avoid a race condition whereby a lost connection could be reported after a peer object was freed and before the threads were stopped.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-29 15:48:18 -07:00
Ralph Castain
952726c121 Update to latest PMIx master - equivalent to 2.0rc2. Update the thread support in the opal/pmix framework to protect the framework-level structures.
This now passes the loop test, and so we believe it resolves the random hangs in finalize.

Changes in PMIx master that are included here:

* Fixed a bug in the PMIx_Get logic
* Fixed self-notification procedure
* Made pmix_output functions thread safe
* Fixed a number of thread safety issues
* Updated configury to use 'uname -n' when hostname is unavailable

Work on cleaning up the event handler thread safety problem
Rarely used functions, but protect them anyway
Fix the last part of the intercomm problem
Ensure we don't cover any PMIx calls with the framework-level lock.
Protect against NULL argv comm_spawn

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-20 09:02:15 -07:00
Ralph Castain
93cf3c7203 Update OPAL and ORTE for thread safety
(I swear, if I look this over one more time, I'll puke)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 12:30:57 -07:00
Ralph Castain
3a434d75d6 By default, use the system default snd/recv buffer sizes
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-05 09:58:05 -07:00
Gilles Gouaillardet
c38ef3d46f oob/tcp: fix short writev handling in send_msg()
Fixes open-mpi/ompi#3414

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-27 10:24:38 +09:00
Ralph Castain
95ae0d1df3 Cleanup timing macros for portability across compilers. Rename the --enable-timing configure option to be --enable-pmix-timing so it doesn't pickup external timing requests. Remove a stale function reference in PMIx so it can compile with timing enabled.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-10 12:56:38 +06:00
Artem Polyakov
79100de014 opal/timing: Remove oob tracing
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-04-07 21:16:21 +06:00
Ralph Castain
48fc339718 Create an alternative mapping method that pushes responsibility
onto the backend daemons. By default, let mpirun only pack the app_context
info and send that to the backend daemons where the mapping will
be done. This significantly reduces the computational time on mpirun as it isn't
running up/down the topology tree computing thousands of binding
locations, and it reduces the launch message to a very small number of
bytes.

When running -novm, fall back to the old way of doing things
where mpirun computes the entire map and binding, and then sends
the full info to the backend daemon.

Add a new cmd line option/mca param --fwd-mpirun-port that allows
mpirun to dynamically select a port, but then passes that back to
all the other daemons so they will use that port as a static port
for their own wireup. In this mode, we no longer "phone home" directly
to mpirun, but instead use the static port to wireup at daemon
start. We then use the routing tree to rollup the initial
launch report, and limit the number of open sockets on mpirun's node.

Update ras simulator to track the new nidmap code

Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found.

Update gadget platform file

Initialize the range count when starting a new range

Fix the no-np case in managed allocation

Ensure DVM node usage gets cleaned up after each job

Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-07 20:43:12 -08:00
Ralph Castain
83199979ba Remove the stale opal/sec framework
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-02 15:41:56 -08:00
Joshua Hursey
df0f8e95cd oob/tcp: Adjust TCP keepalive default values
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-02-17 11:02:25 -06:00
Nathan Hjelm
1c4b735f5f oob/tcp: cleanup peers before event bases
This commit fixes an error in teardown where the event bases are town
down before the peer structures are released. This causes us to call
event_del on an invalid event base. At best this makes valgrind
complain and at worst this causes aborts or segvs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-02-03 15:18:41 -07:00
Ralph Castain
b661275dba For performance, try to send the oob/tcp message a few times before dropping back into the event library
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-02-02 06:44:15 -08:00
Ralph Castain
b59ae14a2a Fix static port and partial allocation operations
Fix static port wireup by recording the TCP port mpirun is using and correctly passing the regex of hosts to the daemons. Do a better job of closing sockets on failed connection attempts. Correctly identify the remote host in the associated error message.

Fix partial allocation operations by not attempting to set #slots on nodes that were not used, and thus don't have a daemon or topology assigned to them

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-28 10:09:44 -08:00
Ralph Castain
d672fad849 Repair rsh/ssh tree spawn
Repair rsh/ssh tree spawn by unpacking and updating the nidmap in remote_spawn.

Add more specific error messages so the cause of a messaging problem is a little clearer. Remove some stale code. Ensure we stop trying to send a message after a few times.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 11:35:00 -08:00
Ralph Castain
184ccc8e91 Cleanup some code so it is clear that it is executing in an event. Ensure that peer event base is properly set on incoming connections
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-25 06:55:11 -08:00
Ralph Castain
ac4fcd3f97 Ensure that oob/base level data is always accessed in the oob/base event thread. Make debruijn the default routed component
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-22 10:33:32 -08:00
Ralph Castain
466cbd4d29 Rework the threading in oob/tcp so that daemons (including mpirun) use multiple progress threads to get messages out to their children, and so that the oob/base uses a separate one to setup sends. This allows the daemon cmd processor to execute in parallel with relay of messages, which significantly reduces launch times at scale
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-21 13:26:19 -08:00
Ralph Castain
668421b6ec Compress the xcast message if bigger than a defined size to further improve launch performance at scale
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-19 22:08:02 -08:00
Ralph Castain
e5f687f896 Speed-up the OOB/TCP communications by using writev instead of writing the header, and then separately write the body
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-19 13:03:44 -08:00
Gilles Gouaillardet
24c61b0625 oob/tcp: plug a memory leak in mca_oob_tcp_component_lost_connection()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 11:35:59 +09:00
Ralph Castain
dd491db21f Fix IOF when outputing to files - the remote orteds were failing to output stdout/err from their procs.
Silence a warning in orted_submit

Protect against a free'd value in an error path when forming oob tcp connections

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-01 14:12:47 -08:00
Artem Polyakov
58300afff2 orte/oob/tcp: Plug the memory leak.
Plug coverity defect CID 1396541.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2016-12-01 06:48:25 +07:00
Ralph Castain
30ff8be9c9 Silence minor warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 08:33:22 -08:00
Artem Polyakov
ada93e0c02 orte/oob/tcp: Fix message dropping in case of concurrent connection.
The problem was observed for direct modex used with recursive doubling
algorithm (used for collective ID calculation prior to d52a2d081e9598a9ac9a50fb4b013a6d2a72375b)
that has pairwise nature and counter-connections are highly likely.

The following scenario was uncovering the issue:
* ranks `x` and `y` want to communicate with each other, `x` < `y`;
* rank `x` initiates the connection and sends the ack;
* rank `y` starts to `connect()` and gets the ack from `x`;
* `y` identifies that it already started connecting and `y` > `x` so it rejects incoming connection.
* `x` sees that his connection was rejected in `mca_oob_tcp_peer_recv_connect_ack()` when trying to
read the message header using `tcp_peer_recv_blocking()` which calls `mca_oob_tcp_peer_close()`
that effectively flushes all the messages in the peer->send_queue.
* `y` send the ack to `x` and the connection is established, however all the messages for the peer
at `x` are vanished (except the front one in peer->send_msg).

This commit introduces a "nack" function that will be used at `y` side to tell `x` that `y` has the
priority and `x`'s connection should be closed. This allows to avoid "guessing" on the unexpectedly
closed connection.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2016-11-27 04:58:34 +07:00
Ralph Castain
188880be3f Since static ports are only used by ORTE if the runtime option is given,
there is no need for a configure option as well - so remove the
--enable-orte-static-ports configure option. When decoding the daemon
nidmap, mark new daemons as ALIVE by default - we will discover dead
ones as we go.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-04 05:01:42 -07:00
Gilles Gouaillardet
da0c873e14 oob/tcp: enhance debugging output
display the hop node used to send a message
(if the message is sent directly, then the hop is the destination)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-11-04 14:16:06 +09:00
Gilles Gouaillardet
30298cc83c oob/tcp: remove debug that should have never been commited
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-31 16:41:14 +09:00
Gilles Gouaillardet
75e96004a4 oob/tcp: fix a typo in mca_oob_tcp_component_no_route()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-31 16:30:24 +09:00
Gilles Gouaillardet
3d4285b04d oob/tcp: silence valgrind warning
fully initialize allocated memory to keep valgrind happy

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-27 17:12:46 +09:00
Ralph Castain
649301a3a2 Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Ralph Castain
a2919174d0 Bring the RML modifications across. This is the first step in a revamp of the ORTE messaging subsystem to support fabric-based communications during launch and wireup phases. When completed, the grpcomm and plm frameworks will each have their own "conduit" for communication - each conduit corresponds to a particular RML messaging transport. This can be the active OOB-based component, or a provider from within the RML/OFI component. Messages sent down the conduit will flow across the associated transport.
Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted.

While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress
2016-10-11 16:01:02 -07:00
Gilles Gouaillardet
c92e9a5406 use the new OPAL_HASH_TABLE_FOREACH convenience macro 2016-10-08 16:58:20 +09:00
Ralph Castain
de7b1494d9 Clean out old cruft from the ORCM project 2016-09-21 00:13:30 -07:00
Gilles Gouaillardet
e84b35217f oob/tcp: plug a memory leak
as reported by Coverity with CID 1196711
2016-09-08 18:50:18 +09:00
Ralph Castain
f85dcaee2a Fixes CID 1369067 and CID 1196684
Fixes CID 1369648

    Fixes CID 1372409
2016-09-06 08:43:15 -07:00
Ralph Castain
a4c8e8c28a Cleanup the proposed change:
* qos framework is moving to the scon layer and is no longer required in ORTE

* remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought

* no need for separating the "base" from "API" module definition. The two are identical

* move the "stub" functions into their own file for cleanliness

* general cleanup to meet coding standards

* cleanup some logic in the stubs
2016-03-10 13:14:17 -08:00
Ralph Castain
351070659e Correct ordering when checking for privileged ports 2016-02-14 09:43:01 -08:00
Ralph Castain
233bd085ca Protect against a non-privileged port connecting to us when we are running as root
Don't close the listener socket upon error unless we are giving up

Cleanup the incoming socket
2016-02-13 08:07:27 -08:00