1
1
Граф коммитов

459 Коммитов

Автор SHA1 Сообщение Дата
Alina Sklarevich
e4c4e7df5e Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)

This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.

Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
2015-02-25 10:58:50 +02:00
Jeff Squyres
a85a392896 Merge pull request #422 from jsquyres/topic/coverity-fixes
Some Coverity fixes
2015-02-24 17:00:10 -05:00
Jeff Squyres
71ae0ad5ec oob_tcp_component: add #if OPAL_ENABLE_IPV6 around IPv6-specific code
This was CID 1196629
2015-02-24 15:24:11 -05:00
Jeff Squyres
0bd2783b91 oob_usock: don't try to close the socket if it didn't open
This was CID 1196663
2015-02-24 15:24:09 -05:00
Nathan Hjelm
ed78553512 Update opal_free_list_t usage to reflect new class interface.
Please verify your components have been updated correctly. Keep in
mind that in terms of threading:

OPAL_FREE_LIST_GET -> opal_free_list_get_st
OPAL_FREE_LIST_RETURN -> opal_free_list_return_st

I used the opal_using_threads() variant anytime it appeared multiple
threads could be operating on the free list. If this is not the case
update to _st. If multiple threads are always in use change to _mt.
2015-02-24 10:05:44 -07:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Ralph Castain
d2938a144f Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch 2015-01-12 05:31:02 -08:00
Artem Polyakov
01601f3284 Merge pull request #305 from artpol84/timing
Timing framework improvement
2014-12-16 15:13:48 +06:00
Ralph Castain
f4ff791335 Close oob/usock connections upon exec 2014-12-13 20:24:09 -08:00
Ralph Castain
6c4d5a51c4 Close tcp sockets upon exec 2014-12-13 20:23:53 -08:00
Artem Polyakov
8ffad75a0a Introduce timing interval measurement facility in timing framework 2014-12-10 16:47:49 +06:00
Ralph Castain
d6d69e2b13 Get the direct routed component to work with both TCP and USOCK OOB components. We previously had setup the direct component so it would only support direct-launched applications. Thus, all routes went direct between processes. However, if the job had been launched by mpirun, this made no sense - what you wanted instead was to have each app proc talk directly to its daemon, but have the daemons all directly connect to each other.
So we need all the routing code for dealing with cross-job communications, lifelines, etc. The HNP will be directly connected to all daemons as they must callback at startup, and so we need to track those children correctly so we know when it is okay to terminate.

We still have to support direct launch, though, as this is the only component we can use in that scenario. So if the app doesn't have daemon URI info, then it must fall back to directly connecting to everything.
2014-12-07 09:11:48 -08:00
Nadezhda Kogteva
8dd21c7736 OOB UD: fix case when multiple oob components were specified in command line (checking of uri). 2014-11-25 11:48:11 +02:00
Gilles Gouaillardet
a6744b8177 fix misc memory leaks specific to the master 2014-11-25 13:52:10 +09:00
Ralph Castain
2e00e335b9 Add missing header to tarball. Remove stale opal_unignore 2014-11-21 17:35:11 -08:00
Nadezhda Kogteva
05b2eb1270 OOB UD: opal_ignore removed from oob ud component: component is compilable. Added support of new RML API, support of opal_buffer as input data. Added usage of routed component. 2014-11-20 10:20:35 +02:00
Ralph Castain
780c93ee57 Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
2014-11-11 17:00:42 -08:00
Gilles Gouaillardet
652ecdb888 oob/tcp: always include a missing header file
improve open-mpi/ompi@c9d1e16a9e
2014-10-29 13:39:23 +09:00
Gilles Gouaillardet
c9d1e16a9e oob/tcp: include a missing header file
warning can be seen under cygwin without the missing header file
2014-10-28 13:56:25 +09:00
Ralph Castain
4fc4a8346b Fix a couple of minor issues. Ensure usock isn't used if the session dirs aren't setup. Protect an oddball case where orte_xml_fp is NULL. 2014-10-09 20:58:46 -07:00
Gilles Gouaillardet
9661e4537f oob/tcp: fix a race condition
Mimick the btl/tcp protocol to solve the race condition that happens
when two peers try to connect to each other at the same time

cmr=v1.8.4:reviewer=rhc

This commit was SVN r32799.
2014-09-26 06:54:30 +00:00
Gilles Gouaillardet
5fa2b6c59c oob/tcp: fix a race condition
Refs trac:4909

This commit was SVN r32754.

The following Trac tickets were found above:
  Ticket 4909 --> https://svn.open-mpi.org/trac/ompi/ticket/4909
2014-09-18 08:17:25 +00:00
Ralph Castain
3a437cbdb3 Silence set-but-not-used warning when timing isn't enabled
This commit was SVN r32749.
2014-09-17 00:40:10 +00:00
Ralph Castain
414f4e9783 Try to provide a real hostname for the remote host to aid in debugging
Refs trac:4908

This commit was SVN r32748.

The following Trac tickets were found above:
  Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908
2014-09-17 00:39:49 +00:00
Jeff Squyres
9dc49c5f92 oob_tcp_connection: print "<unknown>" instead of "NULL"
"NULL" doesn't meany anything to the user, and is somewhat confusing
to see in an error message.  "<unknown>" at least indicates that
there's an error, and we know who the peer is.

This commit was SVN r32747.
2014-09-16 22:47:57 +00:00
Ralph Castain
09aecea55a Can't use show_help as the RML has already been enabled, but we haven't successfully connected back to the HNP. So use opal_output instead and hardwire the message.
Refs trac:4908

This commit was SVN r32746.

The following Trac tickets were found above:
  Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908
2014-09-16 22:21:02 +00:00
Ralph Castain
4bbc9a28d6 Try to resolve the simultaneous connection problem by being a little more careful about the choice of returned status when a connection is refused. As before, have the higher vpid of the two peers retry the connection, while the lower one waits. This can happen in a couple of places, so try to hit them all. Since this is hard to test, will ask Gilles to give it a try since he's the one who is seeing it.
cmr=v1.8.3:reviewer=rhc

This commit was SVN r32744.
2014-09-16 18:59:36 +00:00
Ralph Castain
a74428513d Provide a better help message when we are unable to complete a connection due to a firewall.
cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32743.
2014-09-16 16:28:29 +00:00
Ralph Castain
dfb952fa78 [Contribution from Artem - moved it to svn from git for him]
Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup.

This commit was SVN r32738.
2014-09-15 18:00:46 +00:00
Ralph Castain
4d186e6402 Properly protect the MCA parameters being registered by the OOB/TCP component when IPv6 is enabled
cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32662.
2014-09-02 14:53:00 +00:00
Ralph Castain
e49ca05f11 Remove unused variable
This commit was SVN r32651.
2014-08-31 03:11:50 +00:00
Ralph Castain
5cdbc00136 Re-enable the usock oob component. Ensure the TCP component promotes messages for other procs to the OOB base so that other components have a chance to send the relay. Seems to be passing MTT, so let's see how it works for others.
This commit was SVN r32650.
2014-08-30 19:33:46 +00:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Ralph Castain
b4511913f6 Remove an unnecessary optimization that can cause more trouble than it's worth - just try all the addresses that are given to us.
Refs trac:4870

This commit was SVN r32558.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-20 20:58:07 +00:00
Ralph Castain
fa28710d53 Track down the last piece of the connection problem. It appears that
providing a netmask of 0 to opal_net_samenetwork results in everything
looking like it is on the same network. Hence, we were not retaining any
of the alternative addresses, so we had no other way to check them.

Refs trac:4870

This commit was SVN r32556.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-20 16:55:36 +00:00
Ralph Castain
343038af7b Frazzle-frump! Missed that we reset the peer state just before the new check.
Refs trac:4870

This commit was SVN r32554.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-19 22:34:49 +00:00
Ralph Castain
0a91fdf85f If an initial address fails to connect, record that fact and attempt the next address for that proc. If nothing succeeds, then declare failure.
cmr=v1.8.2:reviewer=edgar

This commit was SVN r32553.
2014-08-19 19:48:24 +00:00
Ralph Castain
5db717f090 Some small leak cleanups
cmr=v1.8.3:reviewer=artpol

This commit was SVN r32358.
2014-07-30 15:46:02 +00:00
Ralph Castain
f3cb124e50 Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed.
This commit was SVN r32089.

The following SVN revision numbers were found above:
  r32070 --> open-mpi/ompi@12d92d0c22
  r32082 --> open-mpi/ompi@aa6438ef7a
2014-06-25 20:43:28 +00:00
Adrian Reber
4aca7095dc fix a syntax error in the FT code
This commit was SVN r32087.
2014-06-25 20:35:50 +00:00
Ralph Castain
12d92d0c22 Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS
This commit was SVN r32070.
2014-06-24 17:05:11 +00:00
Ralph Castain
ba926d8635 The TCP component will have set the hash table entry to NULL, but that doesn't remove the key. So the hash_table retrieval function will return success, but with a NULL pointer - protect against that scenario
Patch provided by Gilles - reviewed by rhc.

RM-approved

cmr=v1.8.2:reviewer=ompi-gk1.8

This commit was SVN r31971.
2014-06-09 17:46:22 +00:00
Ralph Castain
e21bfeadcd Now that the BTLs are moving down to OPAL and becoming available to ORTE, there no longer is a need/desire to push performance in the OOB/TCP component. So we don't need multiple modules driving NICs in parallel, and can drop all the complicated distribution logic. Fall back to the simplified single module model, but retain the ability to run that module in its own progress thread if so directed.
This should eliminate the connectivity issues that have been reported, and will make maintenance of this component much easier.

cmr=v1.8.2:reviewer=jsquyres:subject=simplify the OOB/TCP component

This commit was SVN r31956.
2014-06-06 02:24:17 +00:00
Ralph Castain
7df500ecf5 Break the loop caused by retrying to send a message to a hop that is unknown by the TCP oob component. We attempt to provide a way for other components to try, but need to mark that the TCP component is not able to reach that process so the OOB base will know to give up.
This commit was SVN r31928.
2014-06-02 15:00:33 +00:00
Nathan Hjelm
59d09ad9de orte: fix several small memory leaks
grpcomm: fix memory leaks

We were leaking the caddy object used to pass data to the callback
function. This commit fixes these leaks.

oob,rml: fix memory leaks

This commit fixes several leaks:

 - Both the oob/base and oob/tcp were leaking objects on their peer
   hash tables. Iterate on the hash tables and free any objects.

 - Leaked sent messages because of missing OBJ_RELEASE. I placed the
   release in ORTE_RML_SEND_COMPLETE to catch all the possible
   paths.

ess/base: close the state framework

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31776.
2014-05-15 15:06:27 +00:00
Ralph Castain
ad0e8f841d Just pick a module to handle the incoming connection if no direct interface is identified. Siegmar hit it because his IP/netmask is disjoint, but a router was able to make the connection.
Refs trac:4627

This commit was SVN r31763.

The following Trac tickets were found above:
  Ticket 4627 --> https://svn.open-mpi.org/trac/ompi/ticket/4627
2014-05-14 19:23:02 +00:00
Ralph Castain
e605e73379 Close the incoming socket if we aren't going to accept it
cmr=v1.8.2:reviewer=rhc

This commit was SVN r31759.
2014-05-14 16:51:59 +00:00
Ralph Castain
5602156a1c Use the correct abstraction layer name for the data dirs
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091 The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
a8e2d6c3a6 The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature:
top_ompi_srcdir  ->  OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR

We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.

Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.

This commit was SVN r31678.
2014-05-07 21:48:53 +00:00