1
1

66 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
ea21c31f44 * MCA params btl_openib_use_eager_rdma can now override the
INI file use_eager_rdma value (fixes trac:1169)
 * fixed a typo in a MCA param help message
 * made the check for enabling short/eager RDMA more robust in the
   presence of progress threads; it now emits a show_help warning

This commit was SVN r18723.

The following Trac tickets were found above:
  Ticket 1169 --> https://svn.open-mpi.org/trac/ompi/ticket/1169
2008-06-24 18:31:46 +00:00
Jeff Squyres
e0545460ff Fixes trac:1355: allow INI file to set max_inline_data vale, and if not
specified, probe for max value supported by device.

This commit was SVN r18720.

The following Trac tickets were found above:
  Ticket 1355 --> https://svn.open-mpi.org/trac/ompi/ticket/1355
2008-06-24 17:18:07 +00:00
Jeff Squyres
ed17b51204 Adjust the max_inline default size down so that it can be accepted on
multiple adapters (eg., Chelsio T3).

But we need to figure out how to determine a good value for the
resident adapter(s) at runtime.  It's problematic because, for
example, Mellanox ConnectX and Chelsio T3 report max_inline values
differently at run-time.  If you ibv_create_qp with a max_inline value
of 0, ConnectX reports back a value that is a formular based on a few
other values (e.g., max_send_sge and max_recv_sge).  But T3 always
reports back "64".

We're looking into this to figure out the best way -- reducing the
default right now should allow other adapters to run while we figure
it out.

This commit was SVN r18697.
2008-06-20 18:24:04 +00:00
Pavel Shamis
4537827973 Making the qp allocation more optimized.
- sq parameter was replaced with max_inline parameter
- inline is allocated only for relevant QPs

This commit was SVN r18675.
2008-06-19 08:40:39 +00:00
Ralph Castain
9613b3176c Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.

I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.

This commit was SVN r18619.
2008-06-09 14:53:58 +00:00
Pavel Shamis
7b9024bc05 Updating Mellanox's Copyright in files touched in 2008
This commit was SVN r18592.
2008-06-05 13:40:26 +00:00
Jeff Squyres
69d78c6739 Fixes trac:1215: adds specific show_help messages about PP vs. SRQ/XRC RNR
retry exceeded errors.

This commit was SVN r18554.

The following Trac tickets were found above:
  Ticket 1215 --> https://svn.open-mpi.org/trac/ompi/ticket/1215
2008-06-02 11:03:48 +00:00
Jeff Squyres
64f61ebd07 Fixes trac:1285. Really.
This commit has the same commit message as r18450, but without the
extra bonus memory corruption that was introduced.

This commit was SVN r18467.

The following SVN revision numbers were found above:
  r18450 --> open-mpi/ompi@5295902ebe

The following Trac tickets were found above:
  Ticket 1285 --> https://svn.open-mpi.org/trac/ompi/ticket/1285
2008-05-20 21:53:42 +00:00
Jeff Squyres
76fc8dd188 Revert r18450 -- there is some memory badness in there somewhere...
This commit was SVN r18451.

The following SVN revision numbers were found above:
  r18450 --> open-mpi/ompi@5295902ebe
2008-05-18 19:11:45 +00:00
Jeff Squyres
5295902ebe Fixes trac:1285:
* allow receive_queues to be specified in the INI file 
 * detect when multiple different receive_queues are specified and 
   gracefully abort 

However, accomplishing these goals ran into multiple difficulties. By 
putting receive_queues in the INI file: 

 1. we may not find the value until we've already traversed multiple HCAs 
 1. we may find multiple different receive_queues values

But since the openib btl initializes as it discovers each HCA/port/LID
(including the BSRQ data), if we find a new receive_queues value late
in the discovery process, then all the BSRQ data that was previously
initialized will likely be invalid. So I had to pull all the BSRQ
initialization out until after the rest of the discovery /
initialization process.

Additionally, note that if the user specifies the MCA parameter
btl_openib_receive_queues, it trumps whatever was in the INI file. So
in this case, there can never be a receive_queues conflict.  This
commit does the following (Jon wrote part of this, too):

 * adapt _ini.c to accept the "receive_queues" field in the file 
 * move 90% of _setup_qps() from _ini.c to _component.c 
 * move what was left of _setup_qps() into the main 
   _register_mca_params() function 
 * adapt init_one_hca() to detect conflicting receive_queues values 
   from the INI file 
 * after the _component.c loop calling init_one_hca(): 
   * call setup_qps() to parse the final receive_queues string value 
   * traverse all resulting btls and initialize their HCAs (if they
     weren't already): setup some lists and call prepare_hca_for_use()

I tested this code on a dual-HCA system where I artificially put in 
differing receive_queues values in the INI file for the two different 
types of HCAs that I have and it all seemed to work.

This commit was SVN r18450.

The following Trac tickets were found above:
  Ticket 1285 --> https://svn.open-mpi.org/trac/ompi/ticket/1285
2008-05-18 18:50:56 +00:00
Jeff Squyres
e7ecd56bd2 This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.

= ORTE Job-Level Output Messages =

Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):

 * orte_output(): (and corresponding friends ORTE_OUTPUT,
   orte_output_verbose, etc.)  This function sends the output directly
   to the HNP for processing as part of a job-specific output
   channel.  It supports all the same outputs as opal_output()
   (syslog, file, stdout, stderr), but for stdout/stderr, the output
   is sent to the HNP for processing and output.  More on this below.
 * orte_show_help(): This function is a drop-in-replacement for
   opal_show_help(), with two differences in functionality:
   1. the rendered text help message output is sent to the HNP for
      display (rather than outputting directly into the process' stderr
      stream)
   1. the HNP detects duplicate help messages and does not display them
      (so that you don't see the same error message N times, once from
      each of your N MPI processes); instead, it counts "new" instances
      of the help message and displays a message every ~5 seconds when
      there are new ones ("I got X new copies of the help message...")

opal_show_help and opal_output still exist, but they only output in
the current process.  The intent for the new orte_* functions is that
they can apply job-level intelligence to the output.  As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.

=== New code ===

For ORTE and OMPI programmers, here's what you need to do differently
in new code:

 * Do not include opal/util/show_help.h or opal/util/output.h.
   Instead, include orte/util/output.h (this one header file has
   declarations for both the orte_output() series of functions and
   orte_show_help()).
 * Effectively s/opal_output/orte_output/gi throughout your code.
   Note that orte_output_open() takes a slightly different argument
   list (as a way to pass data to the filtering stream -- see below),
   so you if explicitly call opal_output_open(), you'll need to
   slightly adapt to the new signature of orte_output_open().
 * Literally s/opal_show_help/orte_show_help/.  The function signature
   is identical.

=== Notes ===

 * orte_output'ing to stream 0 will do similar to what
   opal_output'ing did, so leaving a hard-coded "0" as the first
   argument is safe.
 * For systems that do not use ORTE's RML or the HNP, the effect of
   orte_output_* and orte_show_help will be identical to their opal
   counterparts (the additional information passed to
   orte_output_open() will be lost!).  Indeed, the orte_* functions
   simply become trivial wrappers to their opal_* counterparts.  Note
   that we have not tested this; the code is simple but it is quite
   possible that we mucked something up.

= Filter Framework =

Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr.  The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations.  The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc.  This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).

Filtering is not active by default.  Filter components must be
specifically requested, such as:

{{{
$ mpirun --mca filter xml ...
}}}

There can only be one filter component active.

= New MCA Parameters =

The new functionality described above introduces two new MCA
parameters:

 * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
   help messages will be aggregated, as described above.  If set to 0,
   all help messages will be displayed, even if they are duplicates
   (i.e., the original behavior).
 * '''orte_base_show_output_recursions''': An MCA parameter to help
   debug one of the known issues, described below.  It is likely that
   this MCA parameter will disappear before v1.3 final.

= Known Issues =

 * The XML filter component is not complete.  The current output from
   this component is preliminary and not real XML.  A bit more work
   needs to be done to configure.m4 search for an appropriate XML
   library/link it in/use it at run time.
 * There are possible recursion loops in the orte_output() and
   orte_show_help() functions -- e.g., if RML send calls orte_output()
   or orte_show_help().  We have some ideas how to fix these, but
   figured that it was ok to commit before feature freeze with known
   issues.  The code currently contains sub-optimal workarounds so
   that this will not be a problem, but it would be good to actually
   solve the problem rather than have hackish workarounds before v1.3 final.

This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
Jeff Squyres
ba5615a18f Merge in /tmp-public/cpc3 branch to trunk. oob/xoob still remains the
default CPC.

This commit was SVN r18356.
2008-05-02 11:52:33 +00:00
Jeff Squyres
c40740947f Fix minor spelling error.
This commit was SVN r18229.
2008-04-22 13:11:50 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Pavel Shamis
a0d12a9c92 Adding support for APM over different ports
This commit was SVN r17521.
2008-02-20 13:44:05 +00:00
Pavel Shamis
7d83f34eb0 Protecting the apm code with OMPI_HAVE_THREADS.
This commit was SVN r17284.
2008-01-28 16:10:18 +00:00
Pavel Shamis
28a3917306 Adding APM support (over different lids).
This commit was SVN r17280.
2008-01-28 10:38:08 +00:00
Gleb Natapov
c9a1b06771 Remove trailing whitespaces. No code changes in this commit.
This commit was SVN r17167.
2008-01-21 12:11:18 +00:00
Gleb Natapov
b06d92bdab OpenIB BTL has three channels through which data can be received (eager rdma,
high prio QPs and low prio QPs) and because not all of them are polled each time
progrgess() is called (to save on latency) starvation is possible. The commit
fixes this. Now each channel is polled, but higher priority channels are polled
more often. Three new parameters are introduced that control polling ratios 
between different channels.

This commit was SVN r17024.
2007-12-23 12:29:34 +00:00
Gleb Natapov
64a95f63cd Fix error reporting in openib if parameter value is out of range.
This commit was SVN r16971.
2007-12-16 14:04:36 +00:00
Gleb Natapov
8b511b969d Introduce a new BTL parameter btl_rndv_eager_limit which determines size of a
first fragment of rendezvous protocol. Remove no longer used btl_min_send_size
parameter.

This commit was SVN r16969.
2007-12-16 08:35:17 +00:00
Gleb Natapov
5313a2baa7 Message coalescing for openib BTL. If fragment is waiting to be transmitted in
a pending queue pack another message into it if there is enough space there.

This commit was SVN r16900.
2007-12-09 14:05:13 +00:00
Jon Mason
20294e7800 There is a double call to ompi_btl_openib_connect_base_open in
mca_btl_openib_mca_setup_qps().  It looks like someone just forgot to
clean-up the previous call when they added the check for the return
code.

I ran a quick IMB test over IB to verify everything is still working.

This commit was SVN r16870.
2007-12-06 17:25:38 +00:00
Gleb Natapov
a774cd98f8 Put send completions to low prio CQ. Receive is more important.
This commit was SVN r16817.
2007-12-02 14:46:37 +00:00
Gleb Natapov
b17f5b7480 Change how default receive queues parameters are calculated. Current default
parameters don't make any sense. Credits are never piggybacked. Also make
default queue sizes to be calculated from eager_limit and max_send_size values.

This commit was SVN r16816.
2007-12-02 14:43:28 +00:00
Gleb Natapov
b46c9cc7bc Make xrc use srq_qp unions instead of the xrc_qp which is exactly like srq_qp.
This commit was SVN r16789.
2007-11-28 07:20:26 +00:00
Gleb Natapov
bd47da4699 Initial XRC support by Mellanox.
This commit was SVN r16787.
2007-11-28 07:18:59 +00:00
Gleb Natapov
5463eb892c Send all explicit credits for PP QPs of all orders over smallest PP qp.
This commit was SVN r16781.
2007-11-28 07:13:34 +00:00
Gleb Natapov
a9f864d15c If there is an eager rdma credit, but there is no WQE to send a packet we add it
to a pending queue of eager rdma QP instead of correct pending list. This patch
fixes this by getting reed of "eager rdma qp" notion. Packet is always send
over its order QP. The patch also adds two pending queues for high and low prio
packets. Only high prio packets are sent over eager RDMA channel.

This commit was SVN r16780.
2007-11-28 07:12:44 +00:00
Jeff Squyres
94b1e9cff9 Update to use BTL_VERBOSE and BTL_ERROR instead of opal_output'ing to
the mca_btl_base_output stream directly (and relying on it to be -1 if
we didn't want any output).

This commit was SVN r16449.
2007-10-15 17:53:02 +00:00
Gleb Natapov
c7105eadc7 Update Voltaire copyright.
This commit was SVN r16189.
2007-09-24 10:11:52 +00:00
Jeff Squyres
33955a0ed0 Oops -- when converted from uint to int, -1 (the default value,
meaning "infinite") is no longer larger than the minimum required
size.  So put in an appropriate test to ensure that "infinite" was not
requested. 

This commit was SVN r16142.
2007-09-17 19:28:21 +00:00
Jeff Squyres
130a272cec Fix some compiler warnings about signed/unsigned comparisons.
This commit was SVN r16139.
2007-09-17 13:08:45 +00:00
Jeff Squyres
6004e177e0 Fixes trac:1133: if you specify a max freelist size that is too small,
you'll get a helpful error message and the openib BTL will deactivate
itself.

This commit was SVN r16133.

The following Trac tickets were found above:
  Ticket 1133 --> https://svn.open-mpi.org/trac/ompi/ticket/1133
2007-09-14 21:42:56 +00:00
Brian Barrett
59b22533f2 Enable RDMA for heterogeneous situations. Currently done by overloading
the ompi_convertor_need_buffers function to only return 0 if the convertor
is homogeneous (which it never does on the trunk, but does to on v1.2, but
that's a different issue).  Only enable the heterogeneous rdma code for
a btl if it supports it (via a flag), as some btls need some work for this
to work properly.  Currently only TCP and OpenIB extensively tested

This commit was SVN r15990.
2007-08-28 21:23:44 +00:00
Gleb Natapov
33196d972b post_send() function is called without endpoint lock held from explicit credits
update function so eager_rdma_remote.head have to be updated in a thread safe
manner.

This commit was SVN r15966.
2007-08-27 11:37:01 +00:00
Jeff Squyres
d7c5fea096 * Fix problem caused by r15848: the test parser was looking for
semicolons but the new specitifcation string used colons.  The text
   parser now looks for colons.
 * Changed all opal_output() error messages to
   much-more-helpful/descriptive opal_show_help() messages.
 * A few minor style/indenting fixes

This commit was SVN r15850.

The following SVN revision numbers were found above:
  r15848 --> open-mpi/ompi@dd30597f39
2007-08-14 14:46:13 +00:00
Jeff Squyres
dd30597f39 Change the default receive_queues value per
http://www.open-mpi.org/community/lists/devel/2007/08/2100.php.

This commit was SVN r15848.
2007-08-13 21:51:05 +00:00
Jeff Squyres
50bae9c603 Bring in the modular-wireup stuff for the openib BTL (from
/tmp/jms-modular-wireup branch):

 * This commit moves all the openib BTL connection code out of
   btl_openib_endpoint.c and into a connect "pseudo-component" area,
   meaning that different schemes for doing OFA connection schemes can
   be chosen via function pointer (i.e., MCA parameter) at run-time.
 * The connect/connect.h file includes comments describing the
   specific interface for the connect pseudo-component.
 * Two pseudo-components are in this commit (more can certainly be
   added).
   * oob: use the same old oob/rml scheme for creating OFA connections
     that we've had forever; this now just puts the logic into this
     self-contained pseudo-component.
   * rdma_cm: a currently-empty set of functions (that currently
     return NOT_IMPLEMENTED) that will someday use the RDMA connection
     manager to make OFA connections.

This commit was SVN r15786.
2007-08-06 23:40:35 +00:00
Gleb Natapov
9c20d67301 1) Return IB header to it's previous size by using char for cm_seen field.
2) Allow to specify rd_win/rd_rsv parameters by user, but make them optional.

This commit was SVN r15719.
2007-08-01 12:10:56 +00:00
Jeff Squyres
015fc08ff4 Remove the ib_static_rate MCA parameter; it will be replaced with a
dynamic mechanism to adjust the rate only if necessary (e.g., two
ports of differing speeds are connected).

This commit was SVN r15653.
2007-07-26 21:10:51 +00:00
Galen Shipman
438a56e0d7 update copyrights for ib_multifrag commit
This commit was SVN r15612.
2007-07-25 15:03:34 +00:00
Jeff Squyres
f2a2b2c0f9 A little more error checking; clean up the invalid MCA help message
This commit was SVN r15589.
2007-07-24 20:57:40 +00:00
Jeff Squyres
8ace07efed This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
   BTL.
1. Pasha's new implementation of asychronous HCA event handling.

Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.  

Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes).  :-(

== Fine-grain control of queue pair resources ==

Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).

Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments.  When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers.  One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.

The new design allows multiple QPs to be specified at runtime.  Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified.  The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:

{{{
-mca btl_openib_receive_queues \
     "P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}

Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma).  The above
example therefore describes 4 QPs.

The first QP is:

    P,128,16,4

Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes).  The third field indicates the number of receive buffers
to allocate to the QP (16).  The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).

The second QP is:

    S,1024,256,128,32

Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP.  The second, third and fourth fields are the same as in the
per-peer based QP.  The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32).  This provides a
"good enough" mechanism of flow control for some regular communication
patterns.

QPs MUST be specified in ascending receive buffer size order.  This
requirement may be removed prior to 1.3 release.

This commit was SVN r15474.
2007-07-18 01:15:59 +00:00
Gleb Natapov
b88b7dedfe Rename btl_rdma_offset to btl_pipeline_send_length.
This commit was SVN r15153.
2007-06-21 07:12:40 +00:00
Jeff Squyres
930a9b7682 Make the help messages for if_include/if_exclude a little better.
This commit was SVN r15134.
2007-06-19 13:38:58 +00:00
Jeff Squyres
1e18265c16 Bring over the functionality from the /tmp/jnysal-openib-wireup
branch:

 * Support btl_openib_if_include and btl_openib_if_exclude MCA
   parameters, similar to those supported by other BTLs.  Each take a
   comma-delimited lists of identifiers.  Identifiers can be HCA
   interface names (e.g., ipath0, mthca1, etc.)  or an HCA interface
   name and port numbers (e.g., ipath0:1, mthca1:2, etc.).  It is an
   error to specify both _include and _exclude.  If you specify a
   non-existant (or non-ACTIVE) HCA and/or port, you'll get a warning
   unless you disable the warning by setting the MCA parameter
   btl_openib_warn_nonexistent_if to 0.
 * Start updating to use BEGIN_C_DECLS and END_C_DECLS
 * A few other minor fixes that were picked up along the way.

This commit was SVN r15063.
2007-06-14 01:59:25 +00:00
Galen Shipman
5340f5e320 Try to cleanup the flow control logic a bit
Renamed a few variables 
Inialize the reserve receive buffers to 1, prior to this they were initialized
to zero. 

This commit was SVN r14919.
2007-06-06 18:51:09 +00:00
Jeff Squyres
81df632e29 Clarification to MCA parameter help messages
This commit was SVN r14765.
2007-05-24 19:18:29 +00:00
Gleb Natapov
3ebaff8dfe Implement new BTL parameters:
We eagerly send data up to btl_*_eager_limit with the match
Upon ACK of the MATCH we start using send/receives of size
btl_*_max_send_size up to the btl_*_rdma_pipeline_offset
After the btl_*_rdma_pipeline_offset we begin using RDMA writes of
size btl_*_rdma_pipeline_frag_size.

Now, on a per message basis we only use the above protocol if the
message is larger than btl_*_min_rdma_pipeline_size

btl_*_eager_limit - > same
btl_*_max_send_size -> same
btl_*_rdma_pipeline_offset -> btl_*_min_rdma_size
btl_*_rdma_pipeline_frag_size -> btl_*_max_rdma_size


btl_*_min_rdma_pipeline_size is new..

This patch also moves all BTL common parameters initialisation into
btl_base_mca.c file.

This commit was SVN r14681.
2007-05-17 07:54:27 +00:00