1
1
Граф коммитов

133 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
00d24bf8ab Scalability patch, or slim-fast effect #1. All BML structures just
got a whole lot smaller, decreasing the memory footprint of the
running application. How much it's a good question. Here is a
breakdown:

- in mca_bml_base_endpoint_t: 3 *size_t + 1 * uint32_t
- in mca_bml_base_btl_t: 1 * int + 1 * double - 1 * float
                         + 6 * size_t + 9 * (void*)

The decrease in mca_bml_base_endpoint_t is for each peer and the
decrease in mca_bml_base_btl_t is for each BTL for each peer.
So, if we consider the most convenient case where there is only
one network between all peers, this decrease the memory foot print
per peer by
9*size_t + 9*(void*) + 2 * int32_t + 1 * double - 1 * float.
On a 64 bits machine this will be 156 bytes per peer.

Now we access all these fields directly from the underlying BTL
structure, and as this structure is common to multiple BML endpoint,
we are a lot more cache friendly. Even if this do not improve the
latency, it makes the SM performance graph a lot smoother.

This commit was SVN r19659.
2008-09-30 21:02:37 +00:00
Rainer Keller
e84f1f6fdf - Mark the variable bytes_delivered as being unused
(it is just set within MCA_PML_OB1_RECV_REQUEST_UNPACK)
   Iff Coverity's prevent makes usage of __attribute__(unused),
   this should get rid of warning.
   Relates to CID1060

   Would then apply to a many int _rc; definitions, that are
   used in other macros in similar fashion...

This commit was SVN r19179.
2008-08-06 13:46:23 +00:00
George Bosilca
3ba0a8c0c1 In the case where the environment is homogeneous we can ALWAYS create
the receiver convertor when we create the request (as we know all
architectures are identical).

This commit was SVN r18934.
2008-07-17 04:57:55 +00:00
George Bosilca
939fa3001d Small cleanups. Remove some switch cases that cannot be reached. Rename
a struct field.

This commit was SVN r18931.
2008-07-17 04:50:39 +00:00
George Bosilca
319a8b3219 Once matched the proc attached to the request should be the source
of the message and not the first on the list. This fix the ticket
#1386.

This commit was SVN r18929.
2008-07-17 03:04:28 +00:00
George Bosilca
3de0488410 Fix the truncation problem. This close the #211.
This commit was SVN r18850.
2008-07-09 17:38:41 +00:00
Galen Shipman
dbd282fcad doh.. fix GET protocol..
This commit was SVN r18623.
2008-06-09 19:45:44 +00:00
Ralph Castain
9613b3176c Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.

I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.

This commit was SVN r18619.
2008-06-09 14:53:58 +00:00
George Bosilca
e361bcb64c Send optimizations.
1. The send path get shorter. The BTL is allowed to return > 0 to specify that the
   descriptor was pushed to the networks, and that the memory attached to it is 
   available again for the upper layer. The MCA_BTL_DES_SEND_ALWAYS_CALLBACK flag
   can be used by the PML to force the BTL to always trigger the callback.
   Unmodified BTL will continue to work as expected, as they will return OMPI_SUCCESS
   which force the PML to have exactly the same behavior as before. Some BTLs have
   been modified: self, sm, tcp, mx.
2. Add send immediate interface to BTL.
   The idea is to have a mechanism of allowing the BTL to take advantage of
   send optimizations such as the ability to deliver data "inline". Some
   network APIs such as Portals allow data to be sent using a "thin" event
   without packing data into a memory descriptor. This interface change
   allows the BTL to use such capabilities and allows for other optimizations
   in the future. All existing BTLs except for Portals and sm have this interface
   set to NULL.

This commit was SVN r18551.
2008-05-30 03:58:39 +00:00
Galen Shipman
4da4c44210 Receive side changes, basically uses multiple active message callbacks rather
than using a single receive callback followed by a switch on the header.
Also fast pathed the matching for small fragments. 

This commit was SVN r18549.
2008-05-30 01:29:09 +00:00
Jeff Squyres
e7ecd56bd2 This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.

= ORTE Job-Level Output Messages =

Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):

 * orte_output(): (and corresponding friends ORTE_OUTPUT,
   orte_output_verbose, etc.)  This function sends the output directly
   to the HNP for processing as part of a job-specific output
   channel.  It supports all the same outputs as opal_output()
   (syslog, file, stdout, stderr), but for stdout/stderr, the output
   is sent to the HNP for processing and output.  More on this below.
 * orte_show_help(): This function is a drop-in-replacement for
   opal_show_help(), with two differences in functionality:
   1. the rendered text help message output is sent to the HNP for
      display (rather than outputting directly into the process' stderr
      stream)
   1. the HNP detects duplicate help messages and does not display them
      (so that you don't see the same error message N times, once from
      each of your N MPI processes); instead, it counts "new" instances
      of the help message and displays a message every ~5 seconds when
      there are new ones ("I got X new copies of the help message...")

opal_show_help and opal_output still exist, but they only output in
the current process.  The intent for the new orte_* functions is that
they can apply job-level intelligence to the output.  As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.

=== New code ===

For ORTE and OMPI programmers, here's what you need to do differently
in new code:

 * Do not include opal/util/show_help.h or opal/util/output.h.
   Instead, include orte/util/output.h (this one header file has
   declarations for both the orte_output() series of functions and
   orte_show_help()).
 * Effectively s/opal_output/orte_output/gi throughout your code.
   Note that orte_output_open() takes a slightly different argument
   list (as a way to pass data to the filtering stream -- see below),
   so you if explicitly call opal_output_open(), you'll need to
   slightly adapt to the new signature of orte_output_open().
 * Literally s/opal_show_help/orte_show_help/.  The function signature
   is identical.

=== Notes ===

 * orte_output'ing to stream 0 will do similar to what
   opal_output'ing did, so leaving a hard-coded "0" as the first
   argument is safe.
 * For systems that do not use ORTE's RML or the HNP, the effect of
   orte_output_* and orte_show_help will be identical to their opal
   counterparts (the additional information passed to
   orte_output_open() will be lost!).  Indeed, the orte_* functions
   simply become trivial wrappers to their opal_* counterparts.  Note
   that we have not tested this; the code is simple but it is quite
   possible that we mucked something up.

= Filter Framework =

Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr.  The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations.  The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc.  This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).

Filtering is not active by default.  Filter components must be
specifically requested, such as:

{{{
$ mpirun --mca filter xml ...
}}}

There can only be one filter component active.

= New MCA Parameters =

The new functionality described above introduces two new MCA
parameters:

 * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
   help messages will be aggregated, as described above.  If set to 0,
   all help messages will be displayed, even if they are duplicates
   (i.e., the original behavior).
 * '''orte_base_show_output_recursions''': An MCA parameter to help
   debug one of the known issues, described below.  It is likely that
   this MCA parameter will disappear before v1.3 final.

= Known Issues =

 * The XML filter component is not complete.  The current output from
   this component is preliminary and not real XML.  A bit more work
   needs to be done to configure.m4 search for an appropriate XML
   library/link it in/use it at run time.
 * There are possible recursion loops in the orte_output() and
   orte_show_help() functions -- e.g., if RML send calls orte_output()
   or orte_show_help().  We have some ideas how to fix these, but
   figured that it was ok to commit before feature freeze with known
   issues.  The code currently contains sub-optimal workarounds so
   that this will not be a problem, but it would be good to actually
   solve the problem rather than have hackish workarounds before v1.3 final.

This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
Shiqing Fan
8393fb5d47 Use the new memchecker_call function for memory checking of non-blocking communication.
This commit was SVN r18399.
2008-05-07 12:28:51 +00:00
Ralph Castain
fa082cafa9 Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex.
Note: this is an early preliminary step in the movement of portions of the datatype engine to the opal layer.

This commit was SVN r18198.
2008-04-17 20:43:56 +00:00
Shiqing Fan
a1e5df1cc9 Use the new memchecker function call which is based on convertor.
Remove one unnecessary call.

This commit was SVN r18085.
2008-04-07 07:52:04 +00:00
Gleb Natapov
cf40674369 Decide if sends should be throttled at the receiver and pass this to the sender
in an ACK message. The decision can't be done reliably at the sender.

This commit was SVN r17987.
2008-03-27 08:56:43 +00:00
George Bosilca
8943ae0b4e Cleanup plus some typos.
This commit was SVN r17858.
2008-03-18 03:03:33 +00:00
Ralph Castain
d70e2e8c2b Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately.
Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer

This commit was SVN r17632.
2008-02-28 01:57:57 +00:00
George Bosilca
fa31ec81d0 Add the ownership flags to the PML/BTL interface. The layer
owning the descriptor is responsible for releasing it once
the descriptor is not in use anymore.

This commit was SVN r17497.
2008-02-18 17:39:30 +00:00
Gleb Natapov
354c5bc5e1 Don't call progress() from OB1 fragment scheduling functions. They don't serve
any purpose and case recursion calls to progress engine.

This commit was SVN r17478.
2008-02-17 12:42:32 +00:00
Gleb Natapov
0a1fa2cb56 req_match_received is set inside MCA_PML_OB1_RECV_REQUEST_MATCHE().
This commit was SVN r17442.
2008-02-13 08:34:39 +00:00
Shiqing Fan
54c7b71cfd Use the correct way of including memchecker.h, which will work with '--with-devel-headers'.
This commit was SVN r17435.
2008-02-12 18:01:17 +00:00
Shiqing Fan
f5792bbda5 merging the memchecker into trunk.
This commit was SVN r17424.
2008-02-12 08:46:27 +00:00
Gleb Natapov
b37ff74a24 Make function that is used only in one file static. Remove static functions
declaration.

This commit was SVN r17080.
2008-01-09 09:54:35 +00:00
Ethan Mallove
f32dcb1636 The Sun Studio 12 compilers need to have inline specified as
`static` in cases where a function is not part of a separate
compilation unit (such as `append_recv_req_to_queue`).

This commit was SVN r17069.
2008-01-08 18:45:51 +00:00
George Bosilca
b58dae00db Allow PERUSE to compile correctly.
This commit was SVN r17008.
2007-12-21 06:18:19 +00:00
Gleb Natapov
35bf8c7c46 Rewrite OB1 matching logic. Get rid of macros, make the code shorter.
This commit was SVN r16993.
2007-12-19 09:16:20 +00:00
Gleb Natapov
5cd38b8b06 Better encapsulate heterogeneous arch handling in ob1.
This commit was SVN r16970.
2007-12-16 08:45:44 +00:00
Gleb Natapov
e2e211f23b Add flags parameter to btl_alloc() and btl_prepare_src() functions. If BTL
knows at the time of allocation priority of a descriptor it may do some
optimizations.

This commit was SVN r16901.
2007-12-09 14:08:01 +00:00
Gleb Natapov
2d784752dd Remove descriptor caching form BML. With descriptor caching some optimizations
are impossible.

This commit was SVN r16897.
2007-12-09 13:58:17 +00:00
Gleb Natapov
097b17d30e Prevent a receive request from been freed while other thread holds a reference
to it or there is an outstanding completion for the request.

This commit was SVN r16153.
2007-09-18 16:18:47 +00:00
Brian Barrett
59b22533f2 Enable RDMA for heterogeneous situations. Currently done by overloading
the ompi_convertor_need_buffers function to only return 0 if the convertor
is homogeneous (which it never does on the trunk, but does to on v1.2, but
that's a different issue).  Only enable the heterogeneous rdma code for
a btl if it supports it (via a flag), as some btls need some work for this
to work properly.  Currently only TCP and OpenIB extensively tested

This commit was SVN r15990.
2007-08-28 21:23:44 +00:00
Gleb Natapov
fa69c5cc10 If a memory on a sender's size is not registered don't register it on a receive
side too. Otherwise a content of the recvreq->req_rdma array is replaced later
without freeing previous content and refcount on registration in mpool become
wrong.

This commit was SVN r15978.
2007-08-28 07:43:06 +00:00
Gleb Natapov
e1a1d9d90e Receive request converter can be accessed in parallel by a thread that receives
data and a thread that run RDMA schedule function. Protect access to the
converter by a lock.

This commit was SVN r15967.
2007-08-27 11:41:42 +00:00
Gleb Natapov
065d04dfde Do not free recvreq while schedule function is running in another thread.
This commit was SVN r15964.
2007-08-27 11:31:40 +00:00
Rainer Keller
1b5fa48a29 - Add missing PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q when matching
from the posted generic_recv-queue.
 - Move the PERUSE_COMM_MSG_MATCH_POSTED_REQ from
   MCA_PML_OB1_RECV_REQUEST_MATCHED to
   mca_pml_ob1_recv_frag_match() as suggested by Terry Dontje
   Only post, if this is not a probe/iprobe request.
 - Do not post PERUSE_COMM_REQ_MATCH_UNEX for probes / iprobes and
   do in correct order before PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q

This commit was SVN r15947.
2007-08-23 07:09:43 +00:00
Gleb Natapov
afac5eb93f Guard recv request with lock against simultaneous access from different
threads.

This commit was SVN r15681.
2007-07-30 12:50:38 +00:00
George Bosilca
e19777e910 A more consistent version. As we now share the send and receive queue, we
have to construct/destruct only once. Therefore, the construction will
happens before digging for a PML, while the destruction just before
finalizing the component.

Add some OPAL_LIKELY/OPAL_UNLIKELY.

This commit was SVN r15347.
2007-07-10 23:45:23 +00:00
George Bosilca
433f8a7694 This patch bring full support for message queues in Open MPI. Now the send and
receive queues are shared among all PMLs, they are declared in the base PML,
and the selected PML is in charge of initializing and releasing them. 

The CM PML is slightly different compared with OB1 or DR. Internally it use
2 different types of requests: light and heavy. However, now with this patch
both types of requests are stored in the same queue, and cast appropriately
on the allocation macro. This means we might use less memory than we allocate,
but in exchange we got full support for most of the parallel debuggers.

Another thing with this patch, is that now for all PML (CM included) the basic
PML requests start with the same fields, and they are declared in the same order
in the request structure. Moreover, the fields have been moved in such a way
that only one volatile/atomic will exist per line of cache (hopefully).

This commit was SVN r15346.
2007-07-10 22:16:38 +00:00
George Bosilca
11656e20aa Remove few warnings.
This commit was SVN r15250.
2007-07-01 16:16:05 +00:00
Gleb Natapov
77e54ebc7e Schedule RDMA op on the last BTL that got completion.
This commit was SVN r15249.
2007-07-01 11:35:55 +00:00
Gleb Natapov
e74aa6b295 Schedule RDMA traffic between BTLs in accordance with relative bandwidths of
each BTL. Precalculate what part of a message should be send via each BTL in
advance instead of doing it during scheduling.

This commit was SVN r15247.
2007-07-01 11:31:26 +00:00
Gleb Natapov
b88b7dedfe Rename btl_rdma_offset to btl_pipeline_send_length.
This commit was SVN r15153.
2007-06-21 07:12:40 +00:00
Gleb Natapov
10266fb467 Fix deadlock in OB1 protocol by by sending memory by copying if registration
fails.

This commit was SVN r14842.
2007-06-03 08:31:58 +00:00
Gleb Natapov
a25e1e7b15 Implement new function mca_pml_ob1_send_requst_copy_in_out(req, offset, len)
that allows to send any range of a request by send/recv instaed of RDMA
and use it to send data from the end of a request in pipeline protocol. 

This commit was SVN r14841.
2007-06-03 08:30:07 +00:00
Galen Shipman
3401bd2b07 Add optional ordering to the BTL interface.
This is required to tighten up the BTL semantics. Ordering is not guaranteed,
but, if the BTL returns a order tag in a descriptor (other than
MCA_BTL_NO_ORDER) then we may request another descriptor that will obey
ordering w.r.t. to the other descriptor.


This will allow sane behavior for RDMA networks, where local completion of an
RDMA operation on the active side does not imply remote completion on the
passive side. If we send a FIN message after local completion and the FIN is
not ordered w.r.t. the RDMA operation then badness may occur as the passive
side may now try to deregister the memory and the RDMA operation may still be
pending on the passive side. 

Note that this has no impact on networks that don't suffer from this
limitation as the ORDER tag can simply always be specified as
MCA_BTL_NO_ORDER.

This commit was SVN r14768.
2007-05-24 19:51:26 +00:00
Gleb Natapov
3ebaff8dfe Implement new BTL parameters:
We eagerly send data up to btl_*_eager_limit with the match
Upon ACK of the MATCH we start using send/receives of size
btl_*_max_send_size up to the btl_*_rdma_pipeline_offset
After the btl_*_rdma_pipeline_offset we begin using RDMA writes of
size btl_*_rdma_pipeline_frag_size.

Now, on a per message basis we only use the above protocol if the
message is larger than btl_*_min_rdma_pipeline_size

btl_*_eager_limit - > same
btl_*_max_send_size -> same
btl_*_rdma_pipeline_offset -> btl_*_min_rdma_size
btl_*_rdma_pipeline_frag_size -> btl_*_max_rdma_size


btl_*_min_rdma_pipeline_size is new..

This patch also moves all BTL common parameters initialisation into
btl_base_mca.c file.

This commit was SVN r14681.
2007-05-17 07:54:27 +00:00
Gleb Natapov
2562253678 Do more work at RDMA frag preparation time and less work at RDMA frag sending
time.

This commit was SVN r14627.
2007-05-09 12:11:51 +00:00
Galen Shipman
d7e428909e two fixes, one mine, the other gleb's, I'm committing for gleb due to
time difference...  

1) The PML makes an assumption on local/remote completion semantics of the BTL
which Self BTL does not obey, nor should it, so we fix the PML
2) The Get protocol must handle the case when sender and reciever do not agree
on wheter the data is contiguous 

This commit was SVN r14313.
2007-04-11 22:03:06 +00:00
George Bosilca
1cb26e3b9c Finally the convertor export a convenience function to allow a consistent
computation of the current location on the pack/unpack process. This can
be used both for retrieving the pointer to the first byte (in the special
case of the cached RDMA protocol) and for getting the current
position (for the pipelined protocol).

I modified all BTLs, but most of them are still untested.

This commit was SVN r14180.
2007-03-30 22:02:45 +00:00
Galen Shipman
a78672be2b fix mpi_leave_pinned case for arbitrary datatypes
George will be streamlining this with a new convertor function soon... 

This commit was SVN r14174.
2007-03-30 02:06:08 +00:00