1
1

58 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
671f0c379d Remove a whole pile of orte/util/show_help.h's that I missed. :-(
This commit was SVN r18437.
2008-05-14 11:32:33 +00:00
Jeff Squyres
e7ecd56bd2 This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.

= ORTE Job-Level Output Messages =

Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):

 * orte_output(): (and corresponding friends ORTE_OUTPUT,
   orte_output_verbose, etc.)  This function sends the output directly
   to the HNP for processing as part of a job-specific output
   channel.  It supports all the same outputs as opal_output()
   (syslog, file, stdout, stderr), but for stdout/stderr, the output
   is sent to the HNP for processing and output.  More on this below.
 * orte_show_help(): This function is a drop-in-replacement for
   opal_show_help(), with two differences in functionality:
   1. the rendered text help message output is sent to the HNP for
      display (rather than outputting directly into the process' stderr
      stream)
   1. the HNP detects duplicate help messages and does not display them
      (so that you don't see the same error message N times, once from
      each of your N MPI processes); instead, it counts "new" instances
      of the help message and displays a message every ~5 seconds when
      there are new ones ("I got X new copies of the help message...")

opal_show_help and opal_output still exist, but they only output in
the current process.  The intent for the new orte_* functions is that
they can apply job-level intelligence to the output.  As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.

=== New code ===

For ORTE and OMPI programmers, here's what you need to do differently
in new code:

 * Do not include opal/util/show_help.h or opal/util/output.h.
   Instead, include orte/util/output.h (this one header file has
   declarations for both the orte_output() series of functions and
   orte_show_help()).
 * Effectively s/opal_output/orte_output/gi throughout your code.
   Note that orte_output_open() takes a slightly different argument
   list (as a way to pass data to the filtering stream -- see below),
   so you if explicitly call opal_output_open(), you'll need to
   slightly adapt to the new signature of orte_output_open().
 * Literally s/opal_show_help/orte_show_help/.  The function signature
   is identical.

=== Notes ===

 * orte_output'ing to stream 0 will do similar to what
   opal_output'ing did, so leaving a hard-coded "0" as the first
   argument is safe.
 * For systems that do not use ORTE's RML or the HNP, the effect of
   orte_output_* and orte_show_help will be identical to their opal
   counterparts (the additional information passed to
   orte_output_open() will be lost!).  Indeed, the orte_* functions
   simply become trivial wrappers to their opal_* counterparts.  Note
   that we have not tested this; the code is simple but it is quite
   possible that we mucked something up.

= Filter Framework =

Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr.  The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations.  The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc.  This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).

Filtering is not active by default.  Filter components must be
specifically requested, such as:

{{{
$ mpirun --mca filter xml ...
}}}

There can only be one filter component active.

= New MCA Parameters =

The new functionality described above introduces two new MCA
parameters:

 * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
   help messages will be aggregated, as described above.  If set to 0,
   all help messages will be displayed, even if they are duplicates
   (i.e., the original behavior).
 * '''orte_base_show_output_recursions''': An MCA parameter to help
   debug one of the known issues, described below.  It is likely that
   this MCA parameter will disappear before v1.3 final.

= Known Issues =

 * The XML filter component is not complete.  The current output from
   this component is preliminary and not real XML.  A bit more work
   needs to be done to configure.m4 search for an appropriate XML
   library/link it in/use it at run time.
 * There are possible recursion loops in the orte_output() and
   orte_show_help() functions -- e.g., if RML send calls orte_output()
   or orte_show_help().  We have some ideas how to fix these, but
   figured that it was ok to commit before feature freeze with known
   issues.  The code currently contains sub-optimal workarounds so
   that this will not be a problem, but it would be good to actually
   solve the problem rather than have hackish workarounds before v1.3 final.

This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
Donald Kerr
843a35094f adding local work queue accounting
This commit was SVN r18352.
2008-05-01 21:01:51 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
George Bosilca
fa31ec81d0 Add the ownership flags to the PML/BTL interface. The layer
owning the descriptor is responsible for releasing it once
the descriptor is not in use anymore.

This commit was SVN r17497.
2008-02-18 17:39:30 +00:00
Donald Kerr
58bf7f5a1d add uintptr_t to prevent the possibility of a signed extension occuring
This commit was SVN r17456.
2008-02-14 19:16:34 +00:00
Donald Kerr
5f884b1ca4 fix for #1130 - adds support for multi-rail configurations
This commit was SVN r17152.
2008-01-17 17:30:50 +00:00
George Bosilca
6310ce955c The first patch related to the Active Message stuff. So far, here is what we have:
- the registration array is now global instead of one by BTL.
- each framework have to declare the entries in the registration array reserved. Then
  it have to define the internal way of sharing (or not) these entries between all
  components. As an example, the PML will not share as there is only one active PML
  at any moment, while the BTLs will have to. The tag is 8 bits long, the first 3
  are reserved for the framework while the remaining 5 are use internally by each
  framework.
- The registration function is optional. If a BTL do not provide such function,
  nothing happens. However, in the case where such function is provided in the BTL
  structure, it will be called by the BML, when a tag is registered.

Now, it's time for the second step... Converting OB1 from a switch based PML to an
active message one.

This commit was SVN r17140.
2008-01-15 05:32:53 +00:00
George Bosilca
906e8bf1d1 Replace the ompi_pointer_array with opal_pointer_array. The next step
(sometimes after the merge with the ORTE branch), the opal_pointer_array
will became the only pointer_array implementation (the orte_pointer_array
will be removed).

This commit was SVN r17007.
2007-12-21 06:02:00 +00:00
Donald Kerr
d05d3afaed clean up and make consistent the reporting out from the udapl btl; report out readeable event string instead of just a number
This commit was SVN r16954.
2007-12-13 15:32:26 +00:00
Gleb Natapov
e2e211f23b Add flags parameter to btl_alloc() and btl_prepare_src() functions. If BTL
knows at the time of allocation priority of a descriptor it may do some
optimizations.

This commit was SVN r16901.
2007-12-09 14:08:01 +00:00
Gleb Natapov
7364b7cf47 Add endpoint parameter to btl_alloc() function. Enables various optimizations
inside BTL.

This commit was SVN r16898.
2007-12-09 14:00:42 +00:00
Rich Graham
6e77414a68 changes to the ompi_free_list_ex - called ompi_free_list_ex_new, for now.
This commit was SVN r16803.
2007-11-29 21:18:37 +00:00
Jeff Squyres
8ace07efed This commit brings in two major things:
1. Galen's fine-grain control of queue pair resources in the openib
   BTL.
1. Pasha's new implementation of asychronous HCA event handling.

Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.  

Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes).  :-(

== Fine-grain control of queue pair resources ==

Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).

Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments.  When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers.  One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.

The new design allows multiple QPs to be specified at runtime.  Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified.  The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:

{{{
-mca btl_openib_receive_queues \
     "P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}

Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma).  The above
example therefore describes 4 QPs.

The first QP is:

    P,128,16,4

Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes).  The third field indicates the number of receive buffers
to allocate to the QP (16).  The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).

The second QP is:

    S,1024,256,128,32

Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP.  The second, third and fourth fields are the same as in the
per-peer based QP.  The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32).  This provides a
"good enough" mechanism of flow control for some regular communication
patterns.

QPs MUST be specified in ascending receive buffer size order.  This
requirement may be removed prior to 1.3 release.

This commit was SVN r15474.
2007-07-18 01:15:59 +00:00
Donald Kerr
88c9dfdf9f improve message to user when dat_ia_open fails
This commit was SVN r15362.
2007-07-11 15:20:35 +00:00
Gleb Natapov
b88b7dedfe Rename btl_rdma_offset to btl_pipeline_send_length.
This commit was SVN r15153.
2007-06-21 07:12:40 +00:00
Donald Kerr
91c9b7b6f9 don't call dat_evd_resize if new value is less than or equal to current because ofed stack does not return DAT_INVALID_STATE
This commit was SVN r14792.
2007-05-29 20:08:16 +00:00
Galen Shipman
3401bd2b07 Add optional ordering to the BTL interface.
This is required to tighten up the BTL semantics. Ordering is not guaranteed,
but, if the BTL returns a order tag in a descriptor (other than
MCA_BTL_NO_ORDER) then we may request another descriptor that will obey
ordering w.r.t. to the other descriptor.


This will allow sane behavior for RDMA networks, where local completion of an
RDMA operation on the active side does not imply remote completion on the
passive side. If we send a FIN message after local completion and the FIN is
not ordered w.r.t. the RDMA operation then badness may occur as the passive
side may now try to deregister the memory and the RDMA operation may still be
pending on the passive side. 

Note that this has no impact on networks that don't suffer from this
limitation as the ORDER tag can simply always be specified as
MCA_BTL_NO_ORDER.

This commit was SVN r14768.
2007-05-24 19:51:26 +00:00
Donald Kerr
23280bd7da remove an assignment which is not required
This commit was SVN r14692.
2007-05-18 01:33:02 +00:00
Donald Kerr
588d5bd6a9 clean up compile warnings
This commit was SVN r14691.
2007-05-17 23:37:47 +00:00
Gleb Natapov
3ebaff8dfe Implement new BTL parameters:
We eagerly send data up to btl_*_eager_limit with the match
Upon ACK of the MATCH we start using send/receives of size
btl_*_max_send_size up to the btl_*_rdma_pipeline_offset
After the btl_*_rdma_pipeline_offset we begin using RDMA writes of
size btl_*_rdma_pipeline_frag_size.

Now, on a per message basis we only use the above protocol if the
message is larger than btl_*_min_rdma_pipeline_size

btl_*_eager_limit - > same
btl_*_max_send_size -> same
btl_*_rdma_pipeline_offset -> btl_*_min_rdma_size
btl_*_rdma_pipeline_frag_size -> btl_*_max_rdma_size


btl_*_min_rdma_pipeline_size is new..

This patch also moves all BTL common parameters initialisation into
btl_base_mca.c file.

This commit was SVN r14681.
2007-05-17 07:54:27 +00:00
Donald Kerr
c40307fd27 add user warning message to inform when udapl btl is no longer able to register memory
This commit was SVN r14678.
2007-05-16 21:04:50 +00:00
Donald Kerr
2ed72bf2e2 break evd_qlen into individual qlens (async,dto,conn); add checks based on udapl limits and number of peers
This commit was SVN r14659.
2007-05-15 17:47:00 +00:00
Donald Kerr
436d370d51 latency improvements: use ompi_free_list_init_ex, create optimal alignment parameter, remove rdma guarantee path, replace dat_lmt_sync_rdma with use of volatile
This commit was SVN r14634.
2007-05-09 19:41:25 +00:00
Donald Kerr
80d984441f change so that we only check connection queue when expecting a connection; create a mca parameter that controls frequency at which the async queue is checked
This commit was SVN r14511.
2007-04-25 17:46:25 +00:00
Donald Kerr
cae24fcde1 move mca parameter registration into own .c and .h files
This commit was SVN r14493.
2007-04-24 18:34:16 +00:00
Donald Kerr
3f428af7b8 couple of minor changes to fix #973 and seperated eager rdma fragments into structure only and data only area
This commit was SVN r14470.
2007-04-23 17:41:34 +00:00
George Bosilca
1cb26e3b9c Finally the convertor export a convenience function to allow a consistent
computation of the current location on the pack/unpack process. This can
be used both for retrieving the pointer to the first byte (in the special
case of the cached RDMA protocol) and for getting the current
position (for the pipelined protocol).

I modified all BTLs, but most of them are still untested.

This commit was SVN r14180.
2007-03-30 22:02:45 +00:00
Josh Hursey
dadca7da88 Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD).
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.

This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.

This commit closes trac:158

More details to follow.

This commit was SVN r14051.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r13912

The following Trac tickets were found above:
  Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
2007-03-16 23:11:45 +00:00
Donald Kerr
ed097d17c1 fix for bug #749, though I can not confirm without a linux compiler
This commit was SVN r13090.
2007-01-11 22:25:13 +00:00
Donald Kerr
80f2cbb498 add udapl rdma capabilities into the udapl btl
This commit was SVN r13082.
2007-01-11 15:22:08 +00:00
Brian Barrett
48ec0b2071 Revert out r12974, 12976, and 12991 as George has provided a less intrusive fix
for now...

This commit was SVN r12997.

The following SVN revision numbers were found above:
  r12974 --> open-mpi/ompi@27cea44a9c
2007-01-04 22:07:37 +00:00
Brian Barrett
27cea44a9c Fix a number of issues with the ompi_ptr_t:
* Make sure that the pval always writes to the correct portion of the
    lval.  This only matters on 32 bit big endian machines.
  * On 32 bit machines when assigning to pval, the other 4 bytes of lval
    weren't being written, which could lead to bogus data

We use macros so that there aren't casts all over the code and the pval
assignment can occur to the correct 4 bytes.  Refs trac:587

This commit was SVN r12974.

The following Trac tickets were found above:
  Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
2007-01-03 19:47:48 +00:00
Donald Kerr
899297c8f4 udapl btl was not compiling after r12878 on 12/17/2006, some minor changes to allow btl to compile
This commit was SVN r12963.

The following SVN revision numbers were found above:
  r12878 --> open-mpi/ompi@190e7a27cd
2007-01-02 21:44:12 +00:00
Gleb Natapov
190e7a27cd Merge with gleb-mpool branch. All RDMA components use same mpool now (rdma).
udapl/openib/vapi/gm mpools a deprecated. rdma mpool has parameter that allows
to limit its size mpool_rdma_rcache_size_limit (default is 0 - unlimited).

This commit was SVN r12878.
2006-12-17 12:26:41 +00:00
George Bosilca
126a68dc9a Big datatype commit. Remove all unused features of the datatype engine. As the memory
allocation logic is completely done outside the data-type engine (in the PML) there is
no need for any special case inside the data-type engine. There is less arguments for
the ompi_convertor_pack and ompi_convertor_unpack as well (the last field free_after is
not required anymore as there is no memory allocated in the engine itself). This change
affect all components using datatypes. I test most of them, but it might happens that I
miss some ... If it's the case please let me know (don't shoot the pianist!!).

This commit was SVN r12331.
2006-10-26 23:11:26 +00:00
George Bosilca
640178c4b3 Grepping through the source files I found these calls to the data-type engine
with the wrong type of arguments.

This commit was SVN r12148.
2006-10-17 21:05:04 +00:00
Terry Dontje
d636db5832 Fixed bug trac #213 by moving the udapl btl header to being a footer.
Also fixed bug trac #346.

This commit was SVN r11760.
2006-09-22 19:28:09 +00:00
Dan Lacher
f2526d60ed Minor fix for a dropped comma.
This commit was SVN r11259.
2006-08-18 17:55:57 +00:00
Galen Shipman
e5c594c211 More updates for the async error handler for btl's
In order to provide backwards compatability the framework versions are bumped
and the handler registeration function is at the end of the btl struct.
Testing done on sm, openib, and gm.. 

This commit was SVN r11256.
2006-08-17 22:02:01 +00:00
Galen Shipman
3b49953ce2 Add error callback to the btl interface, this allows error to be delivered to
the upperlayer assynchronously although there are some issues with this.. such
as there are multiple consumers of the btl's.. who get's the

This commit was SVN r11232.
2006-08-16 20:21:38 +00:00
Donald Kerr
2e5e01a8df Remove dependency on known port range and allow udapl to provide the port number.
This commit was SVN r11040.
2006-07-28 13:58:21 +00:00
Andrew Friedley
c68c6ac122 A number of fixes and the usual cleanup..
- Added some basic flow control to limit number of posted sends.
- Merged endpoint send/recv lock into single endpoint lock.
- Set the LMR triplet length in the send path, not at allocation time.
  This has to be done because upper layers might send less than the
  amount allocated.
- Alter the tie-breaker if statement protecting the second call
  to dat_ep_connect().  The logic was reversed compared to the tie-
  breaker for the first dat_ep_connect(), making it possible for
  3 or more processes to form a deadlock loop.
- Some asserts were added for debugging purposes.. leaving them
  in place for now.

This commit was SVN r10317.
2006-06-12 22:42:01 +00:00
Andrew Friedley
8a3d0862ca I can commit! *happy dance*
Trying to remember what I did here.. eager/max messages should work now, no RDMA yet.  A number of other fixes and cleanups.

I do know of two problems:
 Bad stuff happens when flooded with send frags too quickly - the BTL doesn't handle flow control.
 Certain IBM tests turn up a length assertion in the datatype engine - needs more investigation.

This commit was SVN r10070.
2006-05-25 15:47:59 +00:00
Andrew Friedley
345551cb36 Checkpoint before starting work on max-sized frags (maybe user too?).
- Some initial work on prepare_src
- Move some fragment initialization around
- Fix a union casting issue on picky compilers, identified by Don Kerr
- Other small cleanups/bugfixes

This commit was SVN r9662.
2006-04-19 22:20:22 +00:00
Andrew Friedley
d461b55696 - Implement OOB connection handshaking via the ORTE RML. To start a connect,
we send our local addr_t OOB.  Remote side then matches endpoints and calls
  dat_ep_connect().  Everything should be the same as before from here, except
  that client/server roles are reversed.
- Properly set our buffer size when posting receives.  When the frag used to
  transfer address information is recycled by the free list, the wrong buffer
  size was being used, which caused buffer overflow errors.
- Finally put the uDAPL error handling stuff in the mpool component.
- Remove a few more OPAL_OUTPUTs.

This commit was SVN r9569.
2006-04-07 15:26:05 +00:00
Andrew Friedley
74b2f77a4c The expected cleanup/refactoring commit..
Not much got tested that wasn't already - I've uncovered a connection
establishment deadlock and wanted to get these changes committed before I
attack it.

The big changes:
 - Moved much of the connection code from btl_udapl_component.c to
   btl_udapl_endpoint.c.
 - Cleaned up initialization of various fragment members.
 - MCA_BTL_UDAPL_ERROR macro, which is compiled in/out appropriately.

This commit was SVN r9496.
2006-03-31 16:25:19 +00:00
Andrew Friedley
0eba366b07 Various pieces all over to make basic small message send/recv work. Next step
is clean up the code.. it is in need of refactoring and testing.

Thanks to Brian for help in troubleshooting!

This commit was SVN r9466.
2006-03-29 21:55:41 +00:00
Andrew Friedley
48d61cd99a Mostly fragment/LMR handling fixes:
- Grab the mpool_registration in _frag_common_constructor()
 - Save the LMR context in the segment key
 - No need for cookie variables - can just cast the frag
 - No need to memcpy() data when recv'ing
 - Add an LMR triplet to the fragment structure and initialize it
   in btl_udapl_alloc().
 - Whitespace/typo fixes, remove some opal_output() calls

Looks like I can use triplets describing sub-regions of registered LMR's.  So I
do this - prior to this patch I was sending the entire free list memory over,
which isn't correct :)

Back to an earlier problem - when sending address information right after
connection establishment, the receiving end receives a DTO completion event and
appears to have good data.  But the sending end never receives a DTO completion
event indicating the send completed, and never completes the client side of the
connection.

This commit was SVN r9386.
2006-03-23 16:21:08 +00:00
Andrew Friedley
cf9246f7b9 Long overdue commit.. many changes.
In short, I'm very close to having connection establishment and eager send/recv working.

Part of the connection process involves sending address information from the
client to server.  For some reason, I am never receiving an event indicating
completetion of the send on the client side.  Otherwise, connection
establishment is working and eager send/recv should be trivial from here.


Some more detailed changes:
 - Send partially implemented, just handles starting up new connections.
 - Several support functions implemented for establishing connection.  Client
   side code went in btl_udapl_endpoint.c, server side in btl_udapl_component.c
 - Frags list and send/recv locks added to the endpoint structure.
 - BTL sets up a public service point, which listens for new connections.
   Steps over ports that are already bound, iterating through a range of ports.
 - Remove any traces of recv frags, don't think I need them after all.
 - Pieces of component_progress() implemented for connection establishment.
 - Frags have two new types for connection establishment - CONN_SEND and
   CONN_RECV.
 - Many other minor cleanups not affecting functionality

This commit was SVN r9345.
2006-03-21 00:12:55 +00:00