1
1

382 Коммитов

Автор SHA1 Сообщение Дата
Gleb Natapov
52c6160252 MCA_PML_BASE_REQUEST_MPI_COMPLETE() macro does nothing except call to
ompi_request_complete(). Remove the macro and call the function directly.

This commit was SVN r16498.
2007-10-18 14:20:24 +00:00
Gleb Natapov
e0a3a7e53e Move duplicated code all over the code to a single function ompi_request_wait_completion().
This commit was SVN r16494.
2007-10-18 12:33:21 +00:00
Gleb Natapov
807f49ed7f If there are more then one BTL present we may divide payload between them in
such a way that converter will not be able to pack some of it. This commit adds
handling of such cases. If converter can't pack any data for a BTL the data is
sent over another BTL that has data to send.

This commit was SVN r16493.
2007-10-18 12:07:37 +00:00
Gleb Natapov
1330974e5e eager_limit is no longer needed in OB1 PML. Remove it.
This commit was SVN r16442.
2007-10-15 09:26:42 +00:00
Ralph Castain
54b2cf747e These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.

This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:

As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.

In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.

The incoming changes revamp these procedures in three ways:

1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.

The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.

Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.


2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.

The size of this data has been reduced in three ways:

(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.

To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.

(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.

(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.

While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.


3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.

It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.

Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.


There are a few minor additional changes in the commit that I'll just note in passing:

* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.

* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.

* cleanup of some stale header files

This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
Gleb Natapov
097b17d30e Prevent a receive request from been freed while other thread holds a reference
to it or there is an outstanding completion for the request.

This commit was SVN r16153.
2007-09-18 16:18:47 +00:00
George Bosilca
05ae27c68b Don't segfault if we receive a fragment for a non existing communicator.
Instead, drop it by now.

This commit was SVN r16105.
2007-09-12 17:52:02 +00:00
Shiqing Fan
a0660f4deb - Just some type casts.
This commit was SVN r16100.
2007-09-12 15:29:58 +00:00
Gleb Natapov
07c8fddeef Fix scheduling of pending send request. It should be scheduled req_lock times.
This commit was SVN r16096.
2007-09-12 07:08:38 +00:00
George Bosilca
d8fed2cfa1 Set a default value so that some compilers stop complaining about
uninitialized values.

This commit was SVN r16094.
2007-09-11 18:00:53 +00:00
Gleb Natapov
79011279e5 Remove debug output.
This commit was SVN r16016.
2007-08-30 13:29:41 +00:00
Gleb Natapov
690fb95bda Cleanup send scheduling code.
This commit was SVN r16014.
2007-08-30 12:10:04 +00:00
Gleb Natapov
0b0f9d14aa Mark send request complete on PML level only when absolutely sure there is
no more work associated with this request. No more outstanding completions or
packets and send scheduling isn't running in another thread.

This commit was SVN r16013.
2007-08-30 12:08:33 +00:00
Gleb Natapov
eac2674f66 The inner voice tells me this is a typo.
This commit was SVN r16004.
2007-08-29 13:28:47 +00:00
Brian Barrett
59b22533f2 Enable RDMA for heterogeneous situations. Currently done by overloading
the ompi_convertor_need_buffers function to only return 0 if the convertor
is homogeneous (which it never does on the trunk, but does to on v1.2, but
that's a different issue).  Only enable the heterogeneous rdma code for
a btl if it supports it (via a flag), as some btls need some work for this
to work properly.  Currently only TCP and OpenIB extensively tested

This commit was SVN r15990.
2007-08-28 21:23:44 +00:00
Gleb Natapov
fa69c5cc10 If a memory on a sender's size is not registered don't register it on a receive
side too. Otherwise a content of the recvreq->req_rdma array is replaced later
without freeing previous content and refcount on registration in mpool become
wrong.

This commit was SVN r15978.
2007-08-28 07:43:06 +00:00
Gleb Natapov
e1a1d9d90e Receive request converter can be accessed in parallel by a thread that receives
data and a thread that run RDMA schedule function. Protect access to the
converter by a lock.

This commit was SVN r15967.
2007-08-27 11:41:42 +00:00
Gleb Natapov
065d04dfde Do not free recvreq while schedule function is running in another thread.
This commit was SVN r15964.
2007-08-27 11:31:40 +00:00
Rainer Keller
1b5fa48a29 - Add missing PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q when matching
from the posted generic_recv-queue.
 - Move the PERUSE_COMM_MSG_MATCH_POSTED_REQ from
   MCA_PML_OB1_RECV_REQUEST_MATCHED to
   mca_pml_ob1_recv_frag_match() as suggested by Terry Dontje
   Only post, if this is not a probe/iprobe request.
 - Do not post PERUSE_COMM_REQ_MATCH_UNEX for probes / iprobes and
   do in correct order before PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q

This commit was SVN r15947.
2007-08-23 07:09:43 +00:00
Rainer Keller
c175801f98 - Initialize in the order of mca_pml_ob1_comm_proc_t...
This commit was SVN r15946.
2007-08-23 05:56:22 +00:00
Rainer Keller
b0df55d53b - For MPI_Probe/MPI_Iprobe, we should not have a
PERUSE_COMM_REQ_ACTIVATE event.
   Therefore move the PERUSE_TRACE_COMM_EVENT for this event from
   MCA_PML_BASE_SEND_REQUEST_INIT / MCA_PML_BASE_RECV_REQUEST_INIT
   to the proper places into pml_ob1_isend.c / pml_ob1_irecv.c right
   after the MCA_PML_OB1_SEND_REQUEST_INIT /
   MCA_PML_OB1_RECV_REQUEST_INIT.

This commit was SVN r15945.
2007-08-23 05:52:33 +00:00
Gleb Natapov
5596aa5f53 The sizes of mca_pml_ob1_send_request_t and mca_pml_ob1_recv_request_t depend
on a parameter and are determined in runtime. r15346 removed calculation of
correct sizes for this structures. This patch adds it back and fixes trac:1116, #1114.

This commit was SVN r15932.

The following SVN revision numbers were found above:
  r15346 --> open-mpi/ompi@433f8a7694

The following Trac tickets were found above:
  Ticket 1116 --> https://svn.open-mpi.org/trac/ompi/ticket/1116
2007-08-20 12:06:27 +00:00
Mohamad Chaarawi
59a7bf8a9f Merging in the Sparse Groups..
This commit includes config changes..

This commit was SVN r15764.
2007-08-04 00:41:26 +00:00
Gleb Natapov
627d9bc8ed Delay freeing of a send request if scheduling function is running by other
thread.

This commit was SVN r15722.
2007-08-01 12:19:16 +00:00
Gleb Natapov
afac5eb93f Guard recv request with lock against simultaneous access from different
threads.

This commit was SVN r15681.
2007-07-30 12:50:38 +00:00
Gleb Natapov
21dd061696 Init req_send_range_lock. Found by Terry Dontje.
This commit was SVN r15677.
2007-07-30 08:21:52 +00:00
Brian Barrett
39a6057fc6 A number of improvements / changes to the RML/OOB layers:
* General TCP cleanup for OPAL / ORTE
  * Simplifying the OOB by moving much of the logic into the RML
  * Allowing the OOB RML component to do routing of messages
  * Adding a component framework for handling routing tables
  * Moving the xcast functionality from the OOB base to its own framework

Includes merge from tmp/bwb-oob-rml-merge revisions:

    r15506, r15507, r15508, r15510, r15511, r15512, r15513

This commit was SVN r15528.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r15506
  r15507
  r15508
  r15510
  r15511
  r15512
  r15513
2007-07-20 01:34:02 +00:00
Rich Graham
0991c3d5f5 move buffered send component clean up out of the pml to ompi_mpi_finalize.
This commit was SVN r15463.
2007-07-17 14:50:52 +00:00
Rich Graham
1a4ce2a961 move setting of the component used to managed buffer sends out of the
pmls, and into ompi_mpi_init.  This is the first of several steps to pull
buffered send management out of the pmls.

This commit was SVN r15451.
2007-07-16 21:52:25 +00:00
George Bosilca
8643f38adf Don't allow the BTL to be closed before the end of the process. Count the
number of times the BTLs are opened, and then don't remove them until
close was called the same number of times.

This commit was SVN r15376.
2007-07-11 22:21:04 +00:00
George Bosilca
e19777e910 A more consistent version. As we now share the send and receive queue, we
have to construct/destruct only once. Therefore, the construction will
happens before digging for a PML, while the destruction just before
finalizing the component.

Add some OPAL_LIKELY/OPAL_UNLIKELY.

This commit was SVN r15347.
2007-07-10 23:45:23 +00:00
George Bosilca
433f8a7694 This patch bring full support for message queues in Open MPI. Now the send and
receive queues are shared among all PMLs, they are declared in the base PML,
and the selected PML is in charge of initializing and releasing them. 

The CM PML is slightly different compared with OB1 or DR. Internally it use
2 different types of requests: light and heavy. However, now with this patch
both types of requests are stored in the same queue, and cast appropriately
on the allocation macro. This means we might use less memory than we allocate,
but in exchange we got full support for most of the parallel debuggers.

Another thing with this patch, is that now for all PML (CM included) the basic
PML requests start with the same fields, and they are declared in the same order
in the request structure. Moreover, the fields have been moved in such a way
that only one volatile/atomic will exist per line of cache (hopefully).

This commit was SVN r15346.
2007-07-10 22:16:38 +00:00
Brian Barrett
8b9e8054fd Move modex from pml base to general ompi runtime, sicne it's used by more
than just the PML/BTLs these days.  Also clean up the code so that it
handles the situation where not all nodes register information for a given
node (rather than just spinning until that node sends information, like
we do today).

Includes r15234 and r15265 from the /tmp/bwb-modex branch.

This commit was SVN r15310.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r15234
  r15265
2007-07-09 17:16:34 +00:00
Tim Prins
f3ac4ac20e Fix order of function arguments
This commit was SVN r15304.
2007-07-08 16:37:51 +00:00
Rainer Keller
cff1b6a71b - PERUSE_COMM_REQ_XFER_BEGIN should be emited for first fragment
of larger message as well.

This commit was SVN r15299.
2007-07-06 15:02:36 +00:00
George Bosilca
951e4929b9 Usually it's unlikely to have additional fragments.
This commit was SVN r15253.
2007-07-01 16:19:53 +00:00
George Bosilca
c435094639 Only trigger the PERUSE_COMM_REQ_XFER_BEGIN event on the initial fragment.
This commit was SVN r15252.
2007-07-01 16:19:13 +00:00
George Bosilca
60319f99ac Make sure in case of error what we return is clean (set to NULL).
This commit was SVN r15251.
2007-07-01 16:17:43 +00:00
George Bosilca
11656e20aa Remove few warnings.
This commit was SVN r15250.
2007-07-01 16:16:05 +00:00
Gleb Natapov
77e54ebc7e Schedule RDMA op on the last BTL that got completion.
This commit was SVN r15249.
2007-07-01 11:35:55 +00:00
Gleb Natapov
54b40aef91 Schedule SEND traffic of pipeline protocol between BTLs in accordance with
relative bandwidths of each BTL. Precalculate what part of a message should
be send via each BTL in advance instead of doing it during scheduling.

This commit was SVN r15248.
2007-07-01 11:34:23 +00:00
Gleb Natapov
e74aa6b295 Schedule RDMA traffic between BTLs in accordance with relative bandwidths of
each BTL. Precalculate what part of a message should be send via each BTL in
advance instead of doing it during scheduling.

This commit was SVN r15247.
2007-07-01 11:31:26 +00:00
Gleb Natapov
1c7141df4d Remove unused struct.
This commit was SVN r15228.
2007-06-28 11:58:16 +00:00
Gleb Natapov
b88b7dedfe Rename btl_rdma_offset to btl_pipeline_send_length.
This commit was SVN r15153.
2007-06-21 07:12:40 +00:00
Josh Hursey
7fd1805e97 Fix a couple of compile warnings that Tim P brought to by attention.
This commit was SVN r15132.
2007-06-19 00:46:16 +00:00
Rainer Keller
ca09aae2cc - Get PERUSE compile again with latest RDMA changes in r14768/r14842.
This commit was SVN r15042.

The following SVN revision numbers were found above:
  r14768 --> open-mpi/ompi@3401bd2b07
  r14842 --> open-mpi/ompi@10266fb467
2007-06-13 12:47:47 +00:00
Brian Barrett
84d1512fba Add the potential for doing some basic error checking on mutexes during
single threaded builds.  In its default configuration, all this does
is ensure that there's at least a good chance of threads building
based on non-threaded development (since the variable names will be
checked).  There is also code to make sure that a "mutex" is never
"double locked" when using the conditional macro mutex operations.
This is off by default because there are a number of places in both
ORTE and OMPI where this alarm spews mega bytes of errors on a
simple test.  So we have some work to do on our path towards
thread support.

Also removed the macro versions of the non-conditional thread locks,
as the only places they were used, the author of the code intended
to use the conditional thread locks.  So now you have upper-case
macros for conditional thread locks and lowercase functions for
non-conditional locks.  Simple, right? :).

This commit was SVN r15011.
2007-06-12 16:25:26 +00:00
Gleb Natapov
423f404c34 Shut up compiler warning. Ugly, but I can see better way except changing
converter to use uint64_t(ssize_t?) for offset.

This commit was SVN r14950.
2007-06-07 11:33:28 +00:00
Gleb Natapov
9f9b64db4e Revert r14947 as this doesn't solve the problem.
This commit was SVN r14949.

The following SVN revision numbers were found above:
  r14947 --> open-mpi/ompi@5b9fe28e3f
2007-06-07 11:24:24 +00:00
Gleb Natapov
5b9fe28e3f Fix warning on 32bit systems.
This commit was SVN r14947.
2007-06-07 08:57:34 +00:00