parameters don't make any sense. Credits are never piggybacked. Also make
default queue sizes to be calculated from eager_limit and max_send_size values.
This commit was SVN r16816.
ompi_mtl_portals_finalize(). Previous code was cleaning up Portals resources
that hadn't been allocated, which caused valid handles used elsewhere to be
freed, which broke cnos_barrier() for the Portals btl.
This commit was SVN r16801.
needed instead of creating it and then canceling if it is not needed. Change
error handling during finalize so that it will not skip async thread
destruction. Otherwise async thread may segfault during openib module unloading.
This commit was SVN r16782.
to a pending queue of eager rdma QP instead of correct pending list. This patch
fixes this by getting reed of "eager rdma qp" notion. Packet is always send
over its order QP. The patch also adds two pending queues for high and low prio
packets. Only high prio packets are sent over eager RDMA channel.
This commit was SVN r16780.
main idea (except of cleanup) is to save on initialisation of unneeded fields
and to use C type checking system to catch obvious errors.
This commit was SVN r16779.
has his own range which is defined by a min value and a range. By default
there is no limitation on the port range, which is exactly the same
behavior as before.
This commit was SVN r16584.
such a way that converter will not be able to pack some of it. This commit adds
handling of such cases. If converter can't pack any data for a BTL the data is
sent over another BTL that has data to send.
This commit was SVN r16493.
the use of the --mca btl_base_verbose flag. The
btl framework now matches all the other frameworks.
Slightly modify error messages for clarity.
This commit was SVN r16443.
PML base will take care of the registration with the event library.
Otherwise, (and this apply for the CM case) the MTL are in charge of
registering their own progress function.
This commit was SVN r16415.
* Fix some missing includes in a few places.
* Add the cr_request() functionality to the BLCR CRS component.
We are now dependent upon the 0.6.* series of BLCR.
* Made the CR notification mechanism a registered function.
This way we can have an OPAL-only version and it can be replaced at
runtime with the ORTE version.
* Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only
CR functionality when the user wants it. Default: Disabled.
* Fix the placement of a checkpoint request check in MPI_Init
* Pull the OPAL notification mechanism into the SnapC framework.
* We no longer fork/exec the 'opal-checkpoint' command for local
checkpointing, the Local coordinator in the orted does this directly.
* The Local and Application coordinator talk together bypassing the OPAL
notifiation mechanism.
* Optimized the Local <-> App Coordinator communication.
* Improved the structure used to track vpid_snapshots in the local coord.
* Fix a race condition in which an application under heavy communication load
may produce an inconsistent global checkpoint.
This commit was SVN r16389.
yesterday. This actually exposed a very, very long-standing bug where
part of the coll base was incorrectly checking the coll API version
against the MCA API version. When coll went to v1.1 (yesterday) and
was no longer the same as the MCA v1.0, the test started failing.
This commit fixes to check for v1.1 everywhere in the coll base, and
to ensure to check coll framework/API version numbers against coll
framework/API version numbers (vs. against the MCA API version
number).
This commit was SVN r16373.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
Check if an exclusion string (i.e. '-mca btl ^sm) was provided; if so OFUD just disables itself.
This commit was SVN r16307.
The following Trac tickets were found above:
Ticket 1154 --> https://svn.open-mpi.org/trac/ompi/ticket/1154
Each one of them has a field to store QP type, but this is redundant.
Store qp type only in one structure (the component one).
This commit was SVN r16272.
meaning "infinite") is no longer larger than the minimum required
size. So put in an appropriate test to ensure that "infinite" was not
requested.
This commit was SVN r16142.
numbers for tuning.
Switch the bookmark_recv to be non-blocking. If this is blocking then for
process counts >= 32 slight process delays were causing cascading performance
delays in the protocol. This lead to checkpoints either taking about 3 sec or
45 sec (or more) for 64 procs due to the cascading delays. With the nonblocking
receive version this is no longer the case we get the speedup we expect for this
part of the protocol.
More tuning to come.
This commit was SVN r16137.
you'll get a helpful error message and the openib BTL will deactivate
itself.
This commit was SVN r16133.
The following Trac tickets were found above:
Ticket 1133 --> https://svn.open-mpi.org/trac/ompi/ticket/1133
NOTE: This build system does not work with the current autogen.sh. Modified one is under heavy testing to make sure it does not have side effects
This commit was SVN r16110.
Basically revert this part of r16015.
This commit was SVN r16029.
The following SVN revision numbers were found above:
r16015 --> open-mpi/ompi@435e7d80e9
no more work associated with this request. No more outstanding completions or
packets and send scheduling isn't running in another thread.
This commit was SVN r16013.
the ompi_convertor_need_buffers function to only return 0 if the convertor
is homogeneous (which it never does on the trunk, but does to on v1.2, but
that's a different issue). Only enable the heterogeneous rdma code for
a btl if it supports it (via a flag), as some btls need some work for this
to work properly. Currently only TCP and OpenIB extensively tested
This commit was SVN r15990.
side too. Otherwise a content of the recvreq->req_rdma array is replaced later
without freeing previous content and refcount on registration in mpool become
wrong.
This commit was SVN r15978.
to always first check for a NULL frag pointer before trying to send the
fragment. This avoids an issue in multi-threaded execution in which
multiple threads working on the same endpoint can result in a thread
finding itself here with nothing to send.
This commit was SVN r15963.
from the posted generic_recv-queue.
- Move the PERUSE_COMM_MSG_MATCH_POSTED_REQ from
MCA_PML_OB1_RECV_REQUEST_MATCHED to
mca_pml_ob1_recv_frag_match() as suggested by Terry Dontje
Only post, if this is not a probe/iprobe request.
- Do not post PERUSE_COMM_REQ_MATCH_UNEX for probes / iprobes and
do in correct order before PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q
This commit was SVN r15947.
PERUSE_COMM_REQ_ACTIVATE event.
Therefore move the PERUSE_TRACE_COMM_EVENT for this event from
MCA_PML_BASE_SEND_REQUEST_INIT / MCA_PML_BASE_RECV_REQUEST_INIT
to the proper places into pml_ob1_isend.c / pml_ob1_irecv.c right
after the MCA_PML_OB1_SEND_REQUEST_INIT /
MCA_PML_OB1_RECV_REQUEST_INIT.
This commit was SVN r15945.
one HCA. Multiple ports, LMC, multiple BTLs per one LID. Having only one CQ for
all of them substantially reduce polling time.
This commit was SVN r15933.
on a parameter and are determined in runtime. r15346 removed calculation of
correct sizes for this structures. This patch adds it back and fixes trac:1116, #1114.
This commit was SVN r15932.
The following SVN revision numbers were found above:
r15346 --> open-mpi/ompi@433f8a7694
The following Trac tickets were found above:
Ticket 1116 --> https://svn.open-mpi.org/trac/ompi/ticket/1116
used at nce (up to one unique collective module per collective function).
Matches r15795:15921 of the tmp/bwb-coll-select branch
This commit was SVN r15924.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15795
r15921
A subset of this patch needs to be applied to v1.2
Refs trac:928
This commit was SVN r15918.
The following Trac tickets were found above:
Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928
only static libraries. Previously, we were linking the libraries into
directly into the common, btl, and mtl code. This seemed to work fine
for me on my Opteron Fedora box, but caused Lisa some issues (PtlNIInit
would succeed, but the network handle would fail when used with
PtlEQAlloc).
Instead, link the portals libraries directly into libmpi and not at
all into the common, btl, or mtl components. THen use some linker
tricks to force the linker to bring in the public interface for the
reference implementation (which thankfully is pretty small).
This commit was SVN r15902.
Fixed a condition test while checking that all segments are empty.
Without this fix, a NULL segment pointer could make it past the
test, resulting in a SegV when dereferenced.
This commit was SVN r15891.
The following Trac tickets were found above:
Ticket 1134 --> https://svn.open-mpi.org/trac/ompi/ticket/1134
* Code cleanup and rationalization
* Fixed: mca_pml_base_send/recv_request are now allocated before recreation by the PML-V
* Fixed: pointer arithmetic bug in sender based that crashed
* Changed: directory structure. This is one step forward using autogen.sh to build static-components.h (it needs to have the directory structure of a mca framework for this).
This commit was SVN r15878.
semicolons but the new specitifcation string used colons. The text
parser now looks for colons.
* Changed all opal_output() error messages to
much-more-helpful/descriptive opal_show_help() messages.
* A few minor style/indenting fixes
This commit was SVN r15850.
The following SVN revision numbers were found above:
r15848 --> open-mpi/ompi@dd30597f39
/tmp/jms-modular-wireup branch):
* This commit moves all the openib BTL connection code out of
btl_openib_endpoint.c and into a connect "pseudo-component" area,
meaning that different schemes for doing OFA connection schemes can
be chosen via function pointer (i.e., MCA parameter) at run-time.
* The connect/connect.h file includes comments describing the
specific interface for the connect pseudo-component.
* Two pseudo-components are in this commit (more can certainly be
added).
* oob: use the same old oob/rml scheme for creating OFA connections
that we've had forever; this now just puts the logic into this
self-contained pseudo-component.
* rdma_cm: a currently-empty set of functions (that currently
return NOT_IMPLEMENTED) that will someday use the RDMA connection
manager to make OFA connections.
This commit was SVN r15786.
the Sandia reference implementation of Portals, and doesn't have the cnos
functions. This file should never be compiled (and wasn't being compiled)
on the Cray machines, so doesn't need to be updated to support CNL.
This commit was SVN r15778.
The following SVN revision numbers were found above:
r15756 --> open-mpi/ompi@755658694e
Application Level Placement Scheduler (ALPS).
This commit was tested under two Cray machines at ORNL: Jaguar (Catamount)
and Rizzo (CNL Test cage). Both machines performed as they should across
the commit.
It is likely that mor changes will follow this the work and environment
stabilizes.
Most of the infrastructure works the same for Catamount and CNL
except for a few bits. Below are the highlights:
Default IFACE Change:
On Catamount we can use PTL_IFACE_DEFAULT, but on the CNL system we have access
to will fail on this interface, and should be set to:
IFACE_FROM_BRIDGE_AND_NALID(PTL_BRIDGE_UK,PTL_IFACE_SS).
So if we detect that we are running with YOD then use the former interface
and if we detect that we are running with ALPS then use the latter.
We will want to pursue a more elegant solution if this interface continues to
change across machines.
PtlGetId and cnos_register_ptlid:
The header suggests that these should never be called when launching with YOD.
But in the ALPS environment the cnos_barrier() will hang forever if these
functions are not called after PtlNIInit(). Since these functions only need to
be called once, and the orte rmgr/cnos component is loaded before the ompi
common/portals componet then just call these functions once in the rmgr/cnos
component.
cnos_barrier_init():
This is a noop for YOD, but critical for ALPS. So be sure to call it before
calling the first barrier in the rmgr/cnos component.
cnos_barrier vs cnos_pm_barrier:
It is suggested the cnos_pm_barrier only be used during finalization
as it will indicate to the launcher (yod or aprun) that the app is about
to complete. It was suggested that we use the regular cnos_barrier() instead.
I want to look into this a bit more to make sure there are not adverse
side effects. A note has been placed in the code to indicate this reasoning.
This commit was SVN r15756.
2GB to 4GB-1 by using long instead of size_t for the sm size.
* it is done to prevent user from running into the ftruncate() in
common sm component (and possibly others) problem that ftruncate
takes an off_t which is a signed long integer. If we use an
unsigned long, it'll run into an invalid argument errno=22.
* See trac #1117
This commit was SVN r15752.
mpi_show_mpi_alloc_mem_leaks
When activated, MPI_FINALIZE displays a list of memory allocations
from MPI_ALLOC_MEM that were not freed by MPI_FREE_MEM (in each MPI
process).
* If set to a positive integer, display only that many leaks.
* If set to a negative integer, display all leaks.
* If set to 0, do not show any leaks.
This commit was SVN r15736.
This mpool will have no btl module owner there was no btl created for
the HCA with no ports, but it will still be tracked in the mpool
framework (i.e., it's available).
If MPI_ALLOC_MEM is called by the app, one of two things will happen:
1. if there's an HCA on the host with some active ports, the openib
btl component will still be in the process space, and therefore
the "mpool with no btl" (MWNB) module will still be able to call
the reg/dereg functions, and all will be fine. However, if
MPI_FREE_MEM is never invoked to free the memory, bad things will
happen during MPI_FINALIZE. The pml is finalized, which finalizes
all the btls. The btls finalize all their mpools and all is fine.
But later we close down the mpool framework which then finalizes
any left over mpool modules, such as MWNB. However, the openib
BTL module functions that the MWNB was registered with are no
longer in the process space, and it segv's while trying deregister
the memory.
2. if there are *no* HCA's on the host with active ports, then the
openib btl will have been unloaded, and when the MWNM tries to
register the memory, the functions it tries to call (in the openib
btl) are no longer there, and we segv.
This commit was SVN r15735.
it to at least compile. If you actually get to the point of invoking
the openib btl progress thread, you'll get a big opal_output warning
that it is pretty much guaranteed not to work.
This commit was SVN r15628.
Fixed: All copyrights are now correct up to 2007
Fixed: Build system now works with VPATHs
Changed: protocol_example is now ignored by default
This commit was SVN r15627.
sender piggybacks a number of credit messages it received from a peer. A number
of outstanding credit messages is limited. This is needed to never ever fall
back to HW flow control.
This commit was SVN r15580.
eager RDMA receive path and checks internally from where it was called from to
perform different tasks. Leave only common code in there and move other code
to appropriate places.
This commit was SVN r15579.
* General TCP cleanup for OPAL / ORTE
* Simplifying the OOB by moving much of the logic into the RML
* Allowing the OOB RML component to do routing of messages
* Adding a component framework for handling routing tables
* Moving the xcast functionality from the OOB base to its own framework
Includes merge from tmp/bwb-oob-rml-merge revisions:
r15506, r15507, r15508, r15510, r15511, r15512, r15513
This commit was SVN r15528.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15506
r15507
r15508
r15510
r15511
r15512
r15513
Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point.
Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings.
This commit was SVN r15517.
It will prevent the error failure in openib finalize
but it doesn't resolve the actual issue. I guess that
oneside tests some how allocates memory (mpool?) and doesn't
release it. Need to check it.
This commit was SVN r15488.
* bml.h had a change that introduced a variable named "_order" to
avoid a conflict with a local variable. The namespace starting
with _ belongs to the os/compiler/kernel/not us. So we can't start
symbols with _. So I replaced it with arg_order, and also updated
the threaded equivalent of the macro that was modified.
* in btl_openib_proc.c, one opal_output accidentally had its string
reverted from "ompi_modex_recv..." to
"mca_pml_base_modex_recv....". This was fixed.
* The change to ompi/runtime/ompi_preconnect.c was entirely
reverted; it was an artifact of debugging.
This commit was SVN r15475.
The following SVN revision numbers were found above:
r15474 --> open-mpi/ompi@8ace07efed
1. Galen's fine-grain control of queue pair resources in the openib
BTL.
1. Pasha's new implementation of asychronous HCA event handling.
Pasha's new implementation doesn't take much explanation, but the new
"multifrag" stuff does.
Note that "svn merge" was not used to bring this new code from the
/tmp/ib_multifrag branch -- something Bad happened in the periodic
trunk pulls on that branch making an actual merge back to the trunk
effectively impossible (i.e., lots and lots of arbitrary conflicts and
artifical changes). :-(
== Fine-grain control of queue pair resources ==
Galen's fine-grain control of queue pair resources to the OpenIB BTL
(thanks to Gleb for fixing broken code and providing additional
functionality, Pasha for finding broken code, and Jeff for doing all
the svn work and regression testing).
Prior to this commit, the OpenIB BTL created two queue pairs: one for
eager size fragments and one for max send size fragments. When the
use of the shared receive queue (SRQ) was specified (via "-mca
btl_openib_use_srq 1"), these QPs would use a shared receive queue for
receive buffers instead of the default per-peer (PP) receive queues
and buffers. One consequence of this design is that receive buffer
utilization (the size of the data received as a percentage of the
receive buffer used for the data) was quite poor for a number of
applications.
The new design allows multiple QPs to be specified at runtime. Each
QP can be setup to use PP or SRQ receive buffers as well as giving
fine-grained control over receive buffer size, number of receive
buffers to post, when to replenish the receive queue (low water mark)
and for SRQ QPs, the number of outstanding sends can also be
specified. The following is an example of the syntax to describe QPs
to the OpenIB BTL using the new MCA parameter btl_openib_receive_queues:
{{{
-mca btl_openib_receive_queues \
"P,128,16,4;S,1024,256,128,32;S,4096,256,128,32;S,65536,256,128,32"
}}}
Each QP description is delimited by ";" (semicolon) with individual
fields of the QP description delimited by "," (comma). The above
example therefore describes 4 QPs.
The first QP is:
P,128,16,4
Meaning: per-peer receive buffer QPs are indicated by a starting field
of "P"; the first QP (shown above) is therefore a per-peer based QP.
The second field indicates the size of the receive buffer in bytes
(128 bytes). The third field indicates the number of receive buffers
to allocate to the QP (16). The fourth field indicates the low
watermark for receive buffers at which time the BTL will repost
receive buffers to the QP (4).
The second QP is:
S,1024,256,128,32
Shared receive queue based QPs are indicated by a starting field of
"S"; the second QP (shown above) is therefore a shared receive queue
based QP. The second, third and fourth fields are the same as in the
per-peer based QP. The fifth field is the number of outstanding sends
that are allowed at a given time on the QP (32). This provides a
"good enough" mechanism of flow control for some regular communication
patterns.
QPs MUST be specified in ascending receive buffer size order. This
requirement may be removed prior to 1.3 release.
This commit was SVN r15474.
switching:
0 0
/ \ \ / \ \
1 \ \ --> 4 \ \
/ \ \ / \ \
3 2 \ 3 2 \
4 1
(duh). The first form is the bmtree suitable for bcast, but the latter is better for reduce.
Updating default decision function accordingly.
This commit was SVN r15422.
instead of just the procs for MCW (in MCW order). Should make resolving
ptl_process_id_t structures for arbitrary communicators easier for
applications that need it.
This commit was SVN r15393.
that exactly describes the buffer to be used as the target of the
operation
* Use the above flag to disable components setting the flag from being
used for real RDMA operations for the one-sided component (the
BTLs will still be used for RDMA transfers for the PML and for
send/receive communication for the OSC component)
This commit was SVN r15375.
have to construct/destruct only once. Therefore, the construction will
happens before digging for a PML, while the destruction just before
finalizing the component.
Add some OPAL_LIKELY/OPAL_UNLIKELY.
This commit was SVN r15347.
receive queues are shared among all PMLs, they are declared in the base PML,
and the selected PML is in charge of initializing and releasing them.
The CM PML is slightly different compared with OB1 or DR. Internally it use
2 different types of requests: light and heavy. However, now with this patch
both types of requests are stored in the same queue, and cast appropriately
on the allocation macro. This means we might use less memory than we allocate,
but in exchange we got full support for most of the parallel debuggers.
Another thing with this patch, is that now for all PML (CM included) the basic
PML requests start with the same fields, and they are declared in the same order
in the request structure. Moreover, the fields have been moved in such a way
that only one volatile/atomic will exist per line of cache (hopefully).
This commit was SVN r15346.
VxWorks. Still some issues remaining, I'm sure.
Refs trac:1010
This commit was SVN r15320.
The following Trac tickets were found above:
Ticket 1010 --> https://svn.open-mpi.org/trac/ompi/ticket/1010
than just the PML/BTLs these days. Also clean up the code so that it
handles the situation where not all nodes register information for a given
node (rather than just spinning until that node sends information, like
we do today).
Includes r15234 and r15265 from the /tmp/bwb-modex branch.
This commit was SVN r15310.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15234
r15265
The problem is that in the case of threaded builds for every fifo
a head and tail lock will be allocated inside the shared memory
segment and the ptr is stored inside the fifo. In the case that the sm backend
file will be mapped in all processes at the same address (mostly the
case for non-thread builds) this is fine, but in the cases when the
processes map the file at different addresses this addresses cause big
trouble in other processes than the one that allocted the locks.
Therefore the send lock addresses have to be recalculated to match
the local mapping of the processes that use them.
This commit was SVN r15291.
relative bandwidths of each BTL. Precalculate what part of a message should
be send via each BTL in advance instead of doing it during scheduling.
This commit was SVN r15248.
each BTL. Precalculate what part of a message should be send via each BTL in
advance instead of doing it during scheduling.
This commit was SVN r15247.
argument to the query for the line speed. This function is still not
documented, and it really look strange that we have to respecify the
nic_id (it's already attached to the endpoint).
This commit was SVN r15241.
* Fix potential race condition with starting a new lock epoch if we
were releasing a lock
* Increment the shared counter if we start a shared lock session during
the unlock code
This commit was SVN r15186.
a WIN_FREE
* Fix race condition in threaded builds with pending unlocks and
finishing an epoch
* Fix memory leak due to use of OBJ_DESTRUCT instead of OBJ_RELEASE
* Fix race condition between releasing multiple shared locks and
starting a new lock
* Need to incremement the shared count if starting a new shared
lock once an exclusive lock finishes
This commit was SVN r15185.
OBJ_NEW
* Need to single when the passive unlock has left an expose epoch for
the win_free case
* Clean up some debugging output
* fix missing variable initialization
This commit was SVN r15167.
flex (which, incidentally, emit ''more'' warnings than earlier
versions). Grumble.
This commit was SVN r15166.
The following SVN revision numbers were found above:
r15158 --> open-mpi/ompi@57d09c10f7
- adding linear algorithm with synchronization for gather.
This algorithm prevents congestion at root process, but introduces
synchronization (serializes non-root processes, but allows messages
to arrive from two processes at the same time).
It performed better than binomial and linear algorithms for large message,
and intermediate and large communicator sizes.
- Updating MPI_Gather decision function to reflect performance results
from MX. I will perform more measurements though - so this one can
change.
This commit was SVN r15165.
reason it's that we don't have the nice configure stuff, so detecting
when to enable the CR PML it's kind of hard. Keep it defined and at
least it compile smoothly.
This commit was SVN r15116.
branch:
* Support btl_openib_if_include and btl_openib_if_exclude MCA
parameters, similar to those supported by other BTLs. Each take a
comma-delimited lists of identifiers. Identifiers can be HCA
interface names (e.g., ipath0, mthca1, etc.) or an HCA interface
name and port numbers (e.g., ipath0:1, mthca1:2, etc.). It is an
error to specify both _include and _exclude. If you specify a
non-existant (or non-ACTIVE) HCA and/or port, you'll get a warning
unless you disable the warning by setting the MCA parameter
btl_openib_warn_nonexistent_if to 0.
* Start updating to use BEGIN_C_DECLS and END_C_DECLS
* A few other minor fixes that were picked up along the way.
This commit was SVN r15063.
Set bandwidth for all ports of mthca0:
--mca btl_openib_bandwidth_mthca0 1000
Set bandwidth for port 1 of mthca1:
--mca btl_openib_bandwidth_mthca1:1 1000
Set latency for port 2 lid 123 on mthca0:
--mca btl_openib_latency_mthca0:2:123 20
This commit was SVN r15041.
even look at the status code and basically guarantee that the aio
function was never called, so there's really no point in AC_TRY_RUN
over AC_COMPILE_IFELSE...
This commit was SVN r15033.
single threaded builds. In its default configuration, all this does
is ensure that there's at least a good chance of threads building
based on non-threaded development (since the variable names will be
checked). There is also code to make sure that a "mutex" is never
"double locked" when using the conditional macro mutex operations.
This is off by default because there are a number of places in both
ORTE and OMPI where this alarm spews mega bytes of errors on a
simple test. So we have some work to do on our path towards
thread support.
Also removed the macro versions of the non-conditional thread locks,
as the only places they were used, the author of the code intended
to use the conditional thread locks. So now you have upper-case
macros for conditional thread locks and lowercase functions for
non-conditional locks. Simple, right? :).
This commit was SVN r15011.
structures in the system. Instead of using memcmp, use the ns function.
This won't cause a problem as long as all three elements of the name are
ints, but if they have different sizes, alignment and padding rules
can cause memcmp() to compare padding space, which rarely holds a sane
value.
This commit was SVN r14998.
id based on the last half of the mapper MAC. This allow us to figure out how
to connect peers. This allow the MX BTL to be used in a cluster of cluster
configuration where each cluster have MX internally as well as on a multi
rail MX system.
This commit was SVN r14932.
symbols in them and environ is defined only in the final application
(probably in crt1.o). Apple provides a function for getting at the
environment, so use that instead if it's available.
This commit was SVN r14857.
have the SRQ interface.
* Instead of setting AC_DEFINEs per MCA component, set per test. THe
answers can never be difference, and this will speed sed just a teeny
bit
This commit was SVN r14856.
that allows to send any range of a request by send/recv instaed of RDMA
and use it to send data from the end of a request in pipeline protocol.
This commit was SVN r14841.
compile failed because of the wrong variable name.
This commit was SVN r14807.
The following SVN revision numbers were found above:
r14806 --> open-mpi/ompi@7e57bbb0ef
This is required to tighten up the BTL semantics. Ordering is not guaranteed,
but, if the BTL returns a order tag in a descriptor (other than
MCA_BTL_NO_ORDER) then we may request another descriptor that will obey
ordering w.r.t. to the other descriptor.
This will allow sane behavior for RDMA networks, where local completion of an
RDMA operation on the active side does not imply remote completion on the
passive side. If we send a FIN message after local completion and the FIN is
not ordered w.r.t. the RDMA operation then badness may occur as the passive
side may now try to deregister the memory and the RDMA operation may still be
pending on the passive side.
Note that this has no impact on networks that don't suffer from this
limitation as the ORDER tag can simply always be specified as
MCA_BTL_NO_ORDER.
This commit was SVN r14768.
fix (r14749) and then backed it out (r14753).
As we are unable to send more than a 32 bits length over TCP in one go, there
is no reason to have an uint64 length in the header. This reduce the size
of the TCP header.
This commit was SVN r14755.
The following SVN revision numbers were found above:
r14749 --> open-mpi/ompi@48c026ce6b
r14753 --> open-mpi/ompi@28ed850b4c
r14703 for the point-to-point component.
* Associate the list of long message requests to poll with the
component, not the individual modules
* add progress thread that sits on the OMPI request structure
and wakes up at the appropriate time to poll the message
list to move long messages asynchronously.
* Instead of calling opal_progress() all over the place, move
to using the condition variables like the rest of the project.
Has the advantage of moving it slightly further along in the
becoming thread safe thing.
* Fix a problem with the passive side of unlock where it could
go recursive and cause all kinds of problems, especially
when progress threads are used. Instead, have two parts of
passive unlock -- one to start the unlock, and another to
complete the lock and send the ack back. The data moving
code trips the second at the right time.
This commit was SVN r14751.
The following SVN revision numbers were found above:
r14703 --> open-mpi/ompi@2b4b754925
mca_btl_tcp_hdr_t struct and remove the need for the heterogeneous
padding by changing the type of the "size" member to be uint32_t
(vs. uint64_t). The value would never be greater than 32 bits anyway,
so having the type be uint64_t was wasteful.
This commit was SVN r14749.
* Remove unused declaration
* remove unused variable warning when not using progress threads
* If we're using progress threads, we want to lock, not trylock
when in progress, since it was called from the wakeup thread
and not the progress function
This commit was SVN r14739.
to do while(...) { } then we can't change the variables in the ...
atomically, but should do it while holding the module lock.
* Fix dumb communicator creation error when we don't create the progress
stuff (because a window already exists), where we would accidently
jump to the error case.
This commit was SVN r14715.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
* Combine polling of the long requests and buffer requests into
one type, and in one place
* Associate the list of requests to poll with the component, not
the individual modules
* add progress thread that sits on the OMPI request structure
and wakes up at the appropriate time to poll the message
list. Not the best, but without some asynch notification
from the PML that a given set of requests has completed, there
isn't much better
* Instead of calling opal_progress() all over the place, move
to using the condition variables like the rest of the project.
Has the advantage of moving it slightly futher along in the
becoming thread safe thing
* Fix a problem with the passive side of unlock where it could
go recursive and cause all kinds of problems, especially
when progress threads are used. Instead, have two parts of
passive unlock -- one to start the unlock, and another to
complete the lock and send the ack back. The data moving
code trips the second at the right time.
This commit was SVN r14703.
We eagerly send data up to btl_*_eager_limit with the match
Upon ACK of the MATCH we start using send/receives of size
btl_*_max_send_size up to the btl_*_rdma_pipeline_offset
After the btl_*_rdma_pipeline_offset we begin using RDMA writes of
size btl_*_rdma_pipeline_frag_size.
Now, on a per message basis we only use the above protocol if the
message is larger than btl_*_min_rdma_pipeline_size
btl_*_eager_limit - > same
btl_*_max_send_size -> same
btl_*_rdma_pipeline_offset -> btl_*_min_rdma_size
btl_*_rdma_pipeline_frag_size -> btl_*_max_rdma_size
btl_*_min_rdma_pipeline_size is new..
This patch also moves all BTL common parameters initialisation into
btl_base_mca.c file.
This commit was SVN r14681.
* Move ipv6comat.h code into opal_config_bottom.h and change into some
more intelligent testing of structures
* Change opal's if interface to use sockaddr instead of sockaddr_storage,
as the RFCs suggest we do
* Move the networking code in opal that isn't directly related to if
detection into net.h
* Add quicky function to get the port out of either a sockaddr_in
or sockaddr_in6, saving a bunch of code in the oob.
* Update TCP oob and btl with new interface
This commit was SVN r14679.
without any context so we need to save a context somewhere so it can be
retrieved given only buffer pointer. This patch saves context (pointer to
frag) just before start of a buffer so it can be be easily retrieved.
This commit was SVN r14664.
The following SVN revision numbers were found above:
r13921 --> open-mpi/ompi@90fb58de4f
* Require Autoconf 2.60 or higher and remove some cruft
required for AC 2.59 or the AC 2.59 / AC 2.60 mix
* Remove a bunch of now unnecessary AC_SUBST calls
* Use the libtool-provided variables for the -I and
library to use when compiling against ltdl
Fixes trac:1000
This commit was SVN r14652.
The following Trac tickets were found above:
Ticket 1000 --> https://svn.open-mpi.org/trac/ompi/ticket/1000
avoid alignmment issues. This commit fixes trac:1009.
This commit was SVN r14608.
The following Trac tickets were found above:
Ticket 1009 --> https://svn.open-mpi.org/trac/ompi/ticket/1009
via the visibility feature that is provided by some compilers.
Per default this feature is disabled, to enable it you need to
configure with --enable-visibility and obviously you need a compiler
with visibility support. Please refer to the wiki for more information.
https://svn.open-mpi.org/trac/ompi/wiki/Visibility
This commit was SVN r14582.
with RDMAing the rest of it. Also more than one RDMA writes can be performed
simultaneously by different threads. To make this code thread safe this patch
clones original request convertor for each RDMA fragment.
This commit was SVN r14574.
- Removing "small" message size limit because it really does not relate to the eager size
accross the board.
Now, the leaf nodes in generalized reduce will use blocking send (DEFAULT/ORIGINAL BEHAVIOR)
either when the maximum number of outstanding requests is 0 or
when the total number of segments is less than the maximum number of outstanding requests.
Otherwise, it will send messages using non-blocking synchronized send operation.
This commit was SVN r14572.
during the IPv6 patch. The most important is the multi BTL support. There
was a quite interesting bug. Instead of setting up the multiple connections
over different physical devices, based on the time when these connections
were created most of the time they were all using the same physical network.
Which, of course, was not the intended goal, as we top at the maximum
bandwidth available over one device instead of gathering all available
bandwidth from all devices.
Second, the IPv6 RFC suggest to use sockaddr_storage as a holder for the
IP information, but use a sockaddr* when we pass it to functions. This is
only partially corrected by this patch.
Some other minor cleanups.
This commit was SVN r14544.
This "feature" is disabled by default and it should not affect the current performance.
In case when the message size is large and segment size is smaller than eager size for particular interface,
the leaf nodes in generalized reduce function can overflood parent nodes by sending all segments without
any synchronization. This can cause the parent to have HIGH number of unexpected messages (think 16MB
message with 1KB segments for example). In case of binomial algorithm root node always has at least one
child which is leaf, so this can potentially affect the root's performance significantly [Especially in
large communicators where root may have quite a few children (binomial tree for example)].
When the segment size is bigger than the eager size, rendezvous protocol ensures that this does
not happen so it is not necessary.
Originally, the problem was exposed in "infinite" bucket allocator clean up time for "small" segment sizes
(which may explain some "deadlocks" on Thunderbird tests).
To prevent this, we allow user to specify mca parameter "--mca coll_tuned_reduce_algorithm_max_requests NUM"
this limits number of outstanding messages from a leaf node in generalized reduce to the parent to NUM.
Messages are sent as non-blocking synchrnous messages, so syncronization happens at "wait" time.
The synchronization actually improved performance of pipeline and binomial algorithm for large message sizes
with 1KB segments over MX, but I need to test it some more to make sure it is consistent.
Since there is no easy way to find out what is "the eager" size for particular btl, I set the limit to 4000B.
If message/individual segment size is greater than 4000B - we will not use this feature. This variable may
or may not be exposed as mca parameter later...
I did not have any problems running it and both "default" and "synchronous" tests passed Intel Reduce* tests
up to 80 processes (over MX).
This commit was SVN r14518.
- make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6
so that it works for IPv4 and IPv6 addresses, and remove a whole bunch
of #ifs in the OOOB code.
- Fix a compiler warning in the TCP BTL due to run-time determined
array size by making it a dynamicly allocated array.
- Fix the unpacking code of IPv4 addresses when using IPv6 support, so
that the address is in the correct location (instead of in an IPv6
structure, use an IPv4 structure). Refs trac:1005.
This commit was SVN r14514.
The following Trac tickets were found above:
Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005
Make sure that the wrapper selection is compiled out if not enabling FT. Before the
logic would skip over it since the conditional if statements would not be satisfied,
now there are no additional if statements when compiled out.
With this modification the selection logic looks nearly identical to pre-r14051
with the exception of the non-FT related improvements.
This commit was SVN r14491.
The following SVN revision numbers were found above:
r14051 --> open-mpi/ompi@dadca7da88
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).
This also involved a slight change to the oob.xcast API, so propagated that as required.
Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)
Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.
This commit was SVN r14475.
It's a bit longer, but much more clear in it's implementation I believe.
Fundamentally it is the same, but is much more solid in the implementation.
I created quite a few directed tests that this version of the implementation
now passes.
This commit was SVN r14457.
finally brings in functionality that is already on the 1.2 branch, and
was developed and tested in the v1.2ofed branch (and other places).
Short version of new features:
* Support for ibv_fork_init()
* Automatically fill in the openib BTL bandwidth value by
querying the HCA port
* Installdirs functionality
* Fixes to always use -I in the Fortran wrapper compilers (#924)
* Gleb's mpool updates
* Remove some kruft in btl/openib/configure.m4, therefore
fixing the harmless warnings noted in #665
* Bunches of updates to the Linux RPM spec file
I.e., effectively the same thing that r14411 brought to the v1.2
branch.
Also effectively brought in r14432 and r14433 (some fixes on top of
the original r14411 commit to v1.2). Still need to bring in the moral
equivalent of r14445 after this commit (fixes to installdirs).
This commit was SVN r14449.
The following SVN revision numbers were found above:
r14411 --> open-mpi/ompi@83b31314ae
r14432 --> open-mpi/ompi@a48f160595
r14433 --> open-mpi/ompi@68f346d2bc
r14445 --> open-mpi/ompi@13d366b827
- Move the PML Modex stuff out of the BML -- Abstraction violation.
- Also fix the location of the add_procs with respect to the stage gates.
This commit was SVN r14422.
to move it into the unexpected queue. Instead pack the data in
only one buffer. Now the code look more optimized and clear, but
I have a doubt about who's using this functionality. I think that
all BTLs always return only one memory segment attached to the
matching fragment (i.e. there is no unexpected iov type receive).
This commit was SVN r14416.
function to make it more scalable. The memory fragmentation is still high, but
at least in most of the cases (where all ressources are correctly released
before the cleanup) the code is now highly efficient. Before the code execute
in (N * (N-1))!, which take a while when the number of allocated ressources
increase (which is the case when a lot of unexpected messages are created).
The fix consist of checking if all items are freed and if it's the case
then do not recreate the free items list (as we know that everything will
be released). If this condition is not true, we fall back on the
original execution path (which is still sub-sub-sub ... optimal).
This commit was SVN r14406.
same "restrict" check as the top-level OMPI configure.ac script so
that it will guarantee to always get the same result. Therefore, the
#define for restrict will always have the same value in both
opal_config.h and romioconf.h, and we get 7 less warnings (6 in the IO
ROMIO component, 1 in ROMIO itself) when compiling with icc on Linux
(because PAC_C_RESTRICT and AC_C_RESTRICT would get different values
for the "restrict" #define in this case).
This commit was SVN r14387.
Fix for memory corruption in the restarted process stack. This stemed from
the brute force method we were previously using. This commit fixes this by
using a lighter weight solution focused in the r2 BML instead of above the PML.
This is a more efficient and flexible solution, and it solves the original
problem.
In the process I pulled out the ft_event function in the tcp BTL and r2 BML
into a set of *_ft.[c|h] files just to keep any updates to these code paths
as isolated as possible to make merging easier on everyone.
This commit was SVN r14371.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
The following Trac tickets were found above:
Ticket 977 --> https://svn.open-mpi.org/trac/ompi/ticket/977
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).
This commit was SVN r14345.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r14289
the real interest is for small to middle size unexpected messages. The unexpected messages are copied
by the PML in it's own unexpected buffers. Therefore, there is no reason to make a first copy in the
TCP BTL. The BTL can handle to the PML it's own buffer, and can be sure that once the callback
completed it can reuse the buffer, no matter what happened with the fragment.
This commit was SVN r14320.
time difference...
1) The PML makes an assumption on local/remote completion semantics of the BTL
which Self BTL does not obey, nor should it, so we fix the PML
2) The Get protocol must handle the case when sender and reciever do not agree
on wheter the data is contiguous
This commit was SVN r14313.
Sorry for configure changes during the day; I totally forgot about
that. :-(
This commit was SVN r14288.
The following SVN revision numbers were found above:
r14286 --> open-mpi/ompi@0083eba18e
The top-level OMPI configure script already checks for "restrict" and
will issue a #define for it. PAC_C_RESTRICT would also check for
restrict, but sometimes come up with a different answer than the
top-level OMPI configure script, thereby resulting in conflicting
#define's for "restrict" (e.g., icc 9.0/9.1 on linux x86-64).
So it's easiest just to remove this test from ROMIO's configure.in
script.
This commit was SVN r14286.
Remove a redundant statement in the r2 BML.
This commit was SVN r14228.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
- Remove an old comment from crcp_base_fns.c
- Let ob1 have its very own ft_event function (which I'll fill in shortly)
- Make sure ob1 finalizes the bsend stuff so we don't leave a bunch of memory sitting around
- PML base - destruct the array upon finalize. Shrink the include search so it stops after finding a match
This commit was SVN r14222.
computation of the current location on the pack/unpack process. This can
be used both for retrieving the pointer to the first byte (in the special
case of the cached RDMA protocol) and for getting the current
position (for the pipelined protocol).
I modified all BTLs, but most of them are still untested.
This commit was SVN r14180.
some time ago (last summer) that included checking for M_TRIM_THRESHOLD and
M_MMAP_MAX, unfortunately we didn't include <malloc.h> which is where these
are define, so disabling sbrk for the registration cache has been busted for
some time.
This commit was SVN r14169.
mca_mpool_base_page_size_log. They are exported by the mpool/base/base.h,
if some other code need them, then it should include this file
instead of having it's own redefinition of these externals.
This commit was SVN r14156.
Back out r14073 - it speeds up TCP latency / bandwidth but at the same time
it kills ROMIO and one-sided performance when using only TCP. The problem
is that it only allows those two to be progressed every couple of seconds,
leading to what looks like hangs in the one-sided tests (and the ROMIO stuff,
although people seem to not notice that at this point).
This commit was SVN r14144.
The following SVN revision numbers were found above:
r14073 --> open-mpi/ompi@64fbbc20b8
r14142 --> open-mpi/ompi@241545a098
it kills ROMIO and one-sided performance when using only TCP. The problem
is that it only allows those two to be progressed every couple of seconds,
leading to what looks like hangs in the one-sided tests (and the ROMIO stuff,
although people seem to not notice that at this point).
This commit was SVN r14142.
The following SVN revision numbers were found above:
r14073 --> open-mpi/ompi@64fbbc20b8
it limits the number of circular buffers allocated between each pair of peers.
This allows for more tight memory usage control.
This commit was SVN r14120.
waist slightly more memory, but prevents problem when fifo cannot be allocated
later during a job run when memory resource is exhausted.
This commit was SVN r14119.
if less than or equal pml_ob1_unexpected_limit just buffer in the PML level recv
fragment else allocate a buffer via the bucket allocator
This commit was SVN r14117.
latency is high and the network relatively fast. This will allow for more kernel
level buffering, which allow overlap between system calls and communications.
Somehow, even on fast clusters there is an improvement (non significant).
This patch create multiple modules for the same device, which in turn will
create multiple sockets between the peers. By default the number of BTL by
device is set to 1, so there is no fundamental difference with the current
version. Change the value of btl_tcp_links to enable multiple links between
peers.
This commit was SVN r14076.