idea at the time, but led to logistical difficulties in importing new
versions of ROMIO:
* We are effectively eliminating the ROMIO file prefix rule hacks in
the ROMIO component, which create symlinks from foo.c to
io_romio_foo.c. In reality, the file name conflict potential will
be small.
* Additionally, we are effectively eliminating the ROMIO function
prefix rule in the ROMIO component. This is another place where
there are generally problems with the merge up new versions of ROMIO
and/or patches from the user community (for their own local builds).
In reality, since other major MPI implementations provides the same
exact symbols, it won't cause any practical problems for users.
In return, we make it ''much'' simpler to apply ROMIO patches to Open
MPI. The problem right now is that any patch will have filenames such
as ad_panfs.c, but Open MPI will only have io_romio_ad_panfs.c, making
things extremely difficult for users. I believe, for example, that
this would make it possible for LANL to have applied their patches
without too much hassle on either their part or our part. It will
also make things easier for OMPI when we/they want to do the next
ROMIO upgrade (this was one of the sources of problems on each
upgrade).
This commit was SVN r17436.
We loop over all peer addresses and accept when one of them matches.
Note that this might break functionality: mca_btl_tcp_proc_insert now
always inserts the same endpoint. (is the lack of endpoints the problem?
should there be one for every remote address?)
Re #1206
This commit was SVN r17307.
- the registration array is now global instead of one by BTL.
- each framework have to declare the entries in the registration array reserved. Then
it have to define the internal way of sharing (or not) these entries between all
components. As an example, the PML will not share as there is only one active PML
at any moment, while the BTLs will have to. The tag is 8 bits long, the first 3
are reserved for the framework while the remaining 5 are use internally by each
framework.
- The registration function is optional. If a BTL do not provide such function,
nothing happens. However, in the case where such function is provided in the BTL
structure, it will be called by the BML, when a tag is registered.
Now, it's time for the second step... Converting OB1 from a switch based PML to an
active message one.
This commit was SVN r17140.
for dynamic selection of cpc methods based on what is available. It
also allows for inclusion/exclusions of methods. It even futher allows
for modifying the priorities of certain cpc methods to better determine
the optimal cpc method.
This patch also contains XRC compile time disablement (per Jeff's
patch).
At a high level, the cpc selections works by walking through each cpc
and allowing it to test to see if it is permissable to run on this
mpirun. It returns a priority if it is permissable or a -1 if not. All
of the cpc names and priorities are rolled into a string. This string
is then encapsulated in a message and passed around all the ompi
processes. Once received and unpacked, the list received is compared
to a local copy of the list. The connection method is chosen by
comparing the lists passed around to all nodes via modex with the list
generated locally. Any non-negative number is a potentially valid
connection method. The method below of determining the optimal
connection method is to take the cross-section of the two lists. The
highest single value (and the other side being non-negative) is selected
as the cpc method.
svn merge -r 16948:17128 https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/ .
This commit was SVN r17138.
header as double-word aligned and prevents bus errors on SPARC
based servers. This is part of fix for #1148.
Refs trac:1148
This commit was SVN r17090.
The following Trac tickets were found above:
Ticket 1148 --> https://svn.open-mpi.org/trac/ompi/ticket/1148
ibv_dereg_mr() for instance) from ptmalloc callback. Call to free() from the
callback causes deadlock. Notice what should be unregistered inside the callback
and do actual cleanup at the next call to mpool->register().
This commit was SVN r17064.
Also introduces work in progress "convertor" sender based copy algorithm. This algorithm cannot be selected without other modifications in the convertor (not currently available in trunk). The default old synchronous copy algorithm is selected by default.
This commit was SVN r17063.
high prio QPs and low prio QPs) and because not all of them are polled each time
progrgess() is called (to save on latency) starvation is possible. The commit
fixes this. Now each channel is polled, but higher priority channels are polled
more often. Three new parameters are introduced that control polling ratios
between different channels.
This commit was SVN r17024.
(sometimes after the merge with the ORTE branch), the opal_pointer_array
will became the only pointer_array implementation (the orte_pointer_array
will be removed).
This commit was SVN r17007.
so that it is higher than the new TCP BTL exclusivity as of r16942.
The portals BTL maintainer may want to do the same...
This commit was SVN r16995.
The following SVN revision numbers were found above:
r16942 --> open-mpi/ompi@80e9730100
about linkers, have all OPAL, ORTE, and OMPI components '''not'' link
against the OPAL, ORTE, or OMPI libraries.
See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for
details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a
better-formatted version of the same info).
This commit was SVN r16968.
This commit also cleans up the checkpoint and terminate case making it more
precise than before. Previously the application could make a small amount of
progress between checkpoint completion and application termination. Now the
application will make no progress at all in this time span.
Additional minor change:
- Start using OPAL_INT_TO_BOOL instead of if/else logic
This commit was SVN r16952.
should be passed via commandline. However, there is a slight coding
bug in the openib connect code. When registering the name of the
option, mca_base_param_reg_string will prepend the relevant info
("btl_openib_" in this case). The existing code will require
"btl_openib_btl_openib_connect" instead of "btl_openib_connect".
This patch corrects this.
This commit was SVN r16937.
smaller then allocated size.
2. If reserve zero don't allocate coalesced frag since it will be RDMAed, not
send. The logic was other way around.
This commit was SVN r16928.
mca_btl_openib_mca_setup_qps(). It looks like someone just forgot to
clean-up the previous call when they added the check for the return
code.
I ran a quick IMB test over IB to verify everything is still working.
This commit was SVN r16870.
parameters don't make any sense. Credits are never piggybacked. Also make
default queue sizes to be calculated from eager_limit and max_send_size values.
This commit was SVN r16816.
ompi_mtl_portals_finalize(). Previous code was cleaning up Portals resources
that hadn't been allocated, which caused valid handles used elsewhere to be
freed, which broke cnos_barrier() for the Portals btl.
This commit was SVN r16801.
needed instead of creating it and then canceling if it is not needed. Change
error handling during finalize so that it will not skip async thread
destruction. Otherwise async thread may segfault during openib module unloading.
This commit was SVN r16782.
to a pending queue of eager rdma QP instead of correct pending list. This patch
fixes this by getting reed of "eager rdma qp" notion. Packet is always send
over its order QP. The patch also adds two pending queues for high and low prio
packets. Only high prio packets are sent over eager RDMA channel.
This commit was SVN r16780.
main idea (except of cleanup) is to save on initialisation of unneeded fields
and to use C type checking system to catch obvious errors.
This commit was SVN r16779.
has his own range which is defined by a min value and a range. By default
there is no limitation on the port range, which is exactly the same
behavior as before.
This commit was SVN r16584.
such a way that converter will not be able to pack some of it. This commit adds
handling of such cases. If converter can't pack any data for a BTL the data is
sent over another BTL that has data to send.
This commit was SVN r16493.
the use of the --mca btl_base_verbose flag. The
btl framework now matches all the other frameworks.
Slightly modify error messages for clarity.
This commit was SVN r16443.
PML base will take care of the registration with the event library.
Otherwise, (and this apply for the CM case) the MTL are in charge of
registering their own progress function.
This commit was SVN r16415.
* Fix some missing includes in a few places.
* Add the cr_request() functionality to the BLCR CRS component.
We are now dependent upon the 0.6.* series of BLCR.
* Made the CR notification mechanism a registered function.
This way we can have an OPAL-only version and it can be replaced at
runtime with the ORTE version.
* Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only
CR functionality when the user wants it. Default: Disabled.
* Fix the placement of a checkpoint request check in MPI_Init
* Pull the OPAL notification mechanism into the SnapC framework.
* We no longer fork/exec the 'opal-checkpoint' command for local
checkpointing, the Local coordinator in the orted does this directly.
* The Local and Application coordinator talk together bypassing the OPAL
notifiation mechanism.
* Optimized the Local <-> App Coordinator communication.
* Improved the structure used to track vpid_snapshots in the local coord.
* Fix a race condition in which an application under heavy communication load
may produce an inconsistent global checkpoint.
This commit was SVN r16389.
yesterday. This actually exposed a very, very long-standing bug where
part of the coll base was incorrectly checking the coll API version
against the MCA API version. When coll went to v1.1 (yesterday) and
was no longer the same as the MCA v1.0, the test started failing.
This commit fixes to check for v1.1 everywhere in the coll base, and
to ensure to check coll framework/API version numbers against coll
framework/API version numbers (vs. against the MCA API version
number).
This commit was SVN r16373.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
Check if an exclusion string (i.e. '-mca btl ^sm) was provided; if so OFUD just disables itself.
This commit was SVN r16307.
The following Trac tickets were found above:
Ticket 1154 --> https://svn.open-mpi.org/trac/ompi/ticket/1154
Each one of them has a field to store QP type, but this is redundant.
Store qp type only in one structure (the component one).
This commit was SVN r16272.
meaning "infinite") is no longer larger than the minimum required
size. So put in an appropriate test to ensure that "infinite" was not
requested.
This commit was SVN r16142.
numbers for tuning.
Switch the bookmark_recv to be non-blocking. If this is blocking then for
process counts >= 32 slight process delays were causing cascading performance
delays in the protocol. This lead to checkpoints either taking about 3 sec or
45 sec (or more) for 64 procs due to the cascading delays. With the nonblocking
receive version this is no longer the case we get the speedup we expect for this
part of the protocol.
More tuning to come.
This commit was SVN r16137.
you'll get a helpful error message and the openib BTL will deactivate
itself.
This commit was SVN r16133.
The following Trac tickets were found above:
Ticket 1133 --> https://svn.open-mpi.org/trac/ompi/ticket/1133
NOTE: This build system does not work with the current autogen.sh. Modified one is under heavy testing to make sure it does not have side effects
This commit was SVN r16110.
Basically revert this part of r16015.
This commit was SVN r16029.
The following SVN revision numbers were found above:
r16015 --> open-mpi/ompi@435e7d80e9
no more work associated with this request. No more outstanding completions or
packets and send scheduling isn't running in another thread.
This commit was SVN r16013.
the ompi_convertor_need_buffers function to only return 0 if the convertor
is homogeneous (which it never does on the trunk, but does to on v1.2, but
that's a different issue). Only enable the heterogeneous rdma code for
a btl if it supports it (via a flag), as some btls need some work for this
to work properly. Currently only TCP and OpenIB extensively tested
This commit was SVN r15990.
side too. Otherwise a content of the recvreq->req_rdma array is replaced later
without freeing previous content and refcount on registration in mpool become
wrong.
This commit was SVN r15978.
to always first check for a NULL frag pointer before trying to send the
fragment. This avoids an issue in multi-threaded execution in which
multiple threads working on the same endpoint can result in a thread
finding itself here with nothing to send.
This commit was SVN r15963.
from the posted generic_recv-queue.
- Move the PERUSE_COMM_MSG_MATCH_POSTED_REQ from
MCA_PML_OB1_RECV_REQUEST_MATCHED to
mca_pml_ob1_recv_frag_match() as suggested by Terry Dontje
Only post, if this is not a probe/iprobe request.
- Do not post PERUSE_COMM_REQ_MATCH_UNEX for probes / iprobes and
do in correct order before PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q
This commit was SVN r15947.
PERUSE_COMM_REQ_ACTIVATE event.
Therefore move the PERUSE_TRACE_COMM_EVENT for this event from
MCA_PML_BASE_SEND_REQUEST_INIT / MCA_PML_BASE_RECV_REQUEST_INIT
to the proper places into pml_ob1_isend.c / pml_ob1_irecv.c right
after the MCA_PML_OB1_SEND_REQUEST_INIT /
MCA_PML_OB1_RECV_REQUEST_INIT.
This commit was SVN r15945.
one HCA. Multiple ports, LMC, multiple BTLs per one LID. Having only one CQ for
all of them substantially reduce polling time.
This commit was SVN r15933.
on a parameter and are determined in runtime. r15346 removed calculation of
correct sizes for this structures. This patch adds it back and fixes trac:1116, #1114.
This commit was SVN r15932.
The following SVN revision numbers were found above:
r15346 --> open-mpi/ompi@433f8a7694
The following Trac tickets were found above:
Ticket 1116 --> https://svn.open-mpi.org/trac/ompi/ticket/1116
used at nce (up to one unique collective module per collective function).
Matches r15795:15921 of the tmp/bwb-coll-select branch
This commit was SVN r15924.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15795
r15921
A subset of this patch needs to be applied to v1.2
Refs trac:928
This commit was SVN r15918.
The following Trac tickets were found above:
Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928
only static libraries. Previously, we were linking the libraries into
directly into the common, btl, and mtl code. This seemed to work fine
for me on my Opteron Fedora box, but caused Lisa some issues (PtlNIInit
would succeed, but the network handle would fail when used with
PtlEQAlloc).
Instead, link the portals libraries directly into libmpi and not at
all into the common, btl, or mtl components. THen use some linker
tricks to force the linker to bring in the public interface for the
reference implementation (which thankfully is pretty small).
This commit was SVN r15902.
Fixed a condition test while checking that all segments are empty.
Without this fix, a NULL segment pointer could make it past the
test, resulting in a SegV when dereferenced.
This commit was SVN r15891.
The following Trac tickets were found above:
Ticket 1134 --> https://svn.open-mpi.org/trac/ompi/ticket/1134