- the registration array is now global instead of one by BTL.
- each framework have to declare the entries in the registration array reserved. Then
it have to define the internal way of sharing (or not) these entries between all
components. As an example, the PML will not share as there is only one active PML
at any moment, while the BTLs will have to. The tag is 8 bits long, the first 3
are reserved for the framework while the remaining 5 are use internally by each
framework.
- The registration function is optional. If a BTL do not provide such function,
nothing happens. However, in the case where such function is provided in the BTL
structure, it will be called by the BML, when a tag is registered.
Now, it's time for the second step... Converting OB1 from a switch based PML to an
active message one.
This commit was SVN r17140.
for dynamic selection of cpc methods based on what is available. It
also allows for inclusion/exclusions of methods. It even futher allows
for modifying the priorities of certain cpc methods to better determine
the optimal cpc method.
This patch also contains XRC compile time disablement (per Jeff's
patch).
At a high level, the cpc selections works by walking through each cpc
and allowing it to test to see if it is permissable to run on this
mpirun. It returns a priority if it is permissable or a -1 if not. All
of the cpc names and priorities are rolled into a string. This string
is then encapsulated in a message and passed around all the ompi
processes. Once received and unpacked, the list received is compared
to a local copy of the list. The connection method is chosen by
comparing the lists passed around to all nodes via modex with the list
generated locally. Any non-negative number is a potentially valid
connection method. The method below of determining the optimal
connection method is to take the cross-section of the two lists. The
highest single value (and the other side being non-negative) is selected
as the cpc method.
svn merge -r 16948:17128 https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/ .
This commit was SVN r17138.
Also introduces work in progress "convertor" sender based copy algorithm. This algorithm cannot be selected without other modifications in the convertor (not currently available in trunk). The default old synchronous copy algorithm is selected by default.
This commit was SVN r17063.
(sometimes after the merge with the ORTE branch), the opal_pointer_array
will became the only pointer_array implementation (the orte_pointer_array
will be removed).
This commit was SVN r17007.
about linkers, have all OPAL, ORTE, and OMPI components '''not'' link
against the OPAL, ORTE, or OMPI libraries.
See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for
details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a
better-formatted version of the same info).
This commit was SVN r16968.
such a way that converter will not be able to pack some of it. This commit adds
handling of such cases. If converter can't pack any data for a BTL the data is
sent over another BTL that has data to send.
This commit was SVN r16493.
PML base will take care of the registration with the event library.
Otherwise, (and this apply for the CM case) the MTL are in charge of
registering their own progress function.
This commit was SVN r16415.
* Fix some missing includes in a few places.
* Add the cr_request() functionality to the BLCR CRS component.
We are now dependent upon the 0.6.* series of BLCR.
* Made the CR notification mechanism a registered function.
This way we can have an OPAL-only version and it can be replaced at
runtime with the ORTE version.
* Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only
CR functionality when the user wants it. Default: Disabled.
* Fix the placement of a checkpoint request check in MPI_Init
* Pull the OPAL notification mechanism into the SnapC framework.
* We no longer fork/exec the 'opal-checkpoint' command for local
checkpointing, the Local coordinator in the orted does this directly.
* The Local and Application coordinator talk together bypassing the OPAL
notifiation mechanism.
* Optimized the Local <-> App Coordinator communication.
* Improved the structure used to track vpid_snapshots in the local coord.
* Fix a race condition in which an application under heavy communication load
may produce an inconsistent global checkpoint.
This commit was SVN r16389.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
NOTE: This build system does not work with the current autogen.sh. Modified one is under heavy testing to make sure it does not have side effects
This commit was SVN r16110.
no more work associated with this request. No more outstanding completions or
packets and send scheduling isn't running in another thread.
This commit was SVN r16013.
the ompi_convertor_need_buffers function to only return 0 if the convertor
is homogeneous (which it never does on the trunk, but does to on v1.2, but
that's a different issue). Only enable the heterogeneous rdma code for
a btl if it supports it (via a flag), as some btls need some work for this
to work properly. Currently only TCP and OpenIB extensively tested
This commit was SVN r15990.
side too. Otherwise a content of the recvreq->req_rdma array is replaced later
without freeing previous content and refcount on registration in mpool become
wrong.
This commit was SVN r15978.
from the posted generic_recv-queue.
- Move the PERUSE_COMM_MSG_MATCH_POSTED_REQ from
MCA_PML_OB1_RECV_REQUEST_MATCHED to
mca_pml_ob1_recv_frag_match() as suggested by Terry Dontje
Only post, if this is not a probe/iprobe request.
- Do not post PERUSE_COMM_REQ_MATCH_UNEX for probes / iprobes and
do in correct order before PERUSE_COMM_MSG_REMOVE_FROM_UNEX_Q
This commit was SVN r15947.
PERUSE_COMM_REQ_ACTIVATE event.
Therefore move the PERUSE_TRACE_COMM_EVENT for this event from
MCA_PML_BASE_SEND_REQUEST_INIT / MCA_PML_BASE_RECV_REQUEST_INIT
to the proper places into pml_ob1_isend.c / pml_ob1_irecv.c right
after the MCA_PML_OB1_SEND_REQUEST_INIT /
MCA_PML_OB1_RECV_REQUEST_INIT.
This commit was SVN r15945.
on a parameter and are determined in runtime. r15346 removed calculation of
correct sizes for this structures. This patch adds it back and fixes trac:1116, #1114.
This commit was SVN r15932.
The following SVN revision numbers were found above:
r15346 --> open-mpi/ompi@433f8a7694
The following Trac tickets were found above:
Ticket 1116 --> https://svn.open-mpi.org/trac/ompi/ticket/1116
A subset of this patch needs to be applied to v1.2
Refs trac:928
This commit was SVN r15918.
The following Trac tickets were found above:
Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928
* Code cleanup and rationalization
* Fixed: mca_pml_base_send/recv_request are now allocated before recreation by the PML-V
* Fixed: pointer arithmetic bug in sender based that crashed
* Changed: directory structure. This is one step forward using autogen.sh to build static-components.h (it needs to have the directory structure of a mca framework for this).
This commit was SVN r15878.
Fixed: All copyrights are now correct up to 2007
Fixed: Build system now works with VPATHs
Changed: protocol_example is now ignored by default
This commit was SVN r15627.
* General TCP cleanup for OPAL / ORTE
* Simplifying the OOB by moving much of the logic into the RML
* Allowing the OOB RML component to do routing of messages
* Adding a component framework for handling routing tables
* Moving the xcast functionality from the OOB base to its own framework
Includes merge from tmp/bwb-oob-rml-merge revisions:
r15506, r15507, r15508, r15510, r15511, r15512, r15513
This commit was SVN r15528.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15506
r15507
r15508
r15510
r15511
r15512
r15513
Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point.
Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings.
This commit was SVN r15517.
have to construct/destruct only once. Therefore, the construction will
happens before digging for a PML, while the destruction just before
finalizing the component.
Add some OPAL_LIKELY/OPAL_UNLIKELY.
This commit was SVN r15347.
receive queues are shared among all PMLs, they are declared in the base PML,
and the selected PML is in charge of initializing and releasing them.
The CM PML is slightly different compared with OB1 or DR. Internally it use
2 different types of requests: light and heavy. However, now with this patch
both types of requests are stored in the same queue, and cast appropriately
on the allocation macro. This means we might use less memory than we allocate,
but in exchange we got full support for most of the parallel debuggers.
Another thing with this patch, is that now for all PML (CM included) the basic
PML requests start with the same fields, and they are declared in the same order
in the request structure. Moreover, the fields have been moved in such a way
that only one volatile/atomic will exist per line of cache (hopefully).
This commit was SVN r15346.
than just the PML/BTLs these days. Also clean up the code so that it
handles the situation where not all nodes register information for a given
node (rather than just spinning until that node sends information, like
we do today).
Includes r15234 and r15265 from the /tmp/bwb-modex branch.
This commit was SVN r15310.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r15234
r15265
relative bandwidths of each BTL. Precalculate what part of a message should
be send via each BTL in advance instead of doing it during scheduling.
This commit was SVN r15248.
each BTL. Precalculate what part of a message should be send via each BTL in
advance instead of doing it during scheduling.
This commit was SVN r15247.
reason it's that we don't have the nice configure stuff, so detecting
when to enable the CR PML it's kind of hard. Keep it defined and at
least it compile smoothly.
This commit was SVN r15116.
single threaded builds. In its default configuration, all this does
is ensure that there's at least a good chance of threads building
based on non-threaded development (since the variable names will be
checked). There is also code to make sure that a "mutex" is never
"double locked" when using the conditional macro mutex operations.
This is off by default because there are a number of places in both
ORTE and OMPI where this alarm spews mega bytes of errors on a
simple test. So we have some work to do on our path towards
thread support.
Also removed the macro versions of the non-conditional thread locks,
as the only places they were used, the author of the code intended
to use the conditional thread locks. So now you have upper-case
macros for conditional thread locks and lowercase functions for
non-conditional locks. Simple, right? :).
This commit was SVN r15011.
that allows to send any range of a request by send/recv instaed of RDMA
and use it to send data from the end of a request in pipeline protocol.
This commit was SVN r14841.
This is required to tighten up the BTL semantics. Ordering is not guaranteed,
but, if the BTL returns a order tag in a descriptor (other than
MCA_BTL_NO_ORDER) then we may request another descriptor that will obey
ordering w.r.t. to the other descriptor.
This will allow sane behavior for RDMA networks, where local completion of an
RDMA operation on the active side does not imply remote completion on the
passive side. If we send a FIN message after local completion and the FIN is
not ordered w.r.t. the RDMA operation then badness may occur as the passive
side may now try to deregister the memory and the RDMA operation may still be
pending on the passive side.
Note that this has no impact on networks that don't suffer from this
limitation as the ORDER tag can simply always be specified as
MCA_BTL_NO_ORDER.
This commit was SVN r14768.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
We eagerly send data up to btl_*_eager_limit with the match
Upon ACK of the MATCH we start using send/receives of size
btl_*_max_send_size up to the btl_*_rdma_pipeline_offset
After the btl_*_rdma_pipeline_offset we begin using RDMA writes of
size btl_*_rdma_pipeline_frag_size.
Now, on a per message basis we only use the above protocol if the
message is larger than btl_*_min_rdma_pipeline_size
btl_*_eager_limit - > same
btl_*_max_send_size -> same
btl_*_rdma_pipeline_offset -> btl_*_min_rdma_size
btl_*_rdma_pipeline_frag_size -> btl_*_max_rdma_size
btl_*_min_rdma_pipeline_size is new..
This patch also moves all BTL common parameters initialisation into
btl_base_mca.c file.
This commit was SVN r14681.
via the visibility feature that is provided by some compilers.
Per default this feature is disabled, to enable it you need to
configure with --enable-visibility and obviously you need a compiler
with visibility support. Please refer to the wiki for more information.
https://svn.open-mpi.org/trac/ompi/wiki/Visibility
This commit was SVN r14582.
with RDMAing the rest of it. Also more than one RDMA writes can be performed
simultaneously by different threads. To make this code thread safe this patch
clones original request convertor for each RDMA fragment.
This commit was SVN r14574.
Make sure that the wrapper selection is compiled out if not enabling FT. Before the
logic would skip over it since the conditional if statements would not be satisfied,
now there are no additional if statements when compiled out.
With this modification the selection logic looks nearly identical to pre-r14051
with the exception of the non-FT related improvements.
This commit was SVN r14491.
The following SVN revision numbers were found above:
r14051 --> open-mpi/ompi@dadca7da88
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).
This also involved a slight change to the oob.xcast API, so propagated that as required.
Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)
Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.
This commit was SVN r14475.
- Move the PML Modex stuff out of the BML -- Abstraction violation.
- Also fix the location of the add_procs with respect to the stage gates.
This commit was SVN r14422.
to move it into the unexpected queue. Instead pack the data in
only one buffer. Now the code look more optimized and clear, but
I have a doubt about who's using this functionality. I think that
all BTLs always return only one memory segment attached to the
matching fragment (i.e. there is no unexpected iov type receive).
This commit was SVN r14416.
Fix for memory corruption in the restarted process stack. This stemed from
the brute force method we were previously using. This commit fixes this by
using a lighter weight solution focused in the r2 BML instead of above the PML.
This is a more efficient and flexible solution, and it solves the original
problem.
In the process I pulled out the ft_event function in the tcp BTL and r2 BML
into a set of *_ft.[c|h] files just to keep any updates to these code paths
as isolated as possible to make merging easier on everyone.
This commit was SVN r14371.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
The following Trac tickets were found above:
Ticket 977 --> https://svn.open-mpi.org/trac/ompi/ticket/977
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).
This commit was SVN r14345.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r14289
time difference...
1) The PML makes an assumption on local/remote completion semantics of the BTL
which Self BTL does not obey, nor should it, so we fix the PML
2) The Get protocol must handle the case when sender and reciever do not agree
on wheter the data is contiguous
This commit was SVN r14313.
Remove a redundant statement in the r2 BML.
This commit was SVN r14228.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
- Remove an old comment from crcp_base_fns.c
- Let ob1 have its very own ft_event function (which I'll fill in shortly)
- Make sure ob1 finalizes the bsend stuff so we don't leave a bunch of memory sitting around
- PML base - destruct the array upon finalize. Shrink the include search so it stops after finding a match
This commit was SVN r14222.
computation of the current location on the pack/unpack process. This can
be used both for retrieving the pointer to the first byte (in the special
case of the cached RDMA protocol) and for getting the current
position (for the pipelined protocol).
I modified all BTLs, but most of them are still untested.
This commit was SVN r14180.
if less than or equal pml_ob1_unexpected_limit just buffer in the PML level recv
fragment else allocate a buffer via the bucket allocator
This commit was SVN r14117.
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.
This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.
This commit closes trac:158
More details to follow.
This commit was SVN r14051.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r13912
The following Trac tickets were found above:
Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
- Implement the BML/r2 finialize funciton
- Cleanup the btl close routine
- Wire up a pml_base_verbose MCA parameter so you can actually watch the PML selection logic if you really want to.
- Fix a potental segfault in the selection logic.
ompi_pointer_array_get_item() may return NULL, so we have to check for it
This commit was SVN r13734.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
when on the Cray machine (aka when the NULL GPR is in use).
This commit was SVN r13638.
The following SVN revision numbers were found above:
r13582 --> open-mpi/ompi@041beeb1b6
adding new procs that the remote proc's pml is the same as our local pml.
Turns the hangs from mismatched PMLs into an abort, which is better,
I think.
This commit was SVN r13582.
open MTL MX and BTL MX and initialize them at the same time. The problem is
that both call mx_init and mx_finalize, solution is to add an external entity
that does the init and finalize (based on ref counting).
This commit was SVN r13576.
needlessly registered in multiple different places, and none of them
had a good help string. There was also an inconsistent check for
setting both mpi_leave_pinned and mpi_leave_pinned_pipeline (i.e., it
was only in ob1). This commit moves the registration of these params
to one central place (ompi/runtime/ompi_mpi_params.c, with all other
mpi_* MCA params) and uses globals to propagate the values as
relevant. The error check was also moved to the central location to
ensure that we can consistency everywhere.
This commit was SVN r13226.
return the buffer address from Fortran. It is not expected
behavior. For MPI_Buffer_attach, adjust the address of
the buffer handed in so it is always aligned.
Refs trac:750
Buffer detach reviewed by Jeff Squyres
Buffer attach alignment reviewed by George Bosilca
This commit was SVN r13205.
The following Trac tickets were found above:
Ticket 750 --> https://svn.open-mpi.org/trac/ompi/ticket/750
OB1 always use first element from array of BTLs available for RDMA. The patch
change the array creation algorithm, it puts different BTL in the first element
in round robin fashion.
This commit was SVN r13174.
components that use configure.m4 for configuration or are always built.
The macro has not been needed since moving to configure types other than
configure.stub
Fixes trac:590
This commit was SVN r13031.
The following Trac tickets were found above:
Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
George wrote the initial patch, I extended it slightly and am responsible for all bugs found.
Refs trac:587
This commit was SVN r13023.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
* Make sure that the pval always writes to the correct portion of the
lval. This only matters on 32 bit big endian machines.
* On 32 bit machines when assigning to pval, the other 4 bytes of lval
weren't being written, which could lead to bogus data
We use macros so that there aren't casts all over the code and the pval
assignment can occur to the correct 4 bytes. Refs trac:587
This commit was SVN r12974.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
udapl/openib/vapi/gm mpools a deprecated. rdma mpool has parameter that allows
to limit its size mpool_rdma_rcache_size_limit (default is 0 - unlimited).
This commit was SVN r12878.
Move the req_mtl structure back to the end of each of the structures in
the CM PML. The req_mtl structure is cast into a mtl_*_request_structure
for each MTL, which is larger than the req_mtl itself. The cast will cause
the *_request to overwrite parts of the heavy requests if the req_mtl
isn't the *LAST* thing on each structure (hence the comment). This was
moved as an optimization at some point, which caused buffer sends to fail...
Refs trac:669
This commit was SVN r12873.
The following SVN revision numbers were found above:
r12871 --> open-mpi/ompi@597598b712
The following Trac tickets were found above:
Ticket 669 --> https://svn.open-mpi.org/trac/ompi/ticket/669
CM PML. The req_mtl structure is cast into a mtl_*_request_structure for
each MTL, which is larger than the req_mtl itself. The cast will cause
the *_request to overwrite parts of the heavy requests if the req_mtl
isn't the *LAST* thing on each structure (hence the comment). This was
moved as an optimization at some point, which caused buffer sends to
fail...
Refs trac:669
This commit was SVN r12871.
The following Trac tickets were found above:
Ticket 669 --> https://svn.open-mpi.org/trac/ompi/ticket/669
I found only two places that were looking at the tokens:
1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message.
2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was *always* the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes).
This commit was SVN r12813.
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
r12714) for supporting compilers / architectures with different
padding rules.
This commit was SVN r12749.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r12491
r12714
hits the buffer on the other side. For this kind of BTLs we need to send
FIN through the same BTL, PUT was performed with so network will handle
ordering for us. If we will use another BTL, receiver can get FIN before
data will hit the buffer and complete request prematurely. We mark such
problematic BTLs with MCA_BTL_FLAGS_FAKE_RDMA flag (this kind of RDMA
is really fake, because the real one guaranties that sender will see the
completion only after receiver's NIC confirmed that all the data was
received).
This commit was SVN r12732.
It calls mca_pml_ob1_send_fin_btl() which may fail and doesn't check return
code. This breaks all RDMA transports event when only one BTL is used. Revert
it for now, I am working on a real fix for the problem (I hope).
This commit was SVN r12731.
The following SVN revision numbers were found above:
r12720 --> open-mpi/ompi@3e3689320b
regresion from v1.1 was reviewed and put to v1.2 branch. So revert this part
of r12721 back.
This commit was SVN r12730.
The following SVN revision numbers were found above:
r12433 --> open-mpi/ompi@82f7c0dd69
r12721 --> open-mpi/ompi@3edd850d2e
protocol when multiple NICS are available between 2 peers. The fix force
the FIN message to take exactly the same path as the fragment it describe
(i.e. same path means same BTL). Otherwise, the FIN can be received by
the peer before the RDMA complete and the request will get freed
too early.
This commit was SVN r12720.
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
Same sort of problem and fix as described in r12323 - mca_pml_ob1_recv_frag_progress() was segfaulting due to a NULL req_proc pointer. The path leading to this was through the mca_pml_ob1_check_cantmatch_for_match() function, where we can match a frag using the same macros as mca_pml_ob1_frag_match() and never initialize the req_proc pointer.
This commit was SVN r12582.
The following SVN revision numbers were found above:
r12323 --> open-mpi/ompi@c752502dee
is that if one add "pml=" to the configuration file, really bad things
happen. All PMLs will get initialize, and each of them will initialize
all BTLs. This patch force the mca_pml_base_pml to get initialized in
all cases before we go out of the mca_pml_base_open function.
This commit was SVN r12527.
set it up before the match when we know the peer, saving some
time on the critical path. If the receive is ANY_SOURCE then
we initialize the convertor on _MATCHED. Anyway, we will set it
up only once per receive.
This commit was SVN r12484.
This commit essentially caches the invoking comm/win/file on the
ompi_request_t. This, paired with the req_type field, allows us to
retrieve the invoking MPI object and invoke the proper errhandler.
The patch is missing most updates for the MPI-2 one-sided stuff (i.e.,
the patch mainly fixes comms and files); I didn't really understand
that code and didn't want to hazard trying to figure it out when Brian
can probably do it much more quickly.
So #250 will still stay open, pending MPI-2 one-sided updates for this
stuff.
This commit was SVN r12339.
The following Trac tickets were found above:
Ticket 250 --> https://svn.open-mpi.org/trac/ompi/ticket/250
* Create a new request type: NOOP (described below)
* For all MPI_*_INIT functions, OBJ_NEW an ompi_request_t and set its
type to NOOP
* Ensure that the NOOP requests are OBJ_RELEASE'd when they are done
* MPI_START looks at the request type; if NOOP, just return success. If
not, call the PML start() function
* MPI_STARTALL always pass the entire array of requests back to the PML
(see next point)
* Make the PMLs only process PML requests (i.e., ignore/skip anything
that isn't of type PML -- such as the NOOP requests)
* Add a little more param error checking in STARTALL
This commit was SVN r12338.
The following Trac tickets were found above:
Ticket 529 --> https://svn.open-mpi.org/trac/ompi/ticket/529
allocation logic is completely done outside the data-type engine (in the PML) there is
no need for any special case inside the data-type engine. There is less arguments for
the ompi_convertor_pack and ompi_convertor_unpack as well (the last field free_after is
not required anymore as there is no memory allocated in the engine itself). This change
affect all components using datatypes. I test most of them, but it might happens that I
miss some ... If it's the case please let me know (don't shoot the pianist!!).
This commit was SVN r12331.
A segfault would occur in mca_pml_ob1_recv_request_progress() when trying to prepare the convertor for unpacking, because the request's req_proc field was NULL.
Turns out that we weren't setting the req_proc field in the MCA_PML_OB1_CHECK_SPECIFIC_AND_WILD_RECEIVES_FOR_MATCH macro. Instead of just setting it there I removed the other place req_proc was being set correctly, and instead took care of all the cases at once in mca_pml_ob1_recv_frag_match().
This commit was SVN r12323.
parameter. For optimisation purpose only this BTL is used to send packet
through instead of trying to send packets through all BTLs. But actually the
code was wrong. It simply used provided bml_btl and it may represent different
endpoint from packet's destination. The fixed code checks if packet's
destination is reachable through the BTL, finds appropriate bml_btl and only
then tries to send it through correct bml_btl.
This commit was SVN r12319.
mentioned in the comment the completion/callback of the triggered
send operation can happen before the call returns. If this happens and
if the pipeline depth is 0 before we triggered the send operation and
this is the last send operation of the request then the completion detection
code will decrement the pipeline depth and check it for equality to 0.
Because (0-1) != 0 the pml completion function for this request will
*not* be called.
This part 2 of the fix for ticket #246.
This commit was SVN r12292.
all platforms. The only exceptions (and I will not deal with them
anytime soon) are on Windows:
- the write functions which require the length to be an int when it's
a size_t on all UNIX variants.
- all iovec manipulation functions where the iov_len is again an int
when it's a size_t on most of the UNIXes.
As these only happens on Windows, so I think we're set for now :)
This commit was SVN r12215.
size and diplacement of data-type. After this patch all data can contain size_t bytes
and the displacements are defined as ptrdiff_t. All of the files I was able to compile
have been modified to match this requirement.
This commit was SVN r12146.
First, move the OPAL_THREAD_LOCK out to the same level as its corresponding UNLOCK. It was possible to hit the UNLOCK without ever acquiring the lock.
Since the OPAL_THREAD_ADD64() is now protected by this lock, we can just do the decrement non-atomically.
This commit was SVN r11958.
Don't try to acquire ompi_request_lock here, which in all cases is already held. Avoids deadlock that occurs even when threads are enabled and we're running a THREAD_SINGLE app.
Reviewed by Galen.
This commit was SVN r11957.
The following Trac tickets were found above:
Ticket 183 --> https://svn.open-mpi.org/trac/ompi/ticket/183
long ago) supposed to be used as a cache for accessing the PML procs. But in
all of the PMLs the PML proc contain only one field i.e. a pointer to the ompi_proc.
This pointer can be accessed using the c_remote_group easily. Therefore, there is no
meaning of keeping the PML procs around. Slim fast commit ...
This commit was SVN r11730.
if we want to be able to reuse the request. If not, the request will never be freed
even if the user call MPI_Request_free.
This commit was SVN r11717.
different macros, one for each project. Therefore, now we have OPAL_DECLSPEC,
ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project.
This commit was SVN r11270.
through the c_pml_procs, as it might be an intercommunicator and therefore
c_my_rank might not be a valid index.
Fixes trac:266.
This commit was SVN r11238.
The following Trac tickets were found above:
Ticket 266 --> https://svn.open-mpi.org/trac/ompi/ticket/266
so use the req_buff field for keeping track of the bsend buffer and the
req_addr field for the user buffer, the way the comments suggested we
were doing it
This commit was SVN r11233.
the upperlayer assynchronously although there are some issues with this.. such
as there are multiple consumers of the btl's.. who get's the
This commit was SVN r11232.
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).
Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).
I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).
In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...
Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.
This commit was SVN r11204.
than $(LN_S). This causes problems with with Windows and probably
elsewhere (re: #200). So use a slightly different trick to get the
right header selected for the MEMCPY and TIMER components.
* Using the same trick used to solve the AC_CONFIG_LINKS problem,
stop using a separate header file for direct calling in the
PML and MTL. This lets me remove some icky code in ompi_mca.m4
that was more fragile than I really liked.
This commit was SVN r10841.
We now set truncation error if we received more than we delivered for both
the OB1 and DR PMLs (the CM PML doesn't need such a fix, as the condition
is set at the MTL level)
This commit was SVN r10812.
The following Trac tickets were found above:
Ticket 172 --> https://svn.open-mpi.org/trac/ompi/ticket/172
all but buffered and persistent requests. Unfortunately we were note able to
reuse the pml_base_request_t as it was just too heavy for our needs. Lots of
code for 2/10 usec ;-)
This commit was SVN r10810.
There was some old code regarding the convertor which does not have to be there
(the problem was corrected a while ago). In the PML we already know how the progress
function is defined, so call the BML progress instead, which will save one function
call.
The macro MCA_PML_OB1_COMPUTE_SEGMENT_LENGTH is already defined in the pml_ob1.h
so it should not be in the endpoint.h.
Remove a double definition of the mca_pml_ob1_progress function in the pml_ob1.h.
This commit was SVN r10775.
is the one provided by the user. For the buffered send the real datatype used
for the communication is always MPI_BYTE and the count can be retrieved from
the req_bytes_packed field. This will decrease the size of the request by
one pointer and one size_t (8 bytes or 16 bytes depending on the architecture).
This commit was SVN r10680.
bsend_request_init, but not both. Otherwise, you don't free
some buffer space and end up leaking buffers and ending in
badness
* since you only call alloc() or init(), but not both, need to
restore reference counting in init()
This commit was SVN r10674.
interconnects that provide matching logic in the library.
Currently includes support for MX and some support for
Portals
* Fix overuse of proc_pml pointer on the ompi_proc structuer,
splitting into proc_pml for pml data and proc_bml for
the BML endpoint data
* bug fixes in bsend init code, which wasn't being used by
the OB1 or DR PMLs...
This commit was SVN r10642.
standard). This macro allow us to specify the length of the fragment. Now we are
able to know how the message is fragmented between the network devices or inside
the communication protocol.
This commit was SVN r10508.