interconnects that provide matching logic in the library.
Currently includes support for MX and some support for
Portals
* Fix overuse of proc_pml pointer on the ompi_proc structuer,
splitting into proc_pml for pml data and proc_bml for
the BML endpoint data
* bug fixes in bsend init code, which wasn't being used by
the OB1 or DR PMLs...
This commit was SVN r10642.
This moves the logic to create the symbolic links for:
- mpirun
- mpiexec
- ompi-ps
- ompi-clean
and their respective man pages to the ompi level from
the orte layer.
This is a bit pedantic, but orte shouldn't be doing the
work of ompi since that is a bit of an abstraction break.
Note: need to autogen.sh to get this. Sorry :(
This commit was SVN r10602.
yes this means it WAS possible for two nodes to choice two different algorithms
(discovered by Doug Gregor and figured out by George)
Also changed some names like size to comsize so we know which sizes we are using where
This should be updated in al versions
This commit was SVN r10601.
order the BTL depending on the real latency for the eager protocol. Starting from now, the
latency one can specify for the devices will be in micro-second, while the bandwidth is in Mbs
(as it was before).
This commit was SVN r10566.
mpi_leave_pinned when multiple OpenIB HCA ports are found.
Specifically, if mpi_leave_pinned == 1 and ultiple HCA ports are
found, the MCA parameter btl_openib_max_btls is set to 1. If the MCA
parameter btl_openib_warn_leave_pinned_multi_port is true, emit a
warning that this happened (having an MCA parameter to control the
warning allows users/sysadmins to turn it off instead of being nagged
for every run).
This commit was SVN r10521.
explicitly enabled at run-time with the mca parameter
io_romio_enable_parallel_optimizations set to something non-zero.
This will enable some magic flags in Panasas if the user didn't
set them (either on or off) and do some slightly better things
with strided collective writes.
This commit was SVN r10516.
standard). This macro allow us to specify the length of the fragment. Now we are
able to know how the message is fragmented between the network devices or inside
the communication protocol.
This commit was SVN r10508.
specified check that the put function is available for the BTL. Same safe check for
the GET function. At the end make sure that at least on communication protocol is
specified, otherwise force the send flag.
This commit was SVN r10507.
by the BTL (btl_max_rdma_size). Now the PUT protocol is pipelined even if there
is just one network between the 2 peers. Unfortunately, this problem is present
the 1.1 (no pipeline for the PUT protocol).
This commit was SVN r10499.
time.
UD is connectionless, and as long as peers are statically assigned to QPs,
there is no reason to set up the adressing information lazily.
Lots of code was axed, as endpoints no longer have state. Removed a
number of other elements in the endpoint struct to make it as lightweight
as possible.
I was able to remove an entire function call/branch in the send path,
which I believe is the main contributor to a 2us drop in NetPIPE latency.
Some whitespace cleanups as well.
Passes IBM test suite, and all but certain intel tests that were failing
before the change, over ob1 PML.
This commit was SVN r10494.
Moved a lot of the module-specific init from the component init to the module init.
Try keeping a pointer to reduce indexing, didn't seem to help - leaving in place
for now.
This commit was SVN r10485.
Playing around with OPAL_LIKELY/UNLIKELY, no real gains yet.
Reworked progress() to process many WC's at a time, as well
as immediately repost groups of receive buffers.
This commit was SVN r10481.
was smaller than the CACHE_LINE_SIZE. Here is the version that works.
In fact this works on 2 steps. First we set the element size to something
multiple of the desired alignment. Then when we allocate memory, we compute
the total size, and we will align each of the elements (we allocate
multiple of them every time) to the CACHE_LINE_SIZE.
This commit was SVN r10479.
bytes). The simplest way to make sure they are aligned is to update
the size of the basic element to a multiple of the desired alignment.
It will use a little bit more memory, but the improvements on the SM BTL
seems quite interesting.
This commit was SVN r10478.
cannot include the PMPI_WTIME|WTICK functions in the external and
double precision statements because some compilers complain about
this. Instead, we need to use the macro that is defined by
configure.ac (MPIF_H_PMPI_W_FUNCS). This unfortunately means that we
need to generate mpif.h (in addition to mpif-config.h) because the
"external" statement is toxic to F90 compilers.
This commit was SVN r10464.
with the other methodology even if there are no choice buffers and no
special constants. But it keeps the Makefile.am simple and the
methodology consistent.
This commit was SVN r10462.
was that declaring the type of MPI_WTICK and MPI_TIME in mpif-common.h
would allow the F90 bindings to call through to the back end f77
function and have the right return type. But upon reflection, that's
silly -- we were just declaring the variables MPI_WTICK and MPI_WTIME
that were of type double precision. Duh.
So add some fixed (non-generated) wrapper F90 functions to call the
back-end *C* MPI_WTICK and MPI_TIME functions (vs. the back end *F77*
functions). We have to call the back-end C functions because there's
a name conflict if we try to call the back-end F77 functions -- for
the same reasons that we can't "implicitly" define MPI_WTIME and
MPI_WTICK in the f90 module, we can't call such an implicitly-defined
function. So we had to add new back-end C functions that are directly
callable from Fortran, the easiest implementation of which was to
provide 4 one-line functions for each (rather than muck around with
weak symbols).
This commit was SVN r10448.
1. ompi/mca/btl/udapl/btl_udapl_proc.c should be including
btl_udapl_endpoint.h for mca_btl_udapl_proc_insert function.
2. btl_udapl_endpoint.c it looks like you are using
&endpoint->endpoint_lock when you should use &ep->endpoint_lock in a
OPAL_THREAD_LOCK call.
3. btl_udapl_frag.h has a couple opal_list_item_t's that should be
ompi_free_list_item_t in the _FRAG_ALLOC_{EAGER,MAX} macros.
This commit was SVN r10442.
call to the _UNPACK macro, in the case where the length of the received data
is zero. This might happens on the PUT protocol.
This commit was SVN r10431.
mpi_leave_pinned when multiple OpenIB HCA ports are found.
Specifically, if mpi_leave_pinned == 1 and ultiple HCA ports are
found, the MCA parameter btl_openib_max_btls is set to 1. If the MCA
parameter btl_openib_warn_leave_pinned_multi_port is true, emit a
warning that this happened (having an MCA parameter to control the
warning allows users/sysadmins to turn it off instead of being nagged
for every run).
This commit was SVN r10424.
send/receive queues if there is something available in the short message
rdma queues, then we have to poll *ALL* the rdma queues before exiting,
or we aren't fair about frag reception and fall into degenerate matching
cases.
This commit was SVN r10410.
where to look for registrations that were used in the alloc/free code don't
work (because the memory returned from malloc() -- whowever gets around to
calling it) might actually be registered already. So just call malloc
and free directly and avoid the whole issue when leave pinned is on. After
all, you have to pay the registration cost sometime, and if leave pinned
is on, you only have to pay it once. It makes things much simpler to
have that once be at first use rather than during ALLOC_MEM, and as far
as I can read, we're still standards conformant this way.
This commit was SVN r10406.
correctly compute the final checksum. This is not a bug in the case where
both the sender and the receiver execute EXACTLY the same checksum
computations but is definitively a problem if not (such as the buffered case).
This commit was SVN r10367.
with the last one was that the resized function only set the soft lb and ub
markers without actually moving the usefull data up to the correct
displacement. Using a struct instead solve the problem. Anyway, as defined
in the MPI standard we have to set the lower bound and the upper bound
of the new type to the correct values too.
This commit was SVN r10328.
- add more comments on the pack and unpack functions.
- remove all pack/unpack versions that are not used anymore.
- other various cleanups.
- update the safeguard macro (which compute theboundaries of the
datatype in order to protect us from accessing memory locations
outside of the data).
- for the contiguous (with or without gaps) pack and unpack correctly
compute the starting point.
This commit was SVN r10327.
- Added some basic flow control to limit number of posted sends.
- Merged endpoint send/recv lock into single endpoint lock.
- Set the LMR triplet length in the send path, not at allocation time.
This has to be done because upper layers might send less than the
amount allocated.
- Alter the tie-breaker if statement protecting the second call
to dat_ep_connect(). The logic was reversed compared to the tie-
breaker for the first dat_ep_connect(), making it possible for
3 or more processes to form a deadlock loop.
- Some asserts were added for debugging purposes.. leaving them
in place for now.
This commit was SVN r10317.
MPI_FILE_GET_INFO should return the info currently in use, not the one
used to create the file handle. ROMIO adds a bunch of keys, so you can
create a file handle with MPI_INFO_NULL and have MPI_FILE_GET_INFO return
something totatlly different.
This commit was SVN r10312.
Sender can receive and complete PUT request before it gets completion on the first rndv packet. senreq struct may be reused for the next MPI_Send and unexpected completion mess up the things. I sometimes got SEGV and sometimes data corruption.
This commit was SVN r10301.
lower_bound is now directly added to the user pointer when the convertor
is created, instead of having to add it all over the places inside the
pack/unpack functions.
This commit was SVN r10292.
* Change the type of Fortan's MPI_STATUSES_IGNORE to double complex
so that it will never possibly be mistaken for a real status (i.e.,
integer(MPI_STATUS_SIZE)), particularly in the F90 bindings. See
comment in mpif-common.h explaining this (analogous argument to
MPI_ARGVS_NULL for MPI_COMM_SPAWN_MULTIPLE).
* Add second interfaces for the following functions that take a double
complex (i.e., MPI_STATUSES_IGNORE). This required adding the second
interface in mpi-f90-interfaces.h[.sh] and then generating new wrapper
functions to call the back-end F77 function for each of these four, so
we added 4 new files in ompi/mpi/f90/scripts/ and updated the various
Makefile.am's to match:
* MPI_TESTALL
* MPI_TESTSOME
* MPI_WAITALL
* MPI_WAITSOME
The XSL is now not in sync with the scripts. Although I suppose that
that is becoming less and less important (because it does not impact
the end user at all -- to be 100% explicit, no release should ever be
held up because the XSL is out of sync), but it will probably be
important when we go to fix the "large" interface; so it's still worth
fixing... for now...
This commit was SVN r10281.
btl_openib_ib_mtu and btl_mvapi_ib_mtu MCA params by showing the valid
values what what they represent (got a question about this from Cisco
testing engineers).
This commit was SVN r10277.
UD is the Unreliable Datagram transport for Infiniband, specifically OpenIB. This BTL is derived from the existing openib BTL, which is RC (Reliable Connection) based.
Still a work in progress, as there is a lot of work left to do. Specifically, performance, scalability, and flow control need to be addressed.
Currently I'm playing around with different methods for handling receive buffers, as well as profiling to figure out where the time is going.
This commit was SVN r10271.
the cases. Instead replace it with a better solution, which work even for
fragments received not in order. However, this solution work only on the
current supported modes in ompi (homogeneous & heterogeneous with endianess).
The method is tricky. We will rely on 2 partial unpacks. First we will find
a byte that is not on the data to unpack, and we will pad the data with this
byte. Once we have the full length as expected, we will unpack the data, and
all the bytes in the unpacked form which do not match the unused byte will be
copied into the user buffer. This way we will reconstruct the unpacked data
in 2 times, once for the begining and once for the end.
This commit was SVN r10270.
they will be contiguous even when a multiple of them are send. This is the difference
between the NO_GAPS and CONTIGUOUS flags: contiguous one suppose that the data might
have gaps in the begining and/or at the end but the content of the data is contiguous.
This commit was SVN r10266.
pack is if the data has the BASIC flag which means it is predefined and contiguous.
For the unpack the convertor has to be homogeneous plus the same requirements
as for the pack.
This commit was SVN r10263.
1) don't need tree if memory is just malloc'd
2) fix memory and free list leak..
3) deregister first and then free... doh..
This commit was SVN r10251.
Added a tree to track memory allocation from MPI_Alloc_mem, this allows us to
free the registrations in a sane fashion.. also should be faster..
This commit was SVN r10248.
user specify a count equal to zero it will get back a datatype with the size, lb, ub,
true_lb and true_ub set to zero (very similar to the MPI_DATATYPE_NULL except it can
be used for communications).
This commit was SVN r10225.
to Fortran). Add a new global variable, which keep track of all MPI predefined types.
This variable include all optional types, and is depend on the system where OMPI is
compiled. Use this variable to correctly find out the size match type.
This commit was SVN r10204.
support for progress threads, so we shouldn't build them or try to use
them when support for progress threads has been requested. The TCP, GM,
SELF, and SM BTLs should have progress thread support, so they aren't
disabled. The Portals BTL isn't compiled on platforms with threads,
so it doens't need to be updated.
This commit was SVN r10156.
Do this rather than the my_list pointer because we need to do some
things that are somewhat special because we pre-pin eager fragments but
not send fragments. Also makes a couple ideas I have slightly easier to
play around with.
This commit was SVN r10127.
- Make the F90 bindings compile and link properly with gfortran 4.0,
4.1, Intel 9.0, PGI 6.1, Sun (don't know version offhand -- the most
current as of this writing, I think), and NAG 5.2, although some
have limitations (e.g., NAG can't seem to handle the medium and
large sizes)
- Building the F90 "small" module size is now the default, even for
developers
- Split up mpif.h into multiple files because parts of it were toxic
to the F90 bindings
- Properly specify unsized/unshaped arrays to make the bindings work
on all known compilers
- Make ompi_info show Fortran 90 bindings size
- XML somewhat lags the generated scripts as of this commit, but
functionality was my main goal -- the XML can be updated later (if
at all).
This commit was SVN r10118.
predefined datatypes) the optimized description was set to NULL instead of
pointing to some valid description. As for some data, having an optimized
version is not possible (as no optimizations bring any benefit), we have
to make sure this field (opt_desc) is always correctly initialized.
This commit was SVN r10112.
Instead of figuring out which free list the fragment belongs to based on size
we simply store a pointer to the list which it belongs in the fragment.
This was reviewed by Brian and should hit all the branches.
This commit was SVN r10072.
Trying to remember what I did here.. eager/max messages should work now, no RDMA yet. A number of other fixes and cleanups.
I do know of two problems:
Bad stuff happens when flooded with send frags too quickly - the BTL doesn't handle flow control.
Certain IBM tests turn up a length assertion in the datatype engine - needs more investigation.
This commit was SVN r10070.
- split mpif.h into mpif.h and mpif-common.h[.in]
- mpif-common.h is included by various f90 things and contains output
from configure
- mpif.h defines some f77-specific stuff and then includes
mpif-common.h
This commit was SVN r9997.
less than we posted for). We were passing by value, so this update was
not being propgated back up the stack and we could segfault. Make the
count argument a pointer so that updates will be passed as expected.
This needs to go to the v1.1 branch
This commit was SVN r9991.
comment in ompi_comm_invalid() in
source:/trunk/ompi/communicator/communicator.h.
Short version:
- ompi_comm_invalid() returns TRUE for MPI_COMM_NULL
- therefore MPI_COMM_C2F needs to explicitly check for MPI_COMM_NULL
(because it uses ompi_comm_invalid())
- make ~20 MPI functions only call ompi_comm_invalid() instead of
calling ompi_comm_invalid() *and* checking for MPI_COMM_NULL (~40 MPI
functions already only called ompi_comm_invalid() -- we should be
consistent)
- similar issue for ompi_win_invalid(), so I added a cross-referencing
comment in win.h and fixed MPI_WIN_SET_NAME to only call
ompi_win_invalid() (and not check for MPI_WIN_NULL)
This commit was SVN r9970.
2. fix for MPI_Free_mem, was calling deregister but never called mpool_free.. so
we leaked memory. Still an open issue here though, if the memory is alloc'd
and the mpool doesn't create and cache a registration, we will never find the
mpool to free with.
This commit was SVN r9944.
derefence through it. It is legal for endpoint_addr to be NULL in the
destructor because if btl_tcp_add_procs() -> btl_tcp_proc_insert()
returns UNREACH, then endpoint_addr will be NULL and we'll OBJ_RELEASE
it.
This commit was SVN r9940.
apparently don't work properly: r9869, r9868 (sm btl alignment issues)
This commit was SVN r9936.
The following SVN revision numbers were found above:
r9868 --> open-mpi/ompi@9b985c3216
r9869 --> open-mpi/ompi@adedf511fb
easier to do event accounting that way
* greatly increase receive event and buffer sizes. We're still about half
of what Cray defaults to, so I don't feel bad about the increases
* Implement a pre-pinning optimization for eager fragments - will be
pinned on first use and left pinned for the life of the fragment
* Since we can't have two receive frag callbacks fired at the same time,
don't have receive free list - just keep one receive fragment in the
module. Saves a big free list and all that interaction.
This commit was SVN r9915.
things I found:
- Locking should prevent it from happening (I think), but there was a
race condition in the component progress -- a callback could be
triggered that would free the request before it was off the outstanding
requests list.
- When pulling a request off the component free list, make sure to
reinitialize the free_called state on the IO request. This was
what was causing Edgar's failures
- In the request cleanup code, pull the request out of the per-
component free list before returning to the free list. This
probably would cause asserts to fire, although it looks like
I wrote the loops such that it would have been memory safe if
the asserts didn't fire. Not really sure why I did that, but
let's try it again...
This should go to the v1.0 and v1.1 branches.
This commit was SVN r9913.
(which is currently the default, although we may argue over this later
:-) ), a new field in the ompi_proc_t named proc_hostname will have
the string hostname of that peer. If 0, this field will be NULL.
This allows for printing nicer error messages in environments where
peer hostnames are not otherwise easily obtainable, such as the mvapi
BTL (requested by Sandia, who has both a *huge* number of nodes and
6GB of RAM per node, so they don't care about the extra memory usage
;-) ).
This commit was SVN r9902.
being null, if the constant MPI_ERRCODES_IGNORE is defined as (void *)
NULL;
- the communicator in file open has to be an intra-communicator.
This commit was SVN r9893.
right now. Some testing on large NUMA machines should be done in order
to make sure that we need to export this variable out to the MCA layer.
This commit was SVN r9868.
- We had a bad conditional choice, such that asking for pvfs2 would
result in pvfs trying to build as well, which was going to fail.
- We didn't try to link in the libray for PVFS2's adio component.
- We were clobbering romio_flags, so it was impossible to pass
flags to romio (like the selection of filesystems)
This commit was SVN r9854.
- LDFLAGS set at the top level of Open MPI were not passed to the
ROMIO configure script
- If ROMIO was explicitly required (with --enable-io-romio) and
not able to be built, abort OMPI's configure script.
This needs to go to the v1.0 and v1.1 branches.
This commit was SVN r9845.
- change MASK behavior for tags - we need the upper bit to be whether
the tag is reseved or not. MPI_ANY_TAG should not pull off any
reserved tag communication
- some other random debugging output to try to get some idea what is
spewing out of here.
This commit was SVN r9844.
message from a known peer (not MPI_ANY_SOURCE) then we can attach the
remote proc and initialize the convertor as soon as we know the data-type,
and the count (so basically in the _INIT macro). If it's not the case, then
create them in the _MATCHED macro (as in the original version). Of course,
beforeinitializing the convertor we check that there will be some data
in the message.
This commit, plus the convertor improvements from few days ago, lower the
latency for my test case environment (mvapi) by 0.1 microseconds. The convertor
now is as slim as it can be, I don't think there is anything else to
remove/improve.
This commit was SVN r9843.
function are stored in a single location, th master convertor. With the
old information (mainly the remote sizes for each predefined data-type)
now we know everything we need about the remote peers.
This commit was SVN r9829.