This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.
This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.
This commit closes trac:158
More details to follow.
This commit was SVN r14051.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r13912
The following Trac tickets were found above:
Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
MCA_BTL_SM_FRAG_SEND) and status success/fail in low bits of pointers we
are passing through circular buffer. The rank that receives ACK doesn't need
to look into data it received and this is a big win since this data is not in
the cache of the rank's CPU. (Note that we can use low bits of pointers because
free_list always return pointers aligned at least to cache line size).
This commit was SVN r13922.
allocated from mpool memory (which is registered memory for RDMA transports)
This is not a problem for a small jobs, but for a big number of ranks an
amount of waisted memory is big.
This commit was SVN r13921.
buffer fails. If cb is already allocated, but it is full and allocation of
additional cb fails, we spin waiting for receiver to free space in existing
cb.
This commit was SVN r13635.
* have the mpool size be based on MCW, not num procs
in other jobs we know about. Solves the problem of
the spawned job having a much bigger than needed
sm file
* Can't assume that "me" is in the list of procs
passed to addprocs, so need to use slightly different
logic and not go through all of add procs unless
there's a proc in my job that isn't me.
This seems to greatly improve the situation, although
there still seems to be more of a slowdown through
MPI_INIT for the children (if there are more than one
child) than MPI_INIT for the parent if there are 'n'
children compared to 'n' parents. Hopefully that
made sense ;)
This commit was SVN r13417.
allocation logic is completely done outside the data-type engine (in the PML) there is
no need for any special case inside the data-type engine. There is less arguments for
the ompi_convertor_pack and ompi_convertor_unpack as well (the last field free_after is
not required anymore as there is no memory allocated in the engine itself). This change
affect all components using datatypes. I test most of them, but it might happens that I
miss some ... If it's the case please let me know (don't shoot the pianist!!).
This commit was SVN r12331.
all platforms. The only exceptions (and I will not deal with them
anytime soon) are on Windows:
- the write functions which require the length to be an int when it's
a size_t on all UNIX variants.
- all iovec manipulation functions where the iov_len is again an int
when it's a size_t on most of the UNIXes.
As these only happens on Windows, so I think we're set for now :)
This commit was SVN r12215.
constrained:
* Make sure we always have a number of eager fragments available
that scales with the number of processes communicating with
a given proc over shared memory
* Use FREE_LIST_GET instead of FREE_LIST_WAIT to return an
error to the PML when resource exhaustion occurs
* Don't dereference the frag during alloc unless we're sure
it's not NULL
Reviewed by: Galen
Refs trac:413
This commit was SVN r12053.
The following Trac tickets were found above:
Ticket 413 --> https://svn.open-mpi.org/trac/ompi/ticket/413
In order to provide backwards compatability the framework versions are bumped
and the handler registeration function is at the end of the btl struct.
Testing done on sm, openib, and gm..
This commit was SVN r11256.
the upperlayer assynchronously although there are some issues with this.. such
as there are multiple consumers of the btl's.. who get's the
This commit was SVN r11232.
- move files out of toplevel include/ and etc/, moving it into the
sub-projects
- rather than including config headers with <project>/include,
have them as <project>
- require all headers to be included with a project prefix, with
the exception of the config headers ({opal,orte,ompi}_config.h
mpi.h, and mpif.h)
This commit was SVN r8985.
- Add big comment about a general overview of what the sm btl is doing
- random small code cleanups
- fix instances of mca_btl_sm[0] to mca_btl_sm[1] where relevant
- remove a lot of unused, confusing, and incorrect interface functions
from ompi_fifo.h and ompi_circular_buffer.h. These functions, if
they were used, would not work properly with the scheme that the sm
btl uses with the fifos (i.e., receiver makes right -- if necessary)
- add some missing offset computations in the fifo and circular buffers
- change the types of offsets to be ssize_t, not size_t
- remove an offset parameter from a function that didn't need it
This commit was SVN r8135.
- Ensure to convert base_shared_mem_flags to be a relative offset in
the global storage, and then to convert that back to an absolute
virtual address before we try to use it
- Don't double increment n_local_procs when calculating the peer rank
during bootstrapping of the different base address case
Something else is still wrong; if mmap() returns a different base
address, things don't work (i.e., segv or hang forever when you try to
send a message). More specifically, the bootstrapping now seems to
correctly handle the case when mmap() base addresses are different,
but the message passing does *not* -- it always assumes that the
mmap() base addresses are the same.
Still working on the fix for that -- want to checkpoint what has been
done so far to facilitate working on different machines...
This commit was SVN r8134.
memory initialization, call opal_progress() to push any pending events
around and possibly yield the processor if nothing entertaining is happening.
This should probably go to the 1.0 branch.
This commit was SVN r7808.
Correct the spring of the vpid problem (similar to the one in the SM PTL).
Add one more argument to the MCA_BTL_SM_FIFO_WRITE macro who will get passed down to the
MCA_BTL_SM_SIGNAL_PEER macro to allow it to have the fifo_fd file descriptor.
This commit was SVN r7305.
alignment of 0, then assume there will be no data segment and don't do
the checks to see if it will be beyond the end of the file.
This commit was SVN r6773.
The following SVN revision numbers were found above:
r6672 --> open-mpi/ompi@8b56769307
- Adjust btl sm to allocate just a few bytes extra to allow the common
sm component to assume that there will be a data segment (even though
the sm btl doesn't use the data segment in that portion of code)
This commit was SVN r6772.
multiple components to share a single mpool module (e.g., the
ptl/btl and coll sm components).
- Re-tool the ptl, btl, and coll sm components to first look for the
target mpool module, and if they don't find it, to create it.
- coll sm component now correctly identifies when it is supposed to
run or not (i.e., if all the processes in the communicator are on
the same host). Now we just need to fill in some algorithms. :-)
This commit was SVN r6530.
* rename ompi_malloc to opal_malloc
* rename ompi_numtostr to opal_numtostr
* start of rename of ompi_environ to opal_environ
This commit was SVN r6332.