http://www.open-mpi.org/community/lists/devel/2008/04/3779.php
{{{
svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play .
}}}
Any components not in the trunk, but in one of the effected frameworks *must* be
updated. Contact the list, look at the RFC, or look at the diff for how to do this.
Sorry for the early commit of this, but I wanted to get it in today (per RFC) and
didn't know if I would have a chance later today.
This commit was SVN r18381.
* Extension to the ESS framework to support C/R
* Fixed support for {{{snapc_base_establish_global_snapshot_dir}}}
* Fixed FileM support
* Misc. minor code modifications
There are some outstanding visability issues that I want to fix next.
This commit was SVN r17725.
This commit also cleans up the checkpoint and terminate case making it more
precise than before. Previously the application could make a small amount of
progress between checkpoint completion and application termination. Now the
application will make no progress at all in this time span.
Additional minor change:
- Start using OPAL_INT_TO_BOOL instead of if/else logic
This commit was SVN r16952.
and implementation. This has shown drastic performance benefit when
transferring Many files at roughly the same time.
I tested this for many different filem operations and everything was working
fine. Let me know if you have any problems with this functionality.
Some Notes:
- opal-checkpoint now has a 'quiet' flag to keep it from being too verbose.
- FileM RSH component is fully non-blocking.
- FileM RSH component has incomming connection throttling since by default
ssh only allows 10 concurrent scp connections to any single host. This
default can be adjusted via an MCA parameter.
{{{-mca filem_rsh_max_incomming 10}}}
- There is an MCA parameter for max outgoing connections, but it is currently
not implemented. If someone needs it then it should not be hard to implement.
{{{-mca filem_rsh_max_outgoing 10}}}
- Changed the FileM request structure so that it is a bit more explicit and
flexible.
- Moved the 'preload-binary' and 'preload-files' functionality into odls/base
allowing for code reuse in the 'process' and 'default' ODLS components.
- Fixed a bug in the process name resolution which broke the 'preload-*'
functionality due to GPR table structure changes.
- The FileM RSH component might be able to see even more speedup from using a
thread pool to operate on the work_pool structures, but that is for future
work.
- Added a 'opal-show-help' file to ODLS Base
This commit was SVN r16252.
performance characterization, and should not be used by anyone doing anything
else since it will not produce a globally consistent checkpoint in this mode.
This commit was SVN r16192.
It is masking a bug that I'm tracking down in the SNAPC FULL - FILEM interations
Also make sure to cleanout the filem structure before asking for another
checkpoint file when not storing the files in place.
This commit was SVN r16109.
r15390 - Changed the paradigm in which the runtime worked by enabling the mpirun
process to become an orted and spawn processes. This broke the C/R for this
special case as it required that the orted start the process, and that
the hierarchy remains.
The fix was to allow the global coordinator to be a local coordinator as well
for this case.
r15528 - Changed the selection logic for the RML. This caused the application to
segv if the 'ftrm' wrapper component was selected as it tried to modify a NULL
pointer.
The fix was to move the 'module swap' code into the init() function, and swap
when passed a NULL pointer. It sounds bad, but actually cleans up the code a bit
more.
Still have to fix the 'routed' framework.
This commit was SVN r15566.
The following SVN revision numbers were found above:
r15390 --> open-mpi/ompi@bd65f8ba88
r15528 --> open-mpi/ompi@39a6057fc6
Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point.
Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings.
This commit was SVN r15517.
Create a sentinel value in the metadata file to clearly indicate
that the sequence number is complete (versus in progress). This
way we do not try to restart from an invalid sequence number
which can lead to badness.
This commit was SVN r14423.
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.
This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.
This commit closes trac:158
More details to follow.
This commit was SVN r14051.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r13912
The following Trac tickets were found above:
Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158