Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework.
Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one.
This commit was SVN r26678.
readable by normal human beings, vs. having a bitmap of physical
PU's). Use the new hwloc base prettyprint functions to generate the
output.
This commit was SVN r26533.
exits with status 77, the whole job will be killed, but mpirun will
still return an exit status of 77, so MTT will report it as a skip
anyway.
This commit was SVN r26445.
The following SVN revision numbers were found above:
r26413 --> open-mpi/ompi@02aa36f2e5
nonzero status (we polled other MPI implementations since one one in
the OMPI community had a concrete opinion on what behavior to do here
-- all other MPI's seem to adhere to this behavior, too).
This commit adds an MCA parameter that allows us to tell ORTE to
''not'' kill jobs when a process exits with a status of 77, meaning
the GNU testing standard of "this test was skipped". In all the OMPI
tests, all procs will either return 77 or not. So if they all return
77, mpirun won't consider it an error, but will still return an exit
status of 77 (so that MTT can know that the test was cleanly skipped).
This commit was SVN r26413.
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well.
This commit was SVN r26380.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.
Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations.
Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way.
This commit was SVN r25497.
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-(
This commit was SVN r25285.
To enable the epochs and the resilient orte code, use the configure flag:
--enable-resilient-orte
This will define both:
ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE
This commit was SVN r25093.
refactoring to improve readability and to compute num_procs_alive
correctly and to remove the use of loop iteration variables for
two loops nested one inside another (causing MPI_Comm_spawn_multiple
to fail).
This commit was SVN r24903.
Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list.
This commit was SVN r24894.
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.
Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.
This commit was SVN r24860.
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.
This commit was SVN r24829.
Rename the memusage sensor plugin to "resusage" as it will soon be updated to include full process stat monitoring.
Extend the heartbeat sensor to report node and process stats in the heartbeat.
Store the process and node stats in their respective orte_xxx_t object.
This commit was SVN r24629.
Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI.
This commit was SVN r24614.
I want to thank Hugo Meyer for reporting this/these bugs.
Notes:
* Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation).
* When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function.
* When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery.
* The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup.
* If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading.
This commit was SVN r24317.
Restore the use of override_oversubscribe to indicate that the data source for resources on the backend nodes used in mapping is unreliable. In this situation (e.g., data came from hostfile, or we are just using localhost because nothing was provided), we don't trust the oversubscribe condition passed by the mapper. Instead, we check locally to ensure we set sched_yield correctly.
This commit was SVN r24130.
I did not want to make this change globally since there could be good reason to keep the check before calling SIGKILL that I am not seeing at the moment.
This commit was SVN r23821.
This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change.
Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation.
This commit was SVN r23764.
extravaganza.
= Short version =
This commit does several things, but the short version is that it
re-orients the error message creation of the ODLS default module to
generate error strings in the child process for errors that occur
after the fork but before the exec (such errors are ''usually''
related to paffinity). A show_help string is rendered in the child
and then IPC'ed up to the parent, who displays the string through
normal ORTE show_help aggregation mechanisms. We also broke up the
ginormous paffinity-setting logic into a few separate functions, both
to help us understand the code, and hopefully to ease future
maintenance.
The logic for the ODLS default binding should not have changed -- this
is mainly a code reshuffle and improvement on error reporting.
= Rationale =
The reasoning for this commit is complex. As mentioned above, it's
the first step in some paffinity cleanup. Here's the line of dominoes
that must fall (in this order):
1. Add hwloc paffinity component (already done).
1. While testing hwloc, we discovered that the error reporting from
the ODLS default module was abysmal. So we fixed it.
1. Further, we reorganized the code in the odsl_default_module.c a bit
to help our understanding of it.
1. We also discovered a few bugs in the original ODLS default module
logic that existed before this code shuffle; separate tickets
will be filed to fix them.
1. Next up will be some improvements to paffinity / odls default to
make the act of binding to a core ensure to bind to ''all''
hardware threads contained in that core (similar for sockets:
binding to a socket will bind to ''all'' hardware threads in that
socket).
1. Next will be improvements to paffinity to expose binding to
hardware threads through the paffinity framework API.
1. Finally, we'll expose these binding controls to the user (e.g.,
through mpirun command line arguments, MCA parameters, etc.).
This commit represents the first few bullets; the last 4 bullets are
being worked on right now, but there is no definite timeline for
completion.
= Miscelaneous =
A few points worth mentioning:
* We have tested this new code a bunch; we're pretty sure it behaves
just like the trunk -- but with better / more precise error
reporting. More testing is needed on a wider array of platforms,
however.
* A big comment at the top of odls_default_module.c explains the
(new) general scheme for the error reporting.
* The error reporting in the parent process is now really dumb;
almost all the intelligence about creating error messages is in the
child.
* The show_help file was renamed to be more consistent with other
help files (help-odls-default.txt -> help-orte-odls-default.txt)
* Removed the use of sched_yield() because of recent changes in the
Linux 2.6.3x kernels. We already had an #else clause for
select()'ing for 1us if we didn't have sched_yield() -- that is now
the only code path. This is not a performance-critical section of
the code, so this shouldn't be controversial.
* Replaced the macro-based error reporting with function-based
reporting. It's a bit more bulky, but it helped us understand the
code and saved us multiple times with compile-time parameter
checking, etc.
* Cleaned up the use of several show_help messages to ensure that
they mapped to real messages in help*.txt files.
This commit was SVN r23652.
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
orte_local_chip_type and orte_local_chip_model in MPI processes it the
appropriate sysinfo module found the values on the machine.
This commit was SVN r23581.
This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all.
Thanks to Dong Ahn (LLNL) for catching this problem!
Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm.
This commit was SVN r23300.
* Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic
OPAL_ERR_NOT_SUPPORTED case.
* When odls_default detects that processor affinity is not supported,
it prints a specific message about it, and then it suppressed a
generic HNP help message that would normally follow it (i.e., it's
easier to have the "processor affinity is not supported" show_help
message last).
* Use some symbolic names in odls_default instead of fixed int's,
just for slight readability improvements in the code.
* Introduce orte_show_help_suppress(), which gives the ability to
suppress any future showings of any arbitrary show_help() message.
This is useful if you display message X and want to suppress
message Y. This suppression *only* works in environments where
orte_show_help() does coalescing.
This commit was SVN r23249.
* If < 0, it's an OPAL_ERR_* value
* If >= 0, it's the actual output value of the function
This is problematic for the OPAL_SOS stuff. This commit changes those
functions to always return OPAL_* statuses and send the output value
back through output parameters (like 95% of the rest of the code
base). This avoids the confusion with OPAL_SOS stuff and makes
paffinity work again (e.g., mpirun --bind-to-core ...).
I updated all paffinitiy modules for the new function signatures, and
bumped the paffinity API version up to 2.0.1. I don't think the
version change will matter, though, because we'll be introducing
support for hardware threads soon, which will either bump the
paffinity version again or we'll replace paffinity with
a new framework.
This commit was SVN r23197.
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
back the native error code.
* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
(OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
decode 'ret' to get the native error code.
This commit was SVN r23162.
Add new mca params to test:
orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch
orte_debugger_test_attach: Test debugger colaunch after debugger attachment
To test co-launch at job start, just set the orte_debugger_test_daemon param.
To test co-launch upon attach:
set orte_debugger_test_daemon
set orte_debugger_test_attach=1
set orte_enable_debug_cospawn_while_running=1
set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching
Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon.
This commit was SVN r23144.
It is okay to not have a paffinity module IF you aren't using paffinity anyway. So don't error out of MPI_Init because a paffinity module wasn't selected.
Cleanup error reporting in the odls default module to (once and for all!) eliminate messages originating in the fork'd process. Create some new error codes to allow us to pass enough info back to the parent process to provide useful error messages.
This commit was SVN r23106.
1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis
2. orte_max_local_restarts - default max number of local restarts, can be overridden
3. orte_max_global_restarts - default max number of relocates, can be overridden
Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code.
This commit was SVN r23057.
Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly.
This commit was SVN r23045.
1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified.
2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set.
This support must be enabled by configuring --enable-sensors.
ompi_info and orte-info have been updated to include the new framework.
Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized.
Implementation continues.
This commit was SVN r23043.
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.
* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.
* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.
* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons
* update the routed modules to reflect that process state is updated in the errmgr
* ensure that the orted's open the errmgr and select their appropriate module
* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place
* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.
* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios
This commit was SVN r23023.