This commit is the first of several steps in a paffinity makeover
extravaganza. = Short version = This commit does several things, but the short version is that it re-orients the error message creation of the ODLS default module to generate error strings in the child process for errors that occur after the fork but before the exec (such errors are ''usually'' related to paffinity). A show_help string is rendered in the child and then IPC'ed up to the parent, who displays the string through normal ORTE show_help aggregation mechanisms. We also broke up the ginormous paffinity-setting logic into a few separate functions, both to help us understand the code, and hopefully to ease future maintenance. The logic for the ODLS default binding should not have changed -- this is mainly a code reshuffle and improvement on error reporting. = Rationale = The reasoning for this commit is complex. As mentioned above, it's the first step in some paffinity cleanup. Here's the line of dominoes that must fall (in this order): 1. Add hwloc paffinity component (already done). 1. While testing hwloc, we discovered that the error reporting from the ODLS default module was abysmal. So we fixed it. 1. Further, we reorganized the code in the odsl_default_module.c a bit to help our understanding of it. 1. We also discovered a few bugs in the original ODLS default module logic that existed before this code shuffle; separate tickets will be filed to fix them. 1. Next up will be some improvements to paffinity / odls default to make the act of binding to a core ensure to bind to ''all'' hardware threads contained in that core (similar for sockets: binding to a socket will bind to ''all'' hardware threads in that socket). 1. Next will be improvements to paffinity to expose binding to hardware threads through the paffinity framework API. 1. Finally, we'll expose these binding controls to the user (e.g., through mpirun command line arguments, MCA parameters, etc.). This commit represents the first few bullets; the last 4 bullets are being worked on right now, but there is no definite timeline for completion. = Miscelaneous = A few points worth mentioning: * We have tested this new code a bunch; we're pretty sure it behaves just like the trunk -- but with better / more precise error reporting. More testing is needed on a wider array of platforms, however. * A big comment at the top of odls_default_module.c explains the (new) general scheme for the error reporting. * The error reporting in the parent process is now really dumb; almost all the intelligence about creating error messages is in the child. * The show_help file was renamed to be more consistent with other help files (help-odls-default.txt -> help-orte-odls-default.txt) * Removed the use of sched_yield() because of recent changes in the Linux 2.6.3x kernels. We already had an #else clause for select()'ing for 1us if we didn't have sched_yield() -- that is now the only code path. This is not a performance-critical section of the code, so this shouldn't be controversial. * Replaced the macro-based error reporting with function-based reporting. It's a bit more bulky, but it helped us understand the code and saved us multiple times with compile-time parameter checking, etc. * Cleaned up the use of several show_help messages to ensure that they mapped to real messages in help*.txt files. This commit was SVN r23652.
Этот коммит содержится в:
родитель
2c03554fe7
Коммит
207ca2d928
@ -3,6 +3,7 @@
|
||||
# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# Copyright (c) 2010 Cisco Systems, Inc. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
@ -36,28 +37,39 @@ Will continue attempting to launch the process.
|
||||
The xterm option was asked to display a rank that is larger
|
||||
than the number of procs in the job:
|
||||
|
||||
Rank: %d
|
||||
#procs: %d
|
||||
Rank: %d
|
||||
Num procs: %d
|
||||
|
||||
Note that ranks start with 0, not 1, and must be specified
|
||||
accordingly.
|
||||
#
|
||||
[orte-odls-base:show-bindings]
|
||||
System has detected external process binding to cores %04lx
|
||||
|
||||
System has detected external process binding to cores %04lx.
|
||||
#
|
||||
[orte-odls-base:warn-not-bound]
|
||||
[warn not bound]
|
||||
A request to bind the processes to a %s was made, but the operation
|
||||
resulted in the processes being unbound. This was most likely caused
|
||||
by the following:
|
||||
|
||||
%s
|
||||
%s
|
||||
|
||||
This is only a warning that can be suppressed in the future by
|
||||
setting the odls_warn_if_not_bound MCA parameter to 0. Execution
|
||||
will continue.
|
||||
|
||||
Local host: %s
|
||||
Action requested: %s
|
||||
Slot list: %s
|
||||
Application name: %s
|
||||
Action requested: %s %s
|
||||
#
|
||||
[error not bound]
|
||||
A request to bind the processes to a %s was made, but the operation
|
||||
resulted in the processes being unbound. This was most likely caused
|
||||
by the following:
|
||||
|
||||
%s
|
||||
|
||||
This is an error; your job will now abort.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
Action requested: %s %s
|
||||
|
@ -9,6 +9,7 @@
|
||||
# University of Stuttgart. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The Regents of the University of California.
|
||||
# All rights reserved.
|
||||
# Copyright (c) 2010 Cisco Systems, Inc. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
@ -16,7 +17,7 @@
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
dist_pkgdata_DATA = help-odls-default.txt
|
||||
dist_pkgdata_DATA = help-orte-odls-default.txt
|
||||
|
||||
sources = \
|
||||
odls_default.h \
|
||||
|
@ -1,47 +0,0 @@
|
||||
# -*- text -*-
|
||||
#
|
||||
# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
# of Tennessee Research Foundation. All rights
|
||||
# reserved.
|
||||
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
||||
# University of Stuttgart. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The Regents of the University of California.
|
||||
# All rights reserved.
|
||||
# Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
# This is the US/English general help file for Open RTE's orted launcher.
|
||||
#
|
||||
[odls-default:could-not-kill]
|
||||
WARNING: A process refused to die!
|
||||
|
||||
Host: %s
|
||||
PID: %d
|
||||
|
||||
This process may still be running and/or consuming resources.
|
||||
#
|
||||
[odls-default:could-not-send-kill]
|
||||
WARNING: A process refused the kill SIGTERM signal!
|
||||
This should never happen unless the application is changing the
|
||||
parent/child relationship permissions.
|
||||
|
||||
Host: %s
|
||||
PID: %d
|
||||
Errno: %d
|
||||
|
||||
This process may still be running and/or consuming resources.
|
||||
#
|
||||
[binding not supported]
|
||||
Open MPI tried to bind a new process, but process binding is not
|
||||
supported on the host where it was launched. The process was killed
|
||||
without launching the target application.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
96
orte/mca/odls/default/help-orte-odls-default.txt
Обычный файл
96
orte/mca/odls/default/help-orte-odls-default.txt
Обычный файл
@ -0,0 +1,96 @@
|
||||
# -*- text -*-
|
||||
#
|
||||
# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
# of Tennessee Research Foundation. All rights
|
||||
# reserved.
|
||||
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
||||
# University of Stuttgart. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The Regents of the University of California.
|
||||
# All rights reserved.
|
||||
# Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
|
||||
# Copyright (c) 2010 Cisco Systems, Inc. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
# This is a US/English help file.
|
||||
#
|
||||
[execve error]
|
||||
Open MPI tried to fork a new process via the "execve" system call but
|
||||
failed. This is an unusual error because Open MPI checks many things
|
||||
before attempting to launch a child process. This error may be
|
||||
indicative of another problem on the target host. Your job will now
|
||||
abort.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
#
|
||||
[binding not supported]
|
||||
Open MPI tried to bind a new process, but process binding is not
|
||||
supported on the host where it was launched. The process was killed
|
||||
without launching the target application. Your job will now abort.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
#
|
||||
[binding generic error]
|
||||
Open MPI tried to bind a new process, but something went wrong. The
|
||||
process was killed without launching the target application. Your job
|
||||
will now abort.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
Error message: %s
|
||||
Location: %s:%d
|
||||
#
|
||||
[bound to everything]
|
||||
Open MPI tried to bind a new process to a specific set of processors,
|
||||
but ended up binding it to *all* processors. This means that the new
|
||||
process is effectively unbound.
|
||||
|
||||
This is only a warning -- your job will continue. You can suppress
|
||||
this warning in the future by setting the odls_warn_if_not_bound MCA
|
||||
parameter to 0.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
Location: %s:%d
|
||||
#
|
||||
[slot list and paffinity_alone]
|
||||
Open MPI detected that both a slot list was specified and the MCA
|
||||
parameter "paffinity_alone" was set to true. Only one of these can be
|
||||
used at a time. Your job will now abort.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
#
|
||||
[iof setup failed]
|
||||
Open MPI tried to launch a child process but the "IOF child setup"
|
||||
failed. This should not happen. Your job will now abort.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
#
|
||||
[not bound]
|
||||
WARNING: Open MPI tried to bind a process but failed. This is a
|
||||
warning only; your job will continue.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
Error message: %s
|
||||
Location: %s:%d
|
||||
#
|
||||
[syscall fail]
|
||||
A system call failed that should not have. In this particular case,
|
||||
a warning or error message was not displayed that should have been.
|
||||
Your job may behave unpredictably after this, or abort.
|
||||
|
||||
Local host: %s
|
||||
Application name: %s
|
||||
Function: %s
|
||||
Location: %s:%d
|
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
Загрузка…
Ссылка в новой задаче
Block a user