2005-03-14 20:57:21 +00:00
|
|
|
# -*- text -*-
|
|
|
|
#
|
2006-02-16 20:40:23 +00:00
|
|
|
# Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
|
2005-11-05 19:57:48 +00:00
|
|
|
# University Research and Technology
|
|
|
|
# Corporation. All rights reserved.
|
|
|
|
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
# of Tennessee Research Foundation. All rights
|
|
|
|
# reserved.
|
2005-03-14 20:57:21 +00:00
|
|
|
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
# University of Stuttgart. All rights reserved.
|
2005-03-24 12:43:37 +00:00
|
|
|
# Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
# All rights reserved.
|
2010-04-01 13:26:29 +00:00
|
|
|
# Copyright (c) 2007-2010 Cisco Systems, Inc. All rights reserved.
|
2005-03-14 20:57:21 +00:00
|
|
|
# $COPYRIGHT$
|
|
|
|
#
|
|
|
|
# Additional copyrights may follow
|
|
|
|
#
|
|
|
|
# $HEADER$
|
|
|
|
#
|
|
|
|
# This is the US/English general help file for Open RTE's orterun.
|
|
|
|
#
|
|
|
|
[orterun:init-failure]
|
|
|
|
Open RTE was unable to initialize properly. The error occured while
|
|
|
|
attempting to %s. Returned value %d instead of ORTE_SUCCESS.
|
|
|
|
[orterun:usage]
|
2006-06-22 19:48:27 +00:00
|
|
|
%s (%s) %s
|
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
Usage: %s [OPTION]... [PROGRAM]...
|
|
|
|
Start the given program using Open RTE
|
|
|
|
|
|
|
|
%s
|
2006-06-22 19:48:27 +00:00
|
|
|
|
|
|
|
Report bugs to %s
|
2006-06-09 17:21:23 +00:00
|
|
|
[orterun:version]
|
|
|
|
%s (%s) %s
|
2006-06-22 19:48:27 +00:00
|
|
|
|
|
|
|
Report bugs to %s
|
2005-03-14 20:57:21 +00:00
|
|
|
[orterun:allocate-resources]
|
|
|
|
%s was unable to allocate enough resources to start your application.
|
|
|
|
This might be a transient error (too many nodes in the cluster were
|
|
|
|
unavailable at the time of the request) or a permenant error (you
|
|
|
|
requsted more nodes than exist in your cluster).
|
|
|
|
|
|
|
|
While probably only useful to Open RTE developers, the error returned
|
|
|
|
was %d.
|
|
|
|
[orterun:error-spawning]
|
|
|
|
%s was unable to start the specified application. An attempt has been
|
|
|
|
made to clean up all processes that did start. The error returned was
|
|
|
|
%d.
|
|
|
|
[orterun:appfile-not-found]
|
2005-03-18 03:43:59 +00:00
|
|
|
Unable to open the appfile:
|
|
|
|
|
|
|
|
%s
|
2005-03-14 20:57:21 +00:00
|
|
|
|
|
|
|
Double check that this file exists and is readable.
|
2005-03-18 03:43:59 +00:00
|
|
|
[orterun:executable-not-specified]
|
|
|
|
No executable was specified on the %s command line.
|
|
|
|
|
|
|
|
Aborting.
|
2006-07-10 21:25:33 +00:00
|
|
|
[orterun:multi-apps-and-zero-np]
|
|
|
|
%s found multiple applications specified on the command line, with
|
|
|
|
at least one that failed to specify the number of processes to execute.
|
|
|
|
When specifying multiple applications, you must specify how many processes
|
|
|
|
of each to launch via the -np argument.
|
2005-05-10 17:14:53 +00:00
|
|
|
[orterun:nothing-to-do]
|
|
|
|
%s could not find anything to do.
|
|
|
|
|
2005-10-01 15:51:20 +00:00
|
|
|
It is possible that you forgot to specify how many processes to run
|
|
|
|
via the "-np" argument.
|
2006-02-16 20:40:23 +00:00
|
|
|
[orterun:call-failed]
|
|
|
|
%s encountered a %s call failure. This should not happen, and
|
2005-05-10 17:14:53 +00:00
|
|
|
usually indicates an error within the operating system itself.
|
|
|
|
Specifically, the following error occurred:
|
|
|
|
|
|
|
|
%s
|
|
|
|
|
|
|
|
The only other available information that may be helpful is the errno
|
|
|
|
that was returned: %d.
|
2005-07-28 21:17:48 +00:00
|
|
|
[orterun:environ]
|
|
|
|
%s was unable to set
|
|
|
|
%s = %s
|
|
|
|
in the environment. Returned value %d instead of ORTE_SUCCESS.
|
2006-10-31 22:16:51 +00:00
|
|
|
[orterun:precondition]
|
|
|
|
%s was unable to precondition transports
|
|
|
|
Returned value %d instead of ORTE_SUCCESS.
|
|
|
|
[orterun:attr-failed]
|
|
|
|
%s was unable to define an attribute
|
|
|
|
Returned value %d instead of ORTE_SUCCESS.
|
2007-04-24 19:19:14 +00:00
|
|
|
#
|
|
|
|
[orterun:proc-ordered-abort]
|
|
|
|
%s has exited due to process rank %lu with PID %lu on
|
2008-02-28 01:57:57 +00:00
|
|
|
node %s calling "abort". This may have caused other processes
|
2007-04-24 19:19:14 +00:00
|
|
|
in the application to be terminated by signals sent by %s
|
|
|
|
(as reported here).
|
|
|
|
#
|
2008-03-19 19:00:51 +00:00
|
|
|
[orterun:proc-exit-no-sync]
|
|
|
|
%s has exited due to process rank %lu with PID %lu on
|
2010-04-23 04:44:41 +00:00
|
|
|
node %s exiting improperly. There are three reasons this could occur:
|
2009-12-17 19:39:53 +00:00
|
|
|
|
|
|
|
1. this process did not call "init" before exiting, but others in
|
|
|
|
the job did. This can cause a job to hang indefinitely while it waits
|
|
|
|
for all processes to call "init". By rule, if one process calls "init",
|
|
|
|
then ALL processes must call "init" prior to termination.
|
|
|
|
|
|
|
|
2. this process called "init", but exited without calling "finalize".
|
|
|
|
By rule, all processes that call "init" MUST call "finalize" prior to
|
|
|
|
exiting or it will be considered an "abnormal termination"
|
|
|
|
|
2010-04-23 04:44:41 +00:00
|
|
|
3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
|
|
|
|
orte_create_session_dirs is set to false. In this case, the run-time cannot
|
|
|
|
detect that the abort call was an abnormal termination. Hence, the only
|
|
|
|
error message you will receive is this one.
|
|
|
|
|
2009-12-17 19:39:53 +00:00
|
|
|
This may have caused other processes in the application to be
|
2008-03-19 19:00:51 +00:00
|
|
|
terminated by signals sent by %s (as reported here).
|
2010-04-23 04:44:41 +00:00
|
|
|
|
|
|
|
You can avoid this message by specifying -quiet on the %s command line.
|
|
|
|
|
2008-03-19 19:00:51 +00:00
|
|
|
#
|
|
|
|
[orterun:proc-exit-no-sync-unknown]
|
|
|
|
%s has exited due to a process exiting without calling "finalize",
|
|
|
|
but has no info as to the process that caused that situation. This
|
|
|
|
may have caused other processes in the application to be
|
|
|
|
terminated by signals sent by %s (as reported here).
|
|
|
|
#
|
2005-07-28 21:17:48 +00:00
|
|
|
[orterun:proc-aborted]
|
2008-02-28 01:57:57 +00:00
|
|
|
%s noticed that process rank %lu with PID %lu on node %s exited on signal %d.
|
|
|
|
#
|
|
|
|
[orterun:proc-aborted-unknown]
|
|
|
|
%s noticed that the job aborted, but has no info as to the process
|
|
|
|
that caused that situation.
|
|
|
|
#
|
|
|
|
[orterun:proc-aborted-signal-unknown]
|
|
|
|
%s noticed that the job aborted by signal, but has no info as
|
|
|
|
to the process that caused that situation.
|
|
|
|
#
|
2006-12-17 20:01:11 +00:00
|
|
|
[orterun:proc-aborted-strsignal]
|
2008-02-28 01:57:57 +00:00
|
|
|
%s noticed that process rank %lu with PID %lu on node %s exited on signal %d (%s).
|
|
|
|
#
|
2005-07-28 21:17:48 +00:00
|
|
|
[orterun:abnormal-exit]
|
2007-01-30 23:03:13 +00:00
|
|
|
WARNING: %s has exited before it received notification that all
|
2005-08-26 20:36:11 +00:00
|
|
|
started processes had terminated. You should double check and ensure
|
|
|
|
that there are no runaway processes still executing.
|
2007-01-25 14:17:44 +00:00
|
|
|
#
|
2007-01-08 20:25:26 +00:00
|
|
|
[orterun:sigint-while-processing]
|
|
|
|
WARNING: %s is in the process of killing a job, but has detected an
|
|
|
|
interruption (probably control-C).
|
|
|
|
|
|
|
|
It is dangerous to interrupt %s while it is killing a job (proper
|
|
|
|
termination may not be guaranteed). Hit control-C again within 1
|
|
|
|
second if you really want to kill %s immediately.
|
2007-01-25 14:17:44 +00:00
|
|
|
#
|
2005-09-06 16:10:05 +00:00
|
|
|
[orterun:empty-prefix]
|
2005-09-06 16:57:11 +00:00
|
|
|
A prefix was supplied to %s that only contained slashes.
|
|
|
|
|
|
|
|
This is a fatal error; %s will now abort. No processes were launched.
|
2005-11-20 16:06:53 +00:00
|
|
|
#
|
|
|
|
[debugger-mca-param-not-found]
|
2007-08-04 00:35:55 +00:00
|
|
|
Internal error -- the orte_base_user_debugger MCA parameter was not able to
|
2005-11-20 16:06:53 +00:00
|
|
|
be found. Please contact the Open RTE developers; this should not
|
|
|
|
happen.
|
|
|
|
#
|
|
|
|
[debugger-orte_base_user_debugger-empty]
|
|
|
|
The MCA parameter "orte_base_user_debugger" was empty, indicating that
|
|
|
|
no user-level debuggers have been defined. Please set this MCA
|
|
|
|
parameter to a value and try again.
|
|
|
|
#
|
|
|
|
[debugger-not-found]
|
|
|
|
A suitable debugger could not be found in your PATH. Check the values
|
|
|
|
specified in the orte_base_user_debugger MCA parameter for the list of
|
|
|
|
debuggers that was searched.
|
|
|
|
#
|
|
|
|
[debugger-exec-failed]
|
|
|
|
%s was unable to launch the specified debugger. This is what was
|
|
|
|
launched:
|
|
|
|
|
|
|
|
%s
|
|
|
|
|
|
|
|
Things to check:
|
2005-10-05 10:24:34 +00:00
|
|
|
|
2005-11-20 16:06:53 +00:00
|
|
|
- Ensure that the debugger is installed properly
|
|
|
|
- Ensure that the "%s" executable is in your path
|
|
|
|
- Ensure that any required licenses are available to run the debugger
|
2006-09-14 21:29:51 +00:00
|
|
|
#
|
2007-04-24 19:19:14 +00:00
|
|
|
[orterun:sys-limit-pipe]
|
|
|
|
%s was unable to launch the specified application as it encountered an error:
|
|
|
|
|
|
|
|
Error: system limit exceeded on number of pipes that can be open
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
when attempting to start process rank %lu.
|
|
|
|
|
2009-03-11 17:48:46 +00:00
|
|
|
This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1,
|
|
|
|
increasing your limit descriptor setting (using limit or ulimit commands),
|
|
|
|
asking the system administrator for that node to increase the system limit, or
|
|
|
|
by rearranging your processes to place fewer of them on that node.
|
2007-04-24 19:19:14 +00:00
|
|
|
#
|
2009-05-26 20:03:21 +00:00
|
|
|
[orterun:sys-limit-sockets]
|
|
|
|
Error: system limit exceeded on number of network connections that can be open
|
|
|
|
|
|
|
|
This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1,
|
|
|
|
increasing your limit descriptor setting (using limit or ulimit commands),
|
|
|
|
or asking the system administrator to increase the system limit.
|
|
|
|
#
|
2007-04-24 19:19:14 +00:00
|
|
|
[orterun:pipe-setup-failure]
|
|
|
|
%s was unable to launch the specified application as it encountered an error:
|
|
|
|
|
|
|
|
Error: pipe function call failed when setting up I/O forwarding subsystem
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
while attempting to start process rank %lu.
|
|
|
|
#
|
|
|
|
[orterun:sys-limit-children]
|
|
|
|
%s was unable to launch the specified application as it encountered an error:
|
|
|
|
|
|
|
|
Error: system limit exceeded on number of processes that can be started
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
when attempting to start process rank %lu.
|
|
|
|
|
|
|
|
This can be resolved by either asking the system administrator for that node to
|
|
|
|
increase the system limit, or by rearranging your processes to place fewer of them
|
|
|
|
on that node.
|
|
|
|
#
|
|
|
|
[orterun:failed-term-attrs]
|
|
|
|
%s was unable to launch the specified application as it encountered an error:
|
|
|
|
|
|
|
|
Error: reading tty attributes function call failed while setting up I/O forwarding system
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
while attempting to start process rank %lu.
|
|
|
|
#
|
|
|
|
[orterun:wdir-not-found]
|
2010-04-01 13:26:29 +00:00
|
|
|
%s was unable to launch the specified application as it could not
|
|
|
|
change to the specified working directory:
|
2007-04-24 19:19:14 +00:00
|
|
|
|
|
|
|
Working directory: %s
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
while attempting to start process rank %lu.
|
|
|
|
#
|
|
|
|
[orterun:exe-not-found]
|
2010-04-01 13:26:29 +00:00
|
|
|
%s was unable to find the specified executable file, and therefore
|
|
|
|
did not launch the job. This error was first reported for process
|
|
|
|
rank %lu; it may have occurred for other processes as well.
|
2007-04-24 19:19:14 +00:00
|
|
|
|
2010-04-01 13:26:29 +00:00
|
|
|
NOTE: A common cause for this error is misspelling a %s command
|
|
|
|
line parameter option (remember that %s interprets the first
|
|
|
|
unrecognized command line token as the executable).
|
2007-04-24 19:19:14 +00:00
|
|
|
|
2010-04-01 13:26:29 +00:00
|
|
|
Node: %s
|
|
|
|
Executable: %s
|
2007-04-24 19:19:14 +00:00
|
|
|
#
|
|
|
|
[orterun:exe-not-accessible]
|
|
|
|
%s was unable to launch the specified application as it could not access
|
|
|
|
or execute an executable:
|
|
|
|
|
|
|
|
Executable: %s
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
while attempting to start process rank %lu.
|
|
|
|
#
|
|
|
|
[orterun:pipe-read-failure]
|
|
|
|
%s was unable to launch the specified application as it encountered an error:
|
|
|
|
|
|
|
|
Error: reading from a pipe function call failed while spawning a local process
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
while attempting to start process rank %lu.
|
|
|
|
#
|
|
|
|
[orterun:proc-failed-to-start]
|
2010-05-28 20:25:54 +00:00
|
|
|
%s was unable to start the specified application as it encountered an
|
|
|
|
error:
|
2009-08-22 02:58:20 +00:00
|
|
|
|
|
|
|
Error name: %s
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
when attempting to start process rank %lu.
|
|
|
|
#
|
|
|
|
[orterun:proc-socket-not-avail]
|
2010-05-28 20:25:54 +00:00
|
|
|
%s was unable to start the specified application as it encountered an
|
|
|
|
error:
|
2007-04-24 19:19:14 +00:00
|
|
|
|
|
|
|
Error name: %s
|
|
|
|
Node: %s
|
|
|
|
|
|
|
|
when attempting to start process rank %lu.
|
|
|
|
#
|
|
|
|
[orterun:proc-failed-to-start-no-status]
|
2010-05-28 20:25:54 +00:00
|
|
|
%s was unable to start the specified application as it encountered an
|
|
|
|
error on node %s. More information may be available above.
|
2007-07-10 12:53:48 +00:00
|
|
|
#
|
2008-02-28 01:57:57 +00:00
|
|
|
[orterun:proc-failed-to-start-no-status-no-node]
|
2010-05-28 20:25:54 +00:00
|
|
|
%s was unable to start the specified application as it encountered an
|
|
|
|
error. More information may be available above.
|
2008-02-28 01:57:57 +00:00
|
|
|
#
|
2007-07-10 12:53:48 +00:00
|
|
|
[debugger requires -np]
|
|
|
|
The number of MPI processes to launch was not specified on the command
|
|
|
|
line.
|
|
|
|
|
|
|
|
The %s debugger requires that you specify a number of MPI processes to
|
|
|
|
launch on the command line via the "-np" command line parameter. For
|
|
|
|
example:
|
|
|
|
|
|
|
|
%s -np 4 %s
|
|
|
|
|
|
|
|
Skipping the %s debugger for now.
|
|
|
|
#
|
|
|
|
[debugger requires executable]
|
|
|
|
The %s debugger requires that you specify an executable on the %s
|
|
|
|
command line; you cannot specify application context files when
|
|
|
|
launching this job in the %s debugger. For example:
|
|
|
|
|
|
|
|
%s -np 4 my_mpi_executable
|
|
|
|
|
|
|
|
Skipping the %s debugger for now.
|
|
|
|
#
|
|
|
|
[debugger only accepts single app]
|
|
|
|
The %s debugger only accepts SPMD-style launching; specifying an
|
|
|
|
MPMD-style launch (with multiple applications separated via ':') is
|
|
|
|
not permitted.
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2007-07-10 12:53:48 +00:00
|
|
|
Skipping the %s debugger for now.
|
When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive.
This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed.
In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean.
I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it.
This commit was SVN r16451.
2007-10-15 18:00:30 +00:00
|
|
|
#
|
|
|
|
[orterun:daemon-died-during-execution]
|
|
|
|
%s has detected that a required daemon terminated during execution
|
|
|
|
of the application with a non-zero status. This is a fatal error.
|
|
|
|
A best-effort attempt has been made to cleanup. However, it is
|
|
|
|
-strongly- recommended that you execute the orte-clean utility
|
|
|
|
to ensure full cleanup is accomplished.
|
2008-02-28 01:57:57 +00:00
|
|
|
#
|
|
|
|
[orterun:no-orted-object-exit]
|
|
|
|
%s was unable to determine the status of the daemons used to
|
|
|
|
launch this application. Additional manual cleanup may be required.
|
|
|
|
Please refer to the "orte-clean" tool for assistance.
|
|
|
|
#
|
|
|
|
[orterun:unclean-exit]
|
|
|
|
%s was unable to cleanly terminate the daemons on the nodes shown
|
|
|
|
below. Additional manual cleanup may be required - please refer to
|
|
|
|
the "orte-clean" tool for assistance.
|
|
|
|
#
|
|
|
|
[orterun:event-def-failed]
|
|
|
|
%s was unable to define an event required for proper operation of
|
|
|
|
the system. The reason for this error was:
|
|
|
|
|
|
|
|
Error: %s
|
|
|
|
|
|
|
|
Please report this to the Open MPI mailing list users@open-mpi.org.
|
|
|
|
#
|
|
|
|
[orterun:ompi-server-filename-bad]
|
|
|
|
%s was unable to parse the filename where contact info for the
|
|
|
|
ompi-server was to be found. The option we were given was:
|
|
|
|
|
|
|
|
--ompi-server %s
|
|
|
|
|
|
|
|
This appears to be missing the required ':' following the
|
|
|
|
keyword "file". Please remember that the correct format for this
|
|
|
|
command line option is:
|
|
|
|
|
|
|
|
--ompi-server file:path-to-file
|
|
|
|
|
|
|
|
where path-to-file can be either relative to the cwd or absolute.
|
|
|
|
#
|
|
|
|
[orterun:ompi-server-filename-missing]
|
|
|
|
%s was unable to parse the filename where contact info for the
|
|
|
|
ompi-server was to be found. The option we were given was:
|
|
|
|
|
|
|
|
--ompi-server %s
|
|
|
|
|
|
|
|
This appears to be missing a filename following the ':'. Please
|
|
|
|
remember that the correct format for this command line option is:
|
|
|
|
|
|
|
|
--ompi-server file:path-to-file
|
|
|
|
|
|
|
|
where path-to-file can be either relative to the cwd or absolute.
|
|
|
|
#
|
|
|
|
[orterun:ompi-server-filename-access]
|
|
|
|
%s was unable to access the filename where contact info for the
|
|
|
|
ompi-server was to be found. The option we were given was:
|
|
|
|
|
|
|
|
--ompi-server %s
|
|
|
|
|
|
|
|
Please remember that the correct format for this command line option is:
|
|
|
|
|
|
|
|
--ompi-server file:path-to-file
|
|
|
|
|
|
|
|
where path-to-file can be either relative to the cwd or absolute, and that
|
|
|
|
you must have read access permissions to that file.
|
|
|
|
#
|
|
|
|
[orterun:ompi-server-file-bad]
|
|
|
|
%s was unable to read the ompi-server's contact info from the
|
|
|
|
given filename. The filename we were given was:
|
|
|
|
|
|
|
|
FILE: %s
|
|
|
|
|
|
|
|
Please remember that the correct format for this command line option is:
|
|
|
|
|
|
|
|
--ompi-server file:path-to-file
|
|
|
|
|
|
|
|
where path-to-file can be either relative to the cwd or absolute, and that
|
|
|
|
the file must have a single line in it that contains the Open MPI
|
|
|
|
uri for the ompi-server. Note that this is *not* a standard uri, but
|
|
|
|
a special format used internally by Open MPI for communications. It can
|
|
|
|
best be generated by simply directing the ompi-server to put its
|
|
|
|
uri in a file, and then giving %s that filename.
|
2008-03-05 22:12:27 +00:00
|
|
|
[orterun:multiple-hostfiles]
|
2010-05-28 20:25:54 +00:00
|
|
|
Error: More than one hostfile was passed for a single application
|
|
|
|
context, which is not supported at this time.
|
2008-07-08 22:36:39 +00:00
|
|
|
#
|
|
|
|
[orterun:conflicting-params]
|
|
|
|
%s has detected multiple instances of an MCA param being specified on
|
|
|
|
the command line, with conflicting values:
|
|
|
|
|
|
|
|
MCA param: %s
|
|
|
|
Value 1: %s
|
|
|
|
Value 2: %s
|
|
|
|
|
|
|
|
This MCA param does not support multiple values, and the system is unable
|
|
|
|
to identify which value was intended. If this was done in error, please
|
|
|
|
re-issue the command with only one value. You may wish to review the
|
|
|
|
output from ompi_info for guidance on accepted values for this param.
|
|
|
|
|
Per the July technical meeting:
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.
At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.
Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):
--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.
--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.
This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.
This commit was SVN r19152.
2008-08-04 20:29:50 +00:00
|
|
|
[orterun:server-not-found]
|
|
|
|
%s was instructed to wait for the requested ompi-server, but was unable to
|
|
|
|
establish contact with the server during the specified wait time:
|
2008-07-08 22:36:39 +00:00
|
|
|
|
Per the July technical meeting:
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.
At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.
Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):
--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.
--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.
This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.
This commit was SVN r19152.
2008-08-04 20:29:50 +00:00
|
|
|
Server uri: %s
|
|
|
|
Timeout time: %ld
|
|
|
|
|
|
|
|
Error received: %s
|
|
|
|
|
|
|
|
Please check to ensure that the requested server matches the actual server
|
|
|
|
information, and that the server is in operation.
|
2008-12-10 17:10:39 +00:00
|
|
|
#
|
|
|
|
[orterun:ompi-server-pid-bad]
|
|
|
|
%s was unable to parse the PID of the %s to be used as the ompi-server.
|
|
|
|
The option we were given was:
|
|
|
|
|
|
|
|
--ompi-server %s
|
|
|
|
|
|
|
|
Please remember that the correct format for this command line option is:
|
|
|
|
|
|
|
|
--ompi-server PID:pid-of-%s
|
|
|
|
|
|
|
|
where PID can be either "PID" or "pid".
|
|
|
|
#
|
|
|
|
[orterun:ompi-server-could-not-get-hnp-list]
|
2010-05-28 20:25:54 +00:00
|
|
|
%s was unable to search the list of local %s contact files to find the
|
|
|
|
specified pid. You might check to see if your local session directory
|
|
|
|
is available and that you have read permissions on the top of that
|
|
|
|
directory tree.
|
2008-12-10 17:10:39 +00:00
|
|
|
#
|
|
|
|
[orterun:ompi-server-pid-not-found]
|
2010-05-28 20:25:54 +00:00
|
|
|
%s was unable to find an %s with the specified pid of %d that was to
|
|
|
|
be used as the ompi-server. The option we were given was:
|
2008-12-10 17:10:39 +00:00
|
|
|
|
|
|
|
--ompi-server %s
|
|
|
|
|
|
|
|
Please remember that the correct format for this command line option is:
|
|
|
|
|
|
|
|
--ompi-server PID:pid-of-%s
|
|
|
|
|
|
|
|
where PID can be either "PID" or "pid".
|
2008-12-24 15:27:46 +00:00
|
|
|
#
|
|
|
|
[orterun:write_file]
|
|
|
|
%s was unable to open a file to printout %s as requested. The file
|
|
|
|
name given was:
|
2008-12-10 17:10:39 +00:00
|
|
|
|
2008-12-24 15:27:46 +00:00
|
|
|
File: %s
|
2010-05-07 14:04:55 +00:00
|
|
|
#
|
|
|
|
[orterun:multiple-paffinity-schemes]
|
2010-05-28 20:25:54 +00:00
|
|
|
Multiple processor affinity schemes were specified (can only specify
|
|
|
|
one):
|
2010-05-07 14:04:55 +00:00
|
|
|
|
|
|
|
Slot list: %s
|
|
|
|
opal_paffinity_alone: true
|
|
|
|
|
|
|
|
Please specify only the one desired method.
|
|
|
|
#
|
|
|
|
[orterun:slot-list-failed]
|
|
|
|
We were unable to successfully process/set the requested processor
|
|
|
|
affinity settings:
|
|
|
|
|
|
|
|
Specified slot list: %s
|
|
|
|
Error: %s
|
|
|
|
|
|
|
|
This could mean that a non-existent processor was specified, or
|
|
|
|
that the specification had improper syntax.
|
|
|
|
#
|
|
|
|
[orterun:invalid-node-rank]
|
|
|
|
An invalid node rank was obtained - this is probably something
|
|
|
|
that should be reported to the OMPI developers.
|
|
|
|
#
|
|
|
|
[orterun:invalid-local-rank]
|
|
|
|
An invalid local rank was obtained - this is probably something
|
|
|
|
that should be reported to the OMPI developers.
|
|
|
|
#
|
|
|
|
[orterun:invalid-phys-cpu]
|
|
|
|
An invalid physical processor id was returned when attempting to
|
|
|
|
set processor affinity - please check to ensure that your system
|
|
|
|
supports such functionality. If so, then this is probably something
|
|
|
|
that should be reported to the OMPI developers.
|
|
|
|
#
|
|
|
|
[orterun:failed-set-paff]
|
|
|
|
An attempt to set processor affinity has failed - please check to
|
|
|
|
ensure that your system supports such functionality. If so, then
|
2010-05-28 20:25:54 +00:00
|
|
|
this is probably something that should be reported to the OMPI
|
|
|
|
developers.
|
2010-05-07 14:04:55 +00:00
|
|
|
#
|
|
|
|
[orterun:topo-not-supported]
|
|
|
|
An attempt was made to bind a process to a specific hardware topology
|
|
|
|
mapping (e.g., binding to a socket) but the operating system does not
|
|
|
|
support such topology-aware actions. Talk to your local system
|
|
|
|
administrator to find out if your system can support topology-aware
|
|
|
|
functionality (e.g., Linux Kernels newer than v2.6.18).
|
|
|
|
|
2010-05-28 20:25:54 +00:00
|
|
|
Systems that do not support processor topology-aware functionality
|
|
|
|
cannot use "bind to socket" and other related functionality.
|
2010-05-07 14:04:55 +00:00
|
|
|
|
|
|
|
Local host: %s
|
|
|
|
Action attempted: %s %s
|
|
|
|
Application name: %s
|
|
|
|
#
|
|
|
|
[orterun:binding-not-avail]
|
2010-05-28 20:25:54 +00:00
|
|
|
A request to bind the processes if the operating system supports such
|
|
|
|
an operation was made, but the OS does not support this operation:
|
2010-05-07 14:04:55 +00:00
|
|
|
|
|
|
|
Local host: %s
|
|
|
|
Action requested: %s
|
|
|
|
Application name: %s
|
|
|
|
|
|
|
|
Because the request was made on an "if-available" basis, the job was
|
2010-05-28 20:25:54 +00:00
|
|
|
launched without taking the requested action. If this is not the
|
|
|
|
desired behavior, talk to your local system administrator to find out
|
|
|
|
if your system can support the requested action.
|
2010-05-07 14:04:55 +00:00
|
|
|
#
|
|
|
|
[orterun:not-enough-resources]
|
|
|
|
Not enough %s were found on the local host to meet the requested
|
|
|
|
binding action:
|
|
|
|
|
|
|
|
Local host: %s
|
|
|
|
Action requested: %s
|
|
|
|
Application name: %s
|
|
|
|
|
|
|
|
Please revise the request and try again.
|
|
|
|
#
|
|
|
|
[orterun:paffinity-missing-module]
|
|
|
|
A request to bind processes was made, but no paffinity module
|
|
|
|
was found:
|
|
|
|
|
|
|
|
Local host: %s
|
|
|
|
|
|
|
|
This is potentially a configuration. You can rerun your job without
|
|
|
|
requesting binding, or check the configuration.
|
|
|
|
#
|
|
|
|
[orterun:invalid-slot-list-range]
|
|
|
|
A slot list was provided that exceeds the boundaries on available
|
|
|
|
resources:
|
|
|
|
|
|
|
|
Local host: %s
|
|
|
|
Slot list: %s
|
|
|
|
|
|
|
|
Please check your boundaries and try again.
|
|
|
|
#
|
2010-05-12 18:11:58 +00:00
|
|
|
[orterun:proc-comm-failed]
|
|
|
|
A critical communication path was lost to:
|
|
|
|
|
|
|
|
Process name: %s
|
|
|
|
Node: %s
|
|
|
|
#
|
|
|
|
[orterun:proc-mem-exceeded]
|
|
|
|
A process exceeded memory limits:
|
|
|
|
|
|
|
|
Process name: %s
|
|
|
|
Node: %s
|
|
|
|
#
|
|
|
|
[orterun:proc-stalled]
|
|
|
|
One or more processes appear to have stalled - a monitored file
|
|
|
|
failed to show the required activity.
|
|
|
|
#
|
|
|
|
[orterun:proc-sensor-exceeded]
|
|
|
|
One or more processes have exceeded a specified sensor limit, but
|
|
|
|
no further info is available.
|
|
|
|
#
|
|
|
|
[orterun:proc-called-abort]
|
|
|
|
%s detected that one or more processes called %s_abort, thus causing
|
|
|
|
the job to be terminated.
|
|
|
|
#
|
|
|
|
[orterun:proc-heartbeat-failed]
|
2010-05-28 20:25:54 +00:00
|
|
|
%s failed to receive scheduled heartbeat communications from a remote
|
|
|
|
process:
|
2010-05-12 18:11:58 +00:00
|
|
|
|
|
|
|
Process name: %s
|
|
|
|
Node: %s
|
|
|
|
|