1
1

Per an off-list discussion, it appears possible for a system to report failure when executing getpwuid. There are several reasons for this error to occur, most notably if the system uses a network-based authentication protocol (e.g., NIS) and that sytem gets overwhelmed when we launch on a lot of nodes.

There is no good way to recover from this scenario, and from past experience, using the user's name in the session directory (as opposed to the uid) is very helpful when things go wrong. So print a help message when this happens (it is extremely rare, but has happened at least once now) and return an error.

cmr:v1.7.3,reviewer=jsquyres
cmr:v1.6.5,reviewer=jsquyres

This commit was SVN r28658.
This commit is contained in:
Ralph Castain 2013-06-20 04:30:42 +00:00
parent 2e5c18195b
commit 13665bffe8
2 changed files with 13 additions and 3 deletions

View File

@ -42,6 +42,16 @@ to have the list of prohibited locations changed. Otherwise, please identify
a different location to be used (use -h to see the cmd line option), or
simply let the system pick a default location.
#
[orte:session:dir:nopwname]
Open MPI was unable to obtain the username in order to
create a path for its required temporary directories. This
is usually caused by either the UID being removed from the
passed file, or from use of network-based authentication
service (e.g., NIS) on a large cluster that might suffer
from congestion.
Please consult your system administrator about these
conditions and try again.
#
[orte_nidmap:too_many_nodes]
An error occurred while trying to pack the information about the job. More nodes

View File

@ -147,9 +147,9 @@ orte_session_dir_get_name(char **fulldirpath,
if (NULL != pwdent) {
user = strdup(pwdent->pw_name);
} else {
if (0 > asprintf(&user, "%d", uid)) {
return ORTE_ERR_OUT_OF_RESOURCE;
}
orte_show_help("help-orte-runtime.txt",
"orte:session:dir:nopwname", true);
return ORTE_ERR_OUT_OF_RESOURCE;
}
/*