1
1
openmpi/orte/util
Ralph Castain 13665bffe8 Per an off-list discussion, it appears possible for a system to report failure when executing getpwuid. There are several reasons for this error to occur, most notably if the system uses a network-based authentication protocol (e.g., NIS) and that sytem gets overwhelmed when we launch on a lot of nodes.
There is no good way to recover from this scenario, and from past experience, using the user's name in the session directory (as opposed to the uid) is very helpful when things go wrong. So print a help message when this happens (it is extremely rare, but has happened at least once now) and return an error.

cmr:v1.7.3,reviewer=jsquyres
cmr:v1.6.5,reviewer=jsquyres

This commit was SVN r28658.
2013-06-20 04:30:42 +00:00
..
comm Sorry for mid-day commit, but I had promised on the call to do this upon my return. 2012-04-06 14:23:13 +00:00
dash_host Remove a bunch of dead code: gcc 4.7 warns of set-but-unused 2013-05-17 21:45:49 +00:00
hostfile Error out if we are filtering a hostfile and encounter a node that is not in the resource-managed allocation, giving an error message identifying the file and the node. Don't filter managed allocations thru a default hostfile as this can lead to "hidden" errors. 2012-09-05 19:42:00 +00:00
context_fns.c MCA/base: Add new MCA variable system 2013-03-27 21:09:41 +00:00
context_fns.h ... Delayed due to notifier commits earlier this day ... 2009-04-29 01:32:14 +00:00
error_strings.c Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data. 2013-02-26 17:50:04 +00:00
error_strings.h Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine. 2012-08-28 21:20:17 +00:00
help-regex.txt Fix regular expression analyzer for slurmd - use a slurm-specific version 2011-07-13 22:49:56 +00:00
hnp_contact.c Remove all remaining vestiges of the Windows integration 2013-02-28 17:31:47 +00:00
hnp_contact.h As requested by Aurelien at the July design meeting - long time coming, but finally got around to it. 2008-12-10 17:10:39 +00:00
Makefile.am Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes 2013-02-28 01:35:55 +00:00
name_fns.c Fix a small leak in orte/util/name_fns.c 2012-11-07 23:59:49 +00:00
name_fns.h Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required. 2012-06-27 14:53:55 +00:00
nidmap.c The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem. 2013-04-08 23:34:16 +00:00
nidmap.h Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine. 2012-08-28 21:20:17 +00:00
parse_options.c Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces. 2011-04-29 18:46:40 +00:00
parse_options.h Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces. 2011-04-29 18:46:40 +00:00
pre_condition_transports.c MCA/base: Add new MCA variable system 2013-03-27 21:09:41 +00:00
pre_condition_transports.h In the case of direct-launched processes running under slurm, psm requires that the pre_condition_transports MCA param be set. This is normally computed by mpirun and inserted into each proc's environ, but that doesn't work here. 2011-04-28 13:54:33 +00:00
proc_info.c MCA/base: Add new MCA variable system 2013-03-27 21:09:41 +00:00
proc_info.h MCA/base: Add new MCA variable system 2013-03-27 21:09:41 +00:00
regex.c Remove a bunch of dead code: gcc 4.7 warns of set-but-unused 2013-05-17 21:45:49 +00:00
regex.h Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long. 2011-07-07 18:54:30 +00:00
session_dir.c Per an off-list discussion, it appears possible for a system to report failure when executing getpwuid. There are several reasons for this error to occur, most notably if the system uses a network-based authentication protocol (e.g., NIS) and that sytem gets overwhelmed when we launch on a lot of nodes. 2013-06-20 04:30:42 +00:00
session_dir.h - Revert r20739 2009-03-05 21:56:03 +00:00
show_help.c Use a non-blocking send in show_help as it could be called from inside an event 2013-02-28 17:19:18 +00:00
show_help.h Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes 2013-02-28 01:35:55 +00:00