1
1
openmpi/orte/mca/plm
Ralph Castain 48c3de1865 Fix a problem in the plm "failed to start" code observed by Jeff. When we are unable to launch to a specific node because it doesn't exist or is down, the system would hang and/or segv. The reason for the hang was that we were "firing" the orted exit trigger prior to its timer event being defined - thus "locking" that one-shot and preventing it from firing when we actually were ready to use it.
The segv was caused by the fact that we don't really know which daemon failed to start (at least, in most cases), so we didn't set a pointer to the aborted proc object. All we really wanted, though, was to ensure that mpirun returned a non-zero exit status, so the fix was to simply return the default error status.

This commit was SVN r19754.
2008-10-16 14:21:37 +00:00
..
alps Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. 2008-09-08 15:45:11 +00:00
base Fix a problem in the plm "failed to start" code observed by Jeff. When we are unable to launch to a specific node because it doesn't exist or is down, the system would hang and/or segv. The reason for the hang was that we were "firing" the orted exit trigger prior to its timer event being defined - thus "locking" that one-shot and preventing it from firing when we actually were ready to use it. 2008-10-16 14:21:37 +00:00
bproc Fixes trac:1392, #1400 2008-07-28 22:40:57 +00:00
ccp Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. 2008-09-08 15:45:11 +00:00
lsf Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. 2008-09-08 15:45:11 +00:00
process Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. 2008-09-08 15:45:11 +00:00
rsh Fix a problem in the plm "failed to start" code observed by Jeff. When we are unable to launch to a specific node because it doesn't exist or is down, the system would hang and/or segv. The reason for the hang was that we were "firing" the orted exit trigger prior to its timer event being defined - thus "locking" that one-shot and preventing it from firing when we actually were ready to use it. 2008-10-16 14:21:37 +00:00
slurm Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. 2008-09-08 15:45:11 +00:00
submit Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. 2008-09-08 15:45:11 +00:00
tm Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. 2008-09-08 15:45:11 +00:00
tmd MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required. 2008-09-25 13:39:08 +00:00
xgrid Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. 2008-09-08 15:45:11 +00:00
Makefile.am Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
plm_types.h When we hit an error prior to actually launching daemons, it would be nice if orterun didn't bark about daemons failing to launch, mpirun detecting a job failed, etc. 2008-08-19 15:19:30 +00:00
plm.h Fixes trac:1392, #1400 2008-07-28 22:40:57 +00:00