Normally, any non-windows nodes should return NodeStatus_Unreachable, but that's not true. We have noticed that Scientific Linux cluster nodes, which are in the same subnet as the Windows nodes, return NodeStatus_PendingApproval value, and the RAS CCP takes them as Windows nodes. So let's just use the "Ready" nodes.
This commit was SVN r21510.
that now the HNP send the messages using the routed component. In the case
of tree spawn, when a intermediary node spawn a child it doesn't know how
to forward a message to it, so when the node-map message is coming from
the HNP (as there is nothing yet in the contact/routing table) the message
is sent back the way it came. As a result the node-map message keeps jumping
between the HNP and the first level orteds.
The solution is to add a new option to the children orte_parent_uri, which
is only set when the orted is _not_ directly spawned by the HNP. When this
option is present on the argument list, the orted will add the parent to
its routing, and force the parent to update his routes (by sending the URI).
With this approach, the routing tree is build in same time as the processes
are spawned, and all messages from the HNP can be routed to the leaves.
However, this is far from an optimal solution. Right now, this so called tree
spawn, only spawn the children in a tree without doing anything about the
"connect back to the HNP" step. The HNP is flooded with reports from all the
orted. The total number of messages is higher than in the non tree startup
scheme, so we do not expect this approach to be scalable in the current
incarnation. A complete overhaul of the tree startup is required in order
improve the scalability. Stay tuned!
This commit was SVN r21504.
orte_session_dir_finalize doesn't clean the right directories.
orte_session_dir_cleanup neither.
This patch fixes several issues:
1. orte_session_dir_cleanup():
1. when jobid is not a wildcard, jobid is used to build the job
session dir (instead of ORTE_LOCAL_JOBID).
1. ORTE_SUCCESS is unconditionally returned (instead of rc that
might have been previously set to another value).
1. orte_session_dir_finalize():
1. convert_jobid_to_string is not the right call to get the job
session dir.
1. in some places orte_process_info.top_session_dir is directly
used, without being prefixed with the base directory.
Factorized the code sections that build the job_session_dir into a
single orte_build_job_session_dir() function that is now called by
both orte_session_dir_finalize() and orte_session_dir_cleanup().
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
This commit was SVN r21498.
The open/select of the PLM is done in orte/mca/ess/base/ess_base_std_orted.c. It only is done when the PLM MCA param is set directing a specific PLM be selected. The function
orte_plm_base_orted_append_basic_args
clears the params passed to the daemon of any PLM selection passed to the HNP. Each PLM then adds a PLM directive if-and-only-if backend PLM support is desired. At present, Torque, SLURM, and rsh all specify this support and direct that the backend orted open the "rsh" PLM.
This commit was SVN r21488.
The following SVN revision numbers were found above:
r21480 --> open-mpi/ompi@ed585bce8a
If we do not initialize the PML, non-HNP daemons will not be able to use its functions. For example, RSH needs it when the tree_spawn mode is
enabled: daemons call orte_pml.remote_spawn() function to spawn their children in the deployment tree.
This commit was SVN r21480.
Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn.
This commit was SVN r21390.
in this directory and it's causing "make dist" to break.
Shiqing -- is there a missing file in this directory? If so, please
add it and restore the EXTRA_DIST line I just removed. Thanks!
This commit was SVN r21340.
way to have no abort message is to pass NULL (the errmanager is smart
enough to handle this case and not emit any extra message).
This commit was SVN r21311.