fix hang in -np 3 --rank-by core
The following command hangs: % mpirun --rank-by core -np 3 --report-bindings hostname because of a loop where i is supposed to cycle through an array of size num_objs, but for some reason it's only looking at node->num_procs entries. I changed the counter so it stays in the loop (stays on this node) until it makes a full cycle through the array of objects without any assignments then it ends the loop so it can go to the next node. Signed-off-by: Mark Allen <markalle@us.ibm.com>
Этот коммит содержится в:
родитель
bdd92a7a64
Коммит
bf3980d70c
@ -378,8 +378,25 @@ static int rank_by(orte_job_t *jdata,
|
||||
* Perhaps someday someone will come up with a more efficient
|
||||
* algorithm, but this works for now.
|
||||
*/
|
||||
// In 3.x this was two loops:
|
||||
// while (cnt < app->num_procs)
|
||||
// for (i=0; i<num_objs; ...)
|
||||
// Then in 4.x it switched to
|
||||
// while (cnt < app->num_procs && i < (int)node->num_procs)
|
||||
// where that extra i part seems wrong to me. First of all if anything
|
||||
// it seems like it should be i<num_objs since that's the array i is
|
||||
// cycling through, but even then all the usage of i below is
|
||||
// (i % num_objs) so I think i is intended to wrap and you should
|
||||
// keep looping until you've made all the assignments you can for
|
||||
// this node.
|
||||
//
|
||||
// So that's what I added the other loop counter for, figuring if it
|
||||
// cycles through the whole array of objs without making an assignment
|
||||
// it's time for this loop to end and the outer loop to take us to the
|
||||
// next node.
|
||||
i = 0;
|
||||
while (cnt < app->num_procs && i < (int)node->num_procs) {
|
||||
int niters_of_i_without_assigning_a_proc = 0;
|
||||
while (cnt < app->num_procs && niters_of_i_without_assigning_a_proc <= num_objs) {
|
||||
/* get the next object */
|
||||
obj = (hwloc_obj_t)opal_pointer_array_get_item(&objs, i % num_objs);
|
||||
if (NULL == obj) {
|
||||
@ -447,6 +464,7 @@ static int rank_by(orte_job_t *jdata,
|
||||
return rc;
|
||||
}
|
||||
num_ranked++;
|
||||
niters_of_i_without_assigning_a_proc = 0;
|
||||
/* track where the highest vpid landed - this is our
|
||||
* new bookmark
|
||||
*/
|
||||
@ -455,6 +473,7 @@ static int rank_by(orte_job_t *jdata,
|
||||
break;
|
||||
}
|
||||
i++;
|
||||
++niters_of_i_without_assigning_a_proc;
|
||||
}
|
||||
}
|
||||
/* cleanup */
|
||||
|
Загрузка…
x
Ссылка в новой задаче
Block a user