1
1

fix hang in -np 3 --rank-by core

The following command hangs:
  % mpirun --rank-by core -np 3 --report-bindings hostname
because of a loop where i is supposed to cycle through an
array of size num_objs, but for some reason it's only
looking at node->num_procs entries.

I changed the counter so it stays in the loop (stays on this
node) until it makes a full cycle through the array of objects
without any assignments then it ends the loop so it can go to
the next node.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
Этот коммит содержится в:
Mark Allen 2019-04-12 14:51:52 -04:00
родитель bdd92a7a64
Коммит bf3980d70c

Просмотреть файл

@ -378,8 +378,25 @@ static int rank_by(orte_job_t *jdata,
* Perhaps someday someone will come up with a more efficient
* algorithm, but this works for now.
*/
// In 3.x this was two loops:
// while (cnt < app->num_procs)
// for (i=0; i<num_objs; ...)
// Then in 4.x it switched to
// while (cnt < app->num_procs && i < (int)node->num_procs)
// where that extra i part seems wrong to me. First of all if anything
// it seems like it should be i<num_objs since that's the array i is
// cycling through, but even then all the usage of i below is
// (i % num_objs) so I think i is intended to wrap and you should
// keep looping until you've made all the assignments you can for
// this node.
//
// So that's what I added the other loop counter for, figuring if it
// cycles through the whole array of objs without making an assignment
// it's time for this loop to end and the outer loop to take us to the
// next node.
i = 0;
while (cnt < app->num_procs && i < (int)node->num_procs) {
int niters_of_i_without_assigning_a_proc = 0;
while (cnt < app->num_procs && niters_of_i_without_assigning_a_proc <= num_objs) {
/* get the next object */
obj = (hwloc_obj_t)opal_pointer_array_get_item(&objs, i % num_objs);
if (NULL == obj) {
@ -447,6 +464,7 @@ static int rank_by(orte_job_t *jdata,
return rc;
}
num_ranked++;
niters_of_i_without_assigning_a_proc = 0;
/* track where the highest vpid landed - this is our
* new bookmark
*/
@ -455,6 +473,7 @@ static int rank_by(orte_job_t *jdata,
break;
}
i++;
++niters_of_i_without_assigning_a_proc;
}
}
/* cleanup */