2005-03-14 20:57:21 +00:00
|
|
|
/*
|
|
|
|
* Copyright (c) 2004-2005 The Trustees of Indiana University.
|
|
|
|
* All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
|
|
|
* All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
2005-03-24 12:43:37 +00:00
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2005-03-14 20:57:21 +00:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
|
|
#include "orte_config.h"
|
|
|
|
#include "include/orte_constants.h"
|
|
|
|
|
|
|
|
#include "mca/mca.h"
|
|
|
|
#include "mca/base/base.h"
|
|
|
|
#include "mca/ras/base/base.h"
|
|
|
|
#include "mca/ras/base/ras_base_node.h"
|
|
|
|
#include "mca/rmgr/base/base.h"
|
2005-05-19 16:46:34 +00:00
|
|
|
#include "mca/errmgr/errmgr.h"
|
2005-03-14 20:57:21 +00:00
|
|
|
|
|
|
|
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
/*
|
|
|
|
* Allocate one process per node on a round-robin basis, looping back
|
|
|
|
* around to the beginning as necessary
|
2005-03-14 20:57:21 +00:00
|
|
|
*/
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
int orte_ras_base_allocate_nodes_by_node(orte_jobid_t jobid,
|
|
|
|
ompi_list_t* nodes)
|
2005-03-14 20:57:21 +00:00
|
|
|
{
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
ompi_list_t allocated;
|
|
|
|
ompi_list_item_t* item;
|
|
|
|
size_t num_requested = 0;
|
|
|
|
size_t num_allocated = 0;
|
|
|
|
size_t num_constrained = 0;
|
|
|
|
size_t slots;
|
|
|
|
bool oversubscribe = false;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
/* query for the number of process slots required */
|
|
|
|
if (ORTE_SUCCESS !=
|
|
|
|
(rc = orte_rmgr_base_get_job_slots(jobid, &num_requested))) {
|
|
|
|
return rc;
|
2005-03-14 20:57:21 +00:00
|
|
|
}
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
|
|
|
|
OBJ_CONSTRUCT(&allocated, ompi_list_t);
|
|
|
|
num_allocated = 0;
|
|
|
|
|
|
|
|
/* This loop continues until all procs have been allocated or we run
|
|
|
|
out of resources. There are two definitions of "run out of
|
|
|
|
resources":
|
|
|
|
|
|
|
|
1. All nodes have node_slots processes allocated to them
|
|
|
|
2. All nodes have node_slots_max processes allocated to them
|
|
|
|
|
|
|
|
We first map until condition #1 is met. If there are still
|
|
|
|
processes that haven't been allocated yet, then we continue
|
|
|
|
until condition #2 is met. If we still have processes that
|
|
|
|
haven't been allocated yet, then it's an "out of resources"
|
|
|
|
error. */
|
|
|
|
while (num_allocated < num_requested) {
|
|
|
|
num_constrained = 0;
|
|
|
|
|
|
|
|
/* loop over all nodes until either all processes are
|
|
|
|
allocated or they all become constrained */
|
|
|
|
for (item = ompi_list_get_first(nodes);
|
|
|
|
item != ompi_list_get_end(nodes) && num_allocated < num_requested;
|
|
|
|
item = ompi_list_get_next(item)) {
|
|
|
|
orte_ras_base_node_t* node = (orte_ras_base_node_t*)item;
|
|
|
|
|
|
|
|
/* are any slots available? */
|
|
|
|
slots = (oversubscribe ? node->node_slots_max : node->node_slots);
|
|
|
|
if (node->node_slots_inuse < slots ||
|
|
|
|
(oversubscribe && 0 == slots)) {
|
|
|
|
++num_allocated;
|
|
|
|
++node->node_slots_inuse; /* running total */
|
|
|
|
++node->node_slots_alloc; /* this job */
|
|
|
|
} else {
|
|
|
|
++num_constrained;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if all nodes are constrained:
|
|
|
|
- if this is the first time through the loop, then set
|
|
|
|
"oversubscribe" to true, and we'll now start obeying
|
|
|
|
node_slots_max instead of node_slots
|
|
|
|
- if this is the second time through the loop, then all
|
|
|
|
nodes are full to the max, and therefore we can't do
|
|
|
|
anything more -- we're out of resources */
|
|
|
|
if (ompi_list_get_size(nodes) == num_constrained) {
|
|
|
|
if (oversubscribe) {
|
|
|
|
rc = ORTE_ERR_OUT_OF_RESOURCE;
|
|
|
|
goto cleanup;
|
|
|
|
} else {
|
|
|
|
oversubscribe = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* move all nodes w/ allocations to the allocated list */
|
|
|
|
item = ompi_list_get_first(nodes);
|
|
|
|
while(item != ompi_list_get_end(nodes)) {
|
|
|
|
orte_ras_base_node_t* node = (orte_ras_base_node_t*)item;
|
|
|
|
ompi_list_item_t* next = ompi_list_get_next(item);
|
|
|
|
if(node->node_slots_alloc) {
|
|
|
|
ompi_list_remove_item(nodes, item);
|
|
|
|
ompi_list_append(&allocated, item);
|
|
|
|
}
|
|
|
|
item = next;
|
|
|
|
}
|
|
|
|
|
|
|
|
rc = orte_ras_base_node_assign(&allocated, jobid);
|
|
|
|
if(ORTE_SUCCESS != rc) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
}
|
|
|
|
cleanup:
|
|
|
|
|
|
|
|
while(NULL != (item = ompi_list_remove_first(&allocated)))
|
|
|
|
ompi_list_append(nodes, item);
|
|
|
|
OBJ_DESTRUCT(&allocated);
|
|
|
|
return rc;
|
2005-03-14 20:57:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
/*
|
|
|
|
* Allocate processes to nodes, using all available slots on a node.
|
|
|
|
*/
|
|
|
|
int orte_ras_base_allocate_nodes_by_slot(orte_jobid_t jobid,
|
|
|
|
ompi_list_t* nodes)
|
2005-03-14 20:57:21 +00:00
|
|
|
{
|
|
|
|
ompi_list_t allocated;
|
|
|
|
ompi_list_item_t* item;
|
|
|
|
size_t num_requested = 0;
|
|
|
|
size_t num_allocated = 0;
|
|
|
|
size_t num_constrained = 0;
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
size_t available;
|
2005-03-14 20:57:21 +00:00
|
|
|
int rc;
|
|
|
|
|
|
|
|
/* query for the number of process slots required */
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
if (ORTE_SUCCESS !=
|
|
|
|
(rc = orte_rmgr_base_get_job_slots(jobid, &num_requested))) {
|
2005-03-14 20:57:21 +00:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
OBJ_CONSTRUCT(&allocated, ompi_list_t);
|
|
|
|
num_allocated = 0;
|
|
|
|
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
/* In the first pass, just grab all available slots (i.e., stay <=
|
|
|
|
node_slots) greedily off each node */
|
|
|
|
for (item = ompi_list_get_first(nodes);
|
|
|
|
item != ompi_list_get_end(nodes) && num_allocated < num_requested;
|
|
|
|
item = ompi_list_get_next(item)) {
|
|
|
|
orte_ras_base_node_t* node = (orte_ras_base_node_t*)item;
|
|
|
|
|
|
|
|
/* are any slots available? */
|
|
|
|
if (node->node_slots_inuse < node->node_slots) {
|
|
|
|
available = node->node_slots - node->node_slots_inuse;
|
|
|
|
if (num_requested - num_allocated < available) {
|
|
|
|
node->node_slots_inuse +=
|
|
|
|
(num_requested - num_allocated); /* running total */
|
|
|
|
node->node_slots_alloc +=
|
|
|
|
(num_requested - num_allocated); /* this job */
|
|
|
|
num_allocated = num_requested;
|
|
|
|
} else {
|
|
|
|
num_allocated += available;
|
|
|
|
node->node_slots_inuse += available; /* running total */
|
|
|
|
node->node_slots_alloc += available; /* this job */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If we're not done, then we're in an oversubscribing situation.
|
|
|
|
Switch to a round-robin-by-node policy -- take one slot from
|
|
|
|
each node until we hit node_slots_max or we have no more
|
|
|
|
resources; whichever occurs first. */
|
|
|
|
while (num_allocated < num_requested) {
|
2005-04-15 13:28:17 +00:00
|
|
|
num_constrained = 0;
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
|
|
|
|
/* loop over all nodes until either all processes are
|
|
|
|
allocated or they all become constrained */
|
|
|
|
for (item = ompi_list_get_first(nodes);
|
|
|
|
item != ompi_list_get_end(nodes) && num_allocated < num_requested;
|
|
|
|
item = ompi_list_get_next(item)) {
|
2005-03-14 20:57:21 +00:00
|
|
|
orte_ras_base_node_t* node = (orte_ras_base_node_t*)item;
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
|
|
|
|
/* are any slots available? */
|
|
|
|
if (node->node_slots_inuse < node->node_slots_max ||
|
|
|
|
0 == node->node_slots_max) {
|
|
|
|
++num_allocated;
|
|
|
|
++node->node_slots_inuse; /* running total */
|
|
|
|
++node->node_slots_alloc; /* this job */
|
|
|
|
} else {
|
|
|
|
++num_constrained;
|
2005-03-14 20:57:21 +00:00
|
|
|
}
|
|
|
|
}
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
|
|
|
|
/* if all nodes are constrained, then we're out of resources
|
|
|
|
-- thanks for playing */
|
|
|
|
if (ompi_list_get_size(nodes) == num_constrained) {
|
2005-03-14 20:57:21 +00:00
|
|
|
rc = ORTE_ERR_OUT_OF_RESOURCE;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
}
|
(copied from a mail that has a lengthy description of this commit)
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
2005-05-31 16:36:53 +00:00
|
|
|
|
2005-04-15 13:28:17 +00:00
|
|
|
/* move all nodes w/ allocations to the allocated list */
|
|
|
|
item = ompi_list_get_first(nodes);
|
|
|
|
while(item != ompi_list_get_end(nodes)) {
|
|
|
|
orte_ras_base_node_t* node = (orte_ras_base_node_t*)item;
|
|
|
|
ompi_list_item_t* next = ompi_list_get_next(item);
|
|
|
|
if(node->node_slots_alloc) {
|
|
|
|
ompi_list_remove_item(nodes, item);
|
|
|
|
ompi_list_append(&allocated, item);
|
|
|
|
}
|
|
|
|
item = next;
|
2005-03-18 23:40:08 +00:00
|
|
|
}
|
2005-05-19 16:46:34 +00:00
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
rc = orte_ras_base_node_assign(&allocated, jobid);
|
2005-05-19 16:46:34 +00:00
|
|
|
if(ORTE_SUCCESS != rc) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
}
|
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
cleanup:
|
|
|
|
|
|
|
|
while(NULL != (item = ompi_list_remove_first(&allocated)))
|
|
|
|
ompi_list_append(nodes, item);
|
|
|
|
OBJ_DESTRUCT(&allocated);
|
|
|
|
return rc;
|
|
|
|
}
|