call the memory pool to do special memory allocations, and extended
the mpool so that it will do the allocations and keep tack of them in
a tree. Currently, if you pass MPI_INFO_NULL to MPI_Alloc_mem, we will
try to allocate the memory and register it with as many mpools as
possible. Alternatively, one can pass an info object with the names of
the mpools as keys, and from these we decide which mpools to register
the new memory with.
- fixed some comments in the allocator and fixed a minor bug
- extended the red black tree test and made a minor correction
This commit was SVN r5902.
I spoke with Tim about this the other day -- he gave me the green
light to go ahead with this, but it turned into a bigger job than I
thought it would be. I revamped how the default RAS scheduling and
round_robin RMAPS mapping occurs. The previous algorithms were pretty
brain dead, and ignored the "slots" and "max_slots" tokens in
hostfiles. I considered this a big enough problem to fix it for the
beta (because there is currently no way to control where processes are
launched on SMPs).
There's still some more bells and whistles that I'd like to implement,
but there's no hurry, and they can go on the trunk at any time. My
patches below are for what I considered "essential", and do the
following:
- honor the "slots" and "max-slots" tokens in the hostfile (and all
their synonyms), meaning that we allocate/map until we fill slots,
and if there are still more processes to allocate/map, we keep going
until we fill max-slots (i.e., only oversubscribe a node if we have
to).
- offer two different algorithms, currently supported by two new
options to orterun. Remember that there are two parts here -- slot
allocation and process mapping. Slot allocation controls how many
processes we'll be running on a node. After that decision has been
made, process mapping effectively controls where the ranks of
MPI_COMM_WORLD (MCW) are placed. Some of the examples given below
don't make sense unless you remember that there is a difference
between the two (which makes total sense, but you have to think
about it in terms of both things):
1. "-bynode": allocates/maps one process per node in a round-robin
fashion until all slots on the node are taken. If we still have more
processes after all slots are taken, then keep going until all
max-slots are taken. Examples:
- The hostfile:
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -bynode -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 2
vogon: MCW ranks 1, 3, 4, 5
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4
vogon: MCW ranks 1, 3, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until each
node's max_slots is hit, of course)
- orterun -bynode -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 2, 4, 6
vogon: MCW ranks 1, 3, 5, 7, 8, 9, 10, 11
2. "-byslot" (this is the default if you don't specify -bynode):
greedily takes all available slots on a node for a job before moving
on to the next node. If we still have processes to allocate/schedule,
then oversubscribe all nodes equally (i.e., go round robin on all
nodes until each node's max_slots is hit). Examples:
- The hostfile
eddie slots=2 max-slots=4
vogon slots=4 max-slots=8
- orterun -np 6 -hostfile hostfile a.out
eddie: MCW ranks 0, 1
vogon: MCW ranks 2, 3, 4, 5
- orterun -np 8 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2
vogon: MCW ranks 3, 4, 5, 6, 7
-> the algorithm oversubscribes all nodes "equally" (until max_slots
is hit)
- orterun -np 12 -hostfile hostfile a.out
eddie: MCW ranks 0, 1, 2, 3
vogon: MCW ranks 4, 5, 6, 7, 8, 9, 10, 11
The above examples are fairly contrived, and it's not clear from them
that you can get different allocation answers in all cases (the
mapping differences are obvious). Consider the following allocation
example:
- The hostfile
eddie count=4
vogon count=4
earth count=4
deep-thought count=4
- orterun -np 8 -hostfile hostfile a.out
eddie: 4 slots will be allocated
vogon: 4 slots will be allocated
earth: no slots allocated
deep-thought: no slots allocated
- orterun -bynode -np 8 -hostfile hostfile a.out
eddie: 2 slots will be allocated
vogon: 2 slots will be allocated
earth: 2 slots will be allocated
deep-thought: 2 slots will be allocated
This commit was SVN r5894.
remove the MPI_ERR_INIT_FINALIZE() macro. Also check to see how we
invoke the errhandler if an error occurs (i.e., the action depends on
whether we're between MPI_INIT and MPI_FINALIZE or not).
This commit was SVN r5891.
additions from his previous commit:
- Properly propagate error upwards if we have a losthost+other_node
error
- Added logic to handle multiple instances of the same hostname
- Added logic to properly increment the slot count for multiple
instances. For example, a hostfile with:
foo.example.com
foo.example.com slots=4
foo.example.com slots=8
would result in a single host with a slot count of 13 (i.e., if no
slot count is specified, 1 is assumed)
- Revised the localhost logic a bit -- some cases are ok (e.g.,
specifying localhost multiple times is ok, as long as there are no
other hosts)
This commit was SVN r5886.
The problem was that the displacement was increased even when the current memcpy completly
succeed. It not a problem for most of the cases ... except when we completly finish a
data.
This commit was SVN r5885.
Long explanation: Jeff and I spent some time chasing this down today (mostly Jeff), and found that the Mac was having problems with the replacement of "localhost" with the local nodename when we read the hostfile. Jeff then found that the Linux documentation specifically warns about the vaguery of the value returned for "nodename" (see the man page for uname for details). Sooo....when we replaced "localhost" with the local "nodename", the system couldn't figure out what node we were referring to when we tried to launch.
Solution (borrowed from LAM): if the user includes "localhost" in the hostfile, then we do NOT allow any other entries in the hostfile - the presence of another entry will generate an error message and cause mpirun to gracefully exit. Obviously, then, if "localhost" is specified in the hostfile, then we are running the application locally.
This commit was SVN r5881.
- creating the stack work now even for contiguous data (with gaps around) and
independing on the fragment size.
- add a TYPE argument to the PUSH_STACK macro. It's too obscure to explain it here :)
- in dt_add we avoid surrounding a datatype with loops if we can handle it by increasing the
count of the datatype (only if the datatype contain one type element and if the extent
match). But it's enough to speed up a lot the packing/unpacking of all composed predefined
datatypes (line MPI_COMPLEX and co.).
- in dt_module.c improve the handling of the flags for all composed predefined
datatypes. There is still something to do for the Fortran datatypes but it will be on
the next commit.
This commit was SVN r5879.
entire explanation ;-) )
Our Abaqus friends just pointed out another bug to me. We have the
"-x" option to orterun to export environment variables to
newly-started processes. However, it doesn't work if the environment
variable is already set in the target environment. For example:
mpirun -x LD_LIBRARY_PATH -np 2 a.out
The app context correctly contains LD_LIBRARY_PATH and its value, and
that app context correctly propagates out to the orted and is present
when we fork/exec a.out. However, if LD_LIBRARY_PATH is already set
in the newly-started process' environment, the fork pls won't override
it with the value from the app context.
It really only has to do with the ordering of arguments in
ompi_environ_merge() -- when merging to env arrays together, we
"prefer" one set to the other if there are duplicate names. I think
that if the user wants to override variables (even variables like
LD_LIBRARY_PATH), we should let them -- it's necessary for some
applications (like in Abaqus' case). If they screw it up, it's their
fault (e.g., setting some LD_LIBRARY_PATH that won't work).
That being said, we should *not* allow them to override specific MCA
parameters that are necessary for startup -- that's easy to accomplish
by setting up that stuff *after* we merge in the context app
environment.
Also note that I am *only* speaking about the fork pls here -- so this
only applies to started ORTE job processes, not the orted.
So an easy re-order to do the following:
env_copy = merge(environ and context->app)
ompi_setenv(...MCA params necessary for startup..., env_copy)
execve(..., env_copy)
does what we want.
This commit was SVN r5878.
split between OMPI and ORTE, added a lengthy comment to ompi_bitmap.h
explaining the reason why (and how it would be fine to re-merge them
-- if someone has the time) and references to it from all the other
relevant .h files.
This commit was SVN r5876.
Add the logic to properly assign new cellid's to hosts read in by the hostfile component. However, don't turn it on yet.
It seems that the code base has (unfortunately) assumed that cellid is always zero. When I turn on the cellid capability, the system "hangs" whenever the cellid is non-zero. I'll have to chase that problem down. For now, I've turned "off" the cellid assignment in the hostfile component.
This commit was SVN r5865.
1. Fixed the GPR search engine so that keys AND worked, and so that multiple objects with the same key didn't mess up the search.
2. Added an orte_bitmap function based on the existing ompi_bitmap one, but minus the fortran "pollution"
3. Added a new name service function called create_my_name to remove the duplicate name creation that was happening with the RML. Basically, the RML has to assign a name when a process makes first contact if the process doesn't already have a name. For processes that get a name passed into them, this was okay - the name was already assigned. For other processes (e.g., singletons), this was not okay - the first message to the seed daemon was to create a name, which caused the RML to assign one, and then the name service to assign another.
4. Change orted so it gets its name the way everyone else does - during orte_init.
This commit was SVN r5842.