We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification
s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti
ng.
This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren
tly have a way of tagging the collective to a specific thread. From the comment in the code:
* In order to allow scalable
* generation of collective id's, they are formed as:
*
* top 32-bits are the jobid of the procs involved in
* the collective. For collectives across multiple jobs
* (e.g., in a connect_accept), the daemon jobid will
* be used as the id will be issued by mpirun. This
* won't cause problems because daemons don't use the
* collective_id
*
* bottom 32-bits are a rolling counter that recycles
* when the max is hit. The daemon will cleanup each
* collective upon completion, so this means a job can
* never have more than 2**32 collectives going on at
* a time. If someone needs more than that - they've got
* a problem.
*
* Note that this means (for now) that RTE-level collectives
* cannot be done by individual threads - they must be
* done at the overall process level. This is required as
* there is no guaranteed ordering for the collective id's,
* and all the participants must agree on the id of the
* collective they are executing. So if thread A on one
* process asks for a collective id before thread B does,
* but B asks before A on another process, the collectives will
* be mixed and not result in the expected behavior. We may
* find a way to relax this requirement in the future by
* adding a thread context id to the jobid field (maybe taking the
* lower 16-bits of that field).
This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives.
This commit was SVN r32203.
Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node.
Refs trac:4003
This commit was SVN r30033.
The following Trac tickets were found above:
Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003
Fix two problems that surfaced when using direct launch under SLURM:
1. locally store our own data because some BTLs want to retrieve
it during add_procs rather than use what they have internally
2. cleanup MPI_Abort so it correctly passes the error status all
the way down to the actual exit. When someone implemented the
"abort_peers" API, they left out the error status. So we lost
it at that point and *always* exited with a status of 1. This
forces a change to the API to include the status.
cmr:v1.7.3:reviewer=jsquyres:subject=Fix MPI_Abort and modex_recv for direct launch
This commit was SVN r29405.
The intercomm "merge" function can create a linkage between procs that was not reflected anywhere in a modex, and so at least some of the procs in the resulting communicator don't know how to talk to some of the new communicator's peers.
For example, consider the case where:
1. parent job A comm_spawns a process (job B) - these processes exchange modex and can communicate
2. parent job A now comm_spawns another process (job C) - again, these can communicate, but the proc in C knows nothing of B
3. do an intercomm merge across the communicators created by the two comm_spawns. This puts B and C into the same communicator, but they know nothing about how to talk to each other as they were not involved in any exchange of contact info. Hence, collectives on that communicator now fail.
This fix adds an API to the ompi/dpm framework that (a) exchanges the modex info across the procs in the merge to ensure all procs know how to communicate, and (b) calls add_procs to give the btl's a chance to select transports to any new procs.
cmr:v1.7.3:reviewer=jsquyres
This commit was SVN r29166.
The following Trac tickets were found above:
Ticket 2904 --> https://svn.open-mpi.org/trac/ompi/ticket/2904
* paccept - establish a persistent listening port for async connect requests
* pconnect - async connect to remote process that has posted a paccept port. Provides a timeout mechanism, and allows the underlying implementation to retry until timeout
* pclose - shuts down a prior paccept posting
Includes example programs paccept.c and pconnect.c in orte/test/mpi. New MPI extension interfaces coming...
This commit was SVN r29063.
additional functionality. Rationale (refs trac:3422):
* Normal MPI applications only ever use the MPI API. Hence, -lmpi is
sufficient (they'll never directly call ORTE or OPAL
functions). This is arguably the most common case.
* That being said, we do have some test programs (e.g., those in
orte/test/mpi) that call MPI functions but also call ORTE/OPAL
functions. I've also written the occasional MPI test program that
calls opal_output, for example (there even might be a few tests in
the IBM test suite that directly call ORTE/OPAL functions).
* Even though this is not a common case, these applications should
also compile/link with mpicc.
* So we should add a --openmpi:linkall option that will also link
in whatever is necessary to call ORTE/OPAL functions
* Yes, we could hard-code "-lopen-rte -lopen-pal" in Makefiles, but
we do reserve the right to change those library names and/or add
others someday, so it's better to abstract out the names and let
the wrapper supply whatever is necessary.
* ORTE programs, however, are different. They almost always call OPAL
functions (e.g., if they want to send a message, they must use the
OPAL DSS). As such, it seems like the ORTE programs should always
link in OPAL.
Therefore:
* Add undocumented --openmpi:linkall flag to the wrapper compilers.
See the comment in opal_wrapper.c for an explanation of what it
does. This flag is only intended for Open MPI developers -- not
end users. That's why it's undocumented.
* Update orte/test/mpi/Makefile.am to add --openmpi:linkall
* Make ortecc/ortec++'s wrapper data text files always explicitly
link in libopen-pal
This commit was SVN r27670.
The following SVN revision numbers were found above:
r27668 --> open-mpi/ompi@cf845897aa
The following Trac tickets were found above:
Ticket 3422 --> https://svn.open-mpi.org/trac/ompi/ticket/3422
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context
Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.
This commit was SVN r27005.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.
1. no binding support - indicated by a negative return code from get_cpubind
2. binding supported, but not bound - the bitset returned by get_cpubind is the same as the available cpuset
3. binding supported and bound - bitset from get_cpubind is a subset of available cpuset
4. only one cpu is available - in this case, get_cpubind matches the available cpuset, but we are effectively bound
This commit was SVN r25957.
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them.
Add a test provided by Philippe that tests this ability.
This commit was SVN r23438.
Add new mca params to test:
orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch
orte_debugger_test_attach: Test debugger colaunch after debugger attachment
To test co-launch at job start, just set the orte_debugger_test_daemon param.
To test co-launch upon attach:
set orte_debugger_test_daemon
set orte_debugger_test_attach=1
set orte_enable_debug_cospawn_while_running=1
set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching
Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon.
This commit was SVN r23144.