1. new functionality in the pls base to check for reusable daemons and launch upon them
2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED
3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse".
This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on.
This commit was SVN r12608.
1. use non-blocking sends to transmit commands (this was actually done in a prior commit)
2. have an "ack" message sent back from the orted when it completes the command
The latter item is the new one here. With my prior commit, it was possible for the HNP to move on to other things before the orted had completed its command. This caused the HNP to occassionally exit before the orted, thus generating "lost connection" errors. With this change, we retain the parallel nature of the command communications, but still hold the HNP at that point until the orteds are done.
Best of both worlds.
This commit was SVN r12605.
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
deploy: will create a full binary distribution (.msi file)
ompi-small: will create an update binary distribution(
no include files, nothing else except the libs and execs).
ompi_install.sh: sh script to install manually into a specific
target directory.
This commit was SVN r12589.
Add some debugger output to the ODLS default component.
Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time.
This commit was SVN r12585.
Same sort of problem and fix as described in r12323 - mca_pml_ob1_recv_frag_progress() was segfaulting due to a NULL req_proc pointer. The path leading to this was through the mca_pml_ob1_check_cantmatch_for_match() function, where we can match a frag using the same macros as mca_pml_ob1_frag_match() and never initialize the req_proc pointer.
This commit was SVN r12582.
The following SVN revision numbers were found above:
r12323 --> open-mpi/ompi@c752502dee
Make it so if -np was not passed and -pernode was, we map bynode
This commit was SVN r12580.
The following Trac tickets were found above:
Ticket 612 --> https://svn.open-mpi.org/trac/ompi/ticket/612
Add some debugging output to the ODLS default module, and the orted.
Remove the nodename data from the ODLS info report - that info is already stored in the registry by the RMAPS framework upon completing the mapping procedure.
Add another test program that does an ORTE-only dynamic spawn (gasp!). Looks just like comm_spawn - just no MPI involved.
Modify the ODLS to release the processor when we "kill" local procs in a more scalable fashion. It previously had a sleep in it that Jeff's prior commit removed. However, he introduced some Windows code into the non-Windows component (protected by "if"s, but unnecessary). This is a more general solution he proposed - included here so I could get things to compile properly.
This commit was SVN r12579.
The fix is to set opcode to SEND at the entrance to the send function before
checking credits and putting fragment to the pending list. We do the same thing
in put/get functions i.e setting opcode at the entrance to the function.
This commit was SVN r12559.
1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs.
2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system.
This commit was SVN r12558.
- consistent arguments checking (not allowing to select an algorithm which
is not available)
- consistent way of computing the segcount (number of datatypes by segment).
- small cleanups.
- more informative debugging messages.
This commit was SVN r12545.
description. Most of the bcast algorithms can be completed using this
generic function once we create the tree structure. Add all kind of
trees.
There are 2 versions of the generic bcast function. One using overlapping
between receives (for intermediary nodes) and then blocking sends to all
childs and another where all sends are non blocking. I still have to
figure out which one give the smallest overhead.
This commit was SVN r12530.
is that if one add "pml=" to the configuration file, really bad things
happen. All PMLs will get initialize, and each of them will initialize
all BTLs. This patch force the mca_pml_base_pml to get initialized in
all cases before we go out of the mca_pml_base_open function.
This commit was SVN r12527.