When we read the input buffer, we don't always get a complete printf output - we sometimes end mid stream. We still need to add the suffix and a <CR> to keep the output working right.
This commit was SVN r21706.
"critical" - any error at or above the critical severity will be reported (i.e., only critical errors)
"warning" - any error at or above the warning severity will be reported (i.e., warning and critical errors)
"notice" - pretty much everything will be reported
Default to "critical" to keep down the chatter.
Obviously, only places that call orte_notifier will be affected - all other error reporting (e.g., via opal_output calls) is unaffected.
This commit was SVN r21693.
Add ability to store the RM's jobid string to tag the notifier message so that the sys admin knows what job had the problem.
This commit was SVN r21687.
Remove all architecture references from ORTE and put them back in the modex using modex_send/recv calls.
Hetero operations are now fully supported again. Comm_spawn now works up to the point where it segfaults due to an error in the CID code - which now allows Edgar to dig further! :-)
This commit was SVN r21655.
making it possible to claim relative hosts from the hostfile/scheduler
by using +n# hostname, where 0 <= # < np
ex:
cat ~/work/svn/hpc/dev/test/Rankfile/rankfile
rank 0=+n0 slot=0
rank 1=+n0 slot=1
rank 2=+n1 slot=2
rank 3=+n1 slot=1
This commit was SVN r21557.
prepare the send buffer, and post the collective order to the local daemon. It
then register the callback and return fromthe modex exchange. It will only
wait for this modex completion when the modex_recv is called. Meanwhile, the
daemon will do the allgather.
This commit was SVN r21543.
Add two options for plm process module, i.e. remote_env_prefix for getting OMPI prefix on remote computer by reading its user environment variable (OPENMPI_HOME), and remote_reg_prefix is similar, but it reads the registry on the remote computer. Reading remote env prefix has a higher priority than reading reg prefix, so that user can use their own installation of OMPI.
This commit was SVN r21538.
It never set the daemon launch counter before launching.
In the output for 'total job launch time' make the message match that of the SLURM PLM for easier parsing.
This commit was SVN r21527.
coded binomial approach. Now, the PLM extract the children information from the
routed component, and the startup follow the routed topology. As a side effect,
we can now launch using linear or radix topologies, in addition to the previously
binomial topology.
This commit was SVN r21514.
the message to update the structures, but instead use the information from
the URI. The reason is that even the launch report messages can get routed.
Deal with the orted_cmd_line in a single location.
This commit was SVN r21513.
Normally, any non-windows nodes should return NodeStatus_Unreachable, but that's not true. We have noticed that Scientific Linux cluster nodes, which are in the same subnet as the Windows nodes, return NodeStatus_PendingApproval value, and the RAS CCP takes them as Windows nodes. So let's just use the "Ready" nodes.
This commit was SVN r21510.
that now the HNP send the messages using the routed component. In the case
of tree spawn, when a intermediary node spawn a child it doesn't know how
to forward a message to it, so when the node-map message is coming from
the HNP (as there is nothing yet in the contact/routing table) the message
is sent back the way it came. As a result the node-map message keeps jumping
between the HNP and the first level orteds.
The solution is to add a new option to the children orte_parent_uri, which
is only set when the orted is _not_ directly spawned by the HNP. When this
option is present on the argument list, the orted will add the parent to
its routing, and force the parent to update his routes (by sending the URI).
With this approach, the routing tree is build in same time as the processes
are spawned, and all messages from the HNP can be routed to the leaves.
However, this is far from an optimal solution. Right now, this so called tree
spawn, only spawn the children in a tree without doing anything about the
"connect back to the HNP" step. The HNP is flooded with reports from all the
orted. The total number of messages is higher than in the non tree startup
scheme, so we do not expect this approach to be scalable in the current
incarnation. A complete overhaul of the tree startup is required in order
improve the scalability. Stay tuned!
This commit was SVN r21504.
The open/select of the PLM is done in orte/mca/ess/base/ess_base_std_orted.c. It only is done when the PLM MCA param is set directing a specific PLM be selected. The function
orte_plm_base_orted_append_basic_args
clears the params passed to the daemon of any PLM selection passed to the HNP. Each PLM then adds a PLM directive if-and-only-if backend PLM support is desired. At present, Torque, SLURM, and rsh all specify this support and direct that the backend orted open the "rsh" PLM.
This commit was SVN r21488.
The following SVN revision numbers were found above:
r21480 --> open-mpi/ompi@ed585bce8a
If we do not initialize the PML, non-HNP daemons will not be able to use its functions. For example, RSH needs it when the tree_spawn mode is
enabled: daemons call orte_pml.remote_spawn() function to spawn their children in the deployment tree.
This commit was SVN r21480.
Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn.
This commit was SVN r21390.
in this directory and it's causing "make dist" to break.
Shiqing -- is there a missing file in this directory? If so, please
add it and restore the EXTRA_DIST line I just removed. Thanks!
This commit was SVN r21340.
way to have no abort message is to pass NULL (the errmanager is smart
enough to handle this case and not emit any extra message).
This commit was SVN r21311.
Emit a more informative error message when the file descriptor limit is
reached during an accept() call. Also, abort when the accept fails to
avoid an infinite loop.
Emit a more informative error message when the help file can't be opened.
This commit was SVN r21271.
The following Trac tickets were found above:
Ticket 1930 --> https://svn.open-mpi.org/trac/ompi/ticket/1930
messages when delivering a signal (like STOP or CONT)
to a non-existant process. This fixes trac:1929.
Also, only print one error message in the other cases.
This commit was SVN r21263.
The following Trac tickets were found above:
Ticket 1929 --> https://svn.open-mpi.org/trac/ompi/ticket/1929
Add a new tm ess module that exploits this capability.
Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function.
Additional fixes included:
1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with!
2. fix a duplicate free when attempting to execute a non-existent app
3. cleanup an typo in the comm utilities
4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps
Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info.
This commit was SVN r21248.
1. replacing mpi_paffinity_alone with opal_paffinity_alone - for back-compatibility, I have aliased mpi_paffinity_alone to the new param name. This caus
es a mild abstraction break in the opal/mca/paffinity framework - per the devel discussion...live with it. :-) I also moved the ompi_xxx global variable
that tracked maffinity setup so it could be properly closed in MPI_Finalize to the opal/mca/maffinity framework to avoid an abstraction break.
2. Added code to the odls/default module to perform paffinity binding and maffinity init between process fork and exec. This has been tested on IU's odi
n cluster and works for both MPI and non-MPI apps.
3. Revise MPI_Init to detect if affinity has already been set, and to attempt to set it if not already done. I have *not* tested this as I haven't yet f
igured out a way to do so - I couldn't get slurm to perform cpu bindings, even though it supposedly does do so.
This has only been lightly tested and would definitely benefit from a wider range of evaluation...
This commit was SVN r21209.
This causes the orteds in the routing tree to remain alive until all termination "acks" from orteds below them have passed through. Thus, if we use static ports, we no longer require a direct orted-to-mpirun connection.
Also modify the binomial routed module so it conforms to what all the other routed modules do and have all messages pass along the routing tree instead of short-circuiting between orteds. This further reduces the number of ports being opened on backend nodes.
This commit was SVN r21203.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
* Pass the sequence number of the checkpoint along with reference from the global to the local coordinator.
* 'orte-restart --apponly' now just generates the app context file, and does not run with it. This provides the user the ability to edit the file before launching.
* Add a OPAL_CRS_NONE state
* Split the INC into three distinct parts.
* Implement a restart mechanism for the 'none' component. If given a context it simply execvp()'s it.
This commit was SVN r21195.
* Add 'orte-checkpoint -l' option that lists all checkpoints currently available on the system.
* Add 'orte-restart -i' which prints information regarding the checkpoint targeted for restart.
* Add ability to extract the timing metadata.
* Fix show_help() in the orte-checkpoint and orte-restart tools. They should be using the opal versions instead of the orte versions (otherwise nothing is printed).
This commit was SVN r21194.
OMPI_* to OPAL_*. This allows opal layer to be used more independent
from the whole of ompi.
NOTE: 9 "svn mv" operations immediately follow this commit.
This commit was SVN r21180.
This patch contains the following items:
* Fix the flag passed to open() for the read side of the named pipe between the local and app coordinator. There is a race condition when using O_RDWR on a named pipe (not sure how that bug got in there in the first place).
* Adjust control in the C/R thread timing
* Clarify return code in BLCR component
* Allow the user to adjust the max wait time for the named pipes in the FileM local coordinator by using the MCA parameter "snapc_full_max_wait_time" (Default: 20 seconds)
* If the application terminates while there are active FileM operations, force mpirun to wait on these operations to complete.
* Allow the user to set the local copy command (Default: cp) via MCA parameter "filem_rsh_cp"
* Implement the ability to throttle the number of outgoing connections in FileM. At larger scales this type of explicit throttling helps prevent overwhelming the HNP machine. Default: 10, set via MCA parameter: {{{filem_rsh_max_outgoing}}}
This commit was SVN r21167.
The following SVN revision numbers were found above:
r21131 --> open-mpi/ompi@0deb009225
* Improved timing in SnapC Full Global Coordinator
* Improved scalability of the SnapC Full protocol
* Minor improvements to the error reporting mechanisms in SnapC and FileM
* Improved the memory usage of the metadata routines - now the owner of the data is more explicit.
* Added a FileM hint to indicate when files stored locally can be moved to/from a globally mounted file system using just the 'cp' command instead of the 'rcp/scp' command. Slightly improves performance, but not too drastically. Can be set using the following SnapC MCA parameter: {{{snapc_base_global_shared=1}}}
* Implement the ability to throttle the number of outgoing connections in FileM. At larger scales this type of explicit throttling helps prevent overwhelming the HNP machine. Default: 10, set via MCA parameter: {{{filem_rsh_max_outgoing}}}
* Add a few diagnostic/debugging features to SnapC and FileM.
This commit was SVN r21131.
Since I already had some changes in there, add in the rmaps rank_file changes - should work okay, but not fully tested.
This commit was SVN r21099.
The following SVN revision numbers were found above:
r21097 --> open-mpi/ompi@88ae934c26
- Delete unnecessary header files using
contrib/check_unnecessary_headers.sh after applying
patches, that include headers, being "lost" due to
inclusion in one of the now deleted headers...
In total 817 files are touched.
In ompi/mpi/c/ header files are moved up into the actual c-file,
where necessary (these are the only additional #include),
otherwise it is only deletions of #include (apart from the above
additions required due to notifier...)
- To get different MCAs (OpenIB, TM, ALPS), an earlier version was
successfully compiled (yesterday) on:
Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled
Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled
Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled
This commit was SVN r21096.
several header files (previously included by header-files)
now have to be moved "upward".
This is mainly system headers such as string.h, stdio.h and for
networking, but also some orte headers.
This commit was SVN r21095.
any arbitrary command as a notifier, potentially allowing just about
anything to be a notifier. This component forks a child during
orte_init() to avoid forking problems with some OS-bypass networks.
The following MCA parameters are available:
notifier_command_cmd:
Default: /sbin/initlog -f $s -n "Open MPI" -s "$S: $m (errorcode: $e)"
Command to execute, with substitution. $s = integer severity; $S =
string severity; $e = integer error code; $m = string message
notifier_command_timeout:
Default: 30
Timeout (in seconds) of the command
This commit was SVN r21076.
We currently apply all of the MCA params in the parent job to the child. This commit allows a user to specify additional params for the child job, and to override any pre-existing params with the new value so they can better control behavior of the child job.
This commit was SVN r20989.
generate mangled windex files. Made ompi-top.1 and ompi-iof.1 build
by default. Also, added the orte-top synonym to the ompi-top manpage.
This commit was SVN r20915.
As a safeguard, good coding style should never access directly opal_pointer_array_t->addr or opal_value_array_t->bytes_array. I found another instance of the same bug somewhere else and will commit a separate patch for it.
This commit fixes ticket #1858 and solves user case http://www.open-mpi.org/community/lists/devel/2009/03/5731.php .
Aurelien
This commit was SVN r20903.
- This patch solely _adds_ required headers and is rather localized
The next patch (after RFC) heavily removes headers (based on script)
- ompi/communicator/communicator.h: For sources that use
ompi_mpi_comm_world, don't require them to include "mpi.h"
- ompi/debuggers/ompi_common_dll.c: mca_topo_base_comm_1_0_0_t needs
#include "ompi/mca/topo/topo.h"
- ompi/errhandler/errhandler_predefined.h:
ompi/communicator/communicator.h depends on this header file!
To prevent recursion just have fwd declarations.
#include "ompi/types.h" for fwd declarations of the main structs.
- ompi/mca/btl/btl.h: #include "opal/types.h" for ompi_ptr_t
- ompi/mca/mpool/base/mpool_base_tree.c: We use ompi_free_list_t and
ompi_rb_tree_t, so have the proper classes
- ompi/mca/op/op.h:
Op is pretty self-contained: Nobody up to now has done
#include "opal/class/opal_object.h"
- ompi/mca/osc/pt2pt/osc_pt2pt_replyreq.h:
#include "opal/types.h" for ompi_ptr_t
- ompi/mca/pml/base/base.h:
We use opal_lists
- ompi/mca/pml/dr/pml_dr_vfrag.h:
#include "opal/types.h" for ompi_ptr_t
- ompi/mca/pml/ob1/pml_ob1_hdr.h:
#include "ompi/mca/btl/btl.h" for mca_btl_base_segment_t
- opal/dss/dss_unpack.c:
#include "opal/types.h"
- opal/mca/base/base.h:
#include "opal/util/cmd_line.h" for opal_cmd_line_t
- orte/mca/oob/tcp/oob_tcp.c:
#include "opal/types.h" for opal_socklen_t
- orte/mca/oob/tcp/oob_tcp.h:
#include "opal/threads/threads.h" for opal_thread_t
- orte/mca/oob/tcp/oob_tcp_msg.c:
#include "opal/types.h"
- orte/mca/oob/tcp/oob_tcp_peer.c:
#include "opal/types.h" for opal_socklen_t
- orte/mca/oob/tcp/oob_tcp_send.c:
#include "opal/types.h"
- orte/mca/plm/base/plm_base_proxy.c:
#include "orte/util/name_fns.h" for ORTE_NAME_PRINT
- orte/mca/rml/base/rml_base_receive.c:
#include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE
- orte/mca/rml/oob/rml_oob_recv.c:
#include "opal/types.h" for ompi_iov_base_ptr_t
- orte/mca/rml/oob/rml_oob_send.c:
#include "opal/types.h" for ompi_iov_base_ptr_t
- orte/runtime/orte_data_server.c
#include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE
- orte/runtime/orte_globals.h:
#include "orte/util/name_fns.h" for ORTE_NAME_PRINT
Tested on Linux/x86-64
This commit was SVN r20817.
In case we use memcmp, strlen, strup and friends include <string.h>
Also several constants.h are not included directly
- Let's have mca_topo_base_cart_create return ompi-errors in
ompi/mca/topo/base/topo_base_cart_create.c
This commit was SVN r20773.
(http://www.stafford.uklinux.net/libesmtp/) via the --with-esmtp(=DIR)
configure option. Several MCA parameters must be set in order to use
this component:
* notifier_smtp_server: SMTP server IP address or name; must be supplied
* notifier_smtp_port: port to talk to on the server; defaults to 25
* notifier_smtp_to: comma-delimited list of email addresses to send
the mail to; must be supplied
* notifier_smtp_from_name: free-form "name" who the mail is from;
defaults to "Open MPI Notifier"
* notifier_smtp_from_addr: email address from the mail is from; must
be supplied
* notifier_smtp_subject: subject of the mail; defaults to "Open MPI
notifier"
* notifier_smtp_body_prefix: prefix of the body of the mail; defaults
to a sensible value
* notifier_smtp_body_suffix: suffix of the body of the mail; defaults
to a sensible value
Also libesmtp supports SMTP AUTH protocols, this component does not.
If people want/need those kinds of features, they're relatively easy
to add -- I just didn't bother [yet] before I knew if anyone cared.
This commit was SVN r20749.
Adapt orte_process_info to orte_proc_info, and
change orte_proc_info() to orte_proc_info_init().
- Compiled on linux-x86-64
- Discussed with Ralph
This commit was SVN r20739.
get bitten by header depending on having already included
the corresponding [opal|orte|ompi]_config.h header.
When separating, things like [OPAL|ORTE|OMPI]_DECLSPEC
are missed.
Script to add the corresponding header in front of all following
(taking care of possible #ifdef HAVE_...)
- Including some minor cleanups to
- ompi/group/group.h -- include _after_ #ifndef OMPI_GROUP_H
- ompi/mca/btl/btl.h -- nclude _after_ #ifndef MCA_BTL_H
- ompi/mca/crcp/bkmrk/crcp_bkmrk_btl.c -- still no need for
orte/util/output.h
- ompi/mca/pml/dr/pml_dr_recvreq.c -- no need for mpool.h
- ompi/mca/btl/btl.h -- reorder to fit
- ompi/mca/bml/bml.h -- reorder to fit
- ompi/runtime/ompi_mpi_finalize.c -- reorder to fit
- ompi/request/request.h -- additionally need ompi/constants.h
- Tested on linux/x86-64
This commit was SVN r20720.
Correct an error wrt how jobids were being computed. Needed to ensure that the job family field was not overrun as we increment jobids for comm_spawn.
Update the slurm plm module so it uses the new slurm termination procedure (brings trunk back into alignment with 1.3 branch).
Update the slurmd ess component so it doesn't get selected if we are running a singleton inside of a slurm allocation.
Cleanup HNP init by moving some code that had been in orte_globals.c for historical reasons into the ess hnp module, and removing the call to that code from the ess_base_std_prolog
NOTE: this change allows orte to support an infinite aggregate number of comm_spawn's, with up to 64k being alive at any one instant. HOWEVER, the MPI layer currently does -not- support re-use of jobids. I did some prototype coding to revise the ompi_proc_t structures, but the BTLs are caching their own data, and there was no readily apparent way to update it. Thus, attempts to spawn more than the 64k limit will abort to avoid causing the MPI layer to hang.
This commit was SVN r20700.
a notifier module.
The Notifier framework was extended slightly to
convey more information about each event notice.
This works with the FTB v0.5 API.
To compile with FTB support, use --with-ftb=/path/to/ftb/install
CIFTS == Coordinated Infrastructure for Fault Tolerant Systems
FTB == Fault Tolerance Backplane
see http://wiki.mcs.anl.gov/cifts/index.php
This commit was SVN r20655.
Headers should be included in the .c directly.
This commit was SVN r20645.
The following SVN revision numbers were found above:
r20643 --> open-mpi/ompi@e46c512ee7
anyhow -- if oob functionality is neededm then orte/mca/oob/oob.h
Nevertheless compiles fine with -Wimplicit-function-declaration
This commit was SVN r20641.
Only proc_info.h-internal include file is opal/dss/dss_types.h
- In one case (orte/util/hnp_contact.c) had to add proc_info.h again.
- Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration
works fine, no errors.
Again, let's have MTT the last word.
This commit was SVN r20631.
Note: this requires that nodes not be shared by jobs/users. SLURM developers are working on an enhancement to remove this constraint.
Note 2: yes, the direct routed module returned! However, it is vastly different than the old one and has zero support for such things as comm_spawn. It is solely to support non-daemon, direct-launch environments.
This commit was SVN r20601.
Note that this assumes non-shared nodes...but only takes affect if there is no prior knowledge of how to talk to the specified peer. Thus, all daemon-based environments are unaffected.
This commit was SVN r20598.
Also create mca parameter to force daemonization (previous
behavior) which might be needed on larger clusters or
to make use of the -notify flag with qsub.
This fixes trac:1783.
This commit was SVN r20582.
The following Trac tickets were found above:
Ticket 1783 --> https://svn.open-mpi.org/trac/ompi/ticket/1783
Often, orte/util/show_help.h is included, although no functionality
is required -- instead, most often opal_output.h, or
orte/mca/rml/rml_types.h
Please see orte_show_help_replacement.sh commited next.
- Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration
actually showed two *missing* #include "orte/util/show_help.h"
in orte/mca/odls/base/odls_base_default_fns.c and
in orte/tools/orte-top/orte-top.c
Manually added these.
Let's have MTT the last word.
This commit was SVN r20557.
The prior ompi_proc_t structure had a uint8_t flag field in it, where only one
bit was used to flag that a proc was "local". In that context, "local" was
constrained to mean "local to this node".
This commit provides a greater degree of granularity on the term "local", to include tests
to see if the proc is on the same socket, PC board, node, switch, CU (computing
unit), and cluster.
Add #define's to designate which bits stand for which local condition. This
was added to the OPAL layer to avoid conflicting with the proposed movement of
the BTLs. To make it easier to use, a set of macros have been defined - e.g.,
OPAL_PROC_ON_LOCAL_SOCKET - that test the specific bit. These can be used in
the code base to clearly indicate which sense of locality is being considered.
All locations in the code base that looked at the current proc_t field have
been changed to use the new macros.
Also modify the orte_ess modules so that each returns a uint8_t (to match the
ompi_proc_t field) that contains a complete description of the locality of this
proc. Obviously, not all environments will be capable of providing such detailed
info. Thus, getting a "false" from a test for "on_local_socket" may simply
indicate a lack of knowledge.
This commit was SVN r20496.
Correct the orte-show-help file when a rank is out of bounds, and do that test where a wildcard doesn't get incorrectly flagged as out-of-bounds.
This commit was SVN r20398.
1. fix a race condition whereby a proc's output could trigger an event prior to the other outputs being setup, thus c ausing the IOF to declare the proc "terminated" too early. This was really rare, but could happen.
2. add a new "timestamp-output" option that timestamp's each line of output
3. add a new "output-filename" option that redirects each proc's output to a separate rank-named file.
4. add a new "xterm" option that redirects the output of the specified ranks to a separate xterm window.
This commit was SVN r20392.
SIGCONT to the a.outs. By default, they are not forwarded and
the behavior remains as it has always been. However, if one
runs with --mca orte_forward_job_control 1, then mpirun will
catch those two signals and forward them to the orteds which
will deliver them to the a.outs. We have had requests for
this feature.
This commit was SVN r20391.
y combination of comma-separated values and ranges. Daemons will use the first port in the range, MPI procs will use the other ports in the range assuming that they know their node rank in time and enough ports were specified.
NOTE: this capability only works under specific conditions. I will outline more about this in a note to devel as the remainder of the implementation progresses. For now, the only environment where this works is slurm. The linear routed module has also been adjusted to work with static ports so that all messaging flows strictly through the topology, including the initial daemon callback - thus limiting the number of sockets opened by mpirun.
This commit was SVN r20390.
If the --wdir option is given, check to see if the user provided a relative path. If so, convert it to an absolute path. This is needed to maintain consistent behavior across environements. Some environments automatically chdir to your current working directory when launching the remote orted, while others (e.g., ssh) don't. This levels the playing field and reduces user surprise.
This commit was SVN r20342.
Reverted r20306 since the fix caused 100% failues on our !BigRed system.
See the comments on ticket #1763 for the details.
This commit was SVN r20339.
The following SVN revision numbers were found above:
r20306 --> open-mpi/ompi@8c87e48721
The following Trac tickets were found above:
Ticket 1763 --> https://svn.open-mpi.org/trac/ompi/ticket/1763
* Improved the error propagation from a backend orted
* Fixed a hang in orterun due to failed files transferred
* Fix the movement of files with relative path names
* Improved error messages when a file cannot be moved
* Move file checks to FileM instead of embedding then in the ODLS
This commit Refs trac:1770
This commit was SVN r20331.
The following Trac tickets were found above:
Ticket 1770 --> https://svn.open-mpi.org/trac/ompi/ticket/1770
The problem was eventually traced to two bugs in the code:
1. the orted wasn't resetting the write event flag, thus preventing itself from turning it on again.
2. the HNP needed to check if the stdin was attached to tty or not before adding the delay for fairness. If it is attached to a tty, there is no need for the delay. This prevents some strangely slow typing response.
This patch needs to move to 1.3
This commit was SVN r20246.
Error output strings were changed to be unique per code site.
They are still pretty meaningless to the user, but at least now
developers might be able to find which unique place in the code
reported which error.
This commit was SVN r20238.
Consolidate the nid/pid lookup code with the rest of the nid/pid code so that changes are easier to track. Add the ability to send cluster profile info as part of the nidmap. Cleanup the setup and teardown of the new global nidmap and pidmap objects.
This commit was SVN r20219.
Also, per chat with Jeff, modified the Makefile.am's of a few orte tools so that they were consistent in the way we generate the ompi-equivalent cmds.
This commit was SVN r20165.