Libltdl erroneously returns an error string of "file not found" for
lots of reasons, even if the file really *is* there, but just failed
to dlopen() for some reason. So if lt_dlerror() returns "file not
found", do some simple hueristics and if we *do* find a file, print a
slightly better error message.
This commit was SVN r21214.
1. replacing mpi_paffinity_alone with opal_paffinity_alone - for back-compatibility, I have aliased mpi_paffinity_alone to the new param name. This caus
es a mild abstraction break in the opal/mca/paffinity framework - per the devel discussion...live with it. :-) I also moved the ompi_xxx global variable
that tracked maffinity setup so it could be properly closed in MPI_Finalize to the opal/mca/maffinity framework to avoid an abstraction break.
2. Added code to the odls/default module to perform paffinity binding and maffinity init between process fork and exec. This has been tested on IU's odi
n cluster and works for both MPI and non-MPI apps.
3. Revise MPI_Init to detect if affinity has already been set, and to attempt to set it if not already done. I have *not* tested this as I haven't yet f
igured out a way to do so - I couldn't get slurm to perform cpu bindings, even though it supposedly does do so.
This has only been lightly tested and would definitely benefit from a wider range of evaluation...
This commit was SVN r21209.
This causes the orteds in the routing tree to remain alive until all termination "acks" from orteds below them have passed through. Thus, if we use static ports, we no longer require a direct orted-to-mpirun connection.
Also modify the binomial routed module so it conforms to what all the other routed modules do and have all messages pass along the routing tree instead of short-circuiting between orteds. This further reduces the number of ports being opened on backend nodes.
This commit was SVN r21203.
Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback.
This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn.
Also, fully implement the "copy" function for the orte_job_t object.
NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value.
This commit was SVN r21200.
for us already.
* Slightly clarify the error message strings; now they match the new
error strings for btl_openib_ipaddr_in|exclude
This commit was SVN r21197.
* Pass the sequence number of the checkpoint along with reference from the global to the local coordinator.
* 'orte-restart --apponly' now just generates the app context file, and does not run with it. This provides the user the ability to edit the file before launching.
* Add a OPAL_CRS_NONE state
* Split the INC into three distinct parts.
* Implement a restart mechanism for the 'none' component. If given a context it simply execvp()'s it.
This commit was SVN r21195.
* Add 'orte-checkpoint -l' option that lists all checkpoints currently available on the system.
* Add 'orte-restart -i' which prints information regarding the checkpoint targeted for restart.
* Add ability to extract the timing metadata.
* Fix show_help() in the orte-checkpoint and orte-restart tools. They should be using the opal versions instead of the orte versions (otherwise nothing is printed).
This commit was SVN r21194.
case the first process of the group was not represented at all in the second
group. Also added some cleanup of the code w.r.t. booleans vs. ints.
Thanks for Geoffrey Irving for reporting the bug and providing the initial
solution.
This commit was SVN r21192.
subnet specifications (in addition to interface names). These
parameters now take a comma-delimited list of interfaces names and/or
a.b.c.d/x specifications (only IPv4 currently supported for subnet
specifications). For example:
mpirun --mca btl_tcp_if_include 10.10.30.0/8,eth0
This commit was SVN r21189.
- due to the <= with we could overrun the array
- we didn't correctly test at _all_, since we never marked the
ranks already excluded / included...
- when returning in error, we should free (elements_int_list)...
This commit was SVN r21186.
OMPI_* to OPAL_*. This allows opal layer to be used more independent
from the whole of ompi.
NOTE: 9 "svn mv" operations immediately follow this commit.
This commit was SVN r21180.
In _correct_ programs only when (group->grp_proc_count - n) > 0,
we may fill ranks_included (callers of ompi_group_excl make sure)...
Therefore move the ranks_included loop into the true
block of the if (which is changed from "!= 0" to ">0").
Otherwise, the initilization of k=0 and ranks_included=NULL is good
for the ompi_group_incl (and submethods ompi_group_*).
Tested on Linux w/ mpi_test_suite and MPIch testsuite:
4 grouptest_coll
4 groupcreate
4 grouptest
This commit was SVN r21172.
Check for error in fcntl, as we depend on close-on-exec,
F_SETFD will result in -1 in case of error (stored in errno).
To not have a follow-up warning about not freeing filename, move up.
This commit was SVN r21171.
This patch contains the following items:
* Fix the flag passed to open() for the read side of the named pipe between the local and app coordinator. There is a race condition when using O_RDWR on a named pipe (not sure how that bug got in there in the first place).
* Adjust control in the C/R thread timing
* Clarify return code in BLCR component
* Allow the user to adjust the max wait time for the named pipes in the FileM local coordinator by using the MCA parameter "snapc_full_max_wait_time" (Default: 20 seconds)
* If the application terminates while there are active FileM operations, force mpirun to wait on these operations to complete.
* Allow the user to set the local copy command (Default: cp) via MCA parameter "filem_rsh_cp"
* Implement the ability to throttle the number of outgoing connections in FileM. At larger scales this type of explicit throttling helps prevent overwhelming the HNP machine. Default: 10, set via MCA parameter: {{{filem_rsh_max_outgoing}}}
This commit was SVN r21167.
The following SVN revision numbers were found above:
r21131 --> open-mpi/ompi@0deb009225