Ralph Castain
a8a91b374e
Update component-level selection comments to match latest revisions
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30087.
2013-12-25 19:12:43 +00:00
Ralph Castain
d049731911
Add pubsub pmi component to list of components to avoid when indirect launch used
...
Refs trac:4032
This commit was SVN r30083.
The following Trac tickets were found above:
Ticket 4032 --> https://svn.open-mpi.org/trac/ompi/ticket/4032
2013-12-25 16:25:37 +00:00
Ralph Castain
85f2429819
Ensure the ipv6 lists get initialized and finalized
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30081.
2013-12-24 17:24:39 +00:00
Ralph Castain
2e08219cac
Silence the valgrind report from the OOB
...
Refs trac:4033
This commit was SVN r30080.
The following Trac tickets were found above:
Ticket 4033 --> https://svn.open-mpi.org/trac/ompi/ticket/4033
2013-12-24 17:06:45 +00:00
Ralph Castain
81df8d09ca
Avoid use of PMI components when launched via mpirun as this is just unnecessary overhead that can cause confusion.
...
cmr=v1.7.4:reviewer=miked:subject=Avoid use of PMI components when launched via mpirun
This commit was SVN r30078.
2013-12-24 16:32:31 +00:00
Ralph Castain
01ee5f380b
Remove debug - problem has been identified
...
Refs trac:4026
This commit was SVN r30075.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 15:22:18 +00:00
Jeff Squyres
ce02002a5e
Free minor memory leak / squash valgrind still-reachable warning.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30071.
2013-12-24 11:04:38 +00:00
Ralph Castain
38f46641ce
Ensure the recv handler has been initialized
...
Refs trac:4026
This commit was SVN r30068.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 06:09:45 +00:00
Ralph Castain
bb80625a8a
Add missing var initialization
...
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30063.
2013-12-24 00:02:22 +00:00
Ralph Castain
65228d3571
Don't use "size_t" for the nbytes field in the header - use uint32_t to ensure that ntohl/htonl correctly match it
...
Refs trac:4026
This commit was SVN r30062.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-23 21:39:49 +00:00
Ralph Castain
7d8c0459a4
Attempt to debug hang that is hitting some environments. Posting to 1.7.4 as a placeholder for the eventual solution
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30060.
2013-12-23 19:57:05 +00:00
Nathan Hjelm
3be4536d9b
Cleanup various leaks in ompi_info reported by valgrind.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30058.
2013-12-23 17:47:43 +00:00
George Bosilca
24879f9def
Code cleanup while chasing valgrind complaints.
...
This commit was SVN r30048.
2013-12-21 23:28:14 +00:00
George Bosilca
38cbaeaa82
Try to impose a little bit of consistency on how we parse lists of
...
modules by enforcing the use of OPAL list accessors.
This commit was SVN r30045.
2013-12-21 23:23:33 +00:00
Ralph Castain
264150872b
Add a bunch of debug output to the OOB connection completion code so we can track down a handshake problem. Available in optimized builds as well as debug ones by setting -mca oob_base_verbose 10
...
No review will be required as this is just debug code for those helping us debug the 1.7.4 release candidates
cmr-=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30043.
2013-12-21 16:09:26 +00:00
Ralph Castain
9c768df8b8
Resolve an unexpected behavior in hostfile allocations. Now that we filter allocations to determine what will be used for mapping, let the initial global pool be the union of nodes from all sources (default hostfile, hostfiles, and dash-hosts). Each app will filter down to only those specified for it using its own hostfile and dash-host options.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Resolve an unexpected behavior in hostfile allocations
This commit was SVN r30040.
2013-12-21 01:38:27 +00:00
Adrian Reber
53a70fe87f
Trying to get the C/R code to compile again. (send_*_nb)
...
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* just replace the blocking calls with the non-blocking calls
* all #ifdef's introduced in V1 are gone
* send_* returns error code or ORTE_SUCCESS (not the number of bytes)
This commit was SVN r30036.
2013-12-20 21:58:28 +00:00
Adrian Reber
a3813d37c7
Trying to get the C/R code to compile again. (recv_*_nb)
...
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* only #ifdef out the code where the behaviour is changed
(used to be blocking; now non-blocking)
This commit was SVN r30035.
2013-12-20 21:05:40 +00:00
Ralph Castain
31248c0985
Correctly add support for the "env" MPI_Info key during comm_spawn, update the "map-by", "rank-by", and "bind-to" Info key behaviors to match the new mapping/ranking/binding system, and update all docs and comments to match.
...
Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node.
Refs trac:4003
This commit was SVN r30033.
The following Trac tickets were found above:
Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003
2013-12-20 20:42:39 +00:00
Ralph Castain
71b52fe861
Ensure that comm_spawn'd procs get user-specified forwarded envars
...
Thanks to Tim Miller for reporting the regression from the 1.6 series
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that comm_spawn'd procs get user-specified forwarded envars
This commit was SVN r30012.
2013-12-20 14:47:35 +00:00
Ralph Castain
d47d2569f3
We stripped the process info packing routine to minimize message size when sending the launch message, but tools still require all the info. So modify the tool-hnp handshake to explicitly add the missing info
...
Refs trac:3992
This commit was SVN r29989.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 20:42:20 +00:00
Ralph Castain
55cd65b149
Don't warn about binding (process and/or memory) if the node cannot do it or if we would overload, but it wasn't specifically requested by the user (i.e., it is the result of the default policy). Instead, just don't bind and quietly move along.
...
Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff.
cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings
This commit was SVN r29978.
2013-12-19 16:31:45 +00:00
Ralph Castain
9b32dacb6c
Ensure we don't abort if a tool cannot send a message - the orte/util/comm library used by tools to query mpirun knows how to handle this situation.
...
Refs trac:3992
This commit was SVN r29975.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 07:10:36 +00:00
Ralph Castain
6239e64f36
Further cleanup of orte-ps so it doesn't abort when hitting a stale HNP - only report that event once and just keep working.
...
Refs trac:3992
This commit was SVN r29974.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 03:28:05 +00:00
Ralph Castain
bf5e314f76
Tools require their own errmgr and state components so they can handle any errors that occur in, for example, communication .
...
Refs trac:3992
This commit was SVN r29972.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 01:49:33 +00:00
Ralph Castain
3aaca16faa
Silence warnings that are no longer valid
...
Refs trac:3992
This commit was SVN r29970.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 00:40:36 +00:00
Ralph Castain
c5956e7b8c
Convert debug output to opal_output_verbose
...
Thanks to Tetsuya Mishima for reporting it
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29969.
2013-12-19 00:36:15 +00:00
Ralph Castain
39957df08e
Fixes trac:3963. Fix the tool ess procedure so it opens and selects the OOB framework, and have the OOB TCP module update the route to new connections (the routed modules know what to do).
...
Thanks to Dave Love and Ashley Pittman for pointing out the problem.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix tool communications with mpirun
This commit was SVN r29959.
The following Trac tickets were found above:
Ticket 3963 --> https://svn.open-mpi.org/trac/ompi/ticket/3963
2013-12-18 23:13:46 +00:00
Ralph Castain
77553f72be
Per this email thread:
...
http://www.open-mpi.org/community/lists/devel/2013/12/13412.php
fix the backtrace function to avoid async issues. Thanks to Takahiro Kawashima for the patch
This commit was SVN r29955.
2013-12-18 17:57:37 +00:00
Ralph Castain
ab4636c47b
Per email on devel list, change the default rank-by to slot unless map-by <obj> is specified, in which case use rank-by <obj>
...
Refs trac:3977
This commit was SVN r29945.
The following Trac tickets were found above:
Ticket 3977 --> https://svn.open-mpi.org/trac/ompi/ticket/3977
2013-12-18 00:48:50 +00:00
Ralph Castain
53cd00fe16
By setting a default mapping/ranking/binding policy that wasn't "none", we introduced a problem for users of the Mac and any other machine where sockets aren't defined and/or binding is not supported. Fix that by checking to see if the user specified the failing policy - if not, then fall back to the old map/rank by slot and no binding.
...
Refs trac:3977
This commit was SVN r29933.
The following Trac tickets were found above:
Ticket 3977 --> https://svn.open-mpi.org/trac/ompi/ticket/3977
2013-12-17 14:50:10 +00:00
Adrian Reber
b42aad44a3
Trying to get the C/R code to compile again. This patch
...
includes various fixes all over the C/R code which are
hard to group like the other patches.
Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values
Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)
This commit was SVN r29922.
2013-12-16 15:35:28 +00:00
Ralph Castain
8b6d117541
Per the OMPI devel conference that changed our default behaviors:
...
* default to bind-to core
* map-by slot if np=2
* map-by socket (balance across sockets on each node) if np > 2
* map-by <obj> will imply rank-by <obj> by default (leave default binding as above)
Fix a bug in the map-by <obj> mapper where we incorrectly compute the #procs to assign if the #slots > #procs
cmr=v1.7.4:reviewer=jsquyres:subject=Update default binding and mapping values
This commit was SVN r29919.
2013-12-15 17:25:54 +00:00
Jeff Squyres
770bf77149
Fix some minor memory leaks in error code paths.
...
Many thanks to Tom Fogal for the patch.
cmr=v1.7.4:reviewer=rhc:subject=Fix minor memory leaks in error code paths
This commit was SVN r29905.
2013-12-14 00:41:21 +00:00
Jeff Squyres
0ab48ad0d2
Fix some annoying flex warnings that have been there for years.
...
Many thanks to Tom Fogal for the initial patch.
cmr=v1.7.4:reviewer=rhc:subject=Fix annoying flex warnings
This commit was SVN r29904.
2013-12-14 00:36:12 +00:00
Jeff Squyres
2e7653e4c2
Add missing argv.h includes.
...
Noticed these as part of #3694 : external libevent's don't cause argv.h
to automatically get included.
Refs trac:3694
This commit was SVN r29897.
The following Trac tickets were found above:
Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
2013-12-13 21:17:36 +00:00
Brian Barrett
121ca26c59
Per discussion at Develoepr's Meeting, remove Solaris threads support. Solaris
...
will just fall back to pthreads, which should be no problem.
This commit was SVN r29893.
2013-12-13 20:07:11 +00:00
Ralph Castain
0e81959aae
Cleanup mindist error messages - already patched in 1.7
...
This commit was SVN r29869.
2013-12-12 15:30:29 +00:00
Ralph Castain
1ff12362da
Cleanup merge conflict that was incorrectly committed
...
This commit was SVN r29851.
2013-12-09 20:20:14 +00:00
Ralph Castain
83e59e6761
Once again, the Slurm folks have decided to redefine their envars, reversing what they had previously told us to do. So cleanup the Slurm allocation code, and also adjust to a change in srun behavior that now aborts a job if the ntasks-per-node doesn't get specified when ORTE calls it, but the user specified it when getting an allocation. Sigh.
...
cmr=v1.7.4:reviewer=miked:subject=Update Slurm allocation and launch
This commit was SVN r29849.
2013-12-09 17:58:46 +00:00
Mike Dubman
c208b858e7
improve error messages in mindist
...
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29846.
2013-12-09 06:34:38 +00:00
Ralph Castain
f2c49c6c19
Fix the map-by object mapper to handle cpus-per-proc by accounting for the request when computing the number of procs to put on each object. This ensures that the binding routine doesn't automatically overload the cores.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29843.
2013-12-08 16:59:25 +00:00
Ralph Castain
9604f36c3b
Specify units for the job completion timeout
...
This commit was SVN r29839.
2013-12-08 04:51:58 +00:00
Ralph Castain
62c9e5c64c
Really is better if we output a message indicating that the job was aborted due to hitting the execution time limit
...
Refs trac:3960
This commit was SVN r29833.
The following Trac tickets were found above:
Ticket 3960 --> https://svn.open-mpi.org/trac/ompi/ticket/3960
2013-12-07 15:33:56 +00:00
Ralph Castain
d44e4a311f
Per request from Dave Goodell, add support for MPIEXEC_TIMEOUT - if set in the environment, terminate the job after the specified number of seconds has passed. Equivalent to MPICH functionality.
...
cmr=v1.7.4:reviewer=dgoodell:subject=add support for MPIEXEC_TIMEOUT
This commit was SVN r29831.
2013-12-07 01:58:32 +00:00
Jeff Squyres
ed9aba3896
This patch fixes
...
error: void value not ignored as it ought to be
in the C/R code by ignoring the return value of functions which
no longer return a value (only void).
Signed-off-by: Adrian Reber <adrian.reber@hs-esslingen.de>
This commit was SVN r29816.
2013-12-06 14:40:10 +00:00
Ralph Castain
fb59b6b875
Silence compiler warning when --disable-orte-static-ports
...
This commit was SVN r29783.
2013-12-03 01:53:31 +00:00
Ralph Castain
617a0edbb8
Fix hostfile parsing for the case where RMs count slots by listing the node multiple times. Thanks to Tetsuya Mishima for rep[orting the problem and providing a patch.
...
cmf=v1.7.4:reviewer=rhc
This commit was SVN r29748.
2013-11-24 16:17:52 +00:00
Ralph Castain
7c23a5ad65
Fix headers when building with ft enabled. Thanks to Adrian Reber for the patch!
...
This commit was SVN r29743.
2013-11-23 22:58:32 +00:00
Ralph Castain
7480beb7f0
Per request from Nathan, add an offset value to the job struct so we can construct a "global rank" that spans multiple jobs during dynamic launch operations. Store a new ORTE_DB_GLOBAL_RANK value for each process in the database, and ensure that we share our own value during connect_accept so both sides can see it.
...
This isn't being used yet - just enabling Nathan to do what he needs.
***** NOTE: any use of the OMPI_DB_GLOBAL_RANK database key must be protected by #ifdef OMPI_DB_GLOBAL_RANK as not all RTE's will define this key. *****
This commit was SVN r29708.
2013-11-14 17:01:43 +00:00