Ralph Castain
fb9e427320
One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do *not* bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound.
...
Refs trac:4077
This commit was SVN r30200.
The following Trac tickets were found above:
Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077
2014-01-09 22:39:34 +00:00
Ralph Castain
24e990e747
Fix comm_spawn for oversubscribed systems by correctly computing the number of available slots
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix comm_spawn for oversubscribed systems
This commit was SVN r30197.
2014-01-09 20:33:48 +00:00
Ralph Castain
9fcb46d85a
Correctly detect and handle oversubscription for comm_spawn
...
cmr=v1.7.4:reviewer=jsquyres:subject=Correctly detect and handle oversubscription for comm_spawn
This commit was SVN r30186.
2014-01-09 18:27:51 +00:00
Ralph Castain
6e5fedeb04
Oops - add verbose output to inform that cannot default bind due to no cores detected
...
Refs trac:4074
This commit was SVN r30185.
The following Trac tickets were found above:
Ticket 4074 --> https://svn.open-mpi.org/trac/ompi/ticket/4074
2014-01-09 18:17:14 +00:00
Ralph Castain
4cdc291df1
Ensure slurm properly dies on abnormal termination
...
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure slurm properly dies on abnormal termination
This commit was SVN r30182.
2014-01-09 16:52:02 +00:00
Jeff Squyres
87e476ebd8
Clean up many references to "rank": usually change to "process" and/or
...
specifically delineate that we're referring to the process' rank in
MPI_COMM_WORLD.
Refs trac:4068
This commit was SVN r30181.
The following Trac tickets were found above:
Ticket 4068 --> https://svn.open-mpi.org/trac/ompi/ticket/4068
2014-01-09 16:37:49 +00:00
Ralph Castain
7e4748a0f1
Handle the case of nodes that do not report cores, and thus our default binding policy will fail even though binding is supported by defaulting to not binding on those nodes.
...
Thanks to Paul Hargrove for reporting the problem on NetBSD.
cmr=v1.7.4:reviewer=jsquyres:subject=Handle the case of nodes that do not report cores
This commit was SVN r30180.
2014-01-09 16:27:58 +00:00
Ralph Castain
f179f2086b
Do a better job of reporting bindings - if someone gives a spec that binds us to all processors, then we are effectively unbound and should report it clearly instead of outputting a long line of B's.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Do a better job of reporting bindings
This commit was SVN r30179.
2014-01-09 16:16:16 +00:00
Ralph Castain
2a0e4b5e62
Update the orterun help messages and man page to reflect new map/rank/bind options and defaults. Thanks to Paul Hargrove for reporting it.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30173.
2014-01-09 04:44:28 +00:00
Ralph Castain
bf453a2575
Reference the correct variable...sigh
...
Refs trac:4059
This commit was SVN r30163.
The following Trac tickets were found above:
Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 22:36:39 +00:00
Ralph Castain
80497d73cf
Need to mark the daemon as alive so that exit commands are properly routed during abnormal terminations. Also, remove stale references to the "selected oob component" as we no longer require only one component be selected
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30162.
2014-01-08 22:35:48 +00:00
Ralph Castain
d5647394d8
Initialize variable so dash-host option gets correctly parsed
...
cmr=v1.7.4:reviewer=rolfv
This commit was SVN r30159.
2014-01-08 15:17:16 +00:00
Ralph Castain
e724d0d12d
Ensure comm_spawn'd jobs get treated the same wrt setting default mapping directives
...
Refs trac:4059
This commit was SVN r30158.
The following Trac tickets were found above:
Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 15:16:22 +00:00
Ralph Castain
fb650aed0c
Fix how we transfer mapping directives to the job, ensuring that directives that can be given outside of a mapping policy (e.g., oversubscribe and no-use-local) are retained.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix how we transfer mapping directives to the job
This commit was SVN r30155.
2014-01-08 04:25:43 +00:00
Ralph Castain
bc75250951
Cleanup the sensor framework close - existing code was using incorrect object type. Don't start sensors if sample rate is zero. Don't add zero-byte data from resusage as it means nothing was measured.
...
cmr=v1.7.4:reviewer=hjelmn
This commit was SVN r30150.
2014-01-08 02:38:56 +00:00
Jeff Squyres
13b29cff2c
This commit compliements/completes r30140. r30140 made all the
...
configury/Makefile.am changes; this commit renames the internal
installdirs.h framework struct field names to match the configry macro
names:
* pkgdatdir -> ompidatadir
* pkglibdir -> ompilibdir
* pkgincludedir -> ompiincludedir
This commit was SVN r30145.
The following SVN revision numbers were found above:
r30140 --> open-mpi/ompi@8b778903d8
2014-01-07 23:36:33 +00:00
Brian Barrett
8b778903d8
Fix longstanding issue with our multi-project support. Rather than using
...
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Mike Dubman
40aadab85f
re-enable map-by dist
...
after last refactoring in rmaps, map-by dist:hca was disabled.
reverting it back
found/fixed by Elena, reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30118.
2014-01-04 20:44:41 +00:00
Ralph Castain
9a855ff58e
Update sensor component for new OOB calls
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30117.
2014-01-03 22:37:15 +00:00
Ralph Castain
3f2b3c53ea
Ensure that rankfile-provided allocations are correctly handled
...
Fixes trac:4043
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that rankfile-provided allocations are correctly handled
This commit was SVN r30106.
The following Trac tickets were found above:
Ticket 4043 --> https://svn.open-mpi.org/trac/ompi/ticket/4043
2014-01-02 16:07:16 +00:00
Ralph Castain
d5a5caa7e0
Restore the bycore mpirun option for backward compatibility
...
Refs trac:4044
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30103.
The following Trac tickets were found above:
Ticket 4044 --> https://svn.open-mpi.org/trac/ompi/ticket/4044
2014-01-02 04:16:43 +00:00
Ralph Castain
a8a91b374e
Update component-level selection comments to match latest revisions
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30087.
2013-12-25 19:12:43 +00:00
Ralph Castain
d049731911
Add pubsub pmi component to list of components to avoid when indirect launch used
...
Refs trac:4032
This commit was SVN r30083.
The following Trac tickets were found above:
Ticket 4032 --> https://svn.open-mpi.org/trac/ompi/ticket/4032
2013-12-25 16:25:37 +00:00
Ralph Castain
85f2429819
Ensure the ipv6 lists get initialized and finalized
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30081.
2013-12-24 17:24:39 +00:00
Ralph Castain
2e08219cac
Silence the valgrind report from the OOB
...
Refs trac:4033
This commit was SVN r30080.
The following Trac tickets were found above:
Ticket 4033 --> https://svn.open-mpi.org/trac/ompi/ticket/4033
2013-12-24 17:06:45 +00:00
Ralph Castain
81df8d09ca
Avoid use of PMI components when launched via mpirun as this is just unnecessary overhead that can cause confusion.
...
cmr=v1.7.4:reviewer=miked:subject=Avoid use of PMI components when launched via mpirun
This commit was SVN r30078.
2013-12-24 16:32:31 +00:00
Ralph Castain
01ee5f380b
Remove debug - problem has been identified
...
Refs trac:4026
This commit was SVN r30075.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 15:22:18 +00:00
Jeff Squyres
ce02002a5e
Free minor memory leak / squash valgrind still-reachable warning.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30071.
2013-12-24 11:04:38 +00:00
Ralph Castain
38f46641ce
Ensure the recv handler has been initialized
...
Refs trac:4026
This commit was SVN r30068.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 06:09:45 +00:00
Ralph Castain
bb80625a8a
Add missing var initialization
...
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30063.
2013-12-24 00:02:22 +00:00
Ralph Castain
65228d3571
Don't use "size_t" for the nbytes field in the header - use uint32_t to ensure that ntohl/htonl correctly match it
...
Refs trac:4026
This commit was SVN r30062.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-23 21:39:49 +00:00
Ralph Castain
7d8c0459a4
Attempt to debug hang that is hitting some environments. Posting to 1.7.4 as a placeholder for the eventual solution
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30060.
2013-12-23 19:57:05 +00:00
Nathan Hjelm
3be4536d9b
Cleanup various leaks in ompi_info reported by valgrind.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30058.
2013-12-23 17:47:43 +00:00
George Bosilca
24879f9def
Code cleanup while chasing valgrind complaints.
...
This commit was SVN r30048.
2013-12-21 23:28:14 +00:00
George Bosilca
38cbaeaa82
Try to impose a little bit of consistency on how we parse lists of
...
modules by enforcing the use of OPAL list accessors.
This commit was SVN r30045.
2013-12-21 23:23:33 +00:00
Ralph Castain
264150872b
Add a bunch of debug output to the OOB connection completion code so we can track down a handshake problem. Available in optimized builds as well as debug ones by setting -mca oob_base_verbose 10
...
No review will be required as this is just debug code for those helping us debug the 1.7.4 release candidates
cmr-=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30043.
2013-12-21 16:09:26 +00:00
Ralph Castain
9c768df8b8
Resolve an unexpected behavior in hostfile allocations. Now that we filter allocations to determine what will be used for mapping, let the initial global pool be the union of nodes from all sources (default hostfile, hostfiles, and dash-hosts). Each app will filter down to only those specified for it using its own hostfile and dash-host options.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Resolve an unexpected behavior in hostfile allocations
This commit was SVN r30040.
2013-12-21 01:38:27 +00:00
Adrian Reber
53a70fe87f
Trying to get the C/R code to compile again. (send_*_nb)
...
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* just replace the blocking calls with the non-blocking calls
* all #ifdef's introduced in V1 are gone
* send_* returns error code or ORTE_SUCCESS (not the number of bytes)
This commit was SVN r30036.
2013-12-20 21:58:28 +00:00
Adrian Reber
a3813d37c7
Trying to get the C/R code to compile again. (recv_*_nb)
...
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* only #ifdef out the code where the behaviour is changed
(used to be blocking; now non-blocking)
This commit was SVN r30035.
2013-12-20 21:05:40 +00:00
Ralph Castain
31248c0985
Correctly add support for the "env" MPI_Info key during comm_spawn, update the "map-by", "rank-by", and "bind-to" Info key behaviors to match the new mapping/ranking/binding system, and update all docs and comments to match.
...
Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node.
Refs trac:4003
This commit was SVN r30033.
The following Trac tickets were found above:
Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003
2013-12-20 20:42:39 +00:00
Ralph Castain
71b52fe861
Ensure that comm_spawn'd procs get user-specified forwarded envars
...
Thanks to Tim Miller for reporting the regression from the 1.6 series
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that comm_spawn'd procs get user-specified forwarded envars
This commit was SVN r30012.
2013-12-20 14:47:35 +00:00
Ralph Castain
d47d2569f3
We stripped the process info packing routine to minimize message size when sending the launch message, but tools still require all the info. So modify the tool-hnp handshake to explicitly add the missing info
...
Refs trac:3992
This commit was SVN r29989.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 20:42:20 +00:00
Ralph Castain
55cd65b149
Don't warn about binding (process and/or memory) if the node cannot do it or if we would overload, but it wasn't specifically requested by the user (i.e., it is the result of the default policy). Instead, just don't bind and quietly move along.
...
Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff.
cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings
This commit was SVN r29978.
2013-12-19 16:31:45 +00:00
Ralph Castain
9b32dacb6c
Ensure we don't abort if a tool cannot send a message - the orte/util/comm library used by tools to query mpirun knows how to handle this situation.
...
Refs trac:3992
This commit was SVN r29975.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 07:10:36 +00:00
Ralph Castain
6239e64f36
Further cleanup of orte-ps so it doesn't abort when hitting a stale HNP - only report that event once and just keep working.
...
Refs trac:3992
This commit was SVN r29974.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 03:28:05 +00:00
Ralph Castain
bf5e314f76
Tools require their own errmgr and state components so they can handle any errors that occur in, for example, communication .
...
Refs trac:3992
This commit was SVN r29972.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 01:49:33 +00:00
Ralph Castain
3aaca16faa
Silence warnings that are no longer valid
...
Refs trac:3992
This commit was SVN r29970.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 00:40:36 +00:00
Ralph Castain
c5956e7b8c
Convert debug output to opal_output_verbose
...
Thanks to Tetsuya Mishima for reporting it
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29969.
2013-12-19 00:36:15 +00:00
Ralph Castain
39957df08e
Fixes trac:3963. Fix the tool ess procedure so it opens and selects the OOB framework, and have the OOB TCP module update the route to new connections (the routed modules know what to do).
...
Thanks to Dave Love and Ashley Pittman for pointing out the problem.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix tool communications with mpirun
This commit was SVN r29959.
The following Trac tickets were found above:
Ticket 3963 --> https://svn.open-mpi.org/trac/ompi/ticket/3963
2013-12-18 23:13:46 +00:00
Ralph Castain
77553f72be
Per this email thread:
...
http://www.open-mpi.org/community/lists/devel/2013/12/13412.php
fix the backtrace function to avoid async issues. Thanks to Takahiro Kawashima for the patch
This commit was SVN r29955.
2013-12-18 17:57:37 +00:00