1
1
Граф коммитов

4108 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
590a87c730 You can't pass static buffer definitions to rml.send as it will attempt to release them upon completion - you need to send dynamically allocated buffers
This commit was SVN r30261.
2014-01-11 19:38:11 +00:00
Ralph Castain
286ff6d552 For large scale systems, we would like to avoid doing a full modex during MPI_Init so that launch will scale a little better. At the moment, our options are somewhat limited as only a few BTLs don't immediately call modex_recv on all procs during startup. However, for those situations where someone can take advantage of it, add the ability to do a "modex on demand" retrieval of data from remote procs when we launch via mpirun.
NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message!

Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place.

This behavior will *only* take effect under the following conditions:

1. launched via mpirun

2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX

3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to *want* this behavior to get it.

The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes:

1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data".

2. adjust the SM rendezvous logic to loop until the required file has been created

Creating a placeholder to bring this over to 1.7.5 when ready.

cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale

This commit was SVN r30259.
2014-01-11 17:36:06 +00:00
Ralph Castain
fb9e427320 One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do *not* bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound.
Refs trac:4077

This commit was SVN r30200.

The following Trac tickets were found above:
  Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077
2014-01-09 22:39:34 +00:00
Ralph Castain
24e990e747 Fix comm_spawn for oversubscribed systems by correctly computing the number of available slots
cmr=v1.7.4:reviewer=jsquyres:subject=Fix comm_spawn for oversubscribed systems

This commit was SVN r30197.
2014-01-09 20:33:48 +00:00
Ralph Castain
9fcb46d85a Correctly detect and handle oversubscription for comm_spawn
cmr=v1.7.4:reviewer=jsquyres:subject=Correctly detect and handle oversubscription for comm_spawn

This commit was SVN r30186.
2014-01-09 18:27:51 +00:00
Ralph Castain
6e5fedeb04 Oops - add verbose output to inform that cannot default bind due to no cores detected
Refs trac:4074

This commit was SVN r30185.

The following Trac tickets were found above:
  Ticket 4074 --> https://svn.open-mpi.org/trac/ompi/ticket/4074
2014-01-09 18:17:14 +00:00
Ralph Castain
4cdc291df1 Ensure slurm properly dies on abnormal termination
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure slurm properly dies on abnormal termination

This commit was SVN r30182.
2014-01-09 16:52:02 +00:00
Jeff Squyres
87e476ebd8 Clean up many references to "rank": usually change to "process" and/or
specifically delineate that we're referring to the process' rank in
MPI_COMM_WORLD.

Refs trac:4068

This commit was SVN r30181.

The following Trac tickets were found above:
  Ticket 4068 --> https://svn.open-mpi.org/trac/ompi/ticket/4068
2014-01-09 16:37:49 +00:00
Ralph Castain
7e4748a0f1 Handle the case of nodes that do not report cores, and thus our default binding policy will fail even though binding is supported by defaulting to not binding on those nodes.
Thanks to Paul Hargrove for reporting the problem on NetBSD.

cmr=v1.7.4:reviewer=jsquyres:subject=Handle the case of nodes that do not report cores

This commit was SVN r30180.
2014-01-09 16:27:58 +00:00
Ralph Castain
f179f2086b Do a better job of reporting bindings - if someone gives a spec that binds us to all processors, then we are effectively unbound and should report it clearly instead of outputting a long line of B's.
cmr=v1.7.4:reviewer=jsquyres:subject=Do a better job of reporting bindings

This commit was SVN r30179.
2014-01-09 16:16:16 +00:00
Ralph Castain
2a0e4b5e62 Update the orterun help messages and man page to reflect new map/rank/bind options and defaults. Thanks to Paul Hargrove for reporting it.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30173.
2014-01-09 04:44:28 +00:00
Ralph Castain
bf453a2575 Reference the correct variable...sigh
Refs trac:4059

This commit was SVN r30163.

The following Trac tickets were found above:
  Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 22:36:39 +00:00
Ralph Castain
80497d73cf Need to mark the daemon as alive so that exit commands are properly routed during abnormal terminations. Also, remove stale references to the "selected oob component" as we no longer require only one component be selected
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30162.
2014-01-08 22:35:48 +00:00
Ralph Castain
d5647394d8 Initialize variable so dash-host option gets correctly parsed
cmr=v1.7.4:reviewer=rolfv

This commit was SVN r30159.
2014-01-08 15:17:16 +00:00
Ralph Castain
e724d0d12d Ensure comm_spawn'd jobs get treated the same wrt setting default mapping directives
Refs trac:4059

This commit was SVN r30158.

The following Trac tickets were found above:
  Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 15:16:22 +00:00
Ralph Castain
fb650aed0c Fix how we transfer mapping directives to the job, ensuring that directives that can be given outside of a mapping policy (e.g., oversubscribe and no-use-local) are retained.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix how we transfer mapping directives to the job

This commit was SVN r30155.
2014-01-08 04:25:43 +00:00
Ralph Castain
bc75250951 Cleanup the sensor framework close - existing code was using incorrect object type. Don't start sensors if sample rate is zero. Don't add zero-byte data from resusage as it means nothing was measured.
cmr=v1.7.4:reviewer=hjelmn

This commit was SVN r30150.
2014-01-08 02:38:56 +00:00
Jeff Squyres
13b29cff2c This commit compliements/completes r30140. r30140 made all the
configury/Makefile.am changes; this commit renames the internal
installdirs.h framework struct field names to match the configry macro
names:

 * pkgdatdir ->	ompidatadir
 * pkglibdir -> ompilibdir
 * pkgincludedir -> ompiincludedir

This commit was SVN r30145.

The following SVN revision numbers were found above:
  r30140 --> open-mpi/ompi@8b778903d8
2014-01-07 23:36:33 +00:00
Brian Barrett
8b778903d8 Fix longstanding issue with our multi-project support. Rather than using
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi.  This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.

This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Mike Dubman
40aadab85f re-enable map-by dist
after last refactoring in rmaps, map-by dist:hca  was disabled.
reverting it back

found/fixed by Elena, reviewed by miked

cmr=v1.7.4:reviewer=ompi-rm1.7

This commit was SVN r30118.
2014-01-04 20:44:41 +00:00
Ralph Castain
9a855ff58e Update sensor component for new OOB calls
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30117.
2014-01-03 22:37:15 +00:00
Ralph Castain
3f2b3c53ea Ensure that rankfile-provided allocations are correctly handled
Fixes trac:4043

cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that rankfile-provided allocations are correctly handled

This commit was SVN r30106.

The following Trac tickets were found above:
  Ticket 4043 --> https://svn.open-mpi.org/trac/ompi/ticket/4043
2014-01-02 16:07:16 +00:00
Ralph Castain
d5a5caa7e0 Restore the bycore mpirun option for backward compatibility
Refs trac:4044

cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30103.

The following Trac tickets were found above:
  Ticket 4044 --> https://svn.open-mpi.org/trac/ompi/ticket/4044
2014-01-02 04:16:43 +00:00
Ralph Castain
a8a91b374e Update component-level selection comments to match latest revisions
cmr=v1.7.4:reviewer=rhc

This commit was SVN r30087.
2013-12-25 19:12:43 +00:00
Ralph Castain
d049731911 Add pubsub pmi component to list of components to avoid when indirect launch used
Refs trac:4032

This commit was SVN r30083.

The following Trac tickets were found above:
  Ticket 4032 --> https://svn.open-mpi.org/trac/ompi/ticket/4032
2013-12-25 16:25:37 +00:00
Ralph Castain
85f2429819 Ensure the ipv6 lists get initialized and finalized
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30081.
2013-12-24 17:24:39 +00:00
Ralph Castain
2e08219cac Silence the valgrind report from the OOB
Refs trac:4033

This commit was SVN r30080.

The following Trac tickets were found above:
  Ticket 4033 --> https://svn.open-mpi.org/trac/ompi/ticket/4033
2013-12-24 17:06:45 +00:00
Ralph Castain
81df8d09ca Avoid use of PMI components when launched via mpirun as this is just unnecessary overhead that can cause confusion.
cmr=v1.7.4:reviewer=miked:subject=Avoid use of PMI components when launched via mpirun

This commit was SVN r30078.
2013-12-24 16:32:31 +00:00
Ralph Castain
01ee5f380b Remove debug - problem has been identified
Refs trac:4026

This commit was SVN r30075.

The following Trac tickets were found above:
  Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 15:22:18 +00:00
Jeff Squyres
ce02002a5e Free minor memory leak / squash valgrind still-reachable warning.
cmr=v1.7.5:reviewer=rhc

This commit was SVN r30071.
2013-12-24 11:04:38 +00:00
Ralph Castain
38f46641ce Ensure the recv handler has been initialized
Refs trac:4026

This commit was SVN r30068.

The following Trac tickets were found above:
  Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 06:09:45 +00:00
Ralph Castain
bb80625a8a Add missing var initialization
cmr=v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r30063.
2013-12-24 00:02:22 +00:00
Ralph Castain
65228d3571 Don't use "size_t" for the nbytes field in the header - use uint32_t to ensure that ntohl/htonl correctly match it
Refs trac:4026

This commit was SVN r30062.

The following Trac tickets were found above:
  Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-23 21:39:49 +00:00
Ralph Castain
7d8c0459a4 Attempt to debug hang that is hitting some environments. Posting to 1.7.4 as a placeholder for the eventual solution
cmr=v1.7.4:reviewer=rhc

This commit was SVN r30060.
2013-12-23 19:57:05 +00:00
Nathan Hjelm
3be4536d9b Cleanup various leaks in ompi_info reported by valgrind.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30058.
2013-12-23 17:47:43 +00:00
George Bosilca
24879f9def Code cleanup while chasing valgrind complaints.
This commit was SVN r30048.
2013-12-21 23:28:14 +00:00
George Bosilca
38cbaeaa82 Try to impose a little bit of consistency on how we parse lists of
modules by enforcing the use of OPAL list accessors.

This commit was SVN r30045.
2013-12-21 23:23:33 +00:00
Ralph Castain
264150872b Add a bunch of debug output to the OOB connection completion code so we can track down a handshake problem. Available in optimized builds as well as debug ones by setting -mca oob_base_verbose 10
No review will be required as this is just debug code for those helping us debug the 1.7.4 release candidates

cmr-=v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r30043.
2013-12-21 16:09:26 +00:00
Ralph Castain
9c768df8b8 Resolve an unexpected behavior in hostfile allocations. Now that we filter allocations to determine what will be used for mapping, let the initial global pool be the union of nodes from all sources (default hostfile, hostfiles, and dash-hosts). Each app will filter down to only those specified for it using its own hostfile and dash-host options.
cmr=v1.7.4:reviewer=jsquyres:subject=Resolve an unexpected behavior in hostfile allocations

This commit was SVN r30040.
2013-12-21 01:38:27 +00:00
Adrian Reber
53a70fe87f Trying to get the C/R code to compile again. (send_*_nb)
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.

Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Changes from V2:
* just replace the blocking calls with the non-blocking calls
* all #ifdef's introduced in V1 are gone
* send_* returns error code or ORTE_SUCCESS (not the number of bytes)

This commit was SVN r30036.
2013-12-20 21:58:28 +00:00
Adrian Reber
a3813d37c7 Trying to get the C/R code to compile again. (recv_*_nb)
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.

Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED

Changes from V2:
* only #ifdef out the code where the behaviour is changed
  (used to be blocking; now non-blocking)

This commit was SVN r30035.
2013-12-20 21:05:40 +00:00
Ralph Castain
31248c0985 Correctly add support for the "env" MPI_Info key during comm_spawn, update the "map-by", "rank-by", and "bind-to" Info key behaviors to match the new mapping/ranking/binding system, and update all docs and comments to match.
Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node.

Refs trac:4003

This commit was SVN r30033.

The following Trac tickets were found above:
  Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003
2013-12-20 20:42:39 +00:00
Ralph Castain
71b52fe861 Ensure that comm_spawn'd procs get user-specified forwarded envars
Thanks to Tim Miller for reporting the regression from the 1.6 series

cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that comm_spawn'd procs get user-specified forwarded envars

This commit was SVN r30012.
2013-12-20 14:47:35 +00:00
Ralph Castain
d47d2569f3 We stripped the process info packing routine to minimize message size when sending the launch message, but tools still require all the info. So modify the tool-hnp handshake to explicitly add the missing info
Refs trac:3992

This commit was SVN r29989.

The following Trac tickets were found above:
  Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 20:42:20 +00:00
Ralph Castain
55cd65b149 Don't warn about binding (process and/or memory) if the node cannot do it or if we would overload, but it wasn't specifically requested by the user (i.e., it is the result of the default policy). Instead, just don't bind and quietly move along.
Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff.

cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings

This commit was SVN r29978.
2013-12-19 16:31:45 +00:00
Ralph Castain
9b32dacb6c Ensure we don't abort if a tool cannot send a message - the orte/util/comm library used by tools to query mpirun knows how to handle this situation.
Refs trac:3992

This commit was SVN r29975.

The following Trac tickets were found above:
  Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 07:10:36 +00:00
Ralph Castain
6239e64f36 Further cleanup of orte-ps so it doesn't abort when hitting a stale HNP - only report that event once and just keep working.
Refs trac:3992

This commit was SVN r29974.

The following Trac tickets were found above:
  Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 03:28:05 +00:00
Ralph Castain
bf5e314f76 Tools require their own errmgr and state components so they can handle any errors that occur in, for example, communication .
Refs trac:3992

This commit was SVN r29972.

The following Trac tickets were found above:
  Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 01:49:33 +00:00
Ralph Castain
3aaca16faa Silence warnings that are no longer valid
Refs trac:3992

This commit was SVN r29970.

The following Trac tickets were found above:
  Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 00:40:36 +00:00
Ralph Castain
c5956e7b8c Convert debug output to opal_output_verbose
Thanks to Tetsuya Mishima for reporting it

cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r29969.
2013-12-19 00:36:15 +00:00