1
1
Граф коммитов

18588 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
43d1cd92ac Ensure we activate the "daemons launched" state when only the HNP is left or else we will hang.
cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29094.
2013-08-29 22:50:51 +00:00
Dave Goodell
d17f104e7a oob: squash some valgrind warnings
These warnings were harmless, but they appeared even for simple programs
like single-process runs of `ring_c`.

This commit was SVN r29093.
2013-08-29 21:08:44 +00:00
Ralph Castain
5d1fa4fa0e Silence warnings:
osc_pt2pt_data_move.c: In function 'ompi_osc_pt2pt_sendreq_recv_accum_long_cb':
osc_pt2pt_data_move.c:643:9: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
osc_rdma_data_move.c: In function 'ompi_osc_rdma_control_send_cb':
osc_rdma_data_move.c:1312:37: warning: variable 'header' set but not used [-Wunused-but-set-variable]

This commit was SVN r29092.
2013-08-29 20:56:36 +00:00
Ralph Castain
12d4f45b5e Silence warning:
oob_tcp_connection.c: In function 'mca_oob_tcp_peer_accept':
oob_tcp_connection.c:725:9: warning: variable 'cmpval' set but not used [-Wunused-but-set-variable]

Refs trac:3696

This commit was SVN r29091.

The following Trac tickets were found above:
  Ticket 3696 --> https://svn.open-mpi.org/trac/ompi/ticket/3696
2013-08-29 20:56:05 +00:00
Ralph Castain
7a7cfdd519 A little cleanup - the base function to sort numa lists must return something or you get a warning about non-void function returning without value, so cleanup the return values. Ensure the mindist module actually checks for a return of "error" so it won't segfault, and have it emit a polite message when that happens.
cmr:v1.7.3:reviewer=jladd

This commit was SVN r29089.
2013-08-29 20:01:06 +00:00
Ralph Castain
3516348aad We don't need to report errors in pmi_setup as it is possible that PMI is available, but that we weren't launched under it (e.g., we launched via mpirun).
cmr:v1.7.3:reviewer=hjelmn:subject="Silence unnecessary PMI error msgs"

This commit was SVN r29086.
2013-08-29 16:35:20 +00:00
Ralph Castain
c71e760e6c The modex code was unfortunately written solely for PMI1 when updated to minimize calls to PMI_get - add the required PMI2 code
This commit was SVN r29084.
2013-08-28 23:52:32 +00:00
Ralph Castain
537e7380b1 As per the discussion on the devel telecon, do not compute ompi_comm_world_thread_level_mult if thread multiple is disabled. We aren't using the value anyway, but we will leave the current code in-place until we understand if it is needed or not.
This commit was SVN r29080.
2013-08-28 17:44:04 +00:00
Joshua Ladd
1802aabf1a Add support for autodetecting a MLNX HCA in the rmaps min distance feature. In this way, .ini files distributed with software stacks need not specify a particular HCA but instead may select the key word auto which will automatically select the discovered device. To use this feature, simply pass the keyword auto instead of a specific device name, --mca rmaps_base_dist_hca auto. If more than one card is installed, the mapper will inform the user of this and, at this point, the user will then need to specify which card via the normal route, e.g. --mca rmaps_base_dist_hca <dev_name>. This should be added to \ncmr=v1.7.4:reviewer=rhc:subject=Autodetect logic for min dist mapping
This commit was SVN r29079.
2013-08-28 16:23:33 +00:00
Nathan Hjelm
77a41e1ca9 ompi_info: mark the variables from disabled components as disabled in
the output of ompi_info.

A variable is disabled if its component will never be selected due to
a component selection parameter (eg. -mca btl self). The old behavior
of ompi_info was to not print these parameters at all. Now we print the
parameters. After some discussion with George it was decided that there
needed to be some way to see what parameters will not be used. This was
the comprimise.

This commit also fixes a bug and a typo in the pvar sytem. The enum_count
value in mca_base_pvar_dump was being used without being set. The full_name
in mca_base_pvar_t was not being used.

cmr=v1.7.3:ticket=trac:3734

This commit was SVN r29078.

The following Trac tickets were found above:
  Ticket 3734 --> https://svn.open-mpi.org/trac/ompi/ticket/3734
2013-08-28 16:03:23 +00:00
George Bosilca
305fa88d4b Remove two warnings from the SM BTL. The return code can be safely ignored
as the internals of the SM BTL will repost the fragment until the send operation
succesfully complete.

This commit was SVN r29077.
2013-08-28 06:36:01 +00:00
George Bosilca
badd011ac3 Minor cleanup.
This commit was SVN r29076.
2013-08-28 05:48:58 +00:00
Dave Goodell
dd82bd3c19 usnic: fix invalid rfstart initialization
endpoint_rfstart was being initialized from a value which was not yet
set.  Also ensure that rfstart is a valid index in the range
0..WINDOW_SIZE-1, since it is used as the index into endpoint_rcvd_segs,
which has WINDOW_SIZE elements.

Without this change there is significant risk of memory corruption or
segfaults, resulting in hangs or crashes, if malloc ever returns us a
value >=WINDOW_SIZE (4096).  Right now we seem to be getting lucky that
the malloc is returning zero-pages to us when we are allocating endpoint
structures (possibly because the freelist performs a single large
allocation for all endpoints).

Fixes Cisco bug CSCui88781.

Reviewed-by: rfaucett@cisco.com
Reviewed-by: jsquyres@cisco.com

cmr=v1.7.3:reviewer=jsquyres

This commit was SVN r29075.
2013-08-27 22:43:20 +00:00
Ralph Castain
7125143253 Replace missing opal_db open/select that was apparently lost on a prior merge. Thanks to Nathan for pointing it out
This commit was SVN r29072.
2013-08-27 19:42:31 +00:00
Nathan Hjelm
3744c5e0be Also check for /dev/mic/scif when deciding whether to enable the Linux
memory hooks.

The MIC has a /dev/scif device and the host has /dev/mic/scif. I do not
know if this device exists when no MIC is connected.

cmr=v1.7.4:ticket=trac:3733:reviewer=jsquyres

This commit was SVN r29071.

The following Trac tickets were found above:
  Ticket 3733 --> https://svn.open-mpi.org/trac/ompi/ticket/3733
2013-08-27 19:40:02 +00:00
Nathan Hjelm
c699ee7812 Update the ompi_info man page with information about variable levels
and improve the behavior of ompi_info.

This commit changes the default behavior of ompi_info --all when a
level is not specified. Instead of assuming level 1 in this case we
now assume level 9. This change is due to feedback from the community
after the introduction of the --level option.

I also added a new option: --selected-only. This option will limit the
displayed variables to components that can be selected (ie. if there
is a selection parameter set-- btl self,sm)

cmr=v1.7.3:reviewer=jsquyres

This commit was SVN r29070.
2013-08-27 19:11:37 +00:00
Nathan Hjelm
6e1656279e Enable the use of the Linux memory hooks on Intel MIC.
cmr=v1.7.3:reviewer=jsquyres

This commit was SVN r29069.
2013-08-27 18:25:18 +00:00
Nathan Hjelm
2da64eb719 Fix compilation of the MPI tools information interface when profiling
is enabled and fix a bug in the handling of watermark performance
variables.

cmr=v1.7.3:ticket=trac:3725:reviewer=jsquyres

This commit was SVN r29068.

The following Trac tickets were found above:
  Ticket 3725 --> https://svn.open-mpi.org/trac/ompi/ticket/3725
2013-08-27 18:19:18 +00:00
George Bosilca
cf09fe7c99 It wasn't even compiling when heterogeneous support was on.
This commit was SVN r29067.
2013-08-27 16:53:33 +00:00
George Bosilca
65a362909d Can't see how it works ...
Thanks Thomas and Arm for the patch.

This commit was SVN r29066.
2013-08-27 16:52:24 +00:00
Nathan Hjelm
f5495ace48 coll/ml: update the coll_ml_enable_fragmentation variable to support the
option to autodetect whether fragmentation should be enabled

cmr=v1.7.3:ticket=trac:3717

This commit was SVN r29065.

The following Trac tickets were found above:
  Ticket 3717 --> https://svn.open-mpi.org/trac/ompi/ticket/3717
2013-08-27 16:36:54 +00:00
Ralph Castain
c9a25465da Don't need the number of nodes any more for PMI
Refs trac:3729

This commit was SVN r29064.

The following Trac tickets were found above:
  Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729
2013-08-23 18:36:51 +00:00
Ralph Castain
6d24b34940 Extend the dpm framework API to support persistent accept/connect operations:
* paccept - establish a persistent listening port for async connect requests

* pconnect - async connect to remote process that has posted a paccept port. Provides a timeout mechanism, and allows the underlying implementation to retry until timeout 

* pclose - shuts down a prior paccept posting

Includes example programs paccept.c and pconnect.c in orte/test/mpi. New MPI extension interfaces coming...

This commit was SVN r29063.
2013-08-23 18:02:50 +00:00
Rolf vandeVaart
96457df9bc Fix compile errors created from changeset 29058.
This commit was SVN r29061.
2013-08-22 18:25:23 +00:00
Jeff Squyres
63ac60864b Refs trac:3730
Turns out that AC_CHECK_DECLS is one of the "new style" Autoconf
macros that #defines the output to be 0 or 1 (vs. #define'ing or
#undef'ing it).  So don't check for "#if defined(..."; just check for
"#if ...".

This commit was SVN r29059.

The following Trac tickets were found above:
  Ticket 3730 --> https://svn.open-mpi.org/trac/ompi/ticket/3730
2013-08-22 17:44:20 +00:00
Ralph Castain
a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
Ralph Castain
63d10d2d0d Fix typo
Refs trac:3729

This commit was SVN r29057.

The following Trac tickets were found above:
  Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729
2013-08-22 16:05:58 +00:00
Ralph Castain
16c5b30a1f Since the calls to "PMI get" scale by number of procs (not nodes), it makes more sense to have the MCA param be the cutoff based on number of procs. Also, it occurred to me that this shouldn't impact the nidmap process as that is built and circulated when we launch via mpirun, not during direct launch.
So shift the cutoff param to the MPI layer, and have it solely determine whether or not we call modex_recv on the hostname. If comm_world is of size greater than the cutoff, then we don't automatically retrieve the hostname when we build the ompi_proc_t for a process - instead, we fill the hostname entry on first call to modex_recv for that process.

The param is now "ompi_hostname_cutoff=N", where N=number of procs for cutoff.

Refs trac:3729

This commit was SVN r29056.

The following Trac tickets were found above:
  Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729
2013-08-22 03:40:26 +00:00
Rolf vandeVaart
504fa2cda9 Fix support in smcuda btl so it does not blow up when there is no CUDA IPC support between two GPUs. Also make it so CUDA IPC support is added dynamically.
Fixes ticket 3531.    

This commit was SVN r29055.
2013-08-21 21:00:09 +00:00
Rolf vandeVaart
96fdb060ea Fix compile errors and warnings from changeset 29052.
This commit was SVN r29054.
2013-08-21 19:01:54 +00:00
Steve Wise
67fe3f23ed Use the HAVE_DECL_IBV_LINK_LAYER_ETHERNET macro.
Commit r27211 added ifdef checks for #define
HAVE_IBV_LINK_LAYER_ETHERNET, which is incorrect.  The correct #define
is HAVE_DECL_IBV_LINK_LAYER_ETHERNET.  This broke OMPI over iWARP.

This fixes trac:3726 and should be added to cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29053.

The following SVN revision numbers were found above:
  r27211 --> open-mpi/ompi@b27862e5c7

The following Trac tickets were found above:
  Ticket 3726 --> https://svn.open-mpi.org/trac/ompi/ticket/3726
2013-08-20 20:00:46 +00:00
Ralph Castain
45e695928f As per the email discussion, revise the sparse handling of hostnames so that we avoid potential infinite loops while allowing large-scale users to improve their startup time:
* add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit.

* remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL"

* modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded

* removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base

* added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames

This commit was SVN r29052.
2013-08-20 18:59:36 +00:00
Ralph Castain
f49f879b2d Set ignore
This commit was SVN r29051.
2013-08-20 18:29:27 +00:00
Jeff Squyres
31283aaffd Revert r29049 because it is incorrectly overriding the results of an
AC config macro.

This commit was SVN r29050.

The following SVN revision numbers were found above:
  r29049 --> open-mpi/ompi@b82f89e78b
2013-08-20 01:21:41 +00:00
Steve Wise
b82f89e78b Define HAVE_IBV_LINK_LAYER_ETHERNET if it is supported in libibverbs.
Commit r27211 missed a config file change which broke ompi over
iwarp transports.  

This fixes trac:3726 and should be added to cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29049.

The following SVN revision numbers were found above:
  r27211 --> open-mpi/ompi@b27862e5c7

The following Trac tickets were found above:
  Ticket 3726 --> https://svn.open-mpi.org/trac/ompi/ticket/3726
2013-08-19 22:27:51 +00:00
Jeff Squyres
b30ad28276 Remove some unused variables and an unused goto label.
This commit was SVN r29044.
2013-08-19 16:18:35 +00:00
Ralph Castain
e0cfcf376f Okay, fix it so it works both --disable-mpi-profile and --enable-mpi-profile. I'm not sure why mpit's library has to be treated differently, but it seems that it needs some special care to work in both scenarios
Refs trac:3725

This commit was SVN r29043.

The following Trac tickets were found above:
  Ticket 3725 --> https://svn.open-mpi.org/trac/ompi/ticket/3725
2013-08-19 14:48:23 +00:00
Ralph Castain
9aebd7e281 Ensure we register the nidmap verbosity in mpirun, and add some debug
This commit was SVN r29042.
2013-08-18 23:40:32 +00:00
Ralph Castain
b730c9540e Fix --disable-mpi-profile option so it can build
cmr:v1.7.3:reviewer=hjelmn

This commit was SVN r29041.
2013-08-18 18:22:34 +00:00
Ralph Castain
611d7f9f6b When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require.
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.

Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:

* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally

* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test

* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).

Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it

Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.

This commit was SVN r29040.
2013-08-17 00:49:18 +00:00
Ralph Castain
991e59a58a Update MCA param in platform file
This commit was SVN r29039.
2013-08-16 22:18:22 +00:00
Ralph Castain
11a3743b21 Cleanup unitialized var warnings
This commit was SVN r29038.
2013-08-16 21:49:17 +00:00
Ralph Castain
90cfd139cf Cleanup error - need an "and" instead of an "or"
This commit was SVN r29037.
2013-08-16 21:41:59 +00:00
Ralph Castain
f8a72feb25 Silence unitialized var warning
This commit was SVN r29036.
2013-08-16 21:39:28 +00:00
Ralph Castain
c5f395d36a Silence unitialized var warnings
This commit was SVN r29035.
2013-08-16 21:37:35 +00:00
Ralph Castain
b2d86e1857 Silence uninitialized var warning
This commit was SVN r29034.
2013-08-16 21:35:51 +00:00
Ralph Castain
c74c54e18d Cleanup uninitialized warnings
This commit was SVN r29033.
2013-08-16 21:23:09 +00:00
Ralph Castain
b34bff8792 Cleanup warning
This commit was SVN r29032.
2013-08-16 21:14:35 +00:00
Ralph Castain
7947cec8fa Cleanup warning
This commit was SVN r29031.
2013-08-16 21:13:40 +00:00
Ralph Castain
33beab5918 Avoid segfault due to uninitialized variable
This commit was SVN r29030.
2013-08-16 21:10:38 +00:00