1
1
Граф коммитов

4626 Коммитов

Автор SHA1 Сообщение Дата
Dave Goodell
0ef8336502 new bookkeeping code should return value indicating whether packet is good or not.
Authored-by: Reese Faucette <rfaucett@cisco.com>

Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)

This commit was SVN r29136.

The following Trac tickets were found above:
  Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
2013-09-06 03:19:32 +00:00
Dave Goodell
122890c2fd usnic: "bookeeping" --> "bookkeeping"
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)

This commit was SVN r29135.

The following Trac tickets were found above:
  Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
2013-09-06 03:19:20 +00:00
Dave Goodell
0df6ed4acc usnic: squash warnings from perf improvements
Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)

This commit was SVN r29134.

The following Trac tickets were found above:
  Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
2013-09-06 03:19:08 +00:00
Dave Goodell
6dc54d372d usnic: Basket of performance changes including:
- round segment buffer allocation to cache-line
    - split some routines into an inline fast section and a called
      slower section
    - introduce receive fastpath in component_progress that:
        o returns immediately if there is a packet available on priority
          queue and fastpath is enabled
        o disables fastpath for 1 time after use to provide fairness to
          other processing
        o defers receive buffer posting
        o defers bookeeping for receive until next call
          to usnic_component_progress

Authored-by: Reese Faucette <rfaucett@cisco.com>

Should be included in usnic v1.7.3 roll-up CMR (refs trac:3760)

This commit was SVN r29133.

The following Trac tickets were found above:
  Ticket 3760 --> https://svn.open-mpi.org/trac/ompi/ticket/3760
2013-09-06 03:18:57 +00:00
Dave Goodell
9cab9777d9 usnic: properly destroy embedded small send frag
Without this, an `--enable-debug` build would hit an assertion in the
list code when run under valgrind with `--malloc-fill=0xff` or any other
case where malloc returned non-zeroed buffers.

Also allow the normal OBJ_ machinery to handle the constructor
invocation ordering for us instead of doing it by hand (which could have
led to future bugs).

Reviewed-by: jsquyres@cisco.com

cmr=v1.7.4

Depends on trunk functionality in r29095 and r29096.  Refs trac:3740,#3741.

This commit was SVN r29127.

The following SVN revision numbers were found above:
  r29095 --> open-mpi/ompi@d1b5940e97
  r29096 --> open-mpi/ompi@a552921171

The following Trac tickets were found above:
  Ticket 3740 --> https://svn.open-mpi.org/trac/ompi/ticket/3740
2013-09-04 20:59:12 +00:00
Jeff Squyres
f6619f8e9e Fix compile error in the heterogeneous case.
We've been forcing C99 compiler compliance for a while now, so use C99
syntax to keep the #if code tidy.

This commit was SVN r29101.
2013-08-31 12:56:08 +00:00
Brian Barrett
16a1166884 Remove the proc_pml and proc_bml fields from ompi_proc_t and replace with a
configure-time dynamic allocation of flags.  The net result for platforms
which only support BTL-based communication is a reduction of 8*nprocs bytes
per process.  Platforms which support both MTLs and BTLs will not see
a space reduction, but will now be able to safely run both the MTL and BTL
side-by-side, which will prove useful.

This commit was SVN r29100.
2013-08-30 16:54:55 +00:00
Rolf vandeVaart
18962d296b This has bothered me for a while. Change MCA_BTL_TAG_BTL to MCA_BTL_TAG_IB. They are the same
value so this does not change anything.  (MCA_BTL_TAG_IB = MCA_BTL_TAG_BTL + 0).  This just makes it more correct.

This commit was SVN r29099.
2013-08-30 14:53:59 +00:00
Dave Goodell
c5a7e8a079 usnic: stomp format specifier warnings
The usnic BTL now builds cleanly under `--enable-picky` when `MSGDEBUG1`
is set.

Reviewed-by: jsquyres

cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r29097.
2013-08-29 23:24:14 +00:00
Ralph Castain
5d1fa4fa0e Silence warnings:
osc_pt2pt_data_move.c: In function 'ompi_osc_pt2pt_sendreq_recv_accum_long_cb':
osc_pt2pt_data_move.c:643:9: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
osc_rdma_data_move.c: In function 'ompi_osc_rdma_control_send_cb':
osc_rdma_data_move.c:1312:37: warning: variable 'header' set but not used [-Wunused-but-set-variable]

This commit was SVN r29092.
2013-08-29 20:56:36 +00:00
George Bosilca
305fa88d4b Remove two warnings from the SM BTL. The return code can be safely ignored
as the internals of the SM BTL will repost the fragment until the send operation
succesfully complete.

This commit was SVN r29077.
2013-08-28 06:36:01 +00:00
Dave Goodell
dd82bd3c19 usnic: fix invalid rfstart initialization
endpoint_rfstart was being initialized from a value which was not yet
set.  Also ensure that rfstart is a valid index in the range
0..WINDOW_SIZE-1, since it is used as the index into endpoint_rcvd_segs,
which has WINDOW_SIZE elements.

Without this change there is significant risk of memory corruption or
segfaults, resulting in hangs or crashes, if malloc ever returns us a
value >=WINDOW_SIZE (4096).  Right now we seem to be getting lucky that
the malloc is returning zero-pages to us when we are allocating endpoint
structures (possibly because the freelist performs a single large
allocation for all endpoints).

Fixes Cisco bug CSCui88781.

Reviewed-by: rfaucett@cisco.com
Reviewed-by: jsquyres@cisco.com

cmr=v1.7.3:reviewer=jsquyres

This commit was SVN r29075.
2013-08-27 22:43:20 +00:00
Nathan Hjelm
f5495ace48 coll/ml: update the coll_ml_enable_fragmentation variable to support the
option to autodetect whether fragmentation should be enabled

cmr=v1.7.3:ticket=trac:3717

This commit was SVN r29065.

The following Trac tickets were found above:
  Ticket 3717 --> https://svn.open-mpi.org/trac/ompi/ticket/3717
2013-08-27 16:36:54 +00:00
Ralph Castain
6d24b34940 Extend the dpm framework API to support persistent accept/connect operations:
* paccept - establish a persistent listening port for async connect requests

* pconnect - async connect to remote process that has posted a paccept port. Provides a timeout mechanism, and allows the underlying implementation to retry until timeout 

* pclose - shuts down a prior paccept posting

Includes example programs paccept.c and pconnect.c in orte/test/mpi. New MPI extension interfaces coming...

This commit was SVN r29063.
2013-08-23 18:02:50 +00:00
Rolf vandeVaart
96457df9bc Fix compile errors created from changeset 29058.
This commit was SVN r29061.
2013-08-22 18:25:23 +00:00
Jeff Squyres
63ac60864b Refs trac:3730
Turns out that AC_CHECK_DECLS is one of the "new style" Autoconf
macros that #defines the output to be 0 or 1 (vs. #define'ing or
#undef'ing it).  So don't check for "#if defined(..."; just check for
"#if ...".

This commit was SVN r29059.

The following Trac tickets were found above:
  Ticket 3730 --> https://svn.open-mpi.org/trac/ompi/ticket/3730
2013-08-22 17:44:20 +00:00
Ralph Castain
a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
Ralph Castain
16c5b30a1f Since the calls to "PMI get" scale by number of procs (not nodes), it makes more sense to have the MCA param be the cutoff based on number of procs. Also, it occurred to me that this shouldn't impact the nidmap process as that is built and circulated when we launch via mpirun, not during direct launch.
So shift the cutoff param to the MPI layer, and have it solely determine whether or not we call modex_recv on the hostname. If comm_world is of size greater than the cutoff, then we don't automatically retrieve the hostname when we build the ompi_proc_t for a process - instead, we fill the hostname entry on first call to modex_recv for that process.

The param is now "ompi_hostname_cutoff=N", where N=number of procs for cutoff.

Refs trac:3729

This commit was SVN r29056.

The following Trac tickets were found above:
  Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729
2013-08-22 03:40:26 +00:00
Rolf vandeVaart
504fa2cda9 Fix support in smcuda btl so it does not blow up when there is no CUDA IPC support between two GPUs. Also make it so CUDA IPC support is added dynamically.
Fixes ticket 3531.    

This commit was SVN r29055.
2013-08-21 21:00:09 +00:00
Rolf vandeVaart
96fdb060ea Fix compile errors and warnings from changeset 29052.
This commit was SVN r29054.
2013-08-21 19:01:54 +00:00
Steve Wise
67fe3f23ed Use the HAVE_DECL_IBV_LINK_LAYER_ETHERNET macro.
Commit r27211 added ifdef checks for #define
HAVE_IBV_LINK_LAYER_ETHERNET, which is incorrect.  The correct #define
is HAVE_DECL_IBV_LINK_LAYER_ETHERNET.  This broke OMPI over iWARP.

This fixes trac:3726 and should be added to cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29053.

The following SVN revision numbers were found above:
  r27211 --> open-mpi/ompi@b27862e5c7

The following Trac tickets were found above:
  Ticket 3726 --> https://svn.open-mpi.org/trac/ompi/ticket/3726
2013-08-20 20:00:46 +00:00
Ralph Castain
45e695928f As per the email discussion, revise the sparse handling of hostnames so that we avoid potential infinite loops while allowing large-scale users to improve their startup time:
* add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit.

* remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL"

* modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded

* removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base

* added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames

This commit was SVN r29052.
2013-08-20 18:59:36 +00:00
Jeff Squyres
b30ad28276 Remove some unused variables and an unused goto label.
This commit was SVN r29044.
2013-08-19 16:18:35 +00:00
Ralph Castain
611d7f9f6b When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require.
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.

Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:

* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally

* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test

* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).

Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it

Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.

This commit was SVN r29040.
2013-08-17 00:49:18 +00:00
Ralph Castain
f8a72feb25 Silence unitialized var warning
This commit was SVN r29036.
2013-08-16 21:39:28 +00:00
Ralph Castain
c74c54e18d Cleanup uninitialized warnings
This commit was SVN r29033.
2013-08-16 21:23:09 +00:00
Ralph Castain
33beab5918 Avoid segfault due to uninitialized variable
This commit was SVN r29030.
2013-08-16 21:10:38 +00:00
Ralph Castain
bebe852057 Add new info key for publish that allows user to designate that the port is to be unique - i.e., to return an error if that service has already been published. Default is to overwrite
This commit was SVN r29028.
2013-08-14 04:21:17 +00:00
Ralph Castain
318467c04f If we only have global scope, then don't fall back to looking at local scope if the lookup target wasn't found else we will hang
This commit was SVN r29025.
2013-08-13 04:45:33 +00:00
Nathan Hjelm
6c75699068 coll/ml: fix typo in assert that could cause an abort in debug builds.
cmr=v1.7.3:reviewer=manjugv

This commit was SVN r29024.
2013-08-12 14:31:44 +00:00
Jeff Squyres
c09ec204ad Change usNIC BTL to always use small fragments when there is a
non-contiguous converter.  We can't "convert on the fly" because the #
of bytes requested may not divide evenly into the convertor data type.

This commit was SVN r29014.
2013-08-11 17:04:13 +00:00
Nathan Hjelm
47320713bb coll/ml: do not register variables in open and fix a bug in the coll/ml parser
cmr=v1.7.3:reviewer=pasha

This commit was SVN r29010.
2013-08-09 17:55:30 +00:00
Rolf vandeVaart
cd72024a3c Refactor some of the initialization code.
This commit was SVN r29009.
2013-08-09 14:54:17 +00:00
Edgar Gabriel
f7391eca23 Lazy open does not work for the addproc sharedfp component since it starts by
spawning a process using MPI_Comm_spawn. For this, the first operation has to
be collective which we can not guarantuee outside of the MPI_File_open
operation.

This commit was SVN r29008.
2013-08-06 20:48:20 +00:00
Edgar Gabriel
e348f5567f add unignore for me.
This commit was SVN r29007.
2013-08-06 20:47:08 +00:00
George Bosilca
837b3363fe Silence few warnings.
This commit was SVN r29004.
2013-08-06 09:38:30 +00:00
Brian Barrett
2cc947513b * Fix some compile errors
* Need to subtract 1 off the size so that we stay in the bit length requirements

This commit was SVN r28997.
2013-08-05 18:49:48 +00:00
Jeff Squyres
87910daf51 Fix a collection of bugs found by QA and Coverity, and make some minor
improvements:

* Fix minor memory leaks during component_init
* Ensure that an initialization loop does not underflow an unsigned int
* Improve mlock limit checking
* Fix set of BTL modules created during component_init when failing to
  get QP resources or otherwise excluding some (but not all) usnic
  verbs devices
* Fix/improve error messages to be consistent with other Cisco
  documentation
* Randomize the initial sliding window sequence number so that we
  silently drop incoming frames from previous jobs that still have
  existant processes in the middle of dying (and are still
  transmitting) 
* Ensure we don't break out of add_procs too soon and create an
  asymetrical view of what interfaces are available

This commit was SVN r28975.
2013-08-01 16:56:15 +00:00
Nathan Hjelm
8429485a39 mpool/grdma: use the rcache even if not using mpi_leave_pinned or
mpi_leave_pinned_pipeline

This change should improve performance is the non-pinned case where
the same memory region is involved in multiple simultaneous transfers.

cmr=v1.7.3:reviewer=brbarret

This commit was SVN r28973.
2013-07-31 23:50:41 +00:00
Nathan Hjelm
1382d4fb53 Fix typos in _Complex ops
This commit was SVN r28959.
2013-07-26 17:02:45 +00:00
Nathan Hjelm
e4f105ffb3 revert change that shouldn't have been part of r28952
This commit was SVN r28953.

The following SVN revision numbers were found above:
  r28952 --> open-mpi/ompi@cb90a4a7fc
2013-07-25 20:23:55 +00:00
Nathan Hjelm
cb90a4a7fc Add simple algorithms to support MPI_IN_PLACE for MPI_Alltoall, MPI_Alltoallv, and MPI_Alltoallw.
Working on faster algorithms for tuned that will come at a later time.

cmr=v1.7.3:ticket=trac:2965

This commit was SVN r28952.

The following Trac tickets were found above:
  Ticket 2965 --> https://svn.open-mpi.org/trac/ompi/ticket/2965
2013-07-25 19:19:41 +00:00
Nathan Hjelm
99adeb7f6e Fix support for complex datatypes when fortran is not available but _Complex is
This commit was SVN r28951.
2013-07-25 19:08:21 +00:00
Edgar Gabriel
012f99c3b6 ompi_ignore this component until we find a solution for the dependence on
libmpi for the external process that is being spawned by this component.

This commit was SVN r28945.
2013-07-24 20:02:47 +00:00
Jeff Squyres
f7337b8f77 Correct faulty max payload and MTU computations (and update some
debugging that helped us find those).

This commit was SVN r28942.
2013-07-24 16:06:28 +00:00
Ralph Castain
db214a2321 Refs trac:3697 - use the opal_pmi_error function instead of ompi_error as the returned error codes are from PMI
This commit was SVN r28941.

The following Trac tickets were found above:
  Ticket 3697 --> https://svn.open-mpi.org/trac/ompi/ticket/3697
2013-07-24 04:05:41 +00:00
Jeff Squyres
5323051047 Use sysfs to check MPI has enough VFs, QPs, and CQs
Use the new sysfs files to check that there are enough VFs, QPs, and
CQs for all the MPI processes on this server.
    
Move the checking code into its own subroutine to make it smaller and
easier to read/grok.

This commit was SVN r28937.
2013-07-24 00:38:32 +00:00
Ralph Castain
59a71765cf Hmmm...these error outputs will never occur, which is probably not what the author intended. So do the output and THEN jump to the error exit.
This commit was SVN r28918.
2013-07-22 22:58:03 +00:00
Edgar Gabriel
8ffc1aac89 update the _component.c files in ompio to use the explicit assignment of the
mca_register_component_params element of the structure.

This commit was SVN r28914.
2013-07-22 21:11:05 +00:00
Nathan Hjelm
b17cd13c09 sharedfp: ensure sharedfp components register their parameters in mca_register_component_params not mca_component_open
This commit was SVN r28910.
2013-07-22 17:53:58 +00:00
Jeff Squyres
b437041aeb Update one more comment.
This commit was SVN r28908.
2013-07-22 17:29:00 +00:00
Jeff Squyres
4b6006402d Use the RTE framework instead of calling ORTE directly.
Brian (rightfully) hit me on the head with the
don't-use-ORTE-use-the-rte-framework clue bat; the usnic BTL now
nicely plays with the RTE framework.

This commit was SVN r28907.
2013-07-22 17:28:23 +00:00
Jeff Squyres
ca9da8a554 Fix minor typo in the comments/docs.
This commit was SVN r28905.
2013-07-22 17:24:17 +00:00
Rolf vandeVaart
67badf384c Only search SONAME of library. Expand comments.
This commit was SVN r28904.
2013-07-22 15:54:45 +00:00
Brian Barrett
e1d72409cd add missing header
This commit was SVN r28897.
2013-07-21 19:40:31 +00:00
Brian Barrett
704f1ecc18 fix non-orte builds of PSM
This commit was SVN r28893.
2013-07-21 19:12:32 +00:00
Brian Barrett
05ab9cbaa6 Need to ship pmi_internal.h
This commit was SVN r28891.
2013-07-21 19:00:50 +00:00
Brian Barrett
495384d8b7 Update documentation in rte.h to match recent changes
This commit was SVN r28887.
2013-07-20 22:14:12 +00:00
Brian Barrett
414ba3dad8 Update PMI RTE to match error handling changes that were part of r28852.
Note that the PMI RTE still doesn't listen for asynchronous errors, so
the error handler still won't ever actually do anything :).

This commit was SVN r28886.

The following SVN revision numbers were found above:
  r28852 --> open-mpi/ompi@e4e678e234
2013-07-20 22:09:02 +00:00
Brian Barrett
5bfd980968 update PMI RTE component to adapt to ORTE changes
This commit was SVN r28885.
2013-07-20 22:06:47 +00:00
Brian Barrett
d984d25da3 Remove orte header file from sharedfp components (OMPI layer should not
include ORTE layer with the RTE framework).  Thankfully, nothing used
orte_show_help, so easy fix.

This commit was SVN r28884.
2013-07-20 22:03:44 +00:00
Jeff Squyres
194b285447 First commit of the Cisco usNIC BTL.
This BTL accesses the Cisco usNIC Linux device via the Linux verbs
API via Unreliable Datagram queue pairs.  A few noteworthy points:

 * This BTL does most of its own fragmentation; it tells the PML that
   it has a very high max_send_size (much higher than the network
   MTU).
 * Since UD fragments are, by definition, unreliable, the usnic BTL
   handles all of its own reliability via a sliding window approach
   using the opal_hotel construct and many tricks stolen from the
   corpus of knowledge surrounding efficient TCP.
 * There is a fun PML latency-metric based optimization for NUMA
   awareness of short messages.
 * Note that this is ''not'' a generic UD verbs BTL; it is specific to
   the Cisco usNIC device.

This commit was SVN r28879.
2013-07-19 22:13:58 +00:00
Jeff Squyres
3546163c48 Devices that do not support RC QP's are also intentionally skipped;
don't warn about skipping them.

This commit was SVN r28874.
2013-07-19 19:05:18 +00:00
Ralph Castain
e4e678e234 Per the RFC and discussion on the devel list, update the RTE-MPI error handling interface. There are a few differences in the code from the original RFC that came out of the discussion - I've captured those in the following writeup
George and I were talking about ORTE's error handling the other day in regards to the right way to deal with errors in the updated OOB. Specifically, it seemed a bad idea for a library such as ORTE to be aborting the job on its own prerogative. If we lose a connection or cannot send a message, then we really should just report it upwards and let the application and/or upper layers decide what to do about it.

The current code base only allows a single error callback to exist, which seemed unduly limiting. So, based on the conversation, I've modified the errmgr interface to provide a mechanism for registering any number of error handlers (this replaces the current "set_fault_callback" API). When an error occurs, these handlers will be called in order until one responds that the error has been "resolved" - i.e., no further action is required - by returning OMPI_SUCCESS. The default MPI layer error handler is specified to go "last" and calls mpi_abort, so the current "abort" behavior is preserved unless other error handlers are registered.

In the register_callback function, I provide an "order" param so you can specify "this callback must come first" or "this callback must come last". Seemed to me that we will probably have different code areas registering callbacks, and one might require it go first (the default "abort" will always require it go last). So you can append and prepend, or go first. Note that only one registration can declare itself "first" or "last", and since the default "abort" callback automatically takes "last", that one isn't available. :-)

The errhandler callback function passes an opal_pointer_array of structs, each of which contains the name of the proc involved (which can be yourself for internal errors) and the error code. This is a change from the current fault callback which returned an opal_pointer_array of just process names. Rationale is that you might need to see the cause of the error to decide what action to take. I realize that isn't a requirement for remote procs, but remember that we will use the SAME interface to report RTE errors internal to the proc itself. In those cases, you really do need to see the error code. It is legal to pass a NULL for the pointer array (e.g., when reporting an internal failure without error code), so handlers must be prepared for that possibility. If people find that too burdensome, we can remove it.

Should we ever decide to create a separate callback path for internal errors vs remote process failures, or if we decide to do something different based on experience, then we can adjust this API.

This commit was SVN r28852.
2013-07-19 01:08:53 +00:00
Ralph Castain
8a8b4896be Need to protect libgen.h as some systems might not have it
This commit was SVN r28845.
2013-07-18 20:21:37 +00:00
Edgar Gabriel
185e365dad make the sm sharedfp component compile on Mac.
This commit was SVN r28844.
2013-07-18 20:17:14 +00:00
Edgar Gabriel
93cef82873 remove the ylib component from the fcoll framework. It is not used, there are
no plans to use it. We can always recover it from svn if we would ever change
our minds.

This commit was SVN r28840.
2013-07-18 16:18:06 +00:00
Pavel Shamis
68969ba6e5 Removing bogus references in iboffload code.
cmr:v1.7:reviewer=hjelmn

This commit was SVN r28834.
2013-07-17 22:35:24 +00:00
Rolf vandeVaart
49663fb802 Move CUDA-aware configurary to its own file and other minor changes due to review.
This commit was SVN r28832.
2013-07-17 22:12:29 +00:00
Edgar Gabriel
6e8522fec5 infuse life into the shared file pointer framework. For this:
- extend the framework API
 - remove the dummy component, not require anymore
 - add four components to perform the actual job.

This commit was SVN r28828.
2013-07-17 21:55:24 +00:00
Edgar Gabriel
ac694b7056 in preparation for the new shared file pointer components to be committed
soon:
 - add a new abstraction layer to be used internally for some operations
 - add a new mca parameter to control lazy intialization of shared file
 pointer structures

This commit was SVN r28826.
2013-07-17 21:30:50 +00:00
Vishwanath Venkatesan
ce8f8f0829 Changing the MPI Datatype from MPI_LONG to OMPI_OFFSET_DATATYPE for send/recv offsets
This commit was SVN r28822.
2013-07-17 19:16:53 +00:00
Nathan Hjelm
d4c6029cf3 sbgp/ibnet: set mca_sbgp_ibnet_component.mtu to IBV_MTU_1024 before registering it. cmr:v1.7:reviewer=pasha
This commit was SVN r28821.
2013-07-17 19:16:31 +00:00
Rolf vandeVaart
7a45be8bde Fix variable initialization.
This commit was SVN r28819.
2013-07-17 17:37:35 +00:00
Nathan Hjelm
f0aeb36d80 Fix warnings in ob1 introduced by the pvar commit
This commit was SVN r28817.
2013-07-17 03:41:05 +00:00
Rolf vandeVaart
f95c95cf79 Additional cleanup of how libraries and paths are searched.
This commit was SVN r28815.
2013-07-16 18:40:55 +00:00
Nathan Hjelm
e6e9f2c6fd Add profiling function definitions for MPI_T and add a missing type into mpi.h
This commit was SVN r28803.
2013-07-16 16:03:33 +00:00
Nathan Hjelm
35673ea400 Add example performance variables to ob1: unexpected message queue length, posted receive length
This commit was SVN r28801.
2013-07-16 16:02:25 +00:00
Rolf vandeVaart
54b1fbdb4a Better error message code. Remove commented out code.
This commit was SVN r28793.
2013-07-15 22:27:34 +00:00
Rolf vandeVaart
4d2c2bcefe Better error message. Remove a tab.
This commit was SVN r28791.
2013-07-15 19:39:54 +00:00
Mike Dubman
5bd2e15cbb support for ConnectX3-Pro card. cmr:v1.7:reviewer=jsquyres cmr:v1.6:reviewer=jsquyres
This commit was SVN r28787.
2013-07-14 06:44:19 +00:00
Nathan Hjelm
dfca3d4804 fix typos in the ugni and vader btls
This commit was SVN r28772.
2013-07-12 17:55:33 +00:00
Nathan Hjelm
1119cd3e8a Merge branch 'vader_fix'
This commit was SVN r28764.
2013-07-11 23:30:20 +00:00
Brian Barrett
2f19fc52de use the same multi-md workaround the rest of the Portals code is using.
This commit was SVN r28761.
2013-07-11 21:00:11 +00:00
Nathan Hjelm
b5281778b0 btl/vader: improve small message performance
This commit improved the small message latency and bandwidth when using
the vader btl. These improvements should make performance competative
with other MPI implementations.

This commit was SVN r28760.
2013-07-11 20:54:12 +00:00
Brian Barrett
bea54eeeb1 First take at a BTL for Portals 4
This commit was SVN r28759.
2013-07-11 20:47:08 +00:00
Jeff Squyres
baa3182794 Per RFC
(http://www.open-mpi.org/community/lists/devel/2013/07/12534.php),
remove a bunch of dead code.

This commit was SVN r28756.
2013-07-11 17:34:28 +00:00
Rolf vandeVaart
858ef65142 Fix loop limit.
This commit was SVN r28755.
2013-07-11 17:15:43 +00:00
Rolf vandeVaart
5051cd53fd Use new API.
This commit was SVN r28754.
2013-07-11 17:06:14 +00:00
Joshua Ladd
16beaa3878 This fixes the nasty configure.m4 hack that was added long ago and not removed. My fault for not catching earlier. I've also removed the '.ompi_ignore' in coll/hcoll. Throwing this to Nathan for review. Upon successful review, this should be added to cmr:v1.7:reviewer=hjelmn
This commit was SVN r28753.
2013-07-11 09:55:46 +00:00
Jeff Squyres
28dac8010b The hcoll component configure.m4 commits multiple sins, and breaks
many builds.  I am temporarily .ompi_ignore'ing this component until
it can be fixed by its owner.

 * It calls AC_MSG_ERROR, which configure.m4 scripts are ''never''
   supposed to do.  If you don't want to build, then call $2.
 * All static and --disable-dlopen builds are broken; they fall afoul
   of whatever test configure.m4 is doing and therefore error out of
   configure entirely (vs. simply disabling the hcoll component).
 * There appear to be multiple shell scripting errors in the
   configure.m4.  Here's the output of "./configure --disable-dlopen":
{{{
--- MCA component coll:hcoll (m4 configuration macro)
checking for MCA component coll:hcoll compile mode... static
checking --with-hcoll value... simple ok (unspecified)
./configure: line 421: test: basic: integer expression expected
configure: error: Can not use coll/hcoll and coll/ml (static build)
   simultaneously. You have two options:
                1. Use static build & disable ml with:
   --enable-mpi-no-build=coll-ml
                2. Use dso build for ML & disable ml at runtime: -mca
   coll self
./configure: line 310: return: basic: numeric argument required
./configure: line 320: exit: basic: numeric argument required
}}}

Finally, all of these configure.m4 errors aside, I don't understand
why there is a ''compile-time'' exclusion between the hcoll and ml
components.  Why isn't this a ''run-time'' decision?  Having what
seems to be an unnecessary compile-time exclusion goes against the
general Open MPI philosophy.

Note: Open MPI 1.7 is also broken in all the same ways.  I suggest
that the RM's .ompi_ignore hcoll over there, too.

Mellanox: please fix.

This commit was SVN r28748.
2013-07-10 16:03:15 +00:00
Jeff Squyres
80145742a3 Fix typo in comment
This commit was SVN r28747.
2013-07-10 15:13:08 +00:00
Jeff Squyres
ea94936531 First cut at assigning some fine-grained "levels" to MCA parameters
for the SM and TCP BTLs, as well as the mca_btl_base_param_register()
function (which registers MCA params for all BTLs).

The guidelines in
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels were used to
pick these levels.

This commit was SVN r28746.
2013-07-10 00:47:52 +00:00
Aurelien Bouteiller
e1066143a4 rename ompi_free_list operations to _mt, as per discussions at last face to face meeting
This commit was SVN r28734.
2013-07-08 22:07:52 +00:00
Brian Barrett
ecbbf888d3 * Update Portals 4 MTL's multi-md code to be a bit cleaner (no if statements
in the path) and not create MDs due to boundary crossing
* Add the same logic to the Coll component

This commit was SVN r28733.
2013-07-08 21:27:37 +00:00
Brian Barrett
84aeb6a6a5 Update request alloc to use free list get instead of free list wait.
This commit was SVN r28729.
2013-07-05 20:24:43 +00:00
George Bosilca
dc9352faf6 Remove some unused variables.
This commit was SVN r28726.
2013-07-05 13:31:54 +00:00
George Bosilca
8b01c3da33 Slightly reorder the code.
This commit was SVN r28725.
2013-07-05 13:29:29 +00:00
George Bosilca
483ed8da8c Remove an unused variable resulting from the removal of the last parameter of
the OMPI_FREE_LIST_GET macro.

This commit was SVN r28723.
2013-07-04 09:19:00 +00:00
George Bosilca
c9e5ab9ed1 Our macros for the OMPI-level free list had one extra argument, a possible return
value to signal that the operation of retrieving the element from the free list
failed. However in this case the returned pointer was set to NULL as well, so the
error code was redundant. Moreover, this was a continuous source of warnings when
the picky mode is on.

The attached parch remove the rc argument from the OMPI_FREE_LIST_GET and
OMPI_FREE_LIST_WAIT macros, and change to check if the item is NULL instead of
using the return code.

This commit was SVN r28722.
2013-07-04 08:34:37 +00:00
Brian Barrett
d3b49535b5 Only allow communication from the same user, since we don't have job-level
protection.

This commit was SVN r28715.
2013-07-03 17:29:02 +00:00
Jeff Squyres
d1ce64f049 Fix some "malloc of 0 bytes" warnings
This commit was SVN r28713.
2013-07-03 12:05:33 +00:00
Brian Barrett
81efd0e3cf Properly shut down Portals collective component
This commit was SVN r28707.
2013-07-02 22:07:27 +00:00
Brian Barrett
133dafd3dc First take at Barrier and Ibarrier, both of which seem to work.
This commit was SVN r28706.
2013-07-02 21:42:10 +00:00
Brian Barrett
c4577723ed fix misuse of param api
This commit was SVN r28705.
2013-07-02 21:41:42 +00:00
Brian Barrett
c9a8217af6 Portals 4 doesn't have a BTL, need to default to MTL, rather than finding some stupid slow BTL. THis selection logic sucks.
This commit was SVN r28704.
2013-07-02 21:18:04 +00:00
Brian Barrett
e4698f5cd4 Shell of the Portals 4 collectives componetn
This commit was SVN r28703.
2013-07-02 15:23:55 +00:00
Joshua Ladd
5d2d5e958c Deleting garbage I accidentally committed. Thanks, Nathan\!
This commit was SVN r28698.
2013-07-01 22:50:54 +00:00
Joshua Ladd
d7a50343bf Per the details and schedule outlined in the attached RFC, Mellanox Technologies would like to CMR the new 'coll/hcoll' component. This component enables Mellanox Technologies' latest HPC middleware offering - 'Hcoll'. 'Hcoll' is a high-performance, standalone collectives library with support for truly asynchronous, non-blocking, hierarchical collectives via hardware offload on supporting Mellanox HCAs (ConnectX-3 and above.) To build the component, libhcoll must first be installed on your system, then you must configure OMPI with the configure flag: '--with-hcoll=/path/to/libhcoll'. Subsequent to installing, you may select the 'coll/hcoll' component at runtime as you would any other coll component, e.g. '-mca coll hcoll,tuned,libnbc'. This has been reviewed by Josh Ladd and should be added to cmr:v1.7:reviewer=jladd
This commit was SVN r28694.
2013-07-01 22:39:43 +00:00
George Bosilca
ae190246df Oops, thanks Jeff for noticing.
This commit was SVN r28693.
2013-07-01 17:51:52 +00:00
George Bosilca
e665cda6c2 Add the empty basic component where the function pointer from the
base will be copied over. Without such a decoy component the
entire framework will not function correctly.

This commit was SVN r28692.
2013-07-01 17:47:44 +00:00
George Bosilca
dc1e68c3c1 Remove the item from the list before releasing it.
This commit was SVN r28691.
2013-07-01 16:54:48 +00:00
George Bosilca
702e669636 Remove a [very] annoying warning.
This commit was SVN r28690.
2013-07-01 16:49:13 +00:00
George Bosilca
5fae72b9aa Add the MPI 2.2 MPI_Dist_graph functionality.
This patch reshape the way we deal with topologies completely. Where
our topologies were mainly storage components (they were not capable
of creating the new communicator), the new version is built around a
[possibly] common representation (in mca/topo/topo.h), but the functions
to attach and retrieve the topological information are specific to each
component. As a result the ompi_create_cart and ompi_create_graph functions
become useless and have been removed.

In addition to adding the internal infrastructure to manage the topology
information, it updates the MPI interface, and the debuggers support and
provides all Fortran interfaces.

This commit was SVN r28687.
2013-07-01 12:40:08 +00:00
George Bosilca
b82abf6bef Silence a compiler warning.
This commit was SVN r28686.
2013-07-01 11:40:42 +00:00
Rolf vandeVaart
adda653fc1 Fix two bugs from previous commit.
This commit was SVN r28684.
2013-06-28 16:32:51 +00:00
Rolf vandeVaart
850d325f32 Adjust how search is done for dynamic load of library. CUDA only.
This commit was SVN r28683.
2013-06-27 22:13:25 +00:00
Jeff Squyres
e3d0782788 Move the assignment after the bozo check.
This commit was SVN r28669.
2013-06-22 12:38:32 +00:00
Rolf vandeVaart
5ebb74bee3 Fix case where amount of data sent is less than expected. Otherwise, we will get hang when running the RGET protocol.
Reviewed by hjelm,bosilca.

This commit was SVN r28667.
2013-06-21 18:35:16 +00:00
Joshua Ladd
0b5c1f2ea8 Add 'generic' support for PMI2 (previously, we checked for PMI2 only on Cray systems.) If your resource manager (e.g. SLURM) has support for PMI2, then the --with-pmi configure flag will enable its usage. If you don't have PMI2, then you will fallback to regular old PMI1. This patch was submitted by Ralph Castain and reviewed and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd
This commit was SVN r28666.
2013-06-21 15:28:14 +00:00
Mike Dubman
d1c82994be fix: detect threading model to take appropriate flow in mxm
This commit was SVN r28648.
2013-06-16 08:40:06 +00:00
Jeff Squyres
a0b27f5b28 Better comment than what was submitted in r28614.
This commit was SVN r28631.

The following SVN revision numbers were found above:
  r28614 --> open-mpi/ompi@9556310bd0
2013-06-13 20:52:44 +00:00
Mike Dubman
9556310bd0 cosmetic: add comment with rationale for malloc.h include
This commit was SVN r28614.
2013-06-12 05:58:32 +00:00
Nathan Hjelm
9b1f32bf12 BTL: add flags for signaled BTL operations
As per discussion in the June 2013 developer meeting these
flags will be used by the PML in the future to request
asynchronous progress on an operation. The naming was chosen
to reflect that a BTL supports this mode (MCA_BTL_FLAG_SIGNALED)
and that a descriptor should "signal" the remote side to wake
up and progress the message (MCA_BTL_DES_FLAG_SIGNAL).

Future commits will update OB1 to take advantage of this
feature when performing the RDMA get or RDMA rendezvous
protocols.

This commit was SVN r28612.
2013-06-11 21:52:20 +00:00
Mike Dubman
d18b3ae1a7 fix malloc deprication error with gcc 4.6.3 on ubuntu/fedora
This commit was SVN r28605.
2013-06-09 18:13:16 +00:00
George Bosilca
d789423d34 Typo.
This commit was SVN r28603.
2013-06-08 10:44:02 +00:00
Vishwanath Venkatesan
0b727f84da Avoid malloc of zero bytes, add a check and avoid it.
This commit was SVN r28597.
2013-06-06 14:08:57 +00:00
Edgar Gabriel
2d4655a05a Logic has been revised compared to the previous implementation.
This commit was SVN r28594.
2013-06-05 23:47:42 +00:00
Edgar Gabriel
03c1db7a3a fix the calculation of the UNIFORM flag.
This commit was SVN r28593.
2013-06-05 23:18:50 +00:00
Vishwanath Venkatesan
7d6a05982a Removing the gather_array based on the flag UNIFORM FVIEW for read all operations (dynamic/static),
+ Disabling Timing data extraction by default in dynamic write all

This commit was SVN r28592.
2013-06-05 21:35:37 +00:00
Vishwanath Venkatesan
55878674d7 1. Removing the allgather_array based on the flag UNIFORM FVIEW. This is not really and optimization.
2. Fixing some of the debug printf's these are outdated.

This commit was SVN r28591.
2013-06-05 21:30:15 +00:00
Jeff Squyres
713e3aa3db Refs trac:3626: that ticket specifically refers to the v1.6 branch; this
commit is the trunk version of what is needed for #3626.

Add the "ignore_device" field to the INI file.  This allows us to
specifically list devices that should be ignored by the openib BTL
(such as the Intel Phi, at least as of May 2013 -- see #3626).  

Also add the Intel Phi to the ini file, and set its ignore_device=1.

Finally, add the concept of counting intentionally ignored verbs
devices.  Devices are ignored for one of two reasons:

 * If the number of allowed ports on that device is 0 (i.e., if
   if_include/if_exclude was set such that we're intentionally
   ignoring this device).
 * If the INI ignore_device field for this device is set to 1.

Once we have the count of devices that were intentionally ignored,
only show the "Hey, there's verbs devices that you're not using!"
show_help message if there are devices that were ''unintentionally''
ignored.

This commit was SVN r28589.

The following Trac tickets were found above:
  Ticket 3626 --> https://svn.open-mpi.org/trac/ompi/ticket/3626
2013-06-05 12:12:09 +00:00
Jeff Squyres
3019b7a3f8 Oops! Remove duplicate registration.
This commit was SVN r28588.
2013-06-05 11:55:19 +00:00
Jeff Squyres
1de00b17ad Properly check the return status from registering the MCA params.
This commit was SVN r28587.
2013-06-05 11:53:18 +00:00
Jeff Squyres
d692aba672 Remove the DR PML. It was abondoned long ago. It had a nice life,
a few papers, and now a decent demise with respect.  

This commit was SVN r28582.
2013-06-04 19:36:16 +00:00
Edgar Gabriel
87b3782b7f arghh, copy-and-paste error, status->_ucount has to be set to 0 not max_data for count=0.
This commit was SVN r28576.
2013-05-30 22:00:29 +00:00
Edgar Gabriel
9daec82f17 - make a fileview of 0 bytes work in ompio
- fixes the bug reported in ticket 3619 (which is already closed) also for ompio

This commit was SVN r28575.
2013-05-30 21:33:13 +00:00
Rolf vandeVaart
3d1d158a80 Do not abort in BTL. Rather, callback into PML error function. Thanks George for review.
This commit was SVN r28559.
2013-05-23 18:45:23 +00:00
Nathan Hjelm
721779d7ab Per RFC: remove old MCA parameter system.
This commit was SVN r28541.
2013-05-20 15:36:13 +00:00
Ralph Castain
889bf60c64 Fix bad merge
This commit was SVN r28540.
2013-05-18 01:29:55 +00:00
Jeff Squyres
089c632cce Remove a bunch of dead code: gcc 4.7 warns of set-but-unused
variables.  So get rid of them.

This commit was SVN r28538.
2013-05-17 21:45:49 +00:00
Edgar Gabriel
1b1051da6c fix a bug in the calculation of the explicit offset. Use the opportunity to
clean up the code a bit.

This commit was SVN r28537.
2013-05-17 20:22:00 +00:00
Ralph Castain
3e6e1046a3 fix a correctness issue by returning an error if waitall fails and invoking the mpi error handler
cmr:v1.7.2:reviewer=jsquyres

This commit was SVN r28533.
2013-05-16 15:04:37 +00:00
Rolf vandeVaart
91fdb423d7 Fix warning in CUDA-aware code.
This commit was SVN r28511.
2013-05-14 21:04:15 +00:00
Rolf vandeVaart
52ebb0b17f Change some opal_output to OPAL_OUTPUT per CMR review.
This commit was SVN r28510.
2013-05-14 20:49:42 +00:00
Nathan Hjelm
32a8ff5255 btl/openib: bump up udcm priority
This commit was SVN r28505.
2013-05-14 20:02:40 +00:00
Edgar Gabriel
d5cae9aced - fix the mca stripe size and stripe depth parameter logic in the pvfs2
component
- correctly recognize and handle the corresponding info objects.

This commit was SVN r28497.
2013-05-14 16:11:39 +00:00
Yossi Etigin
64d98e0438 Fix data corruption in MXM by registering to OPAL memory release hooks and removing any mappings created by mxm
This commit was SVN r28489.
2013-05-14 12:27:44 +00:00
Rolf vandeVaart
9d569f1487 Fix warning when compiling in CUDA aware code.
This commit was SVN r28476.
2013-05-10 21:29:08 +00:00
Nathan Hjelm
422331b4da btl/openib: fix unconnected datagram connection method (udcm)
The primary issue with udcm is that the immediate data in message
acks were often bogus. This caused the sender to keep trying even
though a message was received and acked. The fix is to use the
source LID and QP to determine which message is being acked. In
most cases this should work well since only one message will be
in flight to any peer.

This commit was SVN r28444.
2013-05-03 17:11:38 +00:00
Jeff Squyres
c8258c06e2 In coll_sm, we alloc a huge chunk of shared memory, divvy it into lots
of individual regions (each region is a multiple of page size in
length), and each process claims its own regions by binding it to its
local memory.  Each process would end up membining something like 16
individual regions in the overall shmem segment.

There were two errors in this code relating to the memory affinity
pinning.  Some combination of these two errors would lead to kernel
panics (!) on my RHEL 6.2 x86_64 machines when used with mmap'ed
shared memory (not posix or sysv shared memory, curiously enough):

1. The shared memory segment is initially divided into two regions:
control and data.  The control starts at the beginning of the shmem
segment, the data starts after that.  The data portion, unfortunately,
was ''not'' aligned to a page.  So all the multiple-of-page-size
regions that we divvy up were also not alined on page boundaries.  And
therefore all the regions we tried to membind were not on page
boundaries.

The solution was to ensure that the data portion started on a page
boundary.  Then all of the individual regions were on page boundaries,
too.

That being said, in my tests, Linux mbind() fails gracefully when the
address is not on a page boundary.  So I'm not sure how this worked at
all / led to a kernel panic...

2. There was some bad pointer math that resulted in membinding regions
larger than they should have been, resulting in region overlaps.
There were definitely overlaps between regions in the same process;
it's likely that there were overlaps between regions of multiple
processes, too -- I'm not sure (and don't care to figure out :-) ).

The solution was to fix the pointer math so that each region membinds
exactly only itself and no neighboring/overlapping regions.

cmr:v1.7.2:reviewer=samuel

This commit was SVN r28442.
2013-05-03 12:49:35 +00:00
Alex Mikheev
9e2fdc7d56 - correction of r28440
This commit was SVN r28441.

The following SVN revision numbers were found above:
  r28440 --> open-mpi/ompi@93ce233530
2013-05-02 12:52:58 +00:00
Alex Mikheev
93ce233530 - btl_openib: changed default SRQ settings:
- increase number of wqe to minimize number of RNRs
    - it is better to have high watermark and post relatively small number of wqes
    - increased TX queue size

This commit was SVN r28440.
2013-05-02 12:46:35 +00:00
Alex Mikheev
f76680fbd0 - btl_openib: fix total registered memory calculation for ConnectIB and Ofed 2.0
This commit was SVN r28432.
2013-05-01 13:39:29 +00:00
Jeff Squyres
d92a8e01f8 Use the _SAFE list traversal macro so that we can remove each item
from the list (just for good measure), and then free() it (without
using _SAFE, we were accessing memory that was just free()'d to get to
the next item).  Also be a little more thorough -- DESTRUCT the list
when we're all done.

This commit was SVN r28429.
2013-05-01 12:26:16 +00:00
George Bosilca
8b0335380a Fix the error messages to reference the correct function.
This commit was SVN r28425.
2013-04-30 23:26:03 +00:00
George Bosilca
6a75c84fa8 Remove useless define.
This commit was SVN r28424.
2013-04-30 23:24:59 +00:00
Ralph Castain
9de82aba55 Revert r28417 - given the non-standard way vprotocol is implemented, I see no way to use the framework verbosity here. Best to just leave it alone as those who use it know what they need to do to get debug output
This commit was SVN r28418.

The following SVN revision numbers were found above:
  r28417 --> open-mpi/ompi@b00de5be8b
2013-04-30 16:37:17 +00:00
Nathan Hjelm
b00de5be8b vprotocol: remove the old output and use the framework output
This commit was SVN r28417.
2013-04-30 15:21:42 +00:00
Ralph Castain
ceb4061214 Fix BTL_VERBOSE - when the MCA param change was committed, it left the base verbosity variable declared so things compiled. Sadly, the verbosity was now being set to a new variable, so debug never was output.
This commit was SVN r28414.
2013-04-30 01:15:52 +00:00
Nathan Hjelm
f384263de7 btl/openib: fix typo
This commit was SVN r28413.
2013-04-29 22:21:25 +00:00
Ralph Castain
5d7a93c032 Add the ability to use an external version of libevent. Clearly not recommended at this time. I've verified that it works in limited scenarios, but more thorough testing and performance impacts need to be assessed.
Interesting how many includes had to be fixed here and there to fill in missing dependencies :-)

This commit was SVN r28411.
2013-04-29 17:02:37 +00:00
Ralph Castain
8996ecb128 Add missing include
This commit was SVN r28405.
2013-04-27 00:09:36 +00:00
Jeff Squyres
f55cea1a5b If there are no BTLs, do ''not'' actually shut down the fd listener,
because a) it may still be needed to shut down the CPCs, and b) it
will be shut down during component_close().

This commit was SVN r28402.
2013-04-26 15:31:50 +00:00
Jeff Squyres
99b7a0f20d Remove unused variables.
This commit was SVN r28401.
2013-04-26 15:29:42 +00:00
Vishwanath Venkatesan
c902624b59 Using ompi_type_destroy to free ompi_datatype. This had to be updated in all the collective algorithms.
Hopefully this will fix all warnings.

This commit was SVN r28385.
2013-04-24 19:27:26 +00:00
Nathan Hjelm
2edff7f784 btl/openib: don't free string handle by MCA variable system
This commit was SVN r28383.
2013-04-24 18:59:18 +00:00
Alex Margolin
aebd794bf6 Fixed macro definition order in MXM component headers
This commit was SVN r28378.
2013-04-24 16:51:43 +00:00
Vishwanath Venkatesan
bba4a93f63 Got this wrong while replacing MPI function with OMPI functions. Fixed it now.
This commit was SVN r28350.
2013-04-22 19:58:25 +00:00
Rolf vandeVaart
5e1dde419c Fix some compile errors in CUDA-aware code that has crept in.
This commit was SVN r28346.
2013-04-18 15:34:16 +00:00
Vishwanath Venkatesan
53753622d4 Changing some of the MPI_ functions to ompi_ equivalents.
This commit was SVN r28342.
2013-04-17 21:06:36 +00:00
Alex Margolin
0ab7675019 Fix MXM connection establishment flow
This commit was SVN r28329.
2013-04-12 16:37:42 +00:00
Steve Wise
134baaf2fa Add Chelsio T5 device. This fixes trac:3552 and should be added to cmr:v1.6:reviewer=jsquyres and cmr:v1.7:reviewer=jsquyres
This commit was SVN r28327.

The following Trac tickets were found above:
  Ticket 3552 --> https://svn.open-mpi.org/trac/ompi/ticket/3552
2013-04-11 19:30:53 +00:00
George Bosilca
2d33c9ee39 Stop complaining about an overwritten default parameter.
This commit was SVN r28322.
2013-04-10 19:44:37 +00:00
Jeff Squyres
8405975bf6 Be a little more conservative about initializing devices and modules
(i.e., ensure that more data items get zeroed out/set to NULL) so that
if something goes wrong during initialization, we don't try to clean
up something that isn't there (and segv).

The chance of this happening on the trunk is very low (and will also
be low once the verbs improvements are brought over to v1.7).  But it
can actually happen in the v1.6 branch (e.g., if no CPC is available,
we'll try to get the length of the endpoints list, but the endpoints
list is NULL).  

Hence, even though the real goal is to get this functionality over to
v1.6, I figured I'd commit to the trunk/CMR to v1.7 just to try to
keep commonality in the openib between all three where possible.

This commit was SVN r28317.
2013-04-09 21:55:31 +00:00
Ralph Castain
45af6cf59e The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem.
This commit was SVN r28308.
2013-04-08 23:34:16 +00:00
Nathan Hjelm
4e95d691a7 pml/ob1: do not reset the convertor if one was not created (size = 0).
This macro is only used on the failure path so the additional if statement
should not have any affect on performance.

cmr:v1.7

This commit was SVN r28292.
2013-04-05 01:40:11 +00:00
Pavel Shamis
fed6e60131 Fixing OpenIB BTL compilation failure for a cases when
BTL_OPENIB_MALLOC_HOOKS_ENABLED is disabled.

This commit was SVN r28290.
2013-04-04 20:17:18 +00:00
Pavel Shamis
aa1f5697b4 In order to prevent name conflicts in XRC (MOFED) enabled mode
OFACM's ib_address_t was renamed to ofacm_ib_address_t

This commit was SVN r28289.
2013-04-04 20:02:17 +00:00
Nathan Hjelm
e8d9944456 sbgp/ibnet: fix param -> var update errors
This commit was SVN r28284.
2013-04-03 20:17:18 +00:00
Nathan Hjelm
75093155ab bcol/iboffload: fix still more errors from param -> var updates
This commit was SVN r28283.
2013-04-03 19:57:03 +00:00
Nathan Hjelm
47a1897710 bcol/iboffload: fix more errors from param -> var updates
This commit was SVN r28281.
2013-04-03 18:55:46 +00:00
Nathan Hjelm
31a498c2a1 bcol/iboffload: fix errors from param -> var updates
This commit was SVN r28280.
2013-04-03 18:33:19 +00:00
Ralph Castain
66f3a81488 Cleanup warnings found when building v1.7
cmr:v1.7

This commit was SVN r28279.
2013-04-03 17:37:02 +00:00
Vishwanath Venkatesan
74c418b860 Adding typecasting with intptr_t to remove warnings.
This commit was SVN r28278.
2013-04-03 17:07:43 +00:00
Vishwanath Venkatesan
784337aab1 typecasting with intptr_t to remove warnings
This commit was SVN r28276.
2013-04-03 17:06:02 +00:00
Jeff Squyres
64d39a4e97 Technically speaking, we're creating a QP with 1 send WQE and 1
receive WQE, so it's good form to have a CQ with 2 entries, not 1.

This commit was SVN r28256.
2013-03-28 13:11:31 +00:00
George Bosilca
9c6374b515 Swap the open and register.
This commit was SVN r28253.
2013-03-27 22:19:57 +00:00
Nathan Hjelm
f1fa290157 btl/vader: add missing return statement
This commit was SVN r28252.
2013-03-27 22:16:21 +00:00
Nathan Hjelm
113fadd749 btl/vader: do not use common/sm for shared memory fragments
This commit was SVN r28250.
2013-03-27 22:10:02 +00:00
Nathan Hjelm
9d4a26f47d Update OMPI frameworks to use the MCA framework system.
Notes:
  - This commit also eliminates the need for an available components list in use
    in several frameworks. None of the code in question was making use of the
    priority field of the priority component list item so these extra lists were
    removed.
  - Cleaned up selection code in several frameworks to sort lists using opal_list_sort.
  - Cleans up the ompi/orte-info functions. Expose the functions that construct the
    list of params so they can be used elsewhere.

patches for mtl/portals4 from brian

missed a few output variables in openib

This commit was SVN r28241.
2013-03-27 21:17:31 +00:00
Nathan Hjelm
c041156f60 Update ORTE frameworks to use the MCA framework system.
This commit was SVN r28240.
2013-03-27 21:14:43 +00:00
Nathan Hjelm
cf377db823 MCA/base: Add new MCA variable system
Features:
 - Support for an override parameter file (openmpi-mca-param-override.conf).
   Variable values in this file can not be overridden by any file or environment
   value.
 - Support for boolean, unsigned, and unsigned long long variables.
 - Support for true/false values.
 - Support for enumerations on integer variables.
 - Support for MPIT scope, verbosity, and binding.
 - Support for command line source.
 - Support for setting variable source via the environment using
   OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
 - Cleaner API.
 - Support for variable groups (equivalent to MPIT categories).

Notes:
 - Variables must be created with a backing store (char **, int *, or bool *)
   that must live at least as long as the variable.
 - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
   mca_base_var_set_value() to change the value.
 - String values are duplicated when the variable is registered. It is up to
   the caller to free the original value if necessary. The new value will be
   freed by the mca_base_var system and must not be freed by the user.
 - Variables with constant scope may not be settable.
 - Variable groups (and all associated variables) are deregistered when the
   component is closed or the component repository item is freed. This
   prevents a segmentation fault from accessing a variable after its component
   is unloaded.
 - After some discussion we decided we should remove the automatic registration
   of component priority variables. Few component actually made use of this
   feature.
 - The enumerator interface was updated to be general enough to handle
   future uses of the interface.
 - The code to generate ompi_info output has been moved into the MCA variable
   system. See mca_base_var_dump().

opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system

This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.

This commit was SVN r28236.
2013-03-27 21:09:41 +00:00
Ralph Castain
317915225c Finish the binding cleanup by removing the no-longer-used binding level scheme. This proved to be fallible as there is no guarantee that the hierarchy it used matched physical reality of the machine (e.g., is L3 "above" the socket or not). Still have to complete the ppr update, but get the rest of it correct.
This commit was SVN r28223.
2013-03-26 20:09:49 +00:00
Jeff Squyres
44e371a65d Remove (bogus) port number from the opal_output -- there's no port
number associated with creating a QP.

This commit was SVN r28222.
2013-03-26 19:48:50 +00:00
Vishwanath Venkatesan
e092cc34e0 Fixing the read all bugs discovered by Coverity
This commit was SVN r28189.
2013-03-20 20:27:09 +00:00
Samuel Gutierrez
8ce2041102 Cleanup in error path. Fixes CID 967211. Thanks, Jeff.
This commit was SVN r28183.
2013-03-19 20:00:08 +00:00
Jeff Squyres
2513122d31 Remove extraneous semicolon.
This commit was SVN r28180.
2013-03-18 23:58:11 +00:00
Jeff Squyres
7ac02fb9d4 Two fixes for the ROMIO io module:
* Don't call PMPI_* anything from our module code; that's terribly
   bad form (and disallowed!).  Instead, do the proper back-end stuff
   to reset the error handler on the file handle.
 * If we've already started to MPI_Finalize, then just give up and
   don't actually perform all the file closing actions (because
   ROMIO's file close calls MPI_Barrier, which will obviously fail if
   MPI_Finalize has already been invoked).  Bad user behavior should
   be punished (by leaking resources, not closing the file properly,
   etc.).

This commit was SVN r28177.
2013-03-18 20:11:20 +00:00
Vasily Filipov
7bda23dd84 SBGP, BCOL: add missing "show_help.h" includes.
This commit was SVN r28163.
2013-03-10 09:11:09 +00:00
Brian Barrett
65109de931 Fix leak of comm and datatype references for mprobe/improbe and fix a request leak in improbe
This commit was SVN r28157.
2013-03-07 21:55:22 +00:00
Brian Barrett
db858827df Fill in more of the process info structure when using PMI
This commit was SVN r28152.
2013-03-06 19:32:47 +00:00
Brian Barrett
a67d768ee4 quick hack to get things compiling again. Still need to fill in the fixme parts. sigh.
This commit was SVN r28150.
2013-03-06 18:33:25 +00:00
Nathan Hjelm
3c5cd95087 mtl/psm: add missing header for opal_show_help (one more)
This commit was SVN r28147.
2013-03-05 00:18:51 +00:00
Nathan Hjelm
25d0d97d6b mtl/psm: add missing header for opal_show_help
This commit was SVN r28146.
2013-03-05 00:17:48 +00:00
Nathan Hjelm
213cb79fab mtl/psm: add missing header for opal_show_help
This commit was SVN r28145.
2013-03-05 00:15:11 +00:00
Rolf vandeVaart
037729dcbb Add a search path. Refactor code.
This commit was SVN r28142.
2013-03-01 21:50:56 +00:00
Rolf vandeVaart
5c761d701d Remove tabs for spaces, fix some error messages.
This commit was SVN r28141.
2013-03-01 19:13:06 +00:00
Rolf vandeVaart
ebe63118ac Remove dependency on libcuda.so when building in CUDA-aware support. Dynamically load it if needed.
This commit was SVN r28140.
2013-03-01 13:21:52 +00:00
Ralph Castain
a4b6fb241f Remove all remaining vestiges of the Windows integration
This commit was SVN r28137.
2013-02-28 17:31:47 +00:00
Nathan Hjelm
b5a2cd1cce remove csum pml
This commit was SVN r28133.
2013-02-28 00:17:56 +00:00
Brian Barrett
1370d4569a workaround for case when MD can't span all of memory (sigh)
This commit was SVN r28132.
2013-02-27 17:02:45 +00:00
Vasily Filipov
f897c8a1e0 MTL MXM: STREAM supporting for isend and irecv.
This commit was SVN r28122.
2013-02-27 13:21:30 +00:00
Ralph Castain
8d2fa3693b First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package.
This commit was SVN r28116.
2013-02-26 20:44:56 +00:00
Ralph Castain
bd9265c560 Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data.
A few changes were required to support this move:

1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution).

2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time.

3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering.

4. the ORTE grpcomm and ess/pmi components, and the nidmap code,  were updated to work with the new db framework and to specify internal/global storage options.

No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework.

This commit was SVN r28112.
2013-02-26 17:50:04 +00:00
Ralph Castain
70a28c8a27 Now that we are using local ranks in OMPI, we need to define an ompi_local_rank_t and equate it to orte_local_rank_t. Change the sm btl to use the correct abstraction.
This commit was SVN r28098.
2013-02-22 17:48:53 +00:00
Samuel Gutierrez
af5ed9b25c OMPI_NODE_RANK_INVALID ==> OMPI_LOCAL_RANK_INVALID
This commit was SVN r28096.
2013-02-21 18:28:07 +00:00
Samuel Gutierrez
4bf0134901 Remove debug.
This commit was SVN r28095.
2013-02-21 18:21:22 +00:00
Samuel Gutierrez
b7791963f2 Fix sm BTL initialization for MPI_Comm_spawn and friends. Thanks to Jeff for
finding the issue.

This commit was SVN r28094.
2013-02-21 18:19:46 +00:00
Nathan Hjelm
55cf850eca Add comment about r28083
This commit was SVN r28084.

The following SVN revision numbers were found above:
  r28083 --> open-mpi/ompi@5411e28c00
2013-02-20 21:42:13 +00:00
Nathan Hjelm
5411e28c00 btl/openib: don't align fragments on 2 byte boundaries (changed to 8)
cmr:v1.6,v1.7

This commit was SVN r28083.
2013-02-20 21:27:01 +00:00
Rolf vandeVaart
da3e9ff906 Add show_help.h where needed.
This commit was SVN r28071.
2013-02-19 15:42:09 +00:00
Brian Barrett
3c83618799 fix a missing header file issue with IB
This commit was SVN r28070.
2013-02-18 18:29:14 +00:00
Vasily Filipov
52a9241859 MTL MXM: adapt to mxm 2.0 api changes - flags are only for send requests, and SYNC is part of the opcode.
This commit was SVN r28069.
2013-02-17 10:04:19 +00:00
Vasily Filipov
8270d8f52a MTL MXM: "#include "opal/util/show_help.h" adding.
This commit was SVN r28068.
2013-02-17 09:51:03 +00:00
Ralph Castain
ebad55b933 Apply patches from ORNL to fix compile issues - minor stuff. Thanks to Geoffroy Vallee for the patches.
This commit was SVN r28065.
2013-02-15 22:14:23 +00:00
Jeff Squyres
bbddd6ea03 Add header file for opal_show_help().
This commit was SVN r28056.
2013-02-13 16:31:59 +00:00
Brian Barrett
312f37706e In talking about this with Jeff and Ralph, we don't actually need
ompi_show_help, because opal_show_help is replaced with an 
aggregating version when using ORTE, so there's no reason to
directly call orte_show_help.

This commit was SVN r28051.
2013-02-12 21:10:11 +00:00
Joshua Ladd
70ad711337 Backing out the Open SHMEM project
This commit was SVN r28050.
2013-02-12 17:45:27 +00:00
Mike Dubman
ff384daab4 Added new project: oshmem.
This commit was SVN r28048.
2013-02-12 15:33:21 +00:00
Mike Dubman
55cb00f8a3 Remove references to unexisting files:
ompi/mca/common/netpatterns/
    ompi/mca/common/commpatterns/

This commit was SVN r28044.
2013-02-12 13:21:47 +00:00
Pavel Shamis
a31bc57849 Moving mca/common/netpatterns and commpaterns to ompi/patterns.
This commit was SVN r28035.
2013-02-05 21:52:55 +00:00
Brian Barrett
d80218996f Rather than setting up the direct call stuff in ompi_mca (which requires
modifying ompi_mca for every interface that is direct called), do it in
the framework's .m4 file.

This commit was SVN r28031.
2013-02-04 23:26:42 +00:00
Vasily Filipov
21b170b43b MTL MXM: push commit r27987 back, now with right user.
r27987 - MTL MXM: ver. 2.0 interface changes.

This commit was SVN r28026.

The following SVN revision numbers were found above:
  r27987 --> open-mpi/ompi@2735658d81
2013-02-04 06:59:24 +00:00
Vasily Filipov
aa5e436479 Revert revesion -r27986, the reason is - it was submitted with wrong user name.
This commit was SVN r28025.

The following SVN revision numbers were found above:
  r27986 --> open-mpi/ompi@729caaf0cd
2013-02-04 06:54:24 +00:00
Jeff Squyres
c8dc1905f0 Fixes trac:3494: If we get 0 bytes back for the ACK, it doesn't
necessarily mean an error -- it could (and usually does) mean that the
peer realized that we both initiated a connect at the same time, and
therefore it decided to hang up.

I also added a friendly show_help error message for other cases where
recv_blocking() fails (i.e., "Something went wrong. Kaboom! Your job
will abort...").

This commit was SVN r28023.

The following Trac tickets were found above:
  Ticket 3494 --> https://svn.open-mpi.org/trac/ompi/ticket/3494
2013-02-02 01:19:03 +00:00
Jeff Squyres
f05b7aa6d8 As the help message states, it's not an ''error'' if the specified
interface is not found.  It should just be skipped.

This commit was SVN r28016.
2013-02-01 20:17:43 +00:00
Ralph Castain
afb0db5b6f Okay, Jeff - just for you...flow the show help thru the orte functions so help messages will be aggregated
This commit was SVN r28007.
2013-02-01 00:35:48 +00:00
Ralph Castain
e6555408f4 When we say abort, we mean ABORT!! Actually implement the ompi_rte_abort and ompi_rte_show_help functions in the ORTE module.
This commit was SVN r28004.
2013-01-31 23:12:11 +00:00
Igor Usarov
8d80af6c10 Support FCA v3.0
This commit was SVN r27988.
2013-01-31 11:14:27 +00:00
Pavel Shamis
2735658d81 MTL MXM: ver. 2.0 interface changes.
This commit was SVN r27987.
2013-01-31 08:38:08 +00:00
Rolf vandeVaart
729caaf0cd Remove any dependency on libcuda.so in opal layer. All changes are within OMPI_CUDA_SUPPORT code.
This commit was SVN r27986.
2013-01-30 23:07:32 +00:00
Rolf vandeVaart
aa04de4f1e Add run-time parameter to enable and disable CUDA GPU support.
This commit was SVN r27970.
2013-01-29 20:24:04 +00:00
Rolf vandeVaart
de5b7f5c6a Add mpool_base_verbose parameter. All the other base components appear to have this and it can help with debug.
This commit was SVN r27968.
2013-01-29 17:52:18 +00:00
Brian Barrett
49b2b5bf4f Fix double-install issue when --with-devel-headers is used
This commit was SVN r27967.
2013-01-29 17:23:18 +00:00
Brian Barrett
b8442ba505 Revamp the handling of wrapper compiler flags. The user flags, main configure
flags, and mca flags are kept seperate until the very end.  The main configure
wrapper flags should now be modified by using the OPAL_WRAPPER_FLAGS_ADD
macro.  MCA components should either let <framework>_<component>_{LIBS,LDFLAGS}
be copied over OR set <framework>_<component>_WRAPPER_EXTRA_{LIBS,LDFLAGS}.
The situations in which WRAPPER CPPFLAGS can be set by MCA components was
made very small to match the one use case where it makes sense.

This commit was SVN r27950.
2013-01-29 00:00:43 +00:00
Rolf vandeVaart
b5672927f2 Fix build issue when building with --disable-dlopen.
This commit was SVN r27945.
2013-01-28 20:14:59 +00:00
Rolf vandeVaart
c6412f6dff Add new rte headers in files that need them.
This commit was SVN r27943.
2013-01-28 19:32:33 +00:00
Pavel Shamis
1f1e1efb7b Removing leftovers of old infrastructure.
cmr:v1.7

This commit was SVN r27942.
2013-01-28 19:11:42 +00:00
Vishwanath Venkatesan
5be992f445 The pointer to the structure was also never allocated before retrieving
the stripe size. Fixing that too.

This commit was SVN r27941.
2013-01-28 07:21:22 +00:00
Vishwanath Venkatesan
817f6cd868 To remove the warning due to uninitialized variable.
This commit was SVN r27940.
2013-01-28 06:55:46 +00:00
George Bosilca
4defdea9f2 The shortest lifespan for a BTL.
This commit was SVN r27939.
2013-01-28 03:43:23 +00:00
George Bosilca
1b7dff3f2f A copy for posterity of the Open MPI Sicortex BTL.
This commit was SVN r27938.
2013-01-28 03:42:52 +00:00
Brian Barrett
f42783ae1a Move the RTE framework change into the trunk. With this change, all non-CR
runtime code goes through one of the rte, dpm, or pubsub frameworks.

This commit was SVN r27934.
2013-01-27 23:25:10 +00:00
Brian Barrett
14f4aa1198 Fix memory leak in nbc init
This commit was SVN r27884.
2013-01-21 22:45:59 +00:00
Brian Barrett
407714a85a Fix a memory leak in the RDMA one-sided component. Thanks to Victor Vysotskiy
for letting us know about this one.

This commit was SVN r27883.
2013-01-21 22:45:37 +00:00
George Bosilca
42753b4690 Make the TCP BTL really fail-safe. It now trigger the error callback on
all pending fragments when the destination goes down. This allows the PML
to recalibrate its behavior, either find an alternate route or just give up.

This commit was SVN r27881.
2013-01-21 11:41:08 +00:00
George Bosilca
d2281cc672 Remove the CMA related warnings.
This commit was SVN r27872.
2013-01-19 14:26:43 +00:00
Rolf vandeVaart
f63c88701f Improve CUDA GPU transfers over openib BTL. Use aynchronous copies.
This is RFC that was submitted in July and December of 2012.

This commit was SVN r27862.
2013-01-17 22:34:43 +00:00
Rolf vandeVaart
a07a4bb3f7 Update smcuda to match recent changes in sm BTL.
This commit was SVN r27803.
2013-01-14 14:42:19 +00:00
Rolf vandeVaart
34d1f0a585 Add some comments to the #ifdefs for clarity. No functional changes.
This commit was SVN r27802.
2013-01-13 16:08:48 +00:00
Alex Mikheev
344d407ed4 fixed compilation warning
always send signalled when BTL_OPENIB_FAILOVER is defined

This commit was SVN r27801.
2013-01-13 10:11:03 +00:00
Jeff Squyres
b2d5d1e348 Along with the Automake 1.13.x changes in r27790, rename these third
party configure.in scripts to be configure.ac so that Automake stops
complaining about them.

This commit was SVN r27791.

The following SVN revision numbers were found above:
  r27790 --> open-mpi/ompi@675a2f5c48
2013-01-11 20:26:19 +00:00
Jeff Squyres
675a2f5c48 Updates for Automake 1.13.x. Without these changes, Automake 1.13.x
will error out, due to use of the
previously-deprecated-and-now-removed AM_CONFIG_HEADER macro.

This commit was SVN r27790.
2013-01-11 20:20:02 +00:00
Samuel Gutierrez
4c28c8cbd0 New sm BTL initialization take two. This approach is pretty simple. Instead of
using the modex or RML to share sm initialization information, have node rank 0
create a file containing initialization information in a well-known place. Then
during add_procs, the rest of the node processes requiring sm BTL initialization
will just read from that file to complete their initialization.

This commit was SVN r27789.
2013-01-11 16:24:56 +00:00
Brian Barrett
b817166072 Use a process name instead of a name list in bcol_basesmuma
This commit was SVN r27779.
2013-01-09 16:43:49 +00:00
Joshua Ladd
77df51c516 Fixes the definition of the first fragment and does not assume that first frag has offset_into_user_buff equal to zero. This fix should be added to cmr:v1.7.1:reviewer=pasha
This commit was SVN r27775.
2013-01-08 20:24:58 +00:00
Alex Mikheev
fe672f255f request signal when sending over SRQ and number of SRQ sd_credits is 0
This commit was SVN r27767.
2013-01-08 14:00:29 +00:00
Samuel Gutierrez
c4acd20eb9 Backout r27739.
This commit was SVN r27745.

The following SVN revision numbers were found above:
  r27739 --> open-mpi/ompi@a159bfaf25
2013-01-05 01:54:23 +00:00
Nathan Hjelm
84e34ee0d7 Fix a bug in the uGNI btl that could cause certain descriptor callbacks to be called twice.
There was a race condition in the eager get protocol where the RDMA complete message could be received before the local completion of the SMSG message that started the eager get protocol.

cmr:v1.7

This commit was SVN r27740.
2013-01-03 23:11:13 +00:00
Samuel Gutierrez
a159bfaf25 sm BTL initialization via modex, as discussed at last year's meeting.
This commit was SVN r27739.
2013-01-03 21:52:20 +00:00
Mike Dubman
889d46e966 support for FCA v3.0 and up
This commit was SVN r27731.
2012-12-31 05:49:22 +00:00
Mike Dubman
b6d50a5733 Performance optimizations by alexm:
* btl sendi(): if message can be send inline try to avoid signal
* signal is requested one per 64 or when
    there are no send wqes 
    when message can not be send inline 
    any other btl method then sendi()

This commit was SVN r27724.
2012-12-26 10:19:12 +00:00
George Bosilca
ed77868984 No need for event.h in the SM BTL.
This commit was SVN r27718.
2012-12-23 19:33:53 +00:00
Nathan Hjelm
ef49fcea25 Remove debug printfs.
cmr:v1.7

This commit was SVN r27680.
2012-12-17 16:34:07 +00:00
Nathan Hjelm
ba5b2b0540 btl/vader: fix bug in single copy code that could cause ob1 sends to not get marked complete.
cmr:v1.7

This commit was SVN r27671.
2012-12-13 23:18:53 +00:00
Mike Dubman
a454341e2b add support for mxm 2.0
This commit was SVN r27661.
2012-12-09 22:58:37 +00:00
Nathan Hjelm
3e1b13b13a Re-add support for old flex (2.5.4a and earlier) while still cleaning up properly in new flex.
This commit was SVN r27657.
2012-12-07 00:12:43 +00:00
Brian Barrett
702451111b Remove Portals 3.3 support
This commit was SVN r27656.
2012-12-06 20:11:27 +00:00
Jeff Squyres
c00e6a7abf Remove the OFUD BTL. It doesn't work, and isn't included in 1.7.
An upcoming BTL from Cisco used ofud as a starting point, and should
probably be used as a starting point for any future UD-based BTL.

And this OFUD BTL is obviously still in history if anyone ever wants
to resurrect it.

This commit was SVN r27655.
2012-12-06 17:43:28 +00:00
Steve Wise
176a5a9b3b Update the Chelsio T4 openib device params. This fixes trac:3414 and should be added to cmr:v1.6:reviewer=jsquyres and cmr:v1.7:reviewer=jsquyres
This commit was SVN r27648.

The following Trac tickets were found above:
  Ticket 3414 --> https://svn.open-mpi.org/trac/ompi/ticket/3414
2012-11-30 16:32:34 +00:00
Jeff Squyres
ad15fb5437 Developer enhancement: if a BTL component returns a NULL in its array
of modules, print a BTL_ERROR and exit(1) (previous behavior was to
segv).  This at least explicitly tells the developer that their BTL
component is behaving badly.

This commit was SVN r27634.
2012-11-26 21:19:02 +00:00
Vishwanath Venkatesan
9320beaacc Modified/improved implmentation of dynamic segmentation algorithm to avoid merging in
fbtl modules. This implmentation in alignment with all other collective modules tries to 
keep all the file-ops as contiguous as possible.

This commit was SVN r27611.
2012-11-15 00:59:10 +00:00
Vishwanath Venkatesan
ac1dfae007 Perror should say readv in preadv fbtl, currently says writev
This commit was SVN r27610.
2012-11-15 00:57:13 +00:00
Nathan Hjelm
312857ea86 remove unused allocator output declaration (missed by r27599)
This commit was SVN r27600.

The following SVN revision numbers were found above:
  r27599 --> open-mpi/ompi@87e5f97400
2012-11-13 07:21:10 +00:00
Nathan Hjelm
87e5f97400 add missing #include of opal/util/output.h
This commit was SVN r27599.
2012-11-13 07:14:41 +00:00
Pavel Shamis
cbe6d6548a Cleaning warnings in ml, sbgp, bcol.
cmr:v1.7

This commit was SVN r27598.
2012-11-12 22:30:32 +00:00
Vishwanath Venkatesan
f91340f648 Fixes for the 2gb limitation. This fixes problems both static and two-phase
algorithms. 

This commit was SVN r27596.
2012-11-12 17:39:58 +00:00
Nathan Hjelm
e0f5137e46 add prototypes for lex destroy functions
This commit was SVN r27580.
2012-11-09 22:00:27 +00:00
Mike Dubman
9392f1894e Fixes MPI_Allgather when sendbuf is MPI_IN_PLACE. fixes 3342
This commit was SVN r27579.
2012-11-09 18:05:10 +00:00
Nathan Hjelm
8658bbc902 instead of relying on yyterminate to clean up the lex context call the destroy functions directly (after closing the file)
This commit was SVN r27577.
2012-11-09 16:10:55 +00:00
Aleksey Senin
ae92f64842 Check that MXM runtime version match compiled.
Reviewed by Mike Dubman.

This commit was SVN r27575.
2012-11-07 14:44:33 +00:00
Nathan Hjelm
51bf75f8e9 fix typo
This commit was SVN r27573.
2012-11-06 21:25:19 +00:00
Nathan Hjelm
1400bab466 fix a couple of errors in r27569
This commit was SVN r27572.

The following SVN revision numbers were found above:
  r27569 --> open-mpi/ompi@f3ce12e71a
2012-11-06 20:06:54 +00:00
Nathan Hjelm
7fb5caea92 Remove the finish_parsing function from various .l files. The function is incomplete (doesn't clean up the lex state) and should be replaced by *_yylex_destroy which correctly cleans up the state.
Checked with the flex 2.5.35. Verified with valgrind that this fixes several "still reachable" leaks.

cmr:v1.7

This commit was SVN r27571.
2012-11-06 19:26:14 +00:00
Nathan Hjelm
bdedd8b0d3 Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1.
Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality.

This commit was SVN r27570.
2012-11-06 19:09:26 +00:00
Nathan Hjelm
f3ce12e71a Per RFC fix several leaks in opal and ompi. Details below.
pml/v:
  - If vprotocol is not being used vprotocol_include_list is leaked. Assume vprotocol never takes ownership (see below) and always free the string.

coll/ml:
  - (patch verified) calling mca_base_param_lookup_string after mca_base_param_reg_string is unnecessary. The call to mca_base_param_lookup_string causes the value returned by mca_base_param_reg_string to be leaked.
  - Need to free mca_coll_ml_component.config_file_name on component close.

btl/openib:
  - calling mca_base_param_lookup_string after mca_base_param_reg_string is unnecessary. The call to mca_base_param_lookup_string causes the value returned by mca_base_param_reg_string to be leaked.

vprotocol/base:
  - There was no way for pml/v to determine if vprotocol took ownership of vprotocol_include_list. Fix by always never ownership (use strdup).

mca/base:
  - param_lookup will result in storage->stringval to be a newly allocated string if the mca parameter has a string value. ensure this string is always freed.

cmr:v1.7

This commit was SVN r27569.
2012-11-06 18:57:46 +00:00
Jeff Squyres
d6e9a14b14 Fix minor issue: the argv_delete may change the top list pointer. So
be sure to save it.

This commit was SVN r27568.
2012-11-06 16:05:58 +00:00
Mike Dubman
d47d550dfc performance optimization: process completions in the batch manner
This commit was SVN r27559.
2012-11-05 14:02:37 +00:00
Mike Dubman
ca308974e0 Switched FCA collectives component from dlopen to compile-time linking to libfca
This commit was SVN r27557.
2012-11-02 17:30:00 +00:00
Jeff Squyres
4569f77645 Remove redundant common_verbs.h include.
This commit was SVN r27556.
2012-11-02 14:16:55 +00:00
Mike Dubman
5cdb3654d7 SRQ now supported in ConnectIB
This commit was SVN r27552.
2012-11-01 08:13:56 +00:00
Brian Barrett
25ccfe1bdd Add autogen.sh to list of files to be distributed, so autogen works off a
tarball

This commit was SVN r27548.
2012-11-01 02:26:28 +00:00
Vishwanath Venkatesan
0e6378bfc9 Modifying the static component accordingly for the modification of interfaces in io_ompio.c
This commit was SVN r27546.
2012-10-31 22:09:21 +00:00
Vishwanath Venkatesan
d1fc22883a Changing the dynamic component accordingly for the modified interfaces
This commit was SVN r27545.
2012-10-31 22:08:25 +00:00
Vishwanath Venkatesan
67463de96f Changing the two-phase component accordingly for the modified interfaces.
This commit was SVN r27544.
2012-10-31 22:07:02 +00:00
Vishwanath Venkatesan
2922fa28a6 Changes to the interface for extracting timing information,
to avoid accessing datastructures across frameworks.

This commit was SVN r27543.
2012-10-31 22:03:05 +00:00
Nathan Hjelm
2acd0f83de Revert "Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter".
It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now.

This commit was SVN r27527.

The following SVN revision numbers were found above:
  r27451 --> open-mpi/ompi@d59034e6ef
  r27456 --> open-mpi/ompi@ecdbf34937
2012-10-30 19:45:18 +00:00
Brian Barrett
c0f1775620 Fix warnings in nbc
This commit was SVN r27514.
2012-10-29 19:52:43 +00:00
Ralph Castain
6aac54b02e Revert r27510, r27509, and r27508.
Not sure what happened here, but the resulting trunk wouldn't even configure. After spending time fixing that problem, I found it wouldn't compile due to multiple syntax errors that had been introduced in both the OPAL and OMPI layer. This raised questions as to the completeness of the work.

Given that the author is departing, I pinged Jeff about it and we agreed to revert this for now. Hopefully, it can either be fixed by the author prior to actual departure, or someone else can pick it up (now that it is in the history) and fix it.

This commit was SVN r27511.

The following SVN revision numbers were found above:
  r27508 --> open-mpi/ompi@12c3c743de
  r27509 --> open-mpi/ompi@79e4a8ca38
  r27510 --> open-mpi/ompi@1ad5ff625a
2012-10-27 16:43:45 +00:00
Shiqing Fan
1ad5ff625a Don't know why these lines are modified, probably by subversion. Although it is harmless, change it back as what it was.
This commit was SVN r27510.
2012-10-27 03:04:45 +00:00
Shiqing Fan
12c3c743de Per the MemPin RFC, submit the component source files, and update the memchecker macros.
This commit was SVN r27508.
2012-10-27 02:48:20 +00:00
Ralph Castain
b1e119fe2c Add missing header files to tarball
This commit was SVN r27504.
2012-10-26 23:06:31 +00:00
Edgar Gabriel
3291e3abe3 make ROMIO operational with OpenMPI for PVFS2 file systems.
This commit was SVN r27501.
2012-10-26 18:43:59 +00:00
Brian Barrett
8b40c0de9b * Lock around tag management, so that it's thread safe
* Only register the progress function on first call to a non-blocking
  collective operation, to try to reduce overall performance impact
* Fix tag management in roll-over case

This commit was SVN r27498.
2012-10-26 15:36:09 +00:00
Brian Barrett
51a3ec2d7b remove now unused (after r27485) fake mpool component
This commit was SVN r27486.

The following SVN revision numbers were found above:
  r27485 --> open-mpi/ompi@a1a52c9e90
2012-10-25 21:52:17 +00:00
Brian Barrett
a1a52c9e90 Rather than use the fake mpool for handling callbacks into the MX library,
use the memory hooks interface (which does allow for multiple callbacks
to be registered) directly.

This commit was SVN r27485.
2012-10-25 21:50:07 +00:00
Ralph Castain
e6014bf2e1 Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter
This commit was SVN r27477.

The following SVN revision numbers were found above:
  r27451 --> open-mpi/ompi@d59034e6ef
  r27456 --> open-mpi/ompi@ecdbf34937
2012-10-24 18:38:44 +00:00
Yael Dayan
d6f7e4eb73 openib: modified Mellanox ConnectIB max_inline_data param
This commit was SVN r27457.
2012-10-18 15:59:18 +00:00
Nathan Hjelm
d59034e6ef MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions.
cmr:v1.7

This commit was SVN r27451.
2012-10-17 20:17:37 +00:00
Vishwanath Venkatesan
6e3c9754a9 Initialization issue + fixing error in determining minimum value.
This commit was SVN r27443.
2012-10-11 23:41:25 +00:00
Vishwanath Venkatesan
95d38fdaf5 # Extracting timing information for the static collective write/read algorithms.
# The processes register their information and continue.
# Actual printing of timing information happens at file close.
# Triggered by MCA parameter at runtime

This commit was SVN r27442.
2012-10-11 21:27:47 +00:00
Vishwanath Venkatesan
240d56feeb # Extracting timing information for the two-phase collective write/read algorithms.
# The processes register their information and continue.
# Actual printing of timing information happens at file close.
# Triggered by MCA parameter at runtime

This commit was SVN r27441.
2012-10-11 21:25:30 +00:00
Vishwanath Venkatesan
7bc35f7862 # Extracting timing information for the dynamic collective write/read algorithms.
# The processes register their information and continue.
# Actual printing of timing information happens at file close.
# Triggered by MCA parameter at runtime

This commit was SVN r27440.
2012-10-11 21:23:24 +00:00
Vishwanath Venkatesan
9eeb3b5d50 # Extracting timing information for individual components of collective algorithm using a generic queue.
# This is triggered based on a mca-paramater and can be used with all collective modules.
# Individual queues maintained for read and write.
# The additional communication to combine data is done at file-close so that the 
  actual timing of collective-operations will not get affected. 
# The queues are initialized in file-open

This commit was SVN r27439.
2012-10-11 21:14:07 +00:00
Jeff Squyres
2ab48b997e Give a better verbose message if we're able to make an RC QP and we
didn't want one.

This commit was SVN r27438.
2012-10-11 19:21:21 +00:00
George Bosilca
9984a7143f Reorder the loop index.
This commit was SVN r27423.
2012-10-08 21:34:26 +00:00
George Bosilca
b46167fc4a Fix some issues with the MPI_IN_PLACE support.
This commit was SVN r27422.
2012-10-08 21:34:04 +00:00
Vishwanath Venkatesan
c89c9e40be Code to extract neigbhouing offsets information from OMPIO into a file. Driven by an MCA parameter,
turned-off by default.

This commit was SVN r27407.
2012-10-04 21:53:26 +00:00
Vishwanath Venkatesan
c86e5f6263 set status->_ucount correctly for two-phase collective read and write operations in the module
This commit was SVN r27406.
2012-10-04 21:39:37 +00:00
Vishwanath Venkatesan
12363d50bd Setting the default stripe size to 1MB as 64KB is too small for a generic scenario.
This commit was SVN r27405.
2012-10-04 21:26:29 +00:00
Vishwanath Venkatesan
2ddbfbeb26 Setting the default stripe size for lustre to 64KB. This is an MCA parameter, can be overridden at runtime if needed.
This commit was SVN r27404.
2012-10-04 20:56:26 +00:00
Eugene Loh
25ad84b925 Ensure that MPI_Status objects have proper alignment:
- fix the Fortran layer to use new macros to convert Fortran-to-C status
- change the C internals to pull out old OMPI_SET_STATUS* macros

Also, change name of "status" argument in topo_test_f.c to "topo_type".

This commit was SVN r27403.
2012-10-04 14:39:51 +00:00
Yael Dayan
0122cf6cbb openib: added Mellanox ConnectIB device ID and params to the device parameters ini file
This commit was SVN r27402.
2012-10-04 13:20:47 +00:00
Pavel Shamis
e9df33c10b Fixing a coupe of issues in the iboffload code. It should fix the XRC compilation issue.
This commit was SVN r27400.
2012-10-03 15:44:51 +00:00
Jeff Squyres
8c369224bf More common/verbs improvements:
* Add OMPI_COMMON_VERBS_FLAGS_NOT_RC, which looks for a device	that
   does ''not'' support RC
 * Add ompi_common_verbs_find_max_inline(), and	remove that code from
   the openib BTL component

This commit was SVN r27393.
2012-10-03 00:57:39 +00:00
Nathan Hjelm
9200473a59 mtl/psm: do not let psm set processor affinity
This commit was SVN r27389.
2012-10-02 15:56:51 +00:00
Vishwanath Venkatesan
76dc8a7c55 Fix for two_phase_read_all for multiple cycles.
This commit was SVN r27388.
2012-10-01 19:14:14 +00:00
Ralph Castain
54db4c35eb Get the trunk to build again when --without-hwloc is specified. Move a couple of key type definitions and utilities out from under the HAVE_HWLOC test so they are always available as they don't really depend on hwloc's presence. Tell two compnents not to build if hwloc is disabled:
ompi/mca/sbgp/basesmsocket
orte/mca/rmaps/lama

Remove stale configure.params files from the sbgp framework as the OMPI build system no longer looks at those files.

This commit was SVN r27377.
2012-09-26 23:24:27 +00:00
George Bosilca
48f528f142 icc complains about the missing prototype.
This commit was SVN r27373.
2012-09-26 09:56:14 +00:00
George Bosilca
890dedf13f Cleanup.
This commit was SVN r27372.
2012-09-26 09:44:46 +00:00
Vishwanath Venkatesan
c6751eaf70 Fixing all two_phase read_all bugs,
1. Multiple aggregator with non-contiguous datatype,
2. Memory corruption bugs.

Cleaned version, with proper initialization and memory management.

This commit was SVN r27370.
2012-09-26 00:16:08 +00:00
Vishwanath Venkatesan
2e2a46b2be Fixing two-phase write all bug for non-contiguous file-type.
This commit was SVN r27369.
2012-09-26 00:10:40 +00:00
Jeff Squyres
30d9c36275 FreeBSD detection improvement. Thanks to Brooks Davis for the patch.
This commit was SVN r27334.
2012-09-13 13:25:04 +00:00
Jeff Squyres
3cc8b0461a More updates to common verbs infrastructure:
* Moved "check basics" sanity check from openib BTL to common/verbs
   (which also allows us to have openib ''not'' include
   <infiniband/driver.h>, which is a Very Good Thing)
 * Add new ompi_common_verbs_qp_test() function, which tests to see
   whether a device supports RC and/or UD QPs.  The openib BTL now
   uses this function to ensure that the device supports RC QPs.
 * Rename ompi_common_verbs_find_ibv_ports() to be
   ompi_common_verbs_find_ports() -- the "ibv" was redundant.
 * Re-work ompi_common_verbs_find_ports() to use
   ompi_common_verbs_qp_test() instead of testing for RC/UD QPs itself
 * Add bunches of opal_output_verbose() to the find_ports() routine
   (to help diagnosing connectivity problems -- imaging running with
   --mca btl_base_verbose 10; you'll see all the find_ports() test
   results)
 * Make ompi_common_verbs_qp_test() warn if devices/ports are supplied
   in the if_include/if_exclude strings that do not exists (quite
   similar to what the openib BTL does today).
 * Add ompi_common_verbs_mca_register() function, which registers
   common verbs MCA params.  It will also register MCA param synonyms
   for thse MCA params to upper-level components (e.g.,
   btl_<upper-level-component>_<the-mca-param>). 
   * common_verbs_warn_nonexistent_if: warn if
     if_include/if_exclude-specified devices or ports do not exist.  

This commit was SVN r27332.
2012-09-12 20:47:47 +00:00
Pavel Shamis
1e7b958c2a Cleaning warning in collectives code
This commit was SVN r27331.
2012-09-12 19:47:23 +00:00
Jeff Squyres
171f6efd70 Don't free heap objects!
This commit was SVN r27326.
2012-09-12 15:11:56 +00:00
Jeff Squyres
fb2e543a57 Refs trac:3275.
We ran into a case where the OMPI SVN trunk grew a new acceptable MCA
parameter value, but this new value was not accepted on the v1.6
branch (hwloc_base_mem_bind_failure_action -- on the trunk it accepts
the value "silent", but on the older v1.6 branch, it doesn't).  If you
set "hwloc_base_mem_bind_failure_action=silent" in the default MCA
params file and then accidentally ran with the v1.6 branch, every OMPI
executable (including ompi_info) just failed because hwloc_base_open()
would say "hey, 'silent' is not a valid value for
hwloc_base_mem_bind_failure_action!".  Kaboom.

The only problem is that it didn't give you any indication of where
this value was being set.  Quite maddening, from a user perspective.

So we changed the ompi_info handles this case.  If any framework open
function return OMPI_ERR_BAD_PARAM (either because its base MCA params
got a bad value or because one of its component register/open
functions return OMPI_ERR_BAD_PARAM), ompi_info will stop, print out
a warning that it received and error, and then dump out the parameters
that it has received so far in the framework that had a problem.

At a minimum, this will show the user the MCA param that had an error
(it's usually the last one), and ''where it was set from'' (so that
they can go fix it).  

We updated ompi_info to check for O???_ERR_BAD_PARAM from each from
the framework opens.  Also updated the doxygen docs in mca.h for this
O???_BAD_PARAM behavior.  And we noticed that mca.h had MCA_SUCCESS
and MCA_ERR_??? codes.  Why?  I think we used them in exactly one
place in the code base (mca_base_components_open.c).  So we deleted
those and just used the normal OPAL_* codes instead.

While we were doing this, we also cleaned up a little memory
management during ompi_info/orte-info/opal-info finalization.
Valgrind still reports a truckload of memory still in use at ompi_info
termination, but they mostly look to be components not freeing
memory/resources properly (and outside the scope of this fix).

This commit was SVN r27306.

The following Trac tickets were found above:
  Ticket 3275 --> https://svn.open-mpi.org/trac/ompi/ticket/3275
2012-09-11 20:47:24 +00:00
Shiqing Fan
ec4cf39925 Windows doesn't need to exclude any interface by default. This will avoid tcp warnings.
This commit was SVN r27291.
2012-09-11 15:39:37 +00:00
Shiqing Fan
0c4c2a5f5d Revert r27283. A better solution is found. Thanks to Ralph anyway.
This commit was SVN r27290.

The following SVN revision numbers were found above:
  r27283 --> open-mpi/ompi@38bcd86ae4
2012-09-11 15:37:22 +00:00