1
1

173 Коммитов

Автор SHA1 Сообщение Дата
Brian Barrett
f42783ae1a Move the RTE framework change into the trunk. With this change, all non-CR
runtime code goes through one of the rte, dpm, or pubsub frameworks.

This commit was SVN r27934.
2013-01-27 23:25:10 +00:00
Jeff Squyres
dd254cc202 OMPI_HAVE_IBV_LINK_LAYER does not exist. Instead, check defined(HAVE_IBV_LINK_LAYER_ETHERNET).
This commit was SVN r27251.
2012-09-06 18:25:36 +00:00
Jeff Squyres
341ce2f9a4 Per some discussions between LANL, Cisco, ORNAL, and Mellanox, move
some new common OpenFabrics functionality to ompi/mca/common/verbs.
Also move everything that was in ompi/mca/common/ofautils under
ompi/mca/common/verbs.  

 * Move ofautils -> verbs
 * Add new functionality in ompi/mca/common/verbs (see doxygen
 * comments in ompi/mca/common/verbs/common_verbs.h for details):
   * ompi_common_verbs_find_ibv_ports()
   * ompi_common_verbs_port_bw()
   * ompi_common_verbs_mtu()
   * '''If you're writing verbs-based code, you should be using this
     common functionality'''
 * Adapt openib BTL to use some trivial common functionality in
   common/verbs
 * Don't use "#ifdef OMPI_HAVE_RDMAOE",use 
   "#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)"
 * Update the following to include/link against common/verbs
   * bcol/iboffload
   * sbgp/ibnet
   * btl/openib

This commit was SVN r27212.
2012-09-01 01:42:37 +00:00
Jeff Squyres
c8cee23ee7 Priorities really shouldn't be less than 0.
This commit was SVN r27098.
2012-08-21 15:47:15 +00:00
Ralph Castain
dacb07000d Turn udcm and ud oob off by default, but allow them to build and be used if someone wants to test them
cmr:v1.7

This commit was SVN r27097.
2012-08-21 15:18:34 +00:00
Christopher Yeoh
cc091f4979 Adds synchronisation between main thread and service thread in
btl_openib_connect_udcm when notifying not to listen to an fd to ensure
that the main thread does not continue until the service thread has
processed the message

Adds ability to send message to openib async thread to tell it to
ignore the ERR state on a specific QP. Adds this call to udcm_module_finalize
so when we set the error state on the QP it doesn't cause the 
openib async thread to abort the mpi program prematurely

Fixes trac:3161

This commit was SVN r27064.

The following Trac tickets were found above:
  Ticket 3161 --> https://svn.open-mpi.org/trac/ompi/ticket/3161
2012-08-16 03:56:21 +00:00
Nathan Hjelm
8736953c7f btl/openib/connect improve the help message printed when a queue pair can not be created
This commit was SVN r26876.
2012-07-26 20:36:46 +00:00
Nathan Hjelm
771b427027 udcm: unmonitor the fd BEFORE tearing down the listen qp
This commit was SVN r26800.
2012-07-18 14:22:45 +00:00
Nathan Hjelm
fc1b295606 udcm: evict from the lru of the openib device's grdma mpool if a qp can not be created. Note: there doesn't appear to be a standard way to differentiate between ibv_create_qp failing because the node is out of registered memory and failing because no more qps are available
This commit was SVN r26797.
2012-07-14 01:58:29 +00:00
Nathan Hjelm
344fe61616 remove assertion in udcm
This commit was SVN r26790.
2012-07-13 15:14:48 +00:00
Nathan Hjelm
05c5c1f412 remove unused i_initiate function from udcm
This commit was SVN r26778.
2012-07-10 17:22:19 +00:00
Nathan Hjelm
9f3717959e remove sync step from udcm as it really isn't necessary
This commit was SVN r26724.
2012-07-02 22:54:44 +00:00
Josh Hursey
28681deffa Backout the ORCA commit. :(
There is a linking issue on Mac OSX that needs to be addressed before this is able to come back into the trunk.

This commit was SVN r26676.
2012-06-27 01:28:28 +00:00
Josh Hursey
542330e3a7 Commit of ORCA: Open MPI Runtime Collaborative Abstraction
This is a runtime interposition project that sits between the OMPI and ORTE layers in Open MPI.

The project is described on the wiki:
  https://svn.open-mpi.org/trac/ompi/wiki/Runtime_Interposition

And on this email thread:
  http://www.open-mpi.org/community/lists/devel/2012/06/11109.php

This commit was SVN r26670.
2012-06-26 21:42:16 +00:00
Nathan Hjelm
249066e06d Timeout! Per RFC update the BTL interface to hide segment keys. All BTLs (with the exception of wv), all relevant PMLs, and osc/rdma have been updated for the new interface.
This commit was SVN r26626.
2012-06-21 17:09:12 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Nathan Hjelm
3d9bc68435 reorder udcm init/finalize code. fixes trac:2973
This commit was SVN r25787.

The following Trac tickets were found above:
  Ticket 2973 --> https://svn.open-mpi.org/trac/ompi/ticket/2973
2012-01-26 16:28:55 +00:00
Brad Benton
96395c916e de-tab'd
This commit was SVN r25465.
2011-11-09 19:45:12 +00:00
Christopher Yeoh
fb57a74a40 Removes pointless memmove which because of a previous memcpy will always
have identical source and destination pointers. See #2871
Plugs a couple of minor memory leaks related to remote qp info

This commit was SVN r25431.
2011-11-04 00:15:08 +00:00
George Bosilca
9d8e84142f Survivor!!!
This commit was SVN r25371.
2011-10-26 00:58:55 +00:00
George Bosilca
80c02647c8 Each level (OPAL/ORTE/OMPI) should only return it's own constants,
instead of the current mismatch.

This commit was SVN r25230.
2011-10-04 14:50:31 +00:00
Brad Benton
0f2475c554 Modified set_remote_info() to use memmove() instead of memcpy() when
copying rem_qp info.  This avoids potential errors when src & dest overlap.
This is a workaround for the issue in #2871

This commit was SVN r25180.
2011-09-26 20:07:36 +00:00
Nathan Hjelm
98b56108c4 add unconnect datagram connection manager (udcm)
This commit was SVN r25160.
2011-09-19 21:24:58 +00:00
Nathan Hjelm
8cd550f49f fixed error in last commit
This commit was SVN r25159.
2011-09-19 21:13:59 +00:00
Nathan Hjelm
de950959ee print a more meaningful error message when ibv_create_qp fails
This commit was SVN r25158.
2011-09-19 21:12:22 +00:00
Steve Wise
e5bba38434 Increase the rdmacm cpc address resolution timeout to 30 seconds.
Global rdmacm_resolve_timeout defaults to 1000 (1000 ms), which is way
too small for even a 16 node x 12 core iwarp cluster in the presence
of drops.  Bump up the default to 30000ms.

This commit fixes trac:2860 and should be added to cmr:v1.4:reviewer=jsquyres
and cmr:v1.5:reviewer=jsquyres

This commit was SVN r25144.

The following Trac tickets were found above:
  Ticket 2860 --> https://svn.open-mpi.org/trac/ompi/ticket/2860
2011-09-14 21:52:58 +00:00
Nathan Hjelm
3048ce043d permanently disable ibcm
This commit was SVN r25137.
2011-09-13 15:10:51 +00:00
Rainer Keller
9d5afc58c6 - Fix breakage of the epoch changes with PGI:
Don't juse include pre-processor macros between two strins ("s1" #if 0 ... "s2")...
   Rather print out the epoch as 0 always...

This commit was SVN r25110.
2011-08-31 08:40:31 +00:00
Wesley Bland
4e7ff0bd5e By popular demand the epoch code is now disabled by default.
To enable the epochs and the resilient orte code, use the configure flag:

--enable-resilient-orte

This will define both:

ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE

This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Yevgeny Kliteynik
7068dc64eb Dynamic SL rework:
- Added dynamic SL support to xoob
 - Fixed seg fault in finalization
 - All the code has been moved to separate files: connect/btl_openib_connect_sl.{c,h}
 - The new files compilation is conditionalized

This commit was SVN r24991.
2011-08-04 20:26:08 +00:00
Yevgeny Kliteynik
4fbe68dd86 Removing trailing white spaces in all the openib btl code.
This commit was SVN r24855.
2011-07-04 14:00:41 +00:00
Yevgeny Kliteynik
b05211148d Supporting dynamic SL (#2674)
- Added enable/disable configuration parameter for dynamic SL
 - All the dynamic SL code is conditionalized
 - Removed libibmad dependency
 - Using only one include - ib_types.h (part of opensm-devel package)
 - Removed all the macro and data types definitions, using the
   existing definitions from ib_types.h instead
 - general cleaning here and there

The async mode is not implemented yet - stay tuned...

This commit was SVN r24830.
2011-06-28 14:28:29 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Jeff Squyres
82f9474fec Revert r24533 and r24507 until the compile errors can be fixed.
This commit was SVN r24541.

The following SVN revision numbers were found above:
  r24507 --> open-mpi/ompi@4ce1936fed
  r24533 --> open-mpi/ompi@3204af2d36
2011-03-18 13:33:02 +00:00
Mike Dubman
3204af2d36 * temporary fix for ib btl compilation with old ofed versions 1.3.x.
This commit was SVN r24533.
2011-03-16 17:53:51 +00:00
Doron Shoham
4ce1936fed Fix the following for dynamic SL patch:
* rename ib_path_rec_service_level -> ib_path_record_service_level
* use mad.h and ib_types.h
* free all resources
* move ibv_post_recv to be just before ibv_post_send
* cleanup and beatify code

This commit was SVN r24507.
2011-03-10 16:19:00 +00:00
George Bosilca
4184baa67a Remove the proc_guid from the BTL proc structure. Instead use directly
the one stored in the ompi_proc_t.

This commit was SVN r24461.
2011-02-25 00:36:08 +00:00
Jeff Squyres
b468c71b47 Use complete types
This commit was SVN r24434.
2011-02-22 22:34:44 +00:00
Shiqing Fan
90eeba252e Make openib compile again for Windows.
Update the CMake script for checking mca subdirs.
Add windows support for __attribute__ packed structures.
Define usleep and posix_memalign with equivalent windows functions.
And a few minor fixes, type casts.

This commit was SVN r24429.
2011-02-22 15:49:27 +00:00
Eugene Loh
cd5c2e794f Some minor changes to help the openib BTL build and run on Solaris:
- poll() can return POLLRDNORM even if not requested (Solaris bug)
- MIN macro not defined in btl_openib.c
  and while we're at it, we clean up the MIN definition in ad_bgl_pset.h
- btl_openib_connect_rdmacm.c was calling rdma_destroy_id() twice
  leading to undefined behavior (a hang on Solaris)

This commit was SVN r24356.
2011-02-03 23:53:21 +00:00
Ethan Mallove
82054cb02c Include <stdlib.h> instead of <malloc.h>. This avoids a compiler error
on some systems caused by the definition of malloc in
opal_config_bottom.h getting expanded in the system malloc.h when
OPAL_ENABLE_MEM_DEBUG is set to 1.

This commit was SVN r24210.
2011-01-06 18:16:36 +00:00
Josh Hursey
bbfdf04a81 Fix a couple of 'unused variable' warnings, and one return value warning.
{{{
base/paffinity_base_service.c: In function ‘opal_paffinity_base_cset2mapstr’:
base/paffinity_base_service.c:623: warning: unused variable ‘range_last’
base/paffinity_base_service.c:623: warning: unused variable ‘range_first’
base/paffinity_base_service.c:622: warning: unused variable ‘count’
base/paffinity_base_service.c:622: warning: unused variable ‘m’
}}}

{{{
connect/btl_openib_connect_oob.c: In function ‘init_ud_qp’:
connect/btl_openib_connect_oob.c:1111: warning: control reaches end of non-void function
connect/btl_openib_connect_oob.c: In function ‘init_device’:
connect/btl_openib_connect_oob.c:1235: warning: unused variable ‘i’
connect/btl_openib_connect_oob.c: In function ‘get_pathrecord_sl’:
connect/btl_openib_connect_oob.c:1323: warning: unused variable ‘i’
}}}

This commit was SVN r24196.
2010-12-30 15:37:50 +00:00
Doron Shoham
834625cc51 Currently the service lever is passed as static parameter (ib_service_level), but the service level is possibly dynamic and if so the only way to get a proper value is to ask the SA.
New mca parameter is added (ib_path_rec_service_level) - positive value means that we should get the SL from the SA.

This is usable for torus topologies where different SL value is used for different endpoints.

A cache is kept of ib queue pairs used to communicate with the SA for a particular device and port and path record SL values retrieved from that SA.

The interaction with the cache assumes that there are no recursive calls to these routines. This must be solved either by code flow, by using higher level locks, or by adding a locking mechanism to these routines along with some method for avoiding deadlock.

This code use a UD queue pair to talk to the SA, and not need to chmod /dev/infiniband/umad* for use by normal users.  

The request to the SA is a SubnAdmGet(), not a SubnAdmGetTable().
In the future we might add a support of a SubnAdmGetTable(), but it will require implementing RMPP (Reliable Multi-Packet Transaction Protocol) and I'm not sure we want to do that.

This patched is based on the work of David McMillen <davem@systemfabricworks.com>.

This commit was SVN r24195.
2010-12-30 08:20:24 +00:00
Ralph Castain
fceabb2498 Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.

Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.

Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.

I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:

1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)

2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.

There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.

This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
Rolf vandeVaart
e9a7fea42d Fix up some of the failover code in the openib BTL.
Need to use MCA_BTL_IB_FAILED state to signel failure,
not MCA_BTL_IB_CLOSED.

This commit was SVN r23883.
2010-10-11 17:38:27 +00:00
Donald Kerr
f79c89e0e9 help maintain order established, and defined, during mca_btl_openib_add_procs()
This commit was SVN r23425.
2010-07-16 13:13:37 +00:00
Rolf vandeVaart
b4af9c0efc Fix casts so trunk compiles
This commit was SVN r23381.
2010-07-13 01:52:22 +00:00
Shiqing Fan
cdc7e0bec9 Mainly type casts.
Get rid of pthread and other unnecessary stuffs for Windows.

This commit was SVN r23376.
2010-07-12 16:17:56 +00:00
Josh Hursey
f57e73d4e5 add a few more missing SOS includes
This commit was SVN r23168.
2010-05-18 15:00:07 +00:00
Abhishek Kulkarni
afbe3e99c6 * Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
 SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
 back the native error code.

* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
  (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
  decode 'ret' to get the native error code.

This commit was SVN r23162.
2010-05-17 23:08:56 +00:00