1
1

940 Коммитов

Автор SHA1 Сообщение Дата
Wesley Bland
4e7ff0bd5e By popular demand the epoch code is now disabled by default.
To enable the epochs and the resilient orte code, use the configure flag:

--enable-resilient-orte

This will define both:

ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE

This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Jeff Squyres
1cbfb53801 r24976 wasn't quite right -- you now actually get a warning if you
specify btl_tcp_if_include because btl_tcp_if_exclude is defaulted to
the loopback devices.

This commit does a few things:

 * Introduce a new OPAL MCA base function:
   mca_base_param_check_exclusive_string().  It checks to see that the
   ''user'' does not set two MCA parameters that are mutually
   exclusive by checking the source of those MCS param values.
 * Use the above function in many BTLs (and the OOB TCP) to ensure
   that <foo>_if_include and <foo>_if_exclude are not both specified
   ''by the user''.
 * Re-arrange many of these BTLs to move their MCA registration code
   into a separate component_register() function (vs. the
   component_open() function).

This code has been nominally reviewed and checked by Ralph, George,
Terry, and Shiqing.

This commit was SVN r25043.

The following SVN revision numbers were found above:
  r24976 --> open-mpi/ompi@8f4ac54336
2011-08-10 17:24:36 +00:00
Yevgeny Kliteynik
7068dc64eb Dynamic SL rework:
- Added dynamic SL support to xoob
 - Fixed seg fault in finalization
 - All the code has been moved to separate files: connect/btl_openib_connect_sl.{c,h}
 - The new files compilation is conditionalized

This commit was SVN r24991.
2011-08-04 20:26:08 +00:00
Rolf vandeVaart
3d3b3d4dad Add support for CUDA registering sm and openib buffers. Feature is disabled by default.
This commit was SVN r24987.
2011-08-04 10:15:45 +00:00
Yevgeny Kliteynik
c1ab24c687 openib: added Mellanox ConnectX3 device ID to the device parameters ini file
This commit was SVN r24947.
2011-07-26 12:06:43 +00:00
Yevgeny Kliteynik
4fbe68dd86 Removing trailing white spaces in all the openib btl code.
This commit was SVN r24855.
2011-07-04 14:00:41 +00:00
Yevgeny Kliteynik
5cae33503d Changing the weird non-ASCII sign with '*'
This commit was SVN r24854.
2011-07-04 13:39:38 +00:00
Rolf vandeVaart
e6295159ae Fix compilation of file due to some changes in btl structure.
This commit was SVN r24847.
2011-06-30 19:22:41 +00:00
Yevgeny Kliteynik
b05211148d Supporting dynamic SL (#2674)
- Added enable/disable configuration parameter for dynamic SL
 - All the dynamic SL code is conditionalized
 - Removed libibmad dependency
 - Using only one include - ib_types.h (part of opensm-devel package)
 - Removed all the macro and data types definitions, using the
   existing definitions from ib_types.h instead
 - general cleaning here and there

The async mode is not implemented yet - stay tuned...

This commit was SVN r24830.
2011-06-28 14:28:29 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Jeff Squyres
d1d2cd0a87 Make the description of mca_btl_openib_cq_size be more accurate of
what it really is/does.

cmr:v1.5.4:kliteyn cmr:v1.4.4:reviewer=kliteyn

This commit was SVN r24684.
2011-05-05 13:10:11 +00:00
Jeff Squyres
25a8944e09 Fixes trac:2776. Let the openib BTL auto-detect its bandwidth.
cmr:v1.5.4

This commit was SVN r24621.

The following Trac tickets were found above:
  Ticket 2776 --> https://svn.open-mpi.org/trac/ompi/ticket/2776
2011-04-19 16:31:36 +00:00
Eugene Loh
2770a12beb Continue clean up of thread options started in r22841, 22842, and 22849.
No need for any CMRs to 1.5... that was already done in CMR 2728.

This commit was SVN r24545.

The following SVN revision numbers were found above:
  r22841 --> open-mpi/ompi@b400b84162
2011-03-18 21:36:35 +00:00
Jeff Squyres
82f9474fec Revert r24533 and r24507 until the compile errors can be fixed.
This commit was SVN r24541.

The following SVN revision numbers were found above:
  r24507 --> open-mpi/ompi@4ce1936fed
  r24533 --> open-mpi/ompi@3204af2d36
2011-03-18 13:33:02 +00:00
Mike Dubman
3204af2d36 * temporary fix for ib btl compilation with old ofed versions 1.3.x.
This commit was SVN r24533.
2011-03-16 17:53:51 +00:00
Doron Shoham
4ce1936fed Fix the following for dynamic SL patch:
* rename ib_path_rec_service_level -> ib_path_record_service_level
* use mad.h and ib_types.h
* free all resources
* move ibv_post_recv to be just before ibv_post_send
* cleanup and beatify code

This commit was SVN r24507.
2011-03-10 16:19:00 +00:00
George Bosilca
4184baa67a Remove the proc_guid from the BTL proc structure. Instead use directly
the one stored in the ompi_proc_t.

This commit was SVN r24461.
2011-02-25 00:36:08 +00:00
Jeff Squyres
4cb8a42e7b Add btl_openib_gid_index MCA param to allow selecting which GID to use
from an openfabrics port's GID table.

This commit was SVN r24456.
2011-02-24 14:09:22 +00:00
Doron Shoham
47a0752856 max_hw_msg_size should be 0 (default) or greater
This commit was SVN r24455.
2011-02-24 09:17:18 +00:00
Doron Shoham
e41e15c8db cosmetic fixes in openib btl:
* replace tabs with ws
* remove unnecessary casting
* use proper escape codes for printf() like functions

This commit was SVN r24445.
2011-02-23 15:50:37 +00:00
Jeff Squyres
b468c71b47 Use complete types
This commit was SVN r24434.
2011-02-22 22:34:44 +00:00
Shiqing Fan
90eeba252e Make openib compile again for Windows.
Update the CMake script for checking mca subdirs.
Add windows support for __attribute__ packed structures.
Define usleep and posix_memalign with equivalent windows functions.
And a few minor fixes, type casts.

This commit was SVN r24429.
2011-02-22 15:49:27 +00:00
Doron Shoham
e5eef80364 fix type warning in openib btl
This commit was SVN r24419.
2011-02-21 15:13:30 +00:00
Donald Kerr
995d46344c simplify the way IBV_ACCESS_SO is discovered
This commit was SVN r24409.
2011-02-17 04:28:56 +00:00
Donald Kerr
2b60b165aa on Solaris, when IBV_ACCESS_SO is available, use strong ordered memory region for eager rdma connection
This commit was SVN r24395.
2011-02-16 05:37:22 +00:00
Ethan Mallove
c6fd141923 missing include
This commit was SVN r24368.
2011-02-09 17:59:55 +00:00
Eugene Loh
cd5c2e794f Some minor changes to help the openib BTL build and run on Solaris:
- poll() can return POLLRDNORM even if not requested (Solaris bug)
- MIN macro not defined in btl_openib.c
  and while we're at it, we clean up the MIN definition in ad_bgl_pset.h
- btl_openib_connect_rdmacm.c was calling rdma_destroy_id() twice
  leading to undefined behavior (a hang on Solaris)

This commit was SVN r24356.
2011-02-03 23:53:21 +00:00
Abhishek Kulkarni
3243b16bb3 Decode SOS error code before checking it with the native error code.
This commit was SVN r24281.
2011-01-20 23:21:38 +00:00
Rolf vandeVaart
acd38ff746 Final changes from jsquyres review. Moved configure
code from upper level into btl configure.m4.  Changed
prefix from "OMPI" to "BTL" in preprocessor macro.  Add
an mca param that shows it has been configured in.

This commit was SVN r24270.
2011-01-19 20:58:22 +00:00
Rolf vandeVaart
f22f76a6ff Add byte swapping macro for failover control message per jsquyres review.
This commit was SVN r24266.
2011-01-19 19:58:35 +00:00
Rolf vandeVaart
e75b86d3ab Fix some issues from jsquyres review.
1. Use asprintf instead of snprintf
2. Return remote_proc where possible.
3. Remove dead code.
4. Fix two comment typos.

This commit was SVN r24265.
2011-01-19 16:09:17 +00:00
Ethan Mallove
82054cb02c Include <stdlib.h> instead of <malloc.h>. This avoids a compiler error
on some systems caused by the definition of malloc in
opal_config_bottom.h getting expanded in the system malloc.h when
OPAL_ENABLE_MEM_DEBUG is set to 1.

This commit was SVN r24210.
2011-01-06 18:16:36 +00:00
Jeff Squyres
58445f3775 After being hit by "why is openib not working?" ''again'', add a
verbose statement that shows up when you --mca btl_base_verbose 100.
It clearly states that the openib BTL disqualifies itself when
MPI_THREAD_MULTIPLE is used.

This commit was SVN r24209.
2011-01-05 22:01:15 +00:00
Josh Hursey
bbfdf04a81 Fix a couple of 'unused variable' warnings, and one return value warning.
{{{
base/paffinity_base_service.c: In function ‘opal_paffinity_base_cset2mapstr’:
base/paffinity_base_service.c:623: warning: unused variable ‘range_last’
base/paffinity_base_service.c:623: warning: unused variable ‘range_first’
base/paffinity_base_service.c:622: warning: unused variable ‘count’
base/paffinity_base_service.c:622: warning: unused variable ‘m’
}}}

{{{
connect/btl_openib_connect_oob.c: In function ‘init_ud_qp’:
connect/btl_openib_connect_oob.c:1111: warning: control reaches end of non-void function
connect/btl_openib_connect_oob.c: In function ‘init_device’:
connect/btl_openib_connect_oob.c:1235: warning: unused variable ‘i’
connect/btl_openib_connect_oob.c: In function ‘get_pathrecord_sl’:
connect/btl_openib_connect_oob.c:1323: warning: unused variable ‘i’
}}}

This commit was SVN r24196.
2010-12-30 15:37:50 +00:00
Doron Shoham
834625cc51 Currently the service lever is passed as static parameter (ib_service_level), but the service level is possibly dynamic and if so the only way to get a proper value is to ask the SA.
New mca parameter is added (ib_path_rec_service_level) - positive value means that we should get the SL from the SA.

This is usable for torus topologies where different SL value is used for different endpoints.

A cache is kept of ib queue pairs used to communicate with the SA for a particular device and port and path record SL values retrieved from that SA.

The interaction with the cache assumes that there are no recursive calls to these routines. This must be solved either by code flow, by using higher level locks, or by adding a locking mechanism to these routines along with some method for avoiding deadlock.

This code use a UD queue pair to talk to the SA, and not need to chmod /dev/infiniband/umad* for use by normal users.  

The request to the SA is a SubnAdmGet(), not a SubnAdmGetTable().
In the future we might add a support of a SubnAdmGetTable(), but it will require implementing RMPP (Reliable Multi-Packet Transaction Protocol) and I'm not sure we want to do that.

This patched is based on the work of David McMillen <davem@systemfabricworks.com>.

This commit was SVN r24195.
2010-12-30 08:20:24 +00:00
Doron Shoham
bfe611d3bd This patch fixes bugs #2627 (1.5.2) and #2623 (1.4.2) - Sending large messages over RDMA fails.
The patch includes the following:
 *  Add new mca parameter - btl_openib_max_hw_msg_size - Maximum size (in bytes) of a single fragment of a long message when using the RDMA protocols (must be > 0 and <= hw capabilities).
 *  If btl_openib_max_hw_msg_size is larger than the maximum hw limitation print error message.
 *  Change the default openib flags to include only PUT and not GET.
 *  Print error message if user choose manually GET flag in openib btl.
 *  In prepare_dst: limit the message size to be the minimum of both endpoint's hw_limitation and the user limitation (if requested).

This commit was SVN r24191.
2010-12-23 11:48:43 +00:00
Shiqing Fan
f43862420c Convert the bad dos line endings to unix style for all windows related files.
This commit was SVN r24137.
2010-12-02 12:08:08 +00:00
Jeff Squyres
e4744b4ed5 Per http://www.open-mpi.org/community/lists/devel/2010/11/8671.php,
change a bunch of OMPI_<foo> names to OPAL_<foo>.

This commit was SVN r24046.
2010-11-12 23:22:11 +00:00
Rolf vandeVaart
c74df90729 Fix compile warnings for unused functions.
This commit was SVN r24033.
2010-11-10 21:04:32 +00:00
Rolf vandeVaart
da9c936ba0 Fix cut and paste error. Checking the wrong flags.
This commit was SVN r24029.
2010-11-10 19:09:47 +00:00
Rolf vandeVaart
50f8de7ab0 Fix up the debug functions.
This commit was SVN r24015.
2010-11-09 15:28:21 +00:00
Rolf vandeVaart
c23b26a66f Add a function to debug messages stuck in queues.
Change all tabs to spaces.

This commit was SVN r23974.
2010-11-01 14:23:34 +00:00
Ralph Castain
9ea2b196ce Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions.
Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component.

Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work.

This commit was SVN r23966.
2010-10-28 15:22:46 +00:00
Rolf vandeVaart
4cb414e0cd Fix a place where I did not error out frags.
This commit was SVN r23961.
2010-10-27 17:21:08 +00:00
Ralph Castain
86c7365e8e Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem.
Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution.

This commit was SVN r23943.
2010-10-26 02:41:42 +00:00
Ralph Castain
fceabb2498 Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.

Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.

Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.

I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:

1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)

2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.

There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.

This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
Rolf vandeVaart
364fcd8975 Should not overwrite des_context. Leftover from debugging.
This commit was SVN r23912.
2010-10-19 12:40:30 +00:00
Rolf vandeVaart
e9a7fea42d Fix up some of the failover code in the openib BTL.
Need to use MCA_BTL_IB_FAILED state to signel failure,
not MCA_BTL_IB_CLOSED.

This commit was SVN r23883.
2010-10-11 17:38:27 +00:00
Rolf vandeVaart
a91bd44463 Do not hand a function into this macro as the
function will get called twice.

This commit was SVN r23824.
2010-10-01 18:59:15 +00:00
Steve Wise
9862132836 Add T4 device IDs to openib btl params ini file.
This commit was SVN r23791.
2010-09-22 18:16:53 +00:00