Per RFC which expired two weeks ago:
We are planning to make a change to Open MPI to always set up the btls. This
means the btl init will be called even if add_procs is never called for that
btl. In the openib btl free lists fragments are currently allocated in btl_init.
To avoid wasting that memory this commit moves that final device setup to
the add_procs function. This included allocating free lists, and starting the
async event thread.
At this time this change is safe since we have a barrier after add_procs in
MPI_Init. If this changes we will need to re-think some of the initialization
since we might have the possibility of a connection request before add_procs
is called.
Tested with Mellanox ConnectX2 and QLogic HCAs.
Commit also cleans up tabs in btl_openib_async.c.
cmr=v1.7.5:reviewer=miked
This commit was SVN r30122.
http://www.open-mpi.org/community/lists/devel/2013/10/13072.php
Add support for pinning GPU Direct RDMA in openib BTL for better small message latency of GPU buffers.
Note that none of this is compiled in unless CUDA-aware support is requested.
This commit was SVN r29680.
Let imagine that we have two btls in btl_openib_component_init() both points to the same openib_btl->device and as a result have the same openib_btl->device->endpoints array.
Finalization phase calls twice mca_btl_openib_finalize()->mca_btl_openib_finalize_resources().
mca_btl_openib_finalize_resources() frees endpoint related btl. But the second call of mca_btl_openib_finalize_resources() checks endpoint that is released by previus call.
fixed by Igor, reviewed by miked/vasily
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r29563.
Turns out that AC_CHECK_DECLS is one of the "new style" Autoconf
macros that #defines the output to be 0 or 1 (vs. #define'ing or
#undef'ing it). So don't check for "#if defined(..."; just check for
"#if ...".
This commit was SVN r29059.
The following Trac tickets were found above:
Ticket 3730 --> https://svn.open-mpi.org/trac/ompi/ticket/3730
Commit r27211 added ifdef checks for #define
HAVE_IBV_LINK_LAYER_ETHERNET, which is incorrect. The correct #define
is HAVE_DECL_IBV_LINK_LAYER_ETHERNET. This broke OMPI over iWARP.
This fixes trac:3726 and should be added to cmr:v1.7.3:reviewer=jsquyres
This commit was SVN r29053.
The following SVN revision numbers were found above:
r27211 --> open-mpi/ompi@b27862e5c7
The following Trac tickets were found above:
Ticket 3726 --> https://svn.open-mpi.org/trac/ompi/ticket/3726
* add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit.
* remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL"
* modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded
* removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base
* added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames
This commit was SVN r29052.
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.
Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:
* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally
* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test
* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).
Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it
Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.
This commit was SVN r29040.
value to signal that the operation of retrieving the element from the free list
failed. However in this case the returned pointer was set to NULL as well, so the
error code was redundant. Moreover, this was a continuous source of warnings when
the picky mode is on.
The attached parch remove the rc argument from the OMPI_FREE_LIST_GET and
OMPI_FREE_LIST_WAIT macros, and change to check if the item is NULL instead of
using the return code.
This commit was SVN r28722.
(i.e., ensure that more data items get zeroed out/set to NULL) so that
if something goes wrong during initialization, we don't try to clean
up something that isn't there (and segv).
The chance of this happening on the trunk is very low (and will also
be low once the verbs improvements are brought over to v1.7). But it
can actually happen in the v1.6 branch (e.g., if no CPC is available,
we'll try to get the length of the endpoints list, but the endpoints
list is NULL).
Hence, even though the real goal is to get this functionality over to
v1.6, I figured I'd commit to the trunk/CMR to v1.7 just to try to
keep commonality in the openib between all three where possible.
This commit was SVN r28317.
ompi_show_help, because opal_show_help is replaced with an
aggregating version when using ORTE, so there's no reason to
directly call orte_show_help.
This commit was SVN r28051.
* btl sendi(): if message can be send inline try to avoid signal
* signal is requested one per 64 or when
there are no send wqes
when message can not be send inline
any other btl method then sendi()
This commit was SVN r27724.
some new common OpenFabrics functionality to ompi/mca/common/verbs.
Also move everything that was in ompi/mca/common/ofautils under
ompi/mca/common/verbs.
* Move ofautils -> verbs
* Add new functionality in ompi/mca/common/verbs (see doxygen
* comments in ompi/mca/common/verbs/common_verbs.h for details):
* ompi_common_verbs_find_ibv_ports()
* ompi_common_verbs_port_bw()
* ompi_common_verbs_mtu()
* '''If you're writing verbs-based code, you should be using this
common functionality'''
* Adapt openib BTL to use some trivial common functionality in
common/verbs
* Don't use "#ifdef OMPI_HAVE_RDMAOE",use
"#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)"
* Update the following to include/link against common/verbs
* bcol/iboffload
* sbgp/ibnet
* btl/openib
This commit was SVN r27212.
that causes MPI jobs to abort if there is not enough registered memory
available (vs. just warning).
This commit was SVN r27140.
The following Trac tickets were found above:
Ticket 3258 --> https://svn.open-mpi.org/trac/ompi/ticket/3258
1000. Refs trac:3154.
IB/iWarp vendors need to get together to figure out a real fix.
This commit was SVN r26777.
The following SVN revision numbers were found above:
r26730 --> open-mpi/ompi@5315c91baf
The following Trac tickets were found above:
Ticket 3154 --> https://svn.open-mpi.org/trac/ompi/ticket/3154
* If the MCA param btl_openib_cq_size is set to 0 (which is the
default), use the device CQ max size. Otherwise, use the MCA param
value (and never adjust it again).
* Remove the CQ size adjustment code. Since we default to max CQ
size, there really isn't much point in having it any more. I think
people setting an absolute CQ size is going to be rare, so let's
not do anything fancy with it.
* If the MCA param value is larger than what the device supports,
print a warning (only once per process) and default to using the
device max
* Add a BTL_VERBOSE displaying which CQ size we used
This commit was SVN r26730.
The following Trac tickets were found above:
Ticket 3152 --> https://svn.open-mpi.org/trac/ompi/ticket/3152
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.
- poll() can return POLLRDNORM even if not requested (Solaris bug)
- MIN macro not defined in btl_openib.c
and while we're at it, we clean up the MIN definition in ad_bgl_pset.h
- btl_openib_connect_rdmacm.c was calling rdma_destroy_id() twice
leading to undefined behavior (a hang on Solaris)
This commit was SVN r24356.