There were a couple of issues with the memory leak fixes and several more verbose
issues. This fixes those issues.
cmr=v1.8.1:ticket=trac:4473
This commit was SVN r31273.
The following Trac tickets were found above:
Ticket 4473 --> https://svn.open-mpi.org/trac/ompi/ticket/4473
This fixes more issues identified by armci. More issues still remain and fixes are
coming for those as well.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31272.
Also, since I put some of the macros for these silent/verbose rules up
in the top-level Makefile.man-page-rules file, I renamed it to
Makefile.ompi-rules.
I've had this sitting around for a while; now seems like as good a
time as any to commit it.
This commit was SVN r31271.
directory not the job's
This bug didn't affect the correctness of the vader results just the
cleanup. This commit removes an error message about removing a non-existent
file.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31265.
Thanks to ggouaillardet for finding and fixing these issues.
Closes trac:4460
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31264.
The following Trac tickets were found above:
Ticket 4460 --> https://svn.open-mpi.org/trac/ompi/ticket/4460
Also added some missing values and sentinels.
cmr=v1.8:ticket=trac:4470
This commit was SVN r31263.
The following SVN revision numbers were found above:
r31260 --> open-mpi/ompi@69036437b7
The following Trac tickets were found above:
Ticket 4470 --> https://svn.open-mpi.org/trac/ompi/ticket/4470
If ompi_modex_recv() fails with OPAL_ERR_DATA_VALUE_NOT_FOUND, it
simply means that the peer process did not put any usnic BTL modex
info -- it is not an error. So have the usnic BTL simply ignore that
peer (vs. disqualifying itself / treating this like a real error).
Refs trac:4442.
This commit was SVN r31258.
The following Trac tickets were found above:
Ticket 4442 --> https://svn.open-mpi.org/trac/ompi/ticket/4442
This commit should finish the work started for #869. Closing that ticket
with this commit.
Closes trac:869
cmr=v1.8.1:reviewer=jsquyres
This commit was SVN r31257.
The following Trac tickets were found above:
Ticket 869 --> https://svn.open-mpi.org/trac/ompi/ticket/869
Add a lot more information about the --level CLI option, and the nine
levels.
Also remove some now-erroneous examples regarding --version.
cmr=v1.8:reviewer=rhc
This commit was SVN r31246.
in ompi_mtl_mxm_add_procs, define the ep_index variable only
for an older version of mxm.
submitted by Alina, reviewed by Mike.
cmr=v1.8:reviewer=ompi-rm1.8
This commit was SVN r31245.
The error doesn't prevent the user from running so there is no reason
to display it unless the user requested it (through coll_ml_verbose).
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31242.
Fix a one line bug when dealing with non-contiguous sends in prepare_src. Bug was
identified by the intel test suite.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31232.
the case fix in ompi_osc_base_process_op in r31204.
There are two cases that needed to be handled:
- The target is a simple datatype (contiguous block of a primitive
type) but the origin is not. In this case we still need to pack
the origin data but we can not rely on the convertor to do the
unpack (see r31204).
- Both the origin and target datatypes are simple datatypes. In this
case we can use ompi_op_reduce to do the accumulation without having
to pack the origin data.
cmr=v1.8:ticket=trac:4449
This commit was SVN r31231.
The following SVN revision numbers were found above:
r31204 --> open-mpi/ompi@949abe45cd
The following Trac tickets were found above:
Ticket 4449 --> https://svn.open-mpi.org/trac/ompi/ticket/4449
Fixed two bugs:
- Use module->comm NOT comm to get the CID for the shared memory backing
file. This fixes the case where there are multiple shared memory windows
at the same time.
- Remember to unlink the shared memory backing file.
Refs trac:4438
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31227.
The following Trac tickets were found above:
Ticket 4438 --> https://svn.open-mpi.org/trac/ompi/ticket/4438
This commit fixes two bugs:
- We were not correctly setting the lock type in the outstanding lock
for lock_all. This caused undefined behavior.
- flush_all was incorrectly checking for comm size - 1 lock acks but
comm size flush acks. This is the reverse of what was intended.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31226.
In most cases, bad messages received by the connectivty checker are
just dropped. However, in one specific code path, a bad packet caused
an abort. Doh!
This commit does two things:
1. Improve verbose messages for all these cases
1. Simply drop incoming messages that cannot be identified as ACKs or PINGs
Submitted by Jeff Squyres, reviewed by Dave Goodell.
cmr=v1.8:reviewer=ompi-rm1.8
This commit was SVN r31225.
It is possible to get into a situation where a small accumulate operation
can not be completed because a large accumulate operation holds the lock.
In this case we may return from wait/flush/etc before the operation is
complete. To handle this case increment the expected incoming fragment
count when queuing an accumulate operation and increment the incoming
fragment count after processing the accumulate operation.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31224.
of the primitive datatype
In this case we can not use the convertor to run the accumulate operation
since the datatype is a more or less a primitive type.
cmr=v1.8:ticket=trac:4449
This commit was SVN r31222.
The following Trac tickets were found above:
Ticket 4449 --> https://svn.open-mpi.org/trac/ompi/ticket/4449
This commit fixes two issues:
- osc/rdma: The target side of an accumulate was using the target datatype
in the receive to the packed buffer. This was conflicting with the way
the reduction is done into the target buffer. Changed the receive to use
the primitive datatype.
- osc/base: The copy table was completely wrong. Fixed the table to match
the underlying datatypes (which are opal not ompi datatypes).
- osc/base: There is a problem using the optimized description. Fall back
on using the non-optimized description until we can understand what is
going wrong.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31204.
results
The code to handle completion messages did not correctly increment the
number of expected messages. This could cause wait to return before all
incoming messages are complete.
I also added a check to ensure that start returns an error if we are in
a passive access epoch.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31203.
ROMIO code assumes all processes will use the same ROMIO driver. we
were not reaching the "find a common file system" logic when NFS was
enabled, everyone stat-ed the file system without errors, but some
processees found a different file system (like if some processes are
writing to NFS and others to UFS)
See discussion beginning here:
http://lists.mpich.org/pipermail/discuss/2014-March/002403.html
Tested-by: Jeff Squyres <jsquyres@cisco.com>
Submitted by Rob Lathan, reviewed by Jeff Squyres
cmr=v1.8:reviewer=ompi-rm1.8
This commit was SVN r31201.
the ompi_common_verbs_find_ports function had a call to
ompi_ibv_get_device_list, but not to ompi_ibv_free_device_list.
fixed by Alina, reviewed by Vasily/Mike.
cmr=v1.8:reviewer=ompi-rm1.8
This commit was SVN r31200.
This commit adds large datatype description support to the osc/rdma
component. Support is provided by an additional send/recv of the datatype
description if the description does not fit in an eager buffer. The
code is designed to require minimal new code and not for speed. We
consider this code path to be a slow path.
Refs trac:1905
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31197.
The following Trac tickets were found above:
Ticket 1905 --> https://svn.open-mpi.org/trac/ompi/ticket/1905
When we are only using local ranks basesmuma needs to provide an allreduce
function for both large and small message or else the coll/ml selection
logic will fail. In the future this logic should probably be updated to
just disable allreduce in coll/ml instead of disabling coll/ml. For now
it should be correct to say the basesmuma allgather works for larger
messages.
cmr=v1.8:reviewer=manjugv
This commit was SVN r31190.
a hierarchy actually matches a bcol that is in use.
There was a bug in one of the paths to calculate the ml buffer size. I fixed
the bug and squashed all the paths together to avoid further issues (the
result was correct in another path that calculated the same value).
Additionally, the i_hier was being used as the bcol_index. This is not
correct in a couple of cases so I added a variable to keep track of the
real bcol_index.
cmr=v1.8:reviewer=pasha
This commit was SVN r31189.
bound.
This case is correctly handled by coll/ml so remove the check that diables
coll/ml in the not bound case.
cmr=v1.8:reviewer=manjugv
This commit was SVN r31188.
This patch fixes two leaks:
- Fix typo in fallback collective code that caused coll/ml to retain
the ibcast module twice but only release it once. One of those ibcast
saves was supposed to be bcast.
- Do not check for module initialization in the module destructor. It
is possible to destruct a module that is partially setup.
cmr=v1.8:reviewer=manjugv
This commit was SVN r31187.
Without this, an attribute copy function could return non-success, but
it would not be propagated upwards. This caused the intel
MPI_Keyval3_* tests to fail.
cmr=v1.8:reviewer=hjelmn
This commit was SVN r31147.
When you call MPI_Graph_create with a old_comm of size N, and pass
nnodes=(N=1), then the Nth proc is supposed to get MPI_COMM_NULL out.
The code in this base function didn't properly handle the proc(s) that
are supposed to get MPI_COMM_NULL out.
cmr=v1.7.5:reviewer=hjelmn
This commit was SVN r31145.
This isn't causing any errors that I know about but it does fix an
annoying valgrind warning. Simple fix, no review required.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31130.
There are situations where coll/ml does not initialize properly. These will
eventually need to be fixed but in the meantime it is better to not always
print an error message because the collective framework can still fall back
on another collective module. This commit reduces the verbose output.
cmr=v1.7.5:reviewer=manjugv
This commit was SVN r31129.
It is usually not a good idea to assert when something is not implemented
or something goes wrong. Replace asserts with debug output and return.
cmr=v1.7.5:reviewer=manjugv
This commit was SVN r31128.
The necessary information is stored in the proc object. There is no need
to allgather the local process data to determine if another rank is on
the same socket.
cmr=v1.7.5:reviewer=manjugv
This commit was SVN r31127.
After discussion with Manju we decided to update these the process count
limits of the shared memory collectives to an arbitrarily large number.
cmr=v1.7.5:ticket=trac:4405
This commit was SVN r31126.
The following SVN revision numbers were found above:
r31096 --> open-mpi/ompi@3f469d08e7
The following Trac tickets were found above:
Ticket 4405 --> https://svn.open-mpi.org/trac/ompi/ticket/4405
* Ensure that all endpoints[x] values are initialized to NULL
* If ibv_create_ah fails, remove each endpoint from the
module->all_endpoints list so that the endpoint can be destructed
properly.
Submitted by Jeff Squyres, reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31111.
This provides full locality - i.e., not just node-level, but all the way down to whatever common binding level exists between the procs.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31106.
Also fixed spelling: IS_NOT_RECHABLE -> IS_NOT_REACHABLE.
Also mark a few places where opal_show_help() should have been used;
Manju will take care of these.
This commit was SVN r31104.
In r31071 I modified the logic to not increment the hierarchy level if
no processes were selected by that sbgp. That fixed a problem seen on
systems where we don't support process binding. The problem is there
is a case where we actually did select processes yet the number of
selected processes is 0. We need to increment the hierarchy in this case
as well.
This should fix the segmentation fault found by recent MTT runs. Once
this is committed to 1.7.5 remove the .ompi_ignore's from coll/ml and
bcol/ptpcoll. Tested with ompi-tests/ibm.
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31081.
The following SVN revision numbers were found above:
r31071 --> open-mpi/ompi@1911d97044
This was causing JVMs to run out of stack space, and all manner of
badness ensued.
Instead, use the heap -- that's what it's there for.
cmr=v1.7.5:reviewer=rhc:subject=make coll/ml use the heap for large debug array
This commit was SVN r31073.
fails to select any processes on any nodes.
Also modified basesmsocket to only print debugging info to the framework
output.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31071.
These parameters should not be marked as INTENT(OUT) (they aren't in
the MPI-3 standard).
This commit was SVN r31056.
The following Trac tickets were found above:
Ticket 4372 --> https://svn.open-mpi.org/trac/ompi/ticket/4372
* Several parameters should not be marked as INTENT(OUT) (they aren't in
the MPI-3 standard).
* Added missing PMPI F08 OMPI interfaces
This commit was SVN r31049.
The following Trac tickets were found above:
Ticket 4372 --> https://svn.open-mpi.org/trac/ompi/ticket/4372
These parameters should not be marked as INTENT(OUT) (they aren't in
the MPI-3 standard).
This commit was SVN r31048.
The following Trac tickets were found above:
Ticket 4372 --> https://svn.open-mpi.org/trac/ompi/ticket/4372
It is not valid to call flush outside a passive target epoch nor is
it valid to call lock/lock_all when no_locks is set. In the former
we were just semantically incorrect and the later would crash and
burn.
cmr=v1.7.5:ticket=trac:4382
This commit was SVN r31046.
The following Trac tickets were found above:
Ticket 4382 --> https://svn.open-mpi.org/trac/ompi/ticket/4382
This fixes a bug in r31029 which removes the use of the pml base request
(also not a good way since cm doesn't use the base request). We now allocate
a data structure (ugh) to determine the needed information. Tested with
mtt/onesided.
cmr=v1.7.5:ticket=trac:4379
This commit was SVN r31044.
The following SVN revision numbers were found above:
r31029 --> open-mpi/ompi@29e00f9161
The following Trac tickets were found above:
Ticket 4379 --> https://svn.open-mpi.org/trac/ompi/ticket/4379
- Return an error if the caller specified both MPI_MODE_NOPRECEDE and
MPI_MODE_NOSUCCEED to MPI_Win_fence.
- Return an error if the caller attempts to enter an active target
epoch while already in a passive target epoch.
- End an active target epoch if MPI_Win_fence is called with
MPI_MODE_NOSUCCEED.
cmr=v1.7.5:ticket=trac:4382
This commit was SVN r31043.
The following Trac tickets were found above:
Ticket 4382 --> https://svn.open-mpi.org/trac/ompi/ticket/4382
need to enable the access epoch in MPI_Win_fence.
I missed this change when I fixed the semantics of MPI_Win_create. With
this commit our one-sided MTT runs are now running clean.
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r31041.
See https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/377
This ticket adds the following functions to the standard:
- MPI_T_cvar_get_index, MPI_T_pvar_get_index, and MPI_T_category_get_index
The ticket has passed and the functions are part of MPI-3.1 that will
be released sometime later this year. In Open MPI the functions expose
existing internal functionality so they are low-risk to add to 1.8.0. I
will leave it up to Ralph whether he wants to accept these into 1.8.
cmr=v1.8:reviewer=rhc
This commit was SVN r31037.
We no longer specify interfaces with choice buffers in the TKR "mpi"
module implementation -- MPI-3 prohibits it (see r30169 and r30170 for
more details).
cmr=v1.7.5:ticket=trac:4372
This commit was SVN r31033.
The following SVN revision numbers were found above:
r30169 --> open-mpi/ompi@759ee33fd4
r30170 --> open-mpi/ompi@776f6144af
The following Trac tickets were found above:
Ticket 4372 --> https://svn.open-mpi.org/trac/ompi/ticket/4372
It seems we can't release accumulate buffers in completion callbacks
because the btls don't release registration resources until after the
callback has fired. The fix is to keep track of the unused buffers and
free them later. This should resolve issues when running IMB-EXT and
IMB-RMA.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31029.
Issue found by Absoft MTT runs.
cmr=v1.7.5:ticket=trac:4372
This commit was SVN r31028.
The following Trac tickets were found above:
Ticket 4372 --> https://svn.open-mpi.org/trac/ompi/ticket/4372
Issue found by Absoft MTT runs.
cmr=v1.7.5:ticket=trac:4372
This commit was SVN r31027.
The following Trac tickets were found above:
Ticket 4372 --> https://svn.open-mpi.org/trac/ompi/ticket/4372
Dave Goodell correctly pointed out that it is unusual to return MPI
error classes from internal ompi functions. Correct this in the RMA
case by adding an internal error code to match MPI_ERR_RMA_SYNC.
This does change OMPI_ERR_MAX. I don't think this will cause any
problems with ABI.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31012.
I have only checked that these bindings compile without warnings. They
appear to work with both intel's compilers and gfortran.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31010.
This commit resolves a number of crashed discovered my the onesided
tests in MTT. The functions in question were operating on the assumption
the user was calling RMA functions correctly.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31008.
NOTE: I transferred the oshmem-disabled-by-default from the 1.7 branch to the trunk to minimize future disruption if/when we change that option.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31006.
- -check-shmem-params is OFF by default. It checks OSHMEM API params and will abort on bad input
- hcoll do not save fallback coll pointers for unsupported collectives.
fixed by Val, Roman, reviewed by Miked/Igor
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30995.
The datatype unpacking code assumes that the packed datatype buffer has the
same alignment as an OPAL_PTRDIFF_TYPE. This was not enforced by the rdma
one-sided component. I changed the ordering and sized of various osc/rdma
headers to ensure their sizes are a multiple of 8-bytes and modified the
fragment allocation call to ensure all headers are 8-byte aligned. While
not the cleanest way to handle this situation it should resolve the issue.
Fixes trac:4315
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30974.
The following Trac tickets were found above:
Ticket 4315 --> https://svn.open-mpi.org/trac/ompi/ticket/4315
assumes the send request is derived from mca_pml_base_send_request_t,
but this is not true for pml cm, so we end up freeing invalid pointer.
We cannot take the data pointer from the pml send request, so we pass
the allocated buffer pointer in req_complete_cb_data, and put the
osc_rdma_module pointer in that buffer as well.
Previously, osc_pt2pt was used with pml_cm which didn't have this
problem.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30967.
* clang warning stomp
* memory barrier for volatile variable use
These can go to 1.7.5 or can slip to v1.8 -- RM decision.
Submitted by Jeff Squyres, reviewed by Dave Goodell
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30944.
* Older versions of libusnic_verbs actually return 0 when querying for
an unknown port. So also check for a magic ID in the returned data
to *really* know if the usnic extensions are supported.
* Use a union (in the common_verbs area) and memcpy (in the btl) to
avoid undefined C type aliasing behavior.
* Ensure to memset the function table to 0 if the usnic extensions
are not supported.
Submitted by Jeff Squyres, reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30935.
When compiling --with-ft there are a few compiler warnings about
unused variables. This patch fixes those compiler warnings.
This commit was SVN r30927.
Realistically, the usnic BTL doesn't need to know anything about the
underlying transport except for its header length (so that it knows
where the payload begins in a received buffer). So remove the use of
the specific transport prefix union and just rely on the usnic verbs
extension to tell us what the header length is if we're using the
usNIC/UDP transport, or sizeof(struct ibv_grh) if we're using usNIC/L2
transport.
This commit was SVN r30914.
Check the IBV_TRANSPORT_* values. In the case of IBV_TRANSPORT_IWARP,
there's an ambiguity and we need to also check to see whether the
usnic verbs externsion probe exists.
This commit was SVN r30913.
If they exist, call the usnic verbs extensions to both enable UDP
support and get the UD receiver header length that should be used
(rather than assume 40/struct GRH).
This commit was SVN r30912.
- now add_procs can be called more than once (during MPI_INIT and Inter_Comm_Create)
- adjust MXM to this reality
fixed by Alina, reviewed by Yossi/Mike
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30907.
This allows compilers to know that the code path(s) where
ompi_rte_abort() is invoked won't return (and therefore won't warn in
certain cases).
cmr=v1.8:reviewer=rhc
This commit was SVN r30891.
We don't use this functionality any more; we use the transport_type
and device name to identify usnic devices. It's slightly easier
because we can transport_type+name from ibv_device_open() and don't
have to do an additional ibv_query_device() to get its attributes.
Reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30882.
Follow on to SVN trunk r30850: consolidate the ibv_create_ah() calls
into a single loop, MPI_WAITALL-style. That is, call the (effectively
non-blocking) ibv_create_ah() for each endpoint. If we get
NULL+EAGAIN, it means that the UDP ARP is still ongoing down in the
kernel, so just try again later. We put these all into a single loop
because it allows us to parallelize the ARP progress in the kernel.
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30879.
The following SVN revision numbers were found above:
r30850 --> open-mpi/ompi@3641500442
r30852 --> open-mpi/ompi@4e282a3295
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Basically: since usnic is a connectionless transport, we do not get
OS-provided services "for free" that connection-oriented transports
get, namely: "hey, I wasn't able to make a connection to peer X", and
"hey, your connection to peer X has died."
This connectivity-checker runs in a separate progress thread in the
usnic BTL in local rank 0 on each server. Upon first send in any
process, the connectivty-checker agent will send some UDP pings to the
peer to ensure that we can reach it. If we can't, we'll abort the job
with a nice show_help message.
There's a lengthy comment in btl_usnic_connectivity.h explains the
scheme and how it works.
Reviewed by Dave Goodell.
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30860.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
finalize.
Closes trac:4290
cmr=v1.7.5:reviewer=miked
This commit was SVN r30854.
The following Trac tickets were found above:
Ticket 4290 --> https://svn.open-mpi.org/trac/ompi/ticket/4290
- Fix several typos is osc/rdma.
- Fix a locking issue in osc/sm that was caused by an incorrect
assumption about the semantics of opal_atomic_add_32.
- Always unlock the accumulation lock in osc/sm.
- The base of a processes shared memory window should be NULL if
the size is zero. Fixed.
cmr=v1.7.5:ticket=trac:4304
This commit was SVN r30853.
The following Trac tickets were found above:
Ticket 4304 --> https://svn.open-mpi.org/trac/ompi/ticket/4304
Follow on to SVN trunk r30850: consolidate the ibv_create_ah() calls
into a single loop, MPI_WAITALL-style. That is, call the (effectively
non-blocking) ibv_create_ah() for each endpoint. If we get
NULL+EAGAIN, it means that the UDP ARP is still ongoing down in the
kernel, so just try again later. We put these all into a single loop
because it allows us to parallelize the ARP progress in the kernel.
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30852.
The following SVN revision numbers were found above:
r30850 --> open-mpi/ompi@3641500442
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
ibv_create_ah() may need to effect an ARP resolution, which may take
some time. Rather than blocking in ibv_create_ah(), the usNIC driver
may return NULL and set errno to EAGAIN indicating that we should try
again (i.e., the ARP resolution is proceeding under the covers).
So add a simple loop here to loop over ibv_create_ah() until it
returns non-(NULL+EAGAIN). A future commit will make this a bit more
efficient.
Authored-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Dave Goodell <dgoodell@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30850.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Prior to this commit we matched local interfaces to remote interfaces in
order to create endpoints in a simplistic way. If any remote interfaces
were on the same subnet as any of our local interfaces then only local
interfaces would be paired (IP-routed remote interfaces would be
ignored).
This commit introduces a more general scheme which attempts to make the
"best" pairing of local interfaces to remote interfaces. We now cast
the problem as a graph theory problem known as the "Assignment Problem",
or finding a maximum-cardinality, minimum-weight bipartite matching. We
solve this problem by reducing the bipartite graph of interface
connectivity to a flow network and then solving for a minimum cost flow.
This is then easily converted into back into a matching on the original
bipartite graph.
In the new scheme, interfaces on the same subnet are preferred over
interfaces requiring intermediate routing hops and higher bandwidth
links are preferred over lower bandwidth links.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30849.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Querying the OS routing table is important for making decisions about
which local and remote interfaces should be paired into reliable
communication channels.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30848.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
This code is intended to support usNIC interface matching functionality.
We currently view that problem as essentially the "Assignment Problem"
(http://en.wikipedia.org/wiki/Assignment_problem), for which there are
many possible solution approaches, including flow-network analysis. In
the future, we might transition to a more nuanced view of the problem
which would likely also be flow-network based.
To this end, the current code focuses on providing one major algorithm
to the core usnic BTL: `ompi_btl_usnic_solve_bipartite_assignment`. It
also exposes several typical and necessary functions for constructing,
manipulating, and querying weighted, directed graphs.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30847.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30846.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
This commit adds mechanisms for writing and running unit tests in the
usnic BTL. The short version of how to run the tests is:
1. Configure with `--enable-ompi-btl-usnic-unit-tests`. This will cause
the unit testing code and test runner utility to be built.
2. Run the tests by running `ompi_btl_usnic_run_tests`.
See `README.test` for full details.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30845.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
These includes only exist in the Cisco-internal usnic-v1.6 code base,
but they should not exist anywhere except btl_usnic_compat.h in order to
minimize source differences between usnic-v1.6 and v1.7/trunk.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30844.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253
Lower layer (hardware or software) bugs can result in a mismatch between
our BTL-layer payload size and the actual packet length. We now check
that in order to catch these cases, which otherwise can result in
MPI-layer message corruption.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
cmr=v1.7.5:ticket=trac:4253
This commit was SVN r30843.
The following Trac tickets were found above:
Ticket 4253 --> https://svn.open-mpi.org/trac/ompi/ticket/4253