This is closely related to Platform-MPI's old -prot feature.
The long-format of the tables it prints could look like this:
> Host 0 [myhost001] ranks 0 - 1
> Host 1 [myhost002] ranks 2 - 3
> Host 2 [myhost003] ranks 4
> Host 3 [myhost004] ranks 5
> Host 4 [myhost005] ranks 6
> Host 5 [myhost006] ranks 7
> Host 6 [myhost007] ranks 8
> Host 7 [myhost008] ranks 9
> Host 8 [myhost009] ranks 10
>
> host | 0 1 2 3 4 5 6 7 8
> ======|==============================================
> 0 : sm tcp tcp tcp tcp tcp tcp tcp tcp
> 1 : tcp sm tcp tcp tcp tcp tcp tcp tcp
> 2 : tcp tcp self tcp tcp tcp tcp tcp tcp
> 3 : tcp tcp tcp self tcp tcp tcp tcp tcp
> 4 : tcp tcp tcp tcp self tcp tcp tcp tcp
> 5 : tcp tcp tcp tcp tcp self tcp tcp tcp
> 6 : tcp tcp tcp tcp tcp tcp self tcp tcp
> 7 : tcp tcp tcp tcp tcp tcp tcp self tcp
> 8 : tcp tcp tcp tcp tcp tcp tcp tcp self
>
> Connection summary:
> on-host: all connections are sm or self
> off-host: all connections are tcp
In this example hosts 0 and 1 had multiple ranks so "sm" was more
meaningful than "self" to identify how the ranks on the host are
talking to each other. While host 2..8 were one rank per host so
"self" was more meaningful as their btl.
Above a certain number of hosts (12 by default) the above table gets too big
so we shrink to a more abbreviated looking table that has the same data:
> host | 0 1 2 3 4 8
> ======|====================
> 0 : A C C C C C C C C
> 1 : C A C C C C C C C
> 2 : C C B C C C C C C
> 3 : C C C B C C C C C
> 4 : C C C C B C C C C
> 5 : C C C C C B C C C
> 6 : C C C C C C B C C
> 7 : C C C C C C C B C
> 8 : C C C C C C C C B
> key: A == sm
> key: B == self
> key: C == tcp
Then above 36 hosts we stop printing the 2d table entirely and just print the
summary:
> Connection summary:
> on-host: all connections are sm or self
> off-host: all connections are tcp
The options to control it are
-mca comm_method 1 : print the above table at the end of MPI_Init
-mca comm_method 2 : print the above table at the beginning of MPI_Finalize
-mca comm_method_max <n> : number of hosts <n> for which to print a full size 2d
-mca comm_method_brief 1 : only print summary output, no 2d table
-mca comm_method_fakefile <filename> : for debugging only
* printing at init vs finalize:
The most important difference between these two is that when printing the table
during MPI_Init(), we send extra messages to make sure all hosts are connected to
each other. So the table ends up working against the idea of on-demand connections
(although it's only forcing the n^2 connections in the number of hosts, not the
total ranks). If printing at MPI_Finalize() we don't create any connections that
aren't already connected, so the table is more likely to have "n/a" entries if
some hosts never connected to each other.
* how many hosts <n> for which to print a full size 2d table
The option -mca comm_method_max <n> can be used to specify a number of hosts <n>
(default 12) that controls at what host-count the unabbreviated / abbreviated
2d tables get printed:
1 - n : full size 2d table
n+1 - 3n : shortened 2d table
3n+1 - inf : summary only, no 2d table
* brief
The option -mca comm_method_brief 1 can be used to skip the printing of the 2d
table and only show the short summary
* fakefile
This is a debugging option that allows easeir testing of all the printout
routines by letting all the detected communication methods between the hosts
be overridden by fake data from a file.
The source of the information used in the table is the .mca_component_name
In the case of BTLs, the module always had a .btl_component linking back to the
component. The vars mca_pml_base_selected_component and ompi_mtl_base_selected_component
offer similar functionality for pml/mtl.
So with the ability to identify the component, we can then access
the component name with code like this
mca_pml_base_selected_component.pmlm_version.mca_component_name
See the three lookup_{pml,mtl,btl}_name() functions in hook_comm_method_fns.c,
and their use in comm_method() to parse the strings and produce an integer
to represent the connection type being used.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
zero-size derived datatypes are now flagged as OPAL_DATATYPE_FLAG_CONTIGUOUS
so update mca_pml_ucx_init_datatype() to correctly handle them.
Since 'size' is a 'size_t', the assertion can simply be removed.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The rdma_frag attached to the send request was not correctly released
upon request completion, leaking until MPI_Finalize. A quick solution
would have been to add RDMA_FRAG_RETURN at different locations on the
send request completion, but it would have unnecessarily made the
sendreq completion path more complex. Instead, I added the length to
the RDMA fragment so that it can be completed during the remote ack.
Be more explicit on the comment.
The rdma_frag can only be freed once when the peer forced a protocol
change (from RDMA GET to send/recv). Otherwise the fragment will be
returned once all data pertaining to it has been trasnferred.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
In case of using a btl_put in ob1, the handle of the locally registered
memory is sent with a PUT control message. In the current master code
the sent handle is necessary the handle in the frag but if the handle
has been successfully registered in the request, the frag structure does
not have any valid handle and all fragments use the request one.
I suggest to check if the handle in the fragment is valid and if not to
send the handle from the request.
Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
In the case the btl_get fails Ob1 tries to fallback on btl_put first but
the return code was ignored. So the code fell back on both btl_put and
btl_send.
Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
This is not fixing any issue, it is simply preventing a sefault if the
communicator creation has not happened as expected. Thus, this code path
should never really be hit in a correct MPI application with a valid
communicator creation support.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
- there was a set of UCX related issues reported which caused
by mmap API hooks conflicts. We added diagnostic of such
problems to simplify bug-resolving pipeline
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
For remote node peers pack smaller worker address, which contains
network device addresses only. This would reduce amount of OOB traffic
during startup.
Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
We missed an assert to check if ALLOW_OVERTAKE is set or not before
validating the sequence number and this will cause deadlock.
Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
without this fix, an error handler invoked on pml_ucx request would
segfault while trying to dereference requests[i]->req_mpi_object.comm
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
The monitoring PML hides it's existence from the OMPI infrastructure by
removing itself from the list of PML loaded components, remaining hidden
until MPI_Finalize.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
With this patch the best PML is selected earlier, before finalizing
the others PML. This provides a simpler mechanism to intercept and
highjack the PML (as done in the monitoring PML)
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Several fixes to string handling:
1. strncpy() -> opal_string_copy() (because opal_string_copy()
guarantees to NULL-terminate, and strncpy() does not)
2. Simplify a few places, such as:
* Since opal_string_copy() guarantees to NULL terminate, eliminate
some memsets(), etc.
* Use opal_asprintf() to eliminate multi-step string creation
There's more work that could be done; e.g., this commit doesn't
attempt to clean up any strcpy() usage.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error. However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined. Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the new custom matching code in pml/ob1 so it can
not be enabled with a configure option. This commit also renames the
fuzzy-matching headers to avoid potential name conflicts and removes
the use of C reserved identifiers.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
- in sine cases persistent request was deleted during completion
callback, this cause double free of linked UCX request (assert
in debug build or hang in release build)
- UCX request is freed prior completion calback
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>