default
And other streamlining of aborting behavior.
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
Remove OMPI_COMM_ERRORS and use NOHANDLE macros instead.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
route unbound errors to self error handler
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
Do not raise the error handler from within components
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
This commit updates the btl interface to change the parameters
passed to receive callbacks. The interface used to pass the tag,
a btl base descriptor, and the callback context. Most of the
values in the btl base descriptor were unused and only helped
simplify the callbacks from the self btl. All of the arguments
have now been replaced with a single receive callback descriptor.
This descriptor contains the incoming endpoint, data segment(s),
tag, and callback context. All btls have been updated to use
the new callback and the btl interface version has been bumped
to v3.2.0.
As part of this change the descriptor argument (and the segments
contained within it) have been marked as const. The were treated
as const before but this change could allow the compiler to make
better optimization decisions and will enforce that the callback
does not attempt to change the data in the descriptor.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
- added detection of new API into configuration
- added tag_send call implemented using new API
- added MPI_Send/MPI_Isend/MPI_Recv/MPI_Irecv implementations
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
For direct modex, all procs publish the selected pml module
and then at add_procs pml module for each proc is checked
against every other proc in the add_proc call.
For full modex, there is no change in functionality. Only Rank0
publishes its selected pml, all other procs in the add_proc call
check their selected pml against Rank0.
If pml's do not match, throw error and exit.
Signed-off-by: Dipti Kothari <dkothar@amazon.com>
Pass the correct ompi_proc_t and array length to
mca_pml_base_pml_check_selected() during dynamic modex.
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobot:
https://github.com/pmodels/argobotshttps://github.com/Qthreads/qthreads
The default threading model is pthreads. Alternate thread models are
specificed at configure time using the --with-threads=X option.
The framework is static. The theading model to use is selected at
Open MPI configure/build time.
mca/threads: implement Argobots threading layer
config: fix thread configury
- Add double quotations
- Change Argobot to Argobots
config: implement Argobots check
If the poll time is too long, MPI hangs.
This quick fix just sets it to 0, but it is not good for the
Pthreads version. Need to find a good way to abstract it.
Note that even 1 (= 1 millisecond) causes disastrous performance
degradation.
rework threads MCA framework configury
It now works more like the ompi/mca/rte configury,
modulo some edge items that are special for threading package
linking, etc.
qthreads module
some argobots cleanup
Signed-off-by: Noah Evans <noah.evans@gmail.com>
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
We currently save the hostname of a proc when we create the ompi_proc_t for it. This was originally done because the only method we had for discovering the host of a proc was to include that info in the modex, and we had to therefore store it somewhere proc-local. Obviously, this ccarried a memory penalty for storing all those strings, and so we added a "cutoff" parameter so that we wouldn't collect hostnames above a certain number of procs.
Unfortunately, this still results in an 8-byte/proc memory cost as we have a char* pointer in the opal_proc_t that is contained in the ompi_proc_t so that we can store the hostname of the other procs if we fall below the cutoff. At scale, this can consume a fair amount of memory.
With the switch to relying on PMIx, there is no longer a need to cache the proc hostnames. Using the "optional" feature of PMIx_Get, we restrict the retrieval to be purely proc-local - i.e., we retrieve the info either via shared memory or from within the proc-internal hash storage (depending upon the active PMIx components). Thus, the retrieval of a hostname is purely a local operation involving no communication.
All RM's are required to provide a complete hostname map of all procs at startup. Thus, we have full access to all hostnames without including them in a modex or having to cache them on each proc. This allows us to remove the char* pointer from the opal_proc_t, saving us 8-bytes/proc.
Unfortunately, PMIx_Get does not currently support the return of a static pointer to memory. Thus, even though PMIx has the hostname in its memory, it can only return a malloc'd version of it. I have therefore ensured that the return from opal_get_proc_hostname is consistently malloc'd and free'd wherever used. This shouldn't be a burden as the hostname is only used in one of two circumstances:
(a) in an error message
(b) in a verbose output for debugging purposes
Thus, there should be no performance penalty associated with the malloc/free requirement. PMIx will eventually be returning static pointers, and so we can eventually simplify this method and return a "const char*" - but as noted, this really isn't an issue even today.
Signed-off-by: Ralph Castain <rhc@pmix.org>
This commit fixes an issue with the include usage in some
ompi source files. These source files are using the <> form
of include when the "" form is correct (as these are internal,
**not** system headers).
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
- Port memchecker call from a1d502c.
- Remove unused memcheck macro variables.
- Some code readability improvements.
- Remove some stray +1's in dynamic comm cleanup.
- Re-add OPAL_ENABLE_DEBUG macro to osc header.
- Cleanup some printf's, and includes.
- Refactor cleanup of dpm_disconnect_objs.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Temporary solution for the PML inconsistency issue discussed in #7475.
This patch address 2 things: first it make the PMIx key optional so that
if we are not in a full modex mode we don't do a direct modex, and
second it get the PML info from the vpid 0 instead of from the local
rank.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.
remove orte: misc fixes
- UCX fixes
- VPATH issue
- oshmem fixes
- remove useless definition
- Add PRRTE submodule
- Get autogen.pl to traverse PRRTE submodule
- Remove stale orcm reference
- Configure embedded PRRTE
- Correctly pass the prefix to PRRTE
- Correctly set the OMPI_WANT_PRRTE am_conditional
- Move prrte configuration to the end of OMPI's configure.ac
- Make mpirun a symlink to prun, when available
- Fix makedist with --no-orte/--no-prrte option
- Add a `--no-prrte` option which is the same as the legacy
`--no-orte` option.
- Remove embedded PMIx tarball. Replace it with new submodule
pointing to OpenPMIx master repo's master branch
- Some cleanup in PRRTE integration and add config summary entry
- Correctly set the hostname
- Fix locality
- Fix singleton operations
- Fix support for "tune" and "am" options
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
The opal_gethostname() function provides a more robust mechanism
to retrieve the hostname than gethostname(), which can return
results that are not null-terminated, and which can vary in its
behavior from system to system.
opal_gethostname() just returns the value in opal_process_info.nodename;
this is populated in opal_init_gethostname() inside opal_init.c.
-Changed all gethostname calls in opal subtree to opal_gethostname
-Changed all gethostname calls in orte subtree to opal_gethostname
-Changed all gethostname calls in ompi subdir to opal_gethostname
-Changed all gethostname calls in oshmem subdir to opal_gethostname
-Changed opal_if.c in test subdir to use opal_gethostname
-Changed opal_init.c to include opal_init_gethostname. This function
returns an int and directly sets opal_process_info.nodename per
jsquyres' modifications.
Relates to open-mpi#6801
Signed-off-by: Charles Shereda <cpshereda@lanl.gov>
Add a comment about the stages in the vprotocols initialization and why
we cant use the threading provided during the PML initialization.
Instead, use the OMPI internal threading status to prevent the use of
message logging in multi-threaded applications.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This is closely related to Platform-MPI's old -prot feature.
The long-format of the tables it prints could look like this:
> Host 0 [myhost001] ranks 0 - 1
> Host 1 [myhost002] ranks 2 - 3
> Host 2 [myhost003] ranks 4
> Host 3 [myhost004] ranks 5
> Host 4 [myhost005] ranks 6
> Host 5 [myhost006] ranks 7
> Host 6 [myhost007] ranks 8
> Host 7 [myhost008] ranks 9
> Host 8 [myhost009] ranks 10
>
> host | 0 1 2 3 4 5 6 7 8
> ======|==============================================
> 0 : sm tcp tcp tcp tcp tcp tcp tcp tcp
> 1 : tcp sm tcp tcp tcp tcp tcp tcp tcp
> 2 : tcp tcp self tcp tcp tcp tcp tcp tcp
> 3 : tcp tcp tcp self tcp tcp tcp tcp tcp
> 4 : tcp tcp tcp tcp self tcp tcp tcp tcp
> 5 : tcp tcp tcp tcp tcp self tcp tcp tcp
> 6 : tcp tcp tcp tcp tcp tcp self tcp tcp
> 7 : tcp tcp tcp tcp tcp tcp tcp self tcp
> 8 : tcp tcp tcp tcp tcp tcp tcp tcp self
>
> Connection summary:
> on-host: all connections are sm or self
> off-host: all connections are tcp
In this example hosts 0 and 1 had multiple ranks so "sm" was more
meaningful than "self" to identify how the ranks on the host are
talking to each other. While host 2..8 were one rank per host so
"self" was more meaningful as their btl.
Above a certain number of hosts (12 by default) the above table gets too big
so we shrink to a more abbreviated looking table that has the same data:
> host | 0 1 2 3 4 8
> ======|====================
> 0 : A C C C C C C C C
> 1 : C A C C C C C C C
> 2 : C C B C C C C C C
> 3 : C C C B C C C C C
> 4 : C C C C B C C C C
> 5 : C C C C C B C C C
> 6 : C C C C C C B C C
> 7 : C C C C C C C B C
> 8 : C C C C C C C C B
> key: A == sm
> key: B == self
> key: C == tcp
Then above 36 hosts we stop printing the 2d table entirely and just print the
summary:
> Connection summary:
> on-host: all connections are sm or self
> off-host: all connections are tcp
The options to control it are
-mca comm_method 1 : print the above table at the end of MPI_Init
-mca comm_method 2 : print the above table at the beginning of MPI_Finalize
-mca comm_method_max <n> : number of hosts <n> for which to print a full size 2d
-mca comm_method_brief 1 : only print summary output, no 2d table
-mca comm_method_fakefile <filename> : for debugging only
* printing at init vs finalize:
The most important difference between these two is that when printing the table
during MPI_Init(), we send extra messages to make sure all hosts are connected to
each other. So the table ends up working against the idea of on-demand connections
(although it's only forcing the n^2 connections in the number of hosts, not the
total ranks). If printing at MPI_Finalize() we don't create any connections that
aren't already connected, so the table is more likely to have "n/a" entries if
some hosts never connected to each other.
* how many hosts <n> for which to print a full size 2d table
The option -mca comm_method_max <n> can be used to specify a number of hosts <n>
(default 12) that controls at what host-count the unabbreviated / abbreviated
2d tables get printed:
1 - n : full size 2d table
n+1 - 3n : shortened 2d table
3n+1 - inf : summary only, no 2d table
* brief
The option -mca comm_method_brief 1 can be used to skip the printing of the 2d
table and only show the short summary
* fakefile
This is a debugging option that allows easeir testing of all the printout
routines by letting all the detected communication methods between the hosts
be overridden by fake data from a file.
The source of the information used in the table is the .mca_component_name
In the case of BTLs, the module always had a .btl_component linking back to the
component. The vars mca_pml_base_selected_component and ompi_mtl_base_selected_component
offer similar functionality for pml/mtl.
So with the ability to identify the component, we can then access
the component name with code like this
mca_pml_base_selected_component.pmlm_version.mca_component_name
See the three lookup_{pml,mtl,btl}_name() functions in hook_comm_method_fns.c,
and their use in comm_method() to parse the strings and produce an integer
to represent the connection type being used.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
zero-size derived datatypes are now flagged as OPAL_DATATYPE_FLAG_CONTIGUOUS
so update mca_pml_ucx_init_datatype() to correctly handle them.
Since 'size' is a 'size_t', the assertion can simply be removed.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The rdma_frag attached to the send request was not correctly released
upon request completion, leaking until MPI_Finalize. A quick solution
would have been to add RDMA_FRAG_RETURN at different locations on the
send request completion, but it would have unnecessarily made the
sendreq completion path more complex. Instead, I added the length to
the RDMA fragment so that it can be completed during the remote ack.
Be more explicit on the comment.
The rdma_frag can only be freed once when the peer forced a protocol
change (from RDMA GET to send/recv). Otherwise the fragment will be
returned once all data pertaining to it has been trasnferred.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
In case of using a btl_put in ob1, the handle of the locally registered
memory is sent with a PUT control message. In the current master code
the sent handle is necessary the handle in the frag but if the handle
has been successfully registered in the request, the frag structure does
not have any valid handle and all fragments use the request one.
I suggest to check if the handle in the fragment is valid and if not to
send the handle from the request.
Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
In the case the btl_get fails Ob1 tries to fallback on btl_put first but
the return code was ignored. So the code fell back on both btl_put and
btl_send.
Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
This is not fixing any issue, it is simply preventing a sefault if the
communicator creation has not happened as expected. Thus, this code path
should never really be hit in a correct MPI application with a valid
communicator creation support.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
- there was a set of UCX related issues reported which caused
by mmap API hooks conflicts. We added diagnostic of such
problems to simplify bug-resolving pipeline
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>