This commit add support for scalable endpoint to enhance multithreaded
application performance. The BTL will detect the support from ofi
provider and will fallback to normal usage of scalable endpoint is not
supported.
NEW MCA parameters:
- mca_btl_ofi_disable_sep: force the btl to not use scalable endpoint.
- mca_btl_ofi_num_contexts_per_module: number of communication context
to create (should be the same as number of thread).
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.
All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example.
Signed-off-by: David Eberius <deberius@vols.utk.edu>
FI_MR_UNSPEC is not supposed to be used beyond ofi version 1.5. This
commit replaces FI_MR_UNSPEC with the new FI_MR_BASIC mode bits
(FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR).
The btl functionality remains the same.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
OFI sockets provider will sometime return FI_EINTR. Apparently this flag
does not exist in OFI version 1.5. This commit added an ifdef check.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit adds patcher support of aarch64. The current
implementation uses mov, movk, and br to perform the jump. This uses 5
instructions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes problem where users have libfabric version < 1.5
installed and build failed because the new MR modebits does not exist.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using:
* autogen --no-orte
* with external libevent, external hwloc, and external PMIx master
* configuring PMIx master with the same libevent and hwloc
* execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun
Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)
This commit added new transport layer to be used with osc rdma module.
This BTL provides put, get, atomic and fetch atomic operations. It can
be used with multiple hardware vendors as long as they have their
provider under Libfabric and have the right capabilities.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
Don't do a recursive search (hence no need for *idx anymore).
Find the level depth, to hide cache-issues first.
Then iterate over that level to find the objects we want.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
df_search_min_bound() would need to be fixed for hwloc 2.0,
but it's only used in opal_hwloc_base_find_min_bound_target_under_obj()
which isn't used anymore. So just remove all of them.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
As @hjelmn and I discussed, this is a little hacky. However, it is the only solution that can be done solely from the OMPI side.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
The mpool/memkind component was using a deprecated "partitions" API.
This commit refactors the memkind component to make use of the
supported public API.
The public API uses 3 parameters to specify a mpool "kind":
- a memkind type (which for now is just default or HBM)
- a memkind policy
- a memkind_bits (partly to specify pagesize)
The MCA parameters were changed to reflect these memkind
parameters.
Add a make check test for sanity checking of the memkind component.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
This commit fixes a segfault in btl-portals4 add_procs(). The segfault
occurs if add_procs() is called after a del_procs() call that reduces
the proc count to zero which would cause PT and NI resources to be
freed. This commit resolves the segfault by using a common
initiailization boolean and only freeing module resources in
finalize().
Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
Have Open MPI's PMIx component to set PMIx's "show_load_errors" to do
the same thing that Open MPI's "show_load_errors" does.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Some of the show_help() messages that were added in 40afd525f8 were
really normal / expected behavior (e.g., if 2 peers connect in the TCP
BTL more-or-less simultaneously, one of them will drop the connection
-- no need to show_help() about this; it's expected behavior). Roll
back these messages to be opal_output_verbose() kinds of messages.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The strtoul function returns the pointer to the first non-digit character, which is a '.' in this case. Calling strtoul at that point will always yield a zero - you have to move past it to get the remaining number
Thanks to Greg Lee for the detailed analysis of the problem.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Simple MCA vars for ext1, ext2, and pmix3 components to reflect what
the underlying PMIx library version is. For example:
```
$ ompi_info --param pmix pmix3x --parsable --level 9 | grep
library_version
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:value:PMIx library version 3.0.0 (embedded in Open MPI)
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:source:default
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:status:writeable
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:level:4
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:help:Version of the underlying PMIx library
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:deprecated:no
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:type:string
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:disabled:false
```
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses),
but the memory that we are copying from (my_ss->sin_addr) is only 4
bytes long. Don't copy beyond the end of that source buffer.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes the case when local client asks for the key from the
process on the remote node. The local server don't have commit count for
remote ranks, it is maintained by another PMIx server, so commit count
should be ignored for remote requests.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
The current code path for PMIx_Resolve_peers and PMIx_Resolve_nodes executes a threadshift in the preg components themselves. This is done to ensure thread safety when called from the user level. However, it causes thread-stall when someone attempts to call the regex functions from _inside_ the PMIx code base should the call occur from within an event.
Accordingly, move the threadshift to the client-level functions and make the preg components just execute their algorithms. Create a new pnet/test component to verify that the prge code can be safely accessed - set that component to be selected only when the user directly specifies it. The new component will be used to validate various logical extensions during development, and can then be discarded.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 456ac7f7af3d9ba09888e3c899eb001daaa24aef)
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":
* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch
* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.
* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.
* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Important note :
According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written. This return
value may be less than the total number of requested bytes, if a
partial read/write occurred. (Partial transfers apply at the
granularity of iovec elements. These system calls won't perform a
partial transfer that splits a single iovec element.)"
So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.
Thanks Heiko Bauke for the bug report.
Refs. open-mpi/ompi#4829
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
Three issues were resulting in hangs during large message transfer:
* The 2nd..btl_tcp_link connections were dropped during establishment because the per-process
address check was binary, rather than a count
* The accept handler would not skip a btl module that was already in use, resulting in all
connections for a given address being vectored to a single btl
* Multiple addresses in the same subnet caused connections to be
stalled, as the receiver would always use the same (first) address
found. Binding the outgoing connection solves this issue
* Lastly fix race condition created by connections being started at the exact same time
by accpeting connections not in the closed state, allowing endpoint_accept to resolve
dispute
Signed-off-by: Jordan Cherry <cherryj@amazon.com>
This commit replaces the current VMA tree implementation with one that
uses the new opal_interval_tree_t class. Since the VMA tree lock is no
longer used this commit also updates rcache/grdma and btl/vader to
take better care when searching for existing registrations.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Follow-on to 8097d09858: now that BTL_VERSION is defined in btl.h, be
a little smarter about whether we define it or not.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit adds a new optional function to the BTL module:
btl_flush. This function takes an optional BTL endpoint. When called
this function completes all outstanding RDMA and atomic operations
started prior to the call to btl_flush.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This header file was meant to be autogenerated, and for
some reasons, was never removed from the repository.
Update .gitignore as well
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Since open-mpi/ompi@47fd2313ab
the backing file is now in /dev/shm by default. As a consequence,
the backing file name has to include the jobid so more than one job
can run at a time.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Resolve a race condition between registering for a file to be removed upon termination and actual creation of that file by providing attributes that identify whether the path is a file or directory. This removes the need for PMIx to detect the difference.
Refs #4686
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.
References #4509
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit moves the backing files to /dev/shm to avoid limitations
that may be set on /tmp. The files are registered with pmix to ensure
they are cleaned up after an erroneous exit.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 48101278160672317ade352365592f56ef3b8977)
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.
Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.
References #4509
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.
References #4260
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
set the key of all mpool_tree_item objects, so they can be retrieved
in mpool_base_free and then returned back to the
mca_mpool_base_tree_item_free_list free list.
Refs. open-mpi/ompi#4567
Thanks Philip Blakely for the bug report.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
if PMIx (version > 1.x) is active since all diagnostic messages will instead flow thru
the PMIx connection. Unfortunately, PMIx v1 does not support this
feature, but we can remove the stddiag support once PMIx v1 slides out
of the support window
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
Store the pointer to the object handle and not the pointer to the
pointer.
We should not assert(0) in the code !
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit renames the atomic compare-and-swap functions to indicate
the return value. This is in preperation for adding support for a
compare-and-swap that returns the old value. At the same time the
return type has been changed to bool.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The problem is that the waiting thread is cycling using OMPI_LAZY_WAIT_FOR_COMPLETION so it can exercise opal_progress. This probably isn't as critical for the modex step, but definitely necessary for the barrier at the end of mpi_init. The problem this creates is that the lazy macro exits as soon as "active" becomes false, and then we destruct the lock.
However, wakeup_thread sets "active" to false - and then calls the condition broadcast to wakeup any waiting threads. So there is a race condition between that broadcast and the lock destruct.
Add OPAL_ACQUIRE_OBJECT and OPAL_POST_OBJECT memory barriers to help protect against thread race conditions on some platforms
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Search external libevent library in both DIR/lib64 and DIR/lib
when --with-libevent=DIR is specified but --with-libevent-libdir=DIR is not
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fix an apparent typo in external libevent configury
Require external libevent for install of separate libpmix
Signed-off-by: Ralph Castain <rhc@open-mpi.org>