This code is the implementation of Software-base Performance Counters as described in the paper 'Using Software-Base Performance Counters to Expose Low-Level Open MPI Performance Information' in EuroMPI/USA '17 (http://icl.cs.utk.edu/news_pub/submissions/software-performance-counters.pdf). More practical usage information can be found here: https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI.
All software events functions are put in macros that become no-ops when SOFTWARE_EVENTS_ENABLE is not defined. The internal timer units have been changed to cycles to avoid division operations which was a large source of overhead as discussed in the paper. Added a --with-spc configure option to enable SPCs in the Open MPI build. This defines SOFTWARE_EVENTS_ENABLE. Added an MCA parameter, mpi_spc_enable, for turning on specific counters. Added an MCA parameter, mpi_spc_dump_enabled, for turning on and off dumping SPC counters in MPI_Finalize. Added an SPC test and example.
Signed-off-by: David Eberius <deberius@vols.utk.edu>
FI_MR_UNSPEC is not supposed to be used beyond ofi version 1.5. This
commit replaces FI_MR_UNSPEC with the new FI_MR_BASIC mode bits
(FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR).
The btl functionality remains the same.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
OFI sockets provider will sometime return FI_EINTR. Apparently this flag
does not exist in OFI version 1.5. This commit added an ifdef check.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit adds patcher support of aarch64. The current
implementation uses mov, movk, and br to perform the jump. This uses 5
instructions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes problem where users have libfabric version < 1.5
installed and build failed because the new MR modebits does not exist.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
Get the OMPI rte/pmix component working. This was tested using PRRTE as the RM, configuring OMPI using:
* autogen --no-orte
* with external libevent, external hwloc, and external PMIx master
* configuring PMIx master with the same libevent and hwloc
* execute the application using PRRTE's "prun" launcher, which has the same cmd line as ORTE's mpirun
Note that PMIx master appears to have a bug in the event notification system that caches job termination events. Thus, the first execution runs fine, but subsequent executions cause an "abort" when the OMPI default error handler is invoked upon notification of the prior job's termination. Will work that separately.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 134cca9ac0de092d767999357573a31703f72292)
This checkin mainly concerns our internal info keys that are registering
for callbacks via opal_infosubscribe_subscribe(). Those keys need to have
an extra __IN_<key>/val stored to preserve their pre-callback value. So
that means our internal keys are limited to 5 chars shorter than the usual
key length limit.
The code previously would have been silently inactive if a large key happened
to come in, now it warns and also uses snprintf() to avoid compiler warnings.
I'm also making the top-level MPI_Info_set warn if the user uses our reserved
"__IN_" prefix. I had wanted the feature to be more invisible than that, but
it would require a more sophisticated approach to change that.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
This commit added new transport layer to be used with osc rdma module.
This BTL provides put, get, atomic and fetch atomic operations. It can
be used with multiple hardware vendors as long as they have their
provider under Libfabric and have the right capabilities.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit fixes a hang that occurs with debug builds of Open MPI on
aarch64 and power/powerpc systems. When the ll/sc atomics are inline
functions the compiler emits load/store instructions for the function
arguments with -O0. These extra load/store arguments can cause the ll
reservation to be cancelled causing live-lock.
Note that we did attempt to fix this with always_inline but the extra
instructions are stil emitted by the compiler (gcc). There may be
another fix but this has been tested and is working well.
References #3697. Close when applied to v3.0.x and v3.1.x.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Don't do a recursive search (hence no need for *idx anymore).
Find the level depth, to hide cache-issues first.
Then iterate over that level to find the objects we want.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
df_search_min_bound() would need to be fixed for hwloc 2.0,
but it's only used in opal_hwloc_base_find_min_bound_target_under_obj()
which isn't used anymore. So just remove all of them.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
As @hjelmn and I discussed, this is a little hacky. However, it is the only solution that can be done solely from the OMPI side.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
The mpool/memkind component was using a deprecated "partitions" API.
This commit refactors the memkind component to make use of the
supported public API.
The public API uses 3 parameters to specify a mpool "kind":
- a memkind type (which for now is just default or HBM)
- a memkind policy
- a memkind_bits (partly to specify pagesize)
The MCA parameters were changed to reflect these memkind
parameters.
Add a make check test for sanity checking of the memkind component.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Fixes issue #5069, which relates a BigMPI bug with the use of
MPI_Type_vectpor to construct very large datatypes (>2GB).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit fixes a segfault in btl-portals4 add_procs(). The segfault
occurs if add_procs() is called after a del_procs() call that reduces
the proc count to zero which would cause PT and NI resources to be
freed. This commit resolves the segfault by using a common
initiailization boolean and only freeing module resources in
finalize().
Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
Have Open MPI's PMIx component to set PMIx's "show_load_errors" to do
the same thing that Open MPI's "show_load_errors" does.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Some of the show_help() messages that were added in 40afd525f8 were
really normal / expected behavior (e.g., if 2 peers connect in the TCP
BTL more-or-less simultaneously, one of them will drop the connection
-- no need to show_help() about this; it's expected behavior). Roll
back these messages to be opal_output_verbose() kinds of messages.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The strtoul function returns the pointer to the first non-digit character, which is a '.' in this case. Calling strtoul at that point will always yield a zero - you have to move past it to get the remaining number
Thanks to Greg Lee for the detailed analysis of the problem.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Simple MCA vars for ext1, ext2, and pmix3 components to reflect what
the underlying PMIx library version is. For example:
```
$ ompi_info --param pmix pmix3x --parsable --level 9 | grep
library_version
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:value:PMIx library version 3.0.0 (embedded in Open MPI)
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:source:default
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:status:writeable
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:level:4
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:help:Version of the underlying PMIx library
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:deprecated:no
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:type:string
mca:pmix:pmix3x:param:pmix_pmix3x_library_version:disabled:false
```
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses),
but the memory that we are copying from (my_ss->sin_addr) is only 4
bytes long. Don't copy beyond the end of that source buffer.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes a multi-threading bug when using the thread-safe
free list functions. opal_free_list_wait_mt() was using the
conditional version of opal_lifo_pop() and not the thread-safe call.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Take multiple defensive steps to fix CID 1430413 and ensure that ret
is always initialized upon return.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Returns a string name (either a resolved name or IPv4/IPv6 name in a
string if unresolvable. The caller is responsible for freeing the
string.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes the case when local client asks for the key from the
process on the remote node. The local server don't have commit count for
remote ranks, it is maintained by another PMIx server, so commit count
should be ignored for remote requests.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
The current code path for PMIx_Resolve_peers and PMIx_Resolve_nodes executes a threadshift in the preg components themselves. This is done to ensure thread safety when called from the user level. However, it causes thread-stall when someone attempts to call the regex functions from _inside_ the PMIx code base should the call occur from within an event.
Accordingly, move the threadshift to the client-level functions and make the preg components just execute their algorithms. Create a new pnet/test component to verify that the prge code can be safely accessed - set that component to be selected only when the user directly specifies it. The new component will be used to validate various logical extensions during development, and can then be discarded.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 456ac7f7af3d9ba09888e3c899eb001daaa24aef)
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":
* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch
* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.
* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.
* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Important note :
According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written. This return
value may be less than the total number of requested bytes, if a
partial read/write occurred. (Partial transfers apply at the
granularity of iovec elements. These system calls won't perform a
partial transfer that splits a single iovec element.)"
So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.
Thanks Heiko Bauke for the bug report.
Refs. open-mpi/ompi#4829
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
Three issues were resulting in hangs during large message transfer:
* The 2nd..btl_tcp_link connections were dropped during establishment because the per-process
address check was binary, rather than a count
* The accept handler would not skip a btl module that was already in use, resulting in all
connections for a given address being vectored to a single btl
* Multiple addresses in the same subnet caused connections to be
stalled, as the receiver would always use the same (first) address
found. Binding the outgoing connection solves this issue
* Lastly fix race condition created by connections being started at the exact same time
by accpeting connections not in the closed state, allowing endpoint_accept to resolve
dispute
Signed-off-by: Jordan Cherry <cherryj@amazon.com>
This commit replaces the current VMA tree implementation with one that
uses the new opal_interval_tree_t class. Since the VMA tree lock is no
longer used this commit also updates rcache/grdma and btl/vader to
take better care when searching for existing registrations.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new class to opal: opal_interval_tree_t. This is a
thread-safe impelementation of a 1-dimensional interval tree. The data
structure is intended to provide a faster implementation of the
registration cache VMA tree.
The thread safety is provided by a relativistic red-black tree
implementation. This structure provides support for multiple-reader,
and single writer. There is one caveat, an item may appear in the tree
twice while the tree is being updated. Care needs to be taken to avoid
issues associated with this "feature". I don't anticipate a problem
with the current VMA tree usage.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Follow-on to 8097d09858: now that BTL_VERSION is defined in btl.h, be
a little smarter about whether we define it or not.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit adds a new optional function to the BTL module:
btl_flush. This function takes an optional BTL endpoint. When called
this function completes all outstanding RDMA and atomic operations
started prior to the call to btl_flush.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This header file was meant to be autogenerated, and for
some reasons, was never removed from the repository.
Update .gitignore as well
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Since open-mpi/ompi@47fd2313ab
the backing file is now in /dev/shm by default. As a consequence,
the backing file name has to include the jobid so more than one job
can run at a time.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Reading the system clock on every call to opal_progress() is an
expensive operation on most architectures, and it can negatively affect
the performance, for example of message rate benchmarks.
We change opal_progress() to read the clock once per 8 calls, unless
there are active users of the event mechanism.
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
Resolve a race condition between registering for a file to be removed upon termination and actual creation of that file by providing attributes that identify whether the path is a file or directory. This removes the need for PMIx to detect the difference.
Refs #4686
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
return true if the datatype has non-negative displacements and
monotonically nondecreasing, and false otherwise.
Thanks George for the guidance.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.
References #4509
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit moves the backing files to /dev/shm to avoid limitations
that may be set on /tmp. The files are registered with pmix to ensure
they are cleaned up after an erroneous exit.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 48101278160672317ade352365592f56ef3b8977)
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.
Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.
References #4509
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.
References #4260
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
set the key of all mpool_tree_item objects, so they can be retrieved
in mpool_base_free and then returned back to the
mca_mpool_base_tree_item_free_list free list.
Refs. open-mpi/ompi#4567
Thanks Philip Blakely for the bug report.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit removes eax and edx from the clobber list. Older versions
of gcc handled these ok but gcc 7 does not. They are not required as
eax and edx are specified in output constraints.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new set of compare-and-exchange functions. These
functions have a signature similar to the functions found in C11. The
old cmpset functions are now deprecated and defined in terms of the
new compare-and-exchange functions. All asm backends have been
updated.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
if PMIx (version > 1.x) is active since all diagnostic messages will instead flow thru
the PMIx connection. Unfortunately, PMIx v1 does not support this
feature, but we can remove the stddiag support once PMIx v1 slides out
of the support window
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
* Fix typo in the `opal_atomic_wmb` declaration.
* Fix lingering `eieio` reference in the XL assembly to be `lwsync`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
Sometimes, the ethernet interfaces can get quite high kernel indices. struct
ifreq (see netdevice(7)) defines ifr_ifindex to be int's. The OOB component
used int16_t internally for matching (in case of -mca oob_tcp_if_[in|ex]clude)
which meant that any interface index > 32767 would never be matched because the
integer would be truncated to int16_t upon return from the function. OOB would
then refuse to work because it didn't find any usable interfaces and MPI job
would abort.
Signed-off-by: Wojtek Wasko <wwasko@nvidia.com>
so the bool type is defined when using old compilers that do not support gcc builtin atomics (such as gcc 4.1.x from CentOS 5)
Fixesopen-mpi/ompi#4478
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Store the pointer to the object handle and not the pointer to the
pointer.
We should not assert(0) in the code !
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit renames the atomic compare-and-swap functions to indicate
the return value. This is in preperation for adding support for a
compare-and-swap that returns the old value. At the same time the
return type has been changed to bool.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds additional atomics math operations that are needed
throughout the codebase. The semantics of the new operations are
consistent with the existing atomics (op then fetch).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The problem is that the waiting thread is cycling using OMPI_LAZY_WAIT_FOR_COMPLETION so it can exercise opal_progress. This probably isn't as critical for the modex step, but definitely necessary for the barrier at the end of mpi_init. The problem this creates is that the lazy macro exits as soon as "active" becomes false, and then we destruct the lock.
However, wakeup_thread sets "active" to false - and then calls the condition broadcast to wakeup any waiting threads. So there is a race condition between that broadcast and the lock destruct.
Add OPAL_ACQUIRE_OBJECT and OPAL_POST_OBJECT memory barriers to help protect against thread race conditions on some platforms
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Search external libevent library in both DIR/lib64 and DIR/lib
when --with-libevent=DIR is specified but --with-libevent-libdir=DIR is not
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fix an apparent typo in external libevent configury
Require external libevent for install of separate libpmix
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
gcc 7.[1,2] (at least) fails to correctly parse the OSX 10.13 sys/syslog.h
header. As a results we need to potect syslog support in OPAL, PMIX and
ORTE.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Turns out UCX PML calls opal_pmix.fence in its del procs
method without checking whether or not the fence method
for the pmix component was defined. Rather than patch
UCX PML, actually define a fence method for the cray pmix.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
We no longer officially support MIPS or ARM before v6. This commit
updates the configury to check for sync builtins on these
architectures and removes the MIPS and IA64 assembly from
opal/include/opal/sys.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Give packagers a configure CLI option to set the value of the MCA
variable mca_base_component_show_load_errors.
The --disable form of this option is intended for Open MPI packagers
who tend to enable support for many different types of networks and
systems in their packages. For example, consider a packager who
includes support for both the FOO and BAR networks in their Open MPI
package, both of which require support libraries (libFOO.so and
libBAR.so). If an end user only has BAR hardware, they likely only
have libBAR.so available on their systems -- not libFOO.so. Disabling
load errors by default will prevent the user from seeing potentially
confusing warnings about the FOO components failing to load because
libFOO.so is not available on their systems.
Conversely, system administrators tend to build an Open MPI that is
targeted at their specific environment, and contains few (if any)
components that are not needed. In such cases, they might want their
users to be warned that the FOO network components failed to load
(e.g., if libFOO.so was mistakenly unavailable), because Open MPI may
otherwise silently failover to a slower network path for MPI traffic.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Debounce "unreachable" notifications for tools when they disconnect
Enable the -x cmd line option for prun
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 0a5b36180a22959654461ac1303cec35313f8b4a)
Remove some build product. Tell PMIx that we don't need a new nspace generated when OMPI calls connect
Add missing Makefile
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Their is racing condition in TCP connection establishment
during simultaneous handshake. This PR handles the fix for
it.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>