Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given.
Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example:
$ mpirun -npernode 2 ./mpi_memprobe
Sampling memory usage after MPI_Init
Data for node rhc001
Daemon: 12.483398
Client: 6.514648
Data for node rhc002
Daemon: 11.865234
Client: 4.643555
Sampling memory usage after MPI_Barrier
Data for node rhc001
Daemon: 12.520508
Client: 6.576660
Data for node rhc002
Daemon: 11.879883
Client: 4.703125
Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.
Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).
Begin first cut at memory profiler
Some minor cleanups of memprobe
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This is news to me: I didn't know that some distros do not set $HOME.
So use "~" instead, and only try to grep ~/.rpmmacros if it exists.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
* Point to local libfabric v1.4 install
* Add MPI C++ bindings
* Remove PSM support (if someone can install PSM/PSM2 libraries on the
build server, let's re-enable this)
Also change from -j8 to -j4 (the new AWS build instance only has 1
core / 2 hyperthreads).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Nightly snapshots will now be named:
openmpi-${BRANCHNAME}-${YYYYMMDDHHMM}-${SHORTHASH}.tar.${COMPRESSION}.
Fixes#2337
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted.
While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress
* In open-mpi/ompi@f6f24a4f67 I missed
updating the library references for the wrapper compilers.
* Fixes the CXX wrapper compiler and CXX library is renamed as needed.
* Fixes the Java wrapper compiler and the Java library is renamed as needed.
Update the script to auto-generate the entire AUTHORS file from two
sources:
1. The existing AUTHORS file
2. The output from "git log --format=tformat:=tformat:'%aN <%aE>'"
Merge these two together (which will preserve organization
affiliations) and warn in two cases:
1. If a person has no organization affiliation
1. If the same email address appears for more than one person
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The Open MPI configure script has long-since only paid attention to
FCFLAGS. Indeed, it will warn if you set FFLAGS or F77FLAGS. So
remove them from the spec file.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Add the following to zsh shell completion:
* --get-stack-traces
* --report-state-upon-timeout
* --timeout
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit adds a framework to abstract runtime code patching.
Components in the new framework can provide functions for either
patching a named function or a function pointer. The later
functionality is not being used but may provide a way to allow memory
hooks when dlopen functionality is disabled.
This commit adds two different flavors of code patching. The first is
provided by the overwrite component. This component overwrites the
first several instructions of the target function with code to jump to
the provided hook function. The hook is expected to provide the full
functionality of the hooked function.
The linux patcher component is based on the memory hooks in ucx. It
only works on linux and operates by overwriting function pointers in
the symbol table. In this case the hook is free to call the original
function using the function pointer returned by dlsym.
Both components restore the original functions when the patcher
framework closes.
Changes had to be made to support Power/PowerPC with the Linux
dynamic loader patcher. Some of the changes:
- Move code necessary for powerpc/power support to the patcher
base. The code is needed by both the overwrite and linux
components.
- Move patch structure down to base and move the patch list to
mca_patcher_base_module_t. The structure has been modified to
include a function pointer to the function that will unapply the
patch. This allows the mixing of multiple different types of
patches in the patch_list.
- Update linux patching code to keep track of the matching between
got entry and original (unpatched) address. This allows us to
completely clean up the patch on finalize.
All patchers keep track of the changes they made so that they can be
reversed when the patcher framework is closed.
At this time there are bugs in the Linux dynamic loader patcher so
its priority is lower than the overwrite patcher.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
All BTL-only operations (basically all data movements
with the exception of the matching operation) can now
be handled for the TCP BTL by a progress thread.
This script is useful to measure times from launching ompi
application to different internal points. A user can easy add
it`s test basing on existing tests.
See readme information inside the script.
Fixes to lanl platform files to pick up lustre header
files, etc. for romio and ompi i/o.
Fixes#1033
Thanks to Jerome Vienne for spotting this.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Do not need to test with -f. If directories ever show up in git
ls-files (unlikely) their mime type is application/x-directory which
will fail the test ${file::4} == text check.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>