Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given.
Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example:
$ mpirun -npernode 2 ./mpi_memprobe
Sampling memory usage after MPI_Init
Data for node rhc001
Daemon: 12.483398
Client: 6.514648
Data for node rhc002
Daemon: 11.865234
Client: 4.643555
Sampling memory usage after MPI_Barrier
Data for node rhc001
Daemon: 12.520508
Client: 6.576660
Data for node rhc002
Daemon: 11.879883
Client: 4.703125
Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.
Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).
Begin first cut at memory profiler
Some minor cleanups of memprobe
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This is news to me: I didn't know that some distros do not set $HOME.
So use "~" instead, and only try to grep ~/.rpmmacros if it exists.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
* Point to local libfabric v1.4 install
* Add MPI C++ bindings
* Remove PSM support (if someone can install PSM/PSM2 libraries on the
build server, let's re-enable this)
Also change from -j8 to -j4 (the new AWS build instance only has 1
core / 2 hyperthreads).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Nightly snapshots will now be named:
openmpi-${BRANCHNAME}-${YYYYMMDDHHMM}-${SHORTHASH}.tar.${COMPRESSION}.
Fixes#2337
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted.
While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress
* In open-mpi/ompi@f6f24a4f67 I missed
updating the library references for the wrapper compilers.
* Fixes the CXX wrapper compiler and CXX library is renamed as needed.
* Fixes the Java wrapper compiler and the Java library is renamed as needed.
Update the script to auto-generate the entire AUTHORS file from two
sources:
1. The existing AUTHORS file
2. The output from "git log --format=tformat:=tformat:'%aN <%aE>'"
Merge these two together (which will preserve organization
affiliations) and warn in two cases:
1. If a person has no organization affiliation
1. If the same email address appears for more than one person
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>