Since the usnic BTL is single-threaded in this area, there really is
no danger, but don't use one of the pointers hanging off the frag
after we return it to the freelist. Instead, save the endpoint
pointer before returning the frag.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The types are technically typedef equivalent, but it's less confusing
to use the types that agree with the name of the constructor.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
on Linux, sending the header and then the message data does severely
impact performances of ptl/tcp :
on the receiver, reading the data can often result in an PMIX_ERR_RESOURCE_BUSY
or PMIX_ERR_WOULD_BLOCK, which ends up degrading performances)
this commit send both header and message data at the same time via writev()
and makes ptl/tcp virtually as efficient as ptl/usock.
Short writev generally occur when the kernel buffer is full, so there is no
point for retrying in this case.
fwiw, no such degradation was observed on OSX.
Refs open-mpi/ompi#2657
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The MCA variable code calls the string from value function with a NULL
string to verify values. The verbosity enumerator was not correctly
checking for a non-NULL value before trying to set the string.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given.
Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example:
$ mpirun -npernode 2 ./mpi_memprobe
Sampling memory usage after MPI_Init
Data for node rhc001
Daemon: 12.483398
Client: 6.514648
Data for node rhc002
Daemon: 11.865234
Client: 4.643555
Sampling memory usage after MPI_Barrier
Data for node rhc001
Daemon: 12.520508
Client: 6.576660
Data for node rhc002
Daemon: 11.879883
Client: 4.703125
Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.
Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).
Begin first cut at memory profiler
Some minor cleanups of memprobe
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Plug a minor memory leak. Tell the PMIx server not to create a dstore memory region for the daemon job as there is nobody to share it with.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Protect users of hwloc membind functions
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Update PMIx to include NULL string protection
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Update to PMIx master to include key overwrite protection
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:
* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
instead of the topology itself, if available, thus avoiding instantiating the topology
* openib BTL. This uses the distance matrix. At present, I haven't developed a method
for replacing that reference. Thus, this component will instantiate the topology
* usnic BTL. Uses the distance matrix.
* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
the topology
* ess base functions. If a process is direct launched and not bound at launch, this
code attempts to bind it. Thus, procs in this scenario will instantiate the
topology
Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.
Fix pernode binding policy
Properly handle the unbound case
Correct pointer usage
Do not free static error messages!
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
MPI_T_pvar_get_index was returning an incorrect index. The index
was never set correctly while registering the performance variables.
Additionally fix a missing case in the mca_base_var_type_t to MPI
datatype conversion. This type is currently used for control variables
registered by mxm, fca and hcoll components.
Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
- Fix capitolization typos
- Make comment more correct / flow better
- Use AM_CPPFLAGS, not DEFAULT_INCLUDES
- Remove extra "hwloc/" from external hwloc.h specification
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
- simply #include "hwloc.h" to use the external hwloc header
- do use the external hwloc header instead of opal/mca/hwloc/hwloc.h
Thanks Orion Poplawski for the report
Fixesopen-mpi/ompi#2616
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fine tuning of flux component
Fix a few minor issues with the initial cut:
* Job id could be obtained from the PMI kvsname like SLURM,
but simpler to getenv (FLUX_JOB_ID)
* Flux pmi-1 doesn't define PMI_BOOL, PMI_TRUE, PMI_FALSE
* Flux pmi-1 maps the deprecated PMI_Get_kvs_domain_id() to
PMI_KVS_Get_my_name() internally, so just call that instead.
* Drop residual slurm references.
Add wrappers for PMI functions so that if HAVE_FLUX_PMI_LIBRARY
is not defined, the component can dlopen libpmi.so at location
specified by the FLUX_PMI_LIBRARY_PATH env variable, which adds
flexibility. If HAVE_FLUX_PMI_LIBRARY is defined, link with
libpmi.so at build time in the usual way.
Update configury for flux component
Update m4 so the configure options work as follows:
--with-flux-pmi
Build Flux PMI support (default: yes)
--with-flux-pmi-library
Link Flux PMI support with PMI library at build
time. Otherwise the library is opened at runtime at
location specified by FLUX_PMI_LIBRARY_PATH environment
variable. Use this option to enable Flux support when
building statically or without dlopen support (default: no)
If the latter option is provided, the library/header is located at
build time using the pkg-config module 'flux-pmi'. Otherwise there
is no library/header dependency.
Handle the case where ompi is configured with --disable-dlopen
or --enable-statkc. In those cases, don't build the component
unless --with-flux-pmi-library is provided.
It is fatal if the user explicitly requests --with-flux-pmi but
it cannot be built (e.g. due to --disable-dlopen).
Add a schizo/flux component
Update schizo/flux component
Eliminate slurm-specific usage cases.
Since the module is only loaded if FLUX_JOB_ID is set, there are
only two cases to handle:
1) App was launched indirectly through mpirun. This is not yet
supported with Flux, but hook remains in case this mode is supported
in the future.
2) App was launched directly by Flux, with Flux providing
CPU binding, if any.
Fix up white space in pmix/flux component
Drop non-blocking fence from pmix:flux component
The flux PMI-1 library is not thread safe, therefore
register a regular blocking fence callback instead of the
thread-shifting fencenb().
pmix/flux component avoids extra PMI_KVS_Gets
Keys stored into the base cache under the wildcard
rank are not intended to be part of the global key namespace.
These keys therefore should not trigger a PMI_KVS_Get() if they
are not found in the cache.
Minor pmix/flux component cleanup
pmix/flux: drop code for fetching unused pmix_id
pmix/flux: err_exit must return error
Problem: in flux_init(), although 'ret' (variable holding
err_exit return code) is initialized to OPAL_ERROR, the
variable is reused as a temporary result code, so if there are
some successes followed by a failure that doesn't set 'ret',
flux_init() could return success with PMI not initialized.
Ensure that a "goto err_exit" returns OPAL_ERROR if 'ret'
is not set to some other error code.
pmix/flux: don't mix OPAL_ and PMI_ return codes
Problem: flux_init() can return both PMI_ and OPAL_ return
codes. Although OPAL_SUCCESS and PMI_SUCCESS are both defined
as 0, other codes are not compatible.
Ensure that flux_init() consistently uses 'rc' for PMI_
return codes and 'ret' for OPAL_ return codes.
pmix/flux: factor out repeated code for cache put
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Newer x86 processors have a core invariant tsc. On these systems it is
safe to use the rtdtsc instruction as a monotonic timer. This commit
adds a new function to the opal timer code to check if the timer
backend is monotonic. On x86 it checks the appropriate bit and on
other architectures it parrots back the OPAL_TIMER_MONOTONIC value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn
Update to track master
Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates
Signed-off-by: Ralph Castain <rhc@open-mpi.org>