* The AC_LANG_PROGRAM macro adds the `main()` so it is erroneous
to add it to the test program.
* This was detected with the XL compilers which will fail to
build the program in this situation. The GNU compiler does not
error out or warn, but successfully compiles the program.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Due to the conversion from ssize_t to int we were losing bytes, and
ended up writing outside the receiver buffer. Similarly on the send,
due to the conversion to a lesser type, we could missinterpret the
end of the fragment.
The linux timer code was multiplying the result of the x86 time stamp
counter by 1000000 before dividing by the cpu frequency. This can
cause us to overflow 64 bits if the time stamp counter grows larger
than ~ 1.8e13 (about 8400 seconds after boot). To fix the issue the
units of opal_timer_linux_freq have been changed to MHz.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
the hwloc topology might not contain a NUMA object with hwloc < v2
if the node is not NUMA, so force the NUMA object count to one
in order to correctly allocate mca_btl_sm_component.sm_mpools.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit fixes a deadlock that can occur when the libc version
holds a lock when calling munmap. In this case we could end up calling
free() from vma_tree_delete which would in turn try to obtain the lock
in libc. To avoid the issue put any deleted vma's in a new list on the
vma module and release them on the next call to vma_tree_insert. This
should be safe as this function is not called from the memory hooks.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This should probably not go to the v2.x branch, since it changes the
output format of the usnic stats.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Double check the queue lengths that we get back from libfabric to
ensure that they are at least as long as we need. They *should* never
be shorter than we need, but let's just check to be sure.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Don't just blindly send ACKs; ensure that we have send credits before
doing so. If we don't have any send credits, just don't send the ACK
(it'll come again soon enough; it's not a tragedy if we don't send it
now).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The libfabric usnic provider may give you back TX/RX queues that are
longer than you asked for. So just use the TX/RQ/CQ lengths that we
asked for, regardless of what length comes back.
Additionally, keep the length of the priority channel CQ separate from
the length of the data CQ.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Since the usnic BTL is single-threaded in this area, there really is
no danger, but don't use one of the pointers hanging off the frag
after we return it to the freelist. Instead, save the endpoint
pointer before returning the frag.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The types are technically typedef equivalent, but it's less confusing
to use the types that agree with the name of the constructor.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
on Linux, sending the header and then the message data does severely
impact performances of ptl/tcp :
on the receiver, reading the data can often result in an PMIX_ERR_RESOURCE_BUSY
or PMIX_ERR_WOULD_BLOCK, which ends up degrading performances)
this commit send both header and message data at the same time via writev()
and makes ptl/tcp virtually as efficient as ptl/usock.
Short writev generally occur when the kernel buffer is full, so there is no
point for retrying in this case.
fwiw, no such degradation was observed on OSX.
Refs open-mpi/ompi#2657
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The MCA variable code calls the string from value function with a NULL
string to verify values. The verbosity enumerator was not correctly
checking for a non-NULL value before trying to set the string.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given.
Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example:
$ mpirun -npernode 2 ./mpi_memprobe
Sampling memory usage after MPI_Init
Data for node rhc001
Daemon: 12.483398
Client: 6.514648
Data for node rhc002
Daemon: 11.865234
Client: 4.643555
Sampling memory usage after MPI_Barrier
Data for node rhc001
Daemon: 12.520508
Client: 6.576660
Data for node rhc002
Daemon: 11.879883
Client: 4.703125
Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.
Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).
Begin first cut at memory profiler
Some minor cleanups of memprobe
Signed-off-by: Ralph Castain <rhc@open-mpi.org>