The eviction callback, for convenience (and to avoid code
duplication), use to call opal_hotel_checkout(). However,
opal_hotel_checkout() deletes the eviction event -- which is fine to
do when opal_hotel_checkout() is invoked by the application. But when
it's invoked by the same event that it's deleting, it can cause Bad
Things to happen.
For simplicity, instead of invoking opal_hotel_checkout() from the
eviction callback, just duplicate the checkout logic into the eviction
callback function (and skip the delete-the-evict-event part).
For good measure, put a comment in all three places where the checkout
logic occurs (because it's inlined): don't change this logic without
changing all 3 places.
Finally, also add a line in the docs for opal_hotel_init() warning
users from calling opal_hotel_checkout() from their eviction
callback.
`cm_message_event_active == 1` but main thread has already stopped
processing messages and thus we will have the situation where one
message was left unhandled leading to a hang.
setmntent() doesn't support root_fd, but manual parsing of
/proc/mounts is fragile, and actually buggy for very long mount lines
(see open-mpi/hwloc#142 (comment)).
Since we only openat("/proc/mounts") there, just manually concatenate
the fsroot_path and use setmntent().
Thanks to Nathan Hjelm for the report.
(Cherry-picked from open-mpi/hwloc@d2d07b9a22)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per open-mpi/ompi#1230, add a comment explaining why we write to a
temporary file and then rename(2) the file, just so that future code
maintainers don't wonder why we do this seemingly-useless step.
Update the configure logic for the new pmix120 component
ckpt
Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates
Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works
Cleanup the rename files to use the pretty macros
The problem was in mca_btl_openib_proc_create. This function may be called
from several places simultaneously:
* from the main thread when somebody wants to do `MPI_Send()` (for example) for
the first time;
* from udcm if the counterpart peer is trying to connect and `mca_btl_openib_get_ep()`
is called.
In this case one of the threads may add an uninitialized proc structure
to the `mca_btl_openib_component.ib_procs` and the other will read it and
treat as initialized.
This commit turns ib_proc initialization into a single atomic operation.
Move .so version numbers to their appropriate project in the top-level
VERSION file. Also add the project name to all .so version number
names. Remove no-longer-used .so names.
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.
Closes#1204 (replaces it)
Fixes#1064
It was decided some time ago that there is no benefit to using any
per-peer receive queues on infiniband. At the time we decided not to
change the default but that objection has been dropped. This commit
changes the 128 message queue to use SRQ instead of PP. This has no
impact on iWarp which sets the default in a different way.
Closesopen-mpi/ompi#1156
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
When PtlMDBind was removed, the result check was left in which
causes intermittent failures depending on the junk value found in
the 'ret' variable. The commit removes the result check.
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes
Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.
Update the PMIx code and continue attempting to debug direct modex
Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.
Update PMIx to v1.1.2
This commit removes a check that causes mca_base_group_register to
improperly create a new group instead of using an existing group
when the project and framework names are the same. This check was
originally intended to prevent forming groups with names like
ompi_ompi, opal_opal, etc but there is no reason why we shouldn't
allow that.
Fixesopen-mpi/ompi#1155
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Mofed 2.2 does not have the IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG attribute
flag. Add a check to fix compilation for mofed 2.2. This commit only
fixes complilation with the older mofed. It will not allow an Open MPI
compiled with mofed 2.3 or newer to work on a machine with mofed 2.2.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There was a bug with the way the cray pmix component
was setting the locality property for ranks on the
same node, etc.
Improve location/syntax of a comment block.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
If btl-portals4 is configured to use logical mapping of ranks to
physical nodes, then the endpoint must have the rank field set.
This commit fixes a bug that caused the endpoint to have the
nid/pid instead of the rank if the endpoint already exists.
some of the collective modules. Added a new function
opan_datatype_span, to compute the memory span of
count number of datatype, excluding the gaps in the
beginning and at the end. If a memory allocation is
made using the returned value, the gap (also returned)
should be removed from the allocated pointer.
This update adds an additional check (if supported) to see if 8-byte
atomics are supported by the hardware. If 8-byte atomics are not
supported the atomics support is disabled.
This commit also includes some cleanup.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit allows to control output during abnormal oshmem/ompi application
termination.
Fixed issue in backtrace output. HAVE_BACKTRACE was never set so user was limited
in control of this variable.
Two related mca variables are moved to opal layer. Corresponding aliases are
added for ompi and oshmem.
This commit adds support for fetch-and-add and compare-and-swap when
using the mlx5 driver. The support is only enabled if the expanded
verbs interface is detected. This is required because mlx5 HCAs return
the atomic result in network byte order. This support may need to be
tweaked if Mellanox commits their changes into upstream verbs.
Closesopen-mpi/ompi#1077
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
OPAL free lists can be initialized with a fragment size that differs
from the size of objects from a class. This allows the free list code
to support OPAL objects that have flexible array members.
Unfortunately the free list code will throw out the desired length in
some cases. The code in question was committed in
open-mpi/ompi@90fb58de. The side effects of this are varied and can
cause segmentation faults, assert failures, hangs, etc. This commit
adds a check to ensure the requested size is at least as large as the
class size and makes opal_free_list allocations always honor the
requested fragment size (as long as it is larger than the class
size).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
1. Fix: old v1.6-era code reset the stats-emitting event to fire twice
for each time period.
1. Add the usNIC device name to the output for differentiating the
output in multi-rail scenarios.
This commit revives the component retention functionality that was
removed as part of the component repository rewrite. The new
mca_base_component_repository_retain_component function works by
preventing the dlclosing of a dynamic component until a matching call
to mca_base_component_repository_release is made.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
to continue current default behavior.
Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_
modex is used.
Move the fake usnic IBV provider out of common/verbs and into a new
common/verbs_usnic component that is always statically linked into
libopen-pal. The fake provider is registered with libibverbs at run
time, but there is no *un*register IBV API. Hence, we can't let the
code containing this provider be dlclosed -- which means it needs to
be statically linked into libopen-pal.
Fixesopen-mpi/ompi#1060.
if CFLAGS and/or CPPFLAGS are passed to the ompi configure command line, pmix1xx
configure will not use the correct ones previously passed in the environment
see discussion started at http://www.open-mpi.org/community/lists/devel/2015/10/18159.php
Thanks Siegmar Gross for bringing this to our attention
This commit fixes the following bugs:
- On send failure release newly allocated message.
- In the destructor for udcm_message_sent_t always remove the send
timeout event from the event base. Failure to do this can lead to
memory corruption since the destructor may be called from an event
callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This was fixed on my btl 3.0 branch but the changeset got lost in a
rebase. Fixes issues with lock ups when using osc/rdma.
References open-mpi/ompi#1010
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The handling of RDMA get alignment in ugni BTL for Aries
(cray xc) was wrong, resulting in very poor bandwidth
for ugni BTL on aries.
Verified using osu_bw now gives sensible bandwidth on
Aries.
Fixes#1005
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
This commit removes the service and async event threads from the
openib btl. Both threads are replaced by opal progress thread
support. The run_in_main function is now supported by allocating an
event and adding it to the sync event base. This ensures that the
requested function is called as part of opal_progress.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a access_flags argument to the mpool registration
function. This flag indicates what kind of access is being requested:
local write, remote read, remote write, and remote atomic. The values
of the registration access flags in the btl are tied to the new flags
in the mpool. All mpools have been updated to include the new argument
but only the grdma and udreg mpools have been updated to make use of
the access flags. In both mpools existing registrations are checked
for sufficient access before being returned. If a registration does
not contain sufficient access it is marked as invalid and a new
registration is generated.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Otherwise libpciaccess sends a big error message to stderr:
Error opening /devices/pci@0,0:reg: Permission denied
(cherry picked from commit open-mpi/hwloc@d93c7c0960)
In the default mode of operation, the Portals4 components support
dynamic add_procs().
The Portals4 components have two alternate modes (flow control and
logical-to-physical) that require knowledge of all procs at startup.
In these modes, mtl-portals4 sets the MCA_MTL_BASE_FLAG_REQUIRE_WORLD
flag and btl-portals4 sets the MCA_BTL_FLAGS_SINGLE_ADD_PROCS flag
to tell the PML that we need all the procs in one add_procs() call.
After PMIx integration, the thrid parameter to OPAL_MODEX_RECV() is
opal_process_name_t instead of opal_proc_t. This commit replaces
proc with &proc->proc_name.
This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.
Also do a little cleanup to avoid bombarding the user with multiple error messages.
Thanks to Patrick Begou for reporting the problem
Fix CID 1312120: Uninitialized scalar variable
The response type will always be set unless a message of another type is passed to this function. To make sure that error is caught I am adding an assert.
Fix CID 1312116: Dereference after null check
This is a potential bug. If there is no endpoint data for an incoming connection a rejection should be sent. In this case we would just SEGV.
Fix CID 1312115: Dereference after null check
Clear error in the error message. Use the queue pair number that was passed in.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support to the pml, mtl, and btl frameworks for
components to indicate at runtime that they do not support the new
dynamic add_procs behavior. At the high end the lack of dynamic
add_procs support is signalled by the pml using the new pml_flags
member to the pml module structure. If the
MCA_PML_BASE_FLAG_REQUIRE_WORLD flag is set MPI_Init will generate the
ompi_proc_t array passed to add_proc from ompi_proc_world () instead
of ompi_proc_get_allocated ().
Both cm and ob1 have been updated to detect if the underlying mtl and
btl components support dynamic add_procs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
In the v1.10 opal_progress_thread emulation, ensure to create a
blocking libevent timer far in the future. Without that,
opal_event_loop() will return immediately (and therefore the progress
thread spins hard, stealing CPU cycles).
We try to keep the source code the same between master and v1.10. So
put the #if's back for OPAL_HAVE_HWLOC (and just hard-code it to 1 on
master) so that this code is also compilable in v1.10.
The OPAL_PROC_ON_* definitions have been changed from values to
flags. This should not cause any problems as these values were already
used as flags throughout the code base. Note, there will be a
difference between localities produced by the new code and the
old. For example, if a machine does not have a level-3 but two cores
share a level-1 or level-2 cache cache the level-3 bit will not be set
in the locality and OPAL_PROC_ON_LOCAL_L3CACHE will return 0. Before
this change it would have returned 1.
In addition the OPAL_PROC_ON_LOCAL_* macros have been simplified.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Looks like in ess_pmi_module.c u32 is being used
for retrieving OPAL_PMIX_LOCAL_SIZE, while s1/s2/cray
pmix components were storing as u16.
This commit fixes this problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
This commit makes two changes to the tcp btl:
- If a tcp proc does not exist when handling a new connection create
a new proc and use it. The current implementation uses the
opal_proc_by_name() function to get the opal_proc_t then calls
add_procs on all btl modules. It may be sufficient to just call
add_procs until an endpoint is created so this may change somewhat.
- In add_procs add a check for an existing endpoint before creating
one.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds an opal hash table to keep track of mapping between
process identifiers and ompi_proc_t's. This hash table is used by the
ompi_proc_by_name() function to lookup (in O(1) time) a given
process. This can be used by a BTL or other component to get a
ompi_proc_t when handling an incoming message from an as yet unknown
peer.
Additionally, this commit adds a new MCA variable to control the new
add_procs behavior: mpi_add_procs_cutoff. If the number of ranks in
the process falls below the threshold a ompi_proc_t is created for
every process. If the number of ranks is above the threshold then a
ompi_proc_t is only created for the local rank. The code needed to
generate additional ompi_proc_t's for a communicator is not yet
complete.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates each non-compliant btl to send the
MCA_BTL_FLAGS_SEND flag in the btl_flags field if send is
supported. This fixes a problem identified after the latest bml/r2
update which excplicitly checks for the send flag.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Add more stubs to reduce likelihood of future
mysterious segfaults if some of the newer pmix
funcs start to get used within ompi.
Add a get_version to return the version of the
Cray PMI library being used, since the Cray PMI
library actually has a function to get that info.
Be more accurate about which functions have a hope
of being implemented using Cray PMI and those which
never will.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
* update to configury to silence ident messages (thanks Gilles!)
* fix for warnings Jeff saw when get didn't find the requested data
* fix for Mac OSX operations
Bring Slurm PMI-1 component online
Bring the s2 component online
Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.
Bring the OMPI pubsub/pmi component online
Get comm_spawn working again
Ensure we always provide a cpuset, even if it is NULL
pmix/cray: adjust cray pmix component for pmix
Make changes so cray pmix can work within the integrated
ompi/pmix framework.
Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet
Cleanup comm_spawn - procs now starting, error in connect_accept
Complete integration
There is currently a path through the grdma mpool and vma rcache that
leads to deadlock. It happens during the rcache insert. Before the
insert the rcache mutex is locked. During the call a new vma item is
allocated and then inserted into the rcache tree. The allocation
currently goes through the malloc hooks which may (and does) call back
into the mpool if the ptmalloc heap needs to be reallocated. This
callback tries to lock the rcache mutex which leads to the
deadlock. This has been observed with multi-threaded tests and the
openib btl.
This change may lead to some minor slowdown in the rcache vma when
threading is enabled. This will only affect larger message paths in
some of the btls.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This new class is the same as the opal_mutex_t class but has a
different constructor. This constructor adds the recursive flag to the
mutex attributes for the lock. This class can be used where there may
be re-enty into the lock from within the same thread.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were several issues preventing the openib btl from running in
thread multiple mode:
- Missing locks in UDCM when generating a loopback endpoint. Fixed in
open-mpi/ompi@8205d79819.
- Incorrect sequence numbers generated in debug mode. This did not
prevent the openib btl from running but instead produced incorrect
error messages in debug builds.
- Recursive locking of the rcache lock caused by the malloc
hooks. This is fixed by open-mpi/ompi#827
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
When using eager RDMA in debug builds the openib btl generates a
sequence number for each send. The code independently updated the head
index and the sequence number for the eager rdma transaction. If
multiple threads enter this code at the same time and run in the
following order:
thread 1: update sequence (0 -> 1)
thread 2: update sequence (1 -> 2)
thread 2: update head (0 -> 1)
thread 1: update head (1 -> 2)
the sequence number for head[0] gets 1 and the sequence number for
head[1] gets 0. The fix is to generate the sequence number from the
head index.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds implementations for opal_atomic_lifo_pop and
opal_atomic_lifo_push that make use of the load-linked and
store-conditional instruction. These instruction allow for a more
efficient implementation on supported platforms.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds implementations of opal_atomic_ll_32/64 and
opal_atomic_sc_32/64. These atomics can be used to implement more
efficient lifo/fifo operations on supported platforms. The only
supported platform with this commit is powerpc/power.
This commit also adds an implementation of opal_atomic_swap_32/64 for
powerpc.
Tested with Power8.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit removes alpha asm support. No current processor
manufacturer makes chips compatible with DEC alpha and no
participating organization has alpha processors. This makes it
difficult to support alpha via assembly.
This doesn't mean Open MPI will no longer build/work on alpha
processors. It should continue to work with gcc's builtin sync
atomics.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Up until this point we have had inconsistent usage for MCA verbosity
levels. This commit attempts to correct this by recommending
components use these standard levels: none (0), error (1), warn (10),
info (20), debug (40), and trace (60).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
this test seems broken :
- some false positive were reported
- it fails to detect some OFED version mismatch
this commit simply removes this test, which means the application
will likely fail if XRC is used ad OFED version is different
between compile time and runtime
This code really had no purpose; just assign FI_VERSION(1, 1). This
fixes CID 1315274.
Also clarify the commet about why we still retain libfabric v1.0.0
compatibility code, even though configure.m4 requires libfabric >= v1.1.0.