It turns that there is an incompatibility between the Cray PMI
library and the default configuration for building Open MPI (master).
To work around this, we now disable use of aprun for direct launch
of Open MPI jobs except under specific conditions.
The problem is that there are now (on master) packages getting
initialized that do not work properly across a fork operation.
As part of a constructor in the Cray PMI library, a fork operation
is done to simplify use of shared memory between the
processes in a job on the same node. This ends up thoroughly
messing up the Open MPI initialization process in the case
that dlopen support is enabled. The initialization process gets
about half-way through when the PMIX framework is opened and
components are loaded, which triggers the Cray PMI constructor
and hence the fork operation.
There are two workarounds for this:
1) configure Open MPI for Cray XE/XC systems using aprun with the
--disable-dlopen option
2) set the PMI_NO_FORK environment variable in the shell in which
the aprun command is run.
Without taking these measures, a Open MPI job will just hang at
job startup in the first attempt to "thread-shift" the PMIx
fence_nb operation. Additional hangs occur at shutdown if this
problem is worked around, again due to the insertion of a fork
operation halfway through the Open MPI initialization procedure.
This commit detects if the conditions that bring out the hang
situation are present, and if so, prints out a message and
aborts the job launch.
Note on systems using slurm, the PMI_NO_FORK environment variable
is set as part of the srun job launch, hence this issue is avoided
on those systems.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
opal_convertor_pack() might pack less bytes than requested,
so always set frag->segments[0].seg_len.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
PR open-mpi/ompi#2432 introduced a regression where configure
and build with --disable-dlopn caused build failure owing
to unresolved alps lli symbols in the libopal-pal shared library.
This commit fixes this problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Enhance the cray pmix component to set some OMPI internal
env. variables used to set some key/value pairs
on the MPI_INFO_ENV object. This allows more of the
ompi-tests ibm unit tests to pass when using aprun/srun
direct launch and Cray PMI.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
since pthreads are now mandatory, the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD
is always true and hence can be safely removed
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Remove BTL_OPENIB_FAILOVER_ENABLED code in the openib btl source.
Remove the failover-specific files from the openib btl.
Update the openib/Makefile.am accordingly.
Remove the -enable-openib-failover config logic.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
This commit rewrites much of the btl/self component to fix a long
standing memory usage bug. Before this commit the prepare_src path
would always allocate a max send fragment (256kB). This caused the
rank to allocate 32 * 256k useless buffers from one send. This commit
makes the following changes:
- Add the MCA_BTL_FLAGS_GET flag by default. No reason not to set it.
- Reduce the eager limit, max send size, buffers per allocation, and
maximum buffer count per fragment size. These changes should have
no noticible affect on performance but should greatly reduce the
memory usage of the component.
- Implement the sendi function. This should reduce self send latency
somewhat.
- Rewrite prepare_src to never allocate a eager or max send fragment
for contiguous data.
- add_procs needs to return something in the peer array for the proc
self not just set the reachability bit. Now stores (void *) 1.
- Various cleanups. Removed and unused file.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The vader btl kept a per-peer registration cache to keep track of
attachments. This is not really a problem with small numbers of local
ranks but can be a problem with large SMP machines. To reduce the
footprint there is now one registration cache for all xpmem
attachments. This will probably increase the lookup time for large
transfers but is a worthwhile trade-off.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
libnl and libnl-3 are known to conflict with each other, so detect
and abort if these two libs are both used directly (e.g. Open MPI
uses libnl-3) or indirectly (e.g. libibverbs.so might depend on libnl)
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
OMPI header cannot be included in OPAL source code, hence removed it.
Fixes: (740b636db) btl/openib: Disqualify rdmacm CPC if
MPI_THREAD_MULTIPLE.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
- replace MAXHOSTNAMELEN with hardcoded 1024.
unlike Linux, Solaris #define MAXHOSTNAMELEN in <netdb.h>,
so use a hard coded value to keep the test simpl
- stdout cannot be assigned on Solaris, so use freopen instead
(back-ported from upstream commit pmix/master@a63f6e53f4)
this is a convenience macro similar to the PMIX_LIST_FOREACH macro,
that can be used to iterate on all the key/value pairs of a pmix_hash_table_t
(back-ported from upstream commit pmix/master@349971c68c)
We only support running with libfabric v1.3 or greater. So it's safe
to remove the legacy/adaptive cq_readerr() behavior.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
There are critical usnic libfabric AV insert bugs before v1.3, so
don't allow any version prior to v1.3 at run time (still allow
*compiling* with earlier versions, though, since the ABI guarantees
allow us to compile with an earlier libfabric and run with a later
libfabric).
Switch to using fi_version() to check the version (instead of calling
fi_getinfo()) as a potentially lighter-weight / simpler solution.
This allows us to only call fi_getinfo() once.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The (effective) "+42" computation was, in fact, the incorrect answer
in this case (gasp!).
We should just take the max_msg_size from the command (which came from
the libfabric endpoint max_msg_size attribute in the client) and
subtract off the max header size: 68 (which is explained in the
comment). This will result in a "large" message size which is likely
slightly smaller than the MTU, but still right up near the MTU, and
therefore good enough.
Note: the old computation (i.e., -(68-42)) worked fine when we asked
for Libfabric API v1.1 because the usnic provider would return a
max_msg_size that was already less than the MTU due to FI_PREFIX
behavior shenanigans. Once we started asking for Libfabric API v1.4,
the usnic Libfabric provider started returning (MTU + prefix_size),
and the -(68-42) computation started giving a value that was over the
MTU. This caused sendto() on the connectivity checker UDP socket
to fail.
This commit also removes an old/misleading comment.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
* Add a configure time option to rename libmpi(_FOO).*
- `--with-libmpi-name=STRING`
* This commit only impacts the installed libraries.
Internal, temporary libraries have not been renamed to limit the
scope of the patch to only what is needed.
For example:
```shell
shell$ ./configure --with-libmpi-name=wookie
...
shell$ find . -name "libmpi*"
shell$ find . -name "libwookie*"
./lib/libwookie.so.0.0.0
./lib/libwookie.so.0
./lib/libwookie.so
./lib/libwookie.la
./lib/libwookie_mpifh.so.0.0.0
./lib/libwookie_mpifh.so.0
./lib/libwookie_mpifh.so
./lib/libwookie_mpifh.la
./lib/libwookie_usempi.so.0.0.0
./lib/libwookie_usempi.so.0
./lib/libwookie_usempi.so
./lib/libwookie_usempi.la
shell$
```
theses three pmix components use the same class name,
declare it as static so Open MPI can be built with --disable-dlopen
Thanks Limin Gu for the report
The rdmacm CPC in the openib BTL is not thread safe. The rdmacm CPC
should disqualify itself (instead of failing in random ways) if
MPI_THREAD_MULTIPLE is the thread level.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
This patch is based on the "RFC: Reenabling the TCP BTL over local
interfaces (when specifically requested)". It removes the hardcoded
exception for the local devices that has been enforced by the
TCP BTL. Instead, we exclude the local interface only via the
exclude MCA (both IPv4 and IPv6 local addresses are already in the
default if_exclude), which is also the behavior currently described in
our README file.
pmix_progress_thread_finalize() invokes libevent event_base_free,
so all libevent stuff cannot be used after.
Hence, pmix_client_globals.myserver must be PMIX_DESTRUCT'ed
before invoking pmix_progress_thread_finalize()
pmix1_value_unload() was added a "key" argument which is unused,
and pmix1_value_unload() was sometimes invoked with two arguments instead of three.
since the "key" argument is unused, simply remove it from the
subroutine prototype and calls.
the LT_* macros do overwrite the enable_dlopen variable,
so it must be tested and saved before invoking LT_INIT.
delay the invokation of the LT_* macros and use the
PMIX_ENABLE_DLOPEN_SUPPORT variable to figure out whether
--disable-dlopen was invoked
This commit fixes an abort during finalize because pending events were
removed from the list twice.
References #2030
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
We commonly see messages on the users list where a peer has hung up
because it has crashed. Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
It looks like one help message was accidentally pasted in the middle
of another. Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit prevents the connection code from trying to connect an
endpoint if the directed datagram has been posted but not received.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The xlc compiler seems to behave in a different way that gcc when it
comes the inline asm. There were two problems with the code with xlc:
- The TOC read in mca_patcher_base_patch_hook used the syntax
register unsigned long toc asm("r2") to read $r2 (the TOC
pointer). With gcc this seems to behave as expected but with xlc
the result in toc is not the same as $r2. I updated the code to use
asm volatile ("std 2, %0" : "=m" (toc)) to load the TOC pointer.
- The OPAL_PATCHER_BEGIN macro is meant to be the first thing in a
hook. On PPC64 it loads the correct TOC pointer (thanks to
mca_patcher_base_patch_hook) and saves the old one. The
OPAL_PATCHER_END macro restores the TOC pointer. Because we *need*
the TOC to be correct before it is accessed in the hook the
OPAL_PATCHER_BEGIN macro MUST come first. We did this and all was
well with gcc. With xlc on the other hand there was a TOC access
before the assembly inserted by OPAL_PATCHER_BEGIN. To fix this
quickly I broke each hook into a pair of function with the
OPAL_PATCHER_* macros on the top level functions. This works around
the issue but is not a clean way to fix this. In the future we
should 1) either update overwrite to not need this, or 2) figure
out why xlc is not inserting the asm before the first TOC read.
This fixesopen-mpi/ompi#1854
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
With libfabric v1.4, the usnic provider changed the values of its
fabric and domain name strings (compared to libfabric <v1.4). Update
the Open MPI usNIC BTL to handle both pre-v1.4 and v1.4 fabric/domain
names.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
and fail with a user friendly message if no method is available:
"sec: native cannot validate_cred on this system"
(back-ported from upstream pmix/master@c474a1fc60)
This commit fixes a race that can occur when two threads are in the
ugni progress function at the same time. This race occurs when one
thread calls GNI_PostDataProbeById then goes to sleep then another
thread calls GNI_PostDataProbeById then GNI_EpPostDataWaitById before
the other thread wakes up. If this happens the first thread will print
a warning on GNI_EpPostDataWaitById about no matching post.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes two issues that can occur during a connection:
- Re-entry to connection progress from modex lookup. Added an
additional endpoint state that will keep the code from re-entering
the common endpoint create.
- Fixed a race between a process posting a directed datagram through
a send and a connection being progressed through opal_progress().
The progress code was not obtaining the endpoint lock before
attempting to update the endpoint. To limit the amount of code
changed for 2.0.1 this commit makes the endpoint lock recursive. In
a future update this may be changed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fixes PR https://github.com/open-mpi/ompi/pull/1687
The code that sets OPAL_HAVE_WORKING_EVENTOPS for internal libevent
was executed even if the external libevent component was configured.
As the result libevent progress wasn't called in opal_progress which
for example caused ring_c to hang when pml/ob1 was used.
Architecture is set by the ompi layer *after* job startup, so the key cannot
have the "pmix" prefix since optimizations in open-mpi/ompi@01a653d50a
otherwise architecture cannot be retrieved
This commits changed rand(3) and family in libevent to use internal
random function provided in opal to prevent pertubing user's random seed.
Fixesopen-mpi/ompi#1877
On Cray, PR #1846 introduced a double free
situation which led to all kinds of random memory
corruption problems.
This commit fixes this problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
The max inline send size on a queue pair is not available until after
the endpoint is connected. Before this commit the send flags
(including the inline flag) were set before this value was
initialized. This commit moves setting the send_flags down to
mca_btl_openib_put_internal which is only called after the endpoint is
connected. This fixes a bug when using osc/rdma.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug in the pmix2x client code where a loop
variable is not correctly incremented. This was leading to hangs and
crashes when creating intercommunicators. Also fixed two double
increments in other loops.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Blocking fence is used in yalla del proc. Native pmix exposes this functionality.
We need to expose it for SLURM's s1/s2 components as well.
Also this commit fixes uninitialized `rc` in fencenb's of both
components.
Thanks Jeff for the guidance
Fixesopen-mpi/ompi#1683
note:
in order to keep this commit easy to review, some AS_IF([...]) were replaced with
AS_IF([false], ...) or AS_IF_([true], ...)
these will be removed and re-idented in a subsequent commit
pmix cannot be built on alpine linux because of some missing includes.
uid_t and gid_t are defined in unistd.h or sys/types.h, and unistd.h
is not indirectly pulled under alpine linux, so do it manually.
Thanks N.L.K Nguyen for the report
(back-ported from upstream pmix/master@c8d55350a9)
This commit fixes a long standing bug in rdmacm. It is required that
the thread that calls mca_btl_openib_endpoint_cpc_complete holds the
endpoint lock. This was not the case for rdmacm. This causes debug
builds to abort. This change also required changing
mca_btl_openib_endpoint_send_cts to require the endpoint lock to be
held when calling.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit is an attempt to fix a hang in finalize of rdmacm. This fixes
a path where no rdmacm client is found for an endpoint.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.
Thanks to @artpol84 for tracking this down.
Fixesopen-mpi/ompi#1823
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The way the gni btl is currently coded,
it will run completely out of gas on KNL at
123 processes/node. Since there are bound to be
those who try to run a MPI process/hyperthread
on KNL nodes, the fma sharing mode needs to be requested.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>