- replace MAXHOSTNAMELEN with hardcoded 1024.
unlike Linux, Solaris #define MAXHOSTNAMELEN in <netdb.h>,
so use a hard coded value to keep the test simpl
- stdout cannot be assigned on Solaris, so use freopen instead
(back-ported from upstream commit pmix/master@a63f6e53f4)
this is a convenience macro similar to the PMIX_LIST_FOREACH macro,
that can be used to iterate on all the key/value pairs of a pmix_hash_table_t
(back-ported from upstream commit pmix/master@349971c68c)
We only support running with libfabric v1.3 or greater. So it's safe
to remove the legacy/adaptive cq_readerr() behavior.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
There are critical usnic libfabric AV insert bugs before v1.3, so
don't allow any version prior to v1.3 at run time (still allow
*compiling* with earlier versions, though, since the ABI guarantees
allow us to compile with an earlier libfabric and run with a later
libfabric).
Switch to using fi_version() to check the version (instead of calling
fi_getinfo()) as a potentially lighter-weight / simpler solution.
This allows us to only call fi_getinfo() once.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The (effective) "+42" computation was, in fact, the incorrect answer
in this case (gasp!).
We should just take the max_msg_size from the command (which came from
the libfabric endpoint max_msg_size attribute in the client) and
subtract off the max header size: 68 (which is explained in the
comment). This will result in a "large" message size which is likely
slightly smaller than the MTU, but still right up near the MTU, and
therefore good enough.
Note: the old computation (i.e., -(68-42)) worked fine when we asked
for Libfabric API v1.1 because the usnic provider would return a
max_msg_size that was already less than the MTU due to FI_PREFIX
behavior shenanigans. Once we started asking for Libfabric API v1.4,
the usnic Libfabric provider started returning (MTU + prefix_size),
and the -(68-42) computation started giving a value that was over the
MTU. This caused sendto() on the connectivity checker UDP socket
to fail.
This commit also removes an old/misleading comment.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
* Add a configure time option to rename libmpi(_FOO).*
- `--with-libmpi-name=STRING`
* This commit only impacts the installed libraries.
Internal, temporary libraries have not been renamed to limit the
scope of the patch to only what is needed.
For example:
```shell
shell$ ./configure --with-libmpi-name=wookie
...
shell$ find . -name "libmpi*"
shell$ find . -name "libwookie*"
./lib/libwookie.so.0.0.0
./lib/libwookie.so.0
./lib/libwookie.so
./lib/libwookie.la
./lib/libwookie_mpifh.so.0.0.0
./lib/libwookie_mpifh.so.0
./lib/libwookie_mpifh.so
./lib/libwookie_mpifh.la
./lib/libwookie_usempi.so.0.0.0
./lib/libwookie_usempi.so.0
./lib/libwookie_usempi.so
./lib/libwookie_usempi.la
shell$
```
theses three pmix components use the same class name,
declare it as static so Open MPI can be built with --disable-dlopen
Thanks Limin Gu for the report
The rdmacm CPC in the openib BTL is not thread safe. The rdmacm CPC
should disqualify itself (instead of failing in random ways) if
MPI_THREAD_MULTIPLE is the thread level.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
This patch is based on the "RFC: Reenabling the TCP BTL over local
interfaces (when specifically requested)". It removes the hardcoded
exception for the local devices that has been enforced by the
TCP BTL. Instead, we exclude the local interface only via the
exclude MCA (both IPv4 and IPv6 local addresses are already in the
default if_exclude), which is also the behavior currently described in
our README file.
pmix_progress_thread_finalize() invokes libevent event_base_free,
so all libevent stuff cannot be used after.
Hence, pmix_client_globals.myserver must be PMIX_DESTRUCT'ed
before invoking pmix_progress_thread_finalize()
Prevent a race condition between a thread checking count and then
going in cond_wait, and another thread setting the count to 0 and
signaling the condition.
Thanks to Pascal Deveze for catching up the bug and for
the initial patch.
pmix1_value_unload() was added a "key" argument which is unused,
and pmix1_value_unload() was sometimes invoked with two arguments instead of three.
since the "key" argument is unused, simply remove it from the
subroutine prototype and calls.
The add_64, sub_64, and cmpset_64 atomics used "+m" (*addr) to
indicate the asm also writes the memory location. This is better than
using a memory clobber. PGI 16.9 introduced a bug that causes a
compiler failure on the "+m" constraint (input/output). It seems to
work with "=m" (output) which matches the 32-bit atomics.
Fixesopen-mpi/ompi#2086
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
the LT_* macros do overwrite the enable_dlopen variable,
so it must be tested and saved before invoking LT_INIT.
delay the invokation of the LT_* macros and use the
PMIX_ENABLE_DLOPEN_SUPPORT variable to figure out whether
--disable-dlopen was invoked
This commit fixes an abort during finalize because pending events were
removed from the list twice.
References #2030
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
WAIT_SYNC_INIT(sync,0); WAIT_SYNC_RELEASE(sync);
hanged because sync->signaled was initialised to true, and
there is no reason to invoke WAIT_SYNC_SIGNALED(sync) before
WAIT_SYNC_RELEASE(sync)
this commit initializes sync->signaled to true unless the count is zero.
Thanks George for the review and guidance.
We commonly see messages on the users list where a peer has hung up
because it has crashed. Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit adds selective use of a compiler-specific pragma to
silence the numerous warnings the Sun/Oracle/Studio compilers emit for
the GNU-style inline asm used in atomic.h.
Thanks Paul Hargrove for the initial patch and the guidance.
This commit contains the following changes:
- There is a bug in the PGI 16.x betas for ppc64 that causes them to
emit the incorrect instruction for loading 64-bit operands. If not
cast to void * the operands are loaded with lwz (load word and
zero) instead of ld. This does not affect optimized mode. The work
around is to cast to void * and was implemented similar to a
work-around for a xlc bug.
- Actually implement 64-bit add/sub. These functions were missing and
fell back to the less efficient compare-and-swap implementations.
Thanks to @PHHargrove for helping to track this down. With this update
the GCC inline assembly works as expected with pgi and ppc64.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
It looks like one help message was accidentally pasted in the middle
of another. Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit prevents the connection code from trying to connect an
endpoint if the directed datagram has been posted but not received.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>