1
1
Граф коммитов

3215 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
6041467df0 Update to latest PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-01 14:47:44 -08:00
Gilles Gouaillardet
3a76a78bff btl/openib: plug a memory leak in btl_openib_register_mca_params()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-12-01 14:24:30 +09:00
Gilles Gouaillardet
c9aeccb84e opal/if: open the if framework once in opal_init_util
the if framework is no more open in opal_if*, which plugs
several memory leaks

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-12-01 14:24:30 +09:00
Gilles Gouaillardet
45732fd764 hwloc/base: fix a memory leak in buffer_cleanup()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-12-01 14:24:29 +09:00
Howard Pritchard
7ce3ca25ef Merge pull request #2458 from hppritcha/topic/pmix_cray_no_dlopen
pmix/cray: abort job if using aprun for general case
2016-11-25 14:03:20 -07:00
Howard Pritchard
eee9f7ae3a pmix/cray: abort job if using aprun for general case
It turns that there is an incompatibility between the Cray PMI
library and the default configuration for building Open MPI (master).
To work around this, we now disable use of aprun for direct launch
of Open MPI jobs except under specific conditions.

The problem is that there are now (on master) packages getting
initialized that do not work properly across a fork operation.
As part of a constructor in the Cray PMI library, a fork operation
is done to simplify use of shared memory between the
processes in a job on the same node.  This ends up thoroughly
messing up the Open MPI initialization process in the case
that dlopen support is enabled.  The initialization process gets
about half-way through when the PMIX framework is opened and
components are loaded, which triggers the Cray PMI constructor
and hence the fork operation.

There are two workarounds for this:
1) configure Open MPI for Cray XE/XC systems using aprun with the
   --disable-dlopen option
2) set the PMI_NO_FORK environment variable in the shell in which
   the aprun command is run.

Without taking these measures, a Open MPI job will just hang at
job startup in the first attempt to "thread-shift" the PMIx
fence_nb operation.  Additional hangs occur at shutdown if this
problem is worked around, again due to the insertion of a fork
operation halfway through the Open MPI initialization procedure.

This commit detects if the conditions that bring out the hang
situation are present, and if so, prints out a message and
aborts the job launch.

Note on systems using slurm, the PMI_NO_FORK environment variable
is set as part of the srun job launch, hence this issue is avoided
on those systems.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-25 06:28:19 -07:00
Gilles Gouaillardet
1a279c4ee9 btl/self: fix fragment segment length in mca_btl_self_prepare_src()
opal_convertor_pack() might pack less bytes than requested,
so always set frag->segments[0].seg_len.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-11-24 10:44:56 +09:00
Howard Pritchard
2bb4cffffd Merge pull request #2447 from hppritcha/topic/compiler_warning_swats
btl/ugni:vader swat some compiler warnings
2016-11-22 05:28:20 -07:00
Howard Pritchard
09f47fcf8e btl/ugni:vader swat some compiler warnings
Swat some compiler warnings.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-21 14:58:34 -06:00
Howard Pritchard
2cbc0e8472 pmix/cray: fix disable-dlopen problem
PR open-mpi/ompi#2432 introduced a regression where configure
and build with --disable-dlopn caused build failure owing
to unresolved alps lli symbols in the libopal-pal shared library.

This commit fixes this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-21 13:45:10 -06:00
Howard Pritchard
0bbb319246 Merge pull request #2444 from hppritcha/topic/cray_pmix_ws_cleanup
pmix/cray: whitespace cleanup
2016-11-21 06:03:56 -07:00
Howard Pritchard
08dce4f161 pmix/cray: whitespace cleanup
Get rid of tabs.  This is anti-ompi style.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-18 19:30:40 -07:00
Howard Pritchard
44f4663d0d Merge pull request #2432 from hppritcha/topic/fix_info_etc_w_alps
pmix/cray: set some envars for MPI_INFO_ENV object
2016-11-18 14:23:53 -07:00
Howard Pritchard
de3de131af pmix/cray: set some envars for MPI_INFO_ENV object
Enhance the cray pmix component to set some OMPI internal
env. variables used to set some key/value pairs
on the MPI_INFO_ENV object.  This allows more of the
ompi-tests ibm unit tests to pass when using aprun/srun
direct launch and Cray PMI.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-16 17:37:52 -06:00
Joshua Ladd
4907085c6f Add the ConnectX-5 device ID to openib BTL.
Signed-off-by: Joshua Ladd <jladd.mlnx@gmail.com>
2016-11-16 21:42:37 +02:00
Gilles Gouaillardet
8ef538adeb Merge pull request #2398 from bosilca/topic/tcp_endpoints_mutex
Protect the tcp_endpoints list from concurrent accesses.
2016-11-14 22:13:29 -07:00
Howard Pritchard
703b464c03 pmix: fix a typo in a help file
Fixes #2391

Thanks to @njoly for reporting

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-12 11:49:15 -07:00
George Bosilca
d0dddef53d
Protect the tcp_endpoints list from concurrent accesses.
Thanks Gilles for your help.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2016-11-11 00:06:03 -05:00
Gilles Gouaillardet
a49422fe84 btl/tcp: get rid of the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD macro
since pthreads are now mandatory, the MCA_BTL_TCP_SUPPORT_PROGRESS_THREAD
is always true and hence can be safely removed

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-11-08 14:00:05 +09:00
Aboorva Devarajan
fb8e074583 powerpc: Add support for powerpcle in timer/pstat.
Signed-off-by: Aboorva Devarajan <abodevar@in.ibm.com>
2016-11-07 02:35:44 -05:00
Gilles Gouaillardet
7a2894f1e0 event/libevent2022: cleanup dependencies to the embedded libevent lib
configury force event/libevent2022 to be built as a static module,
so simplify Makefile.am

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-11-07 14:57:52 +09:00
Gilles Gouaillardet
6f7ed1f552 event/libevent2022: add missing dependencies to the embedded libevent lib
force the libevent2022 component rebuild if the embedded libevent is updated

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-11-04 11:13:44 +09:00
Jeff Squyres
a4ffa590c8 Merge pull request #2308 from hjelmn/vader_mem
btl/vader: reduce memory footprint when using xpmem
2016-11-02 10:28:26 -04:00
Steve Wise
7050969d47 openib btl: remove BTL_OPENIB_FAILOVER_ENABLED code
Remove BTL_OPENIB_FAILOVER_ENABLED code in the openib btl source.

Remove the failover-specific files from the openib btl.

Update the openib/Makefile.am accordingly.

Remove the -enable-openib-failover config logic.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
2016-11-01 14:45:36 -07:00
Ralph Castain
64873487b4 Remove the max_connections parameter from the radix component as it is confusing. Modify PMIx client init so that it simply returns the nspace/rank if called by a server - this allows the server to retrieve its assigned ID. Register the server's nspace so client-side operations can succeed
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-01 12:17:11 -07:00
Jeff Squyres
149b660666 btl/usnic: fix compiler warning
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-10-28 07:36:20 -07:00
rhc54
698dac108b Merge pull request #2309 from rhc54/topic/pmix
Update to latest PMIx master - mostly updates example codes, but incl…
2016-10-27 14:21:46 -07:00
Ralph Castain
f4a55118e6 Update to latest PMIx master - mostly updates example codes, but includes one critical cleanup during finalize.
NOTE: set dstore shared memory storage option to "on" by default

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-10-27 11:16:03 -07:00
Nathan Hjelm
9d92075e60 btl/self: rewrite to decrease memory usage (#2307)
This commit rewrites much of the btl/self component to fix a long
standing memory usage bug. Before this commit the prepare_src path
would always allocate a max send fragment (256kB). This caused the
rank to allocate 32 * 256k useless buffers from one send. This commit
makes the following changes:

 - Add the MCA_BTL_FLAGS_GET flag by default. No reason not to set it.

 - Reduce the eager limit, max send size, buffers per allocation, and
   maximum buffer count per fragment size. These changes should have
   no noticible affect on performance but should greatly reduce the
   memory usage of the component.

 - Implement the sendi function. This should reduce self send latency
   somewhat.

 - Rewrite prepare_src to never allocate a eager or max send fragment
   for contiguous data.

 - add_procs needs to return something in the peer array for the proc
   self not just set the reachability bit. Now stores (void *) 1.

 - Various cleanups. Removed and unused file.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-10-27 12:34:54 -04:00
Nathan Hjelm
a652a193ea btl/vader: reduce memory footprint when using xpmem
The vader btl kept a per-peer registration cache to keep track of
attachments. This is not really a problem with small numbers of local
ranks but can be a problem with large SMP machines. To reduce the
footprint there is now one registration cache for all xpmem
attachments. This will probably increase the lookup time for large
transfers but is a worthwhile trade-off.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-10-27 10:09:43 -06:00
Ralph Castain
f298f294e1 Update PMIx to latest master tarball. Ensure we set the HNP name for orted's so that PMIx_Lookup can find the server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-10-26 15:48:56 -07:00
Gilles Gouaillardet
f2a80dc09f configury: check libnl version and abort in case of conflict
libnl and libnl-3 are known to conflict with each other, so detect
and abort if these two libs are both used directly (e.g. Open MPI
uses libnl-3) or indirectly (e.g. libibverbs.so might depend on libnl)
2016-10-25 09:23:59 +09:00
Ralph Castain
649301a3a2 Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
rhc54
900ae15d49 Merge pull request #2221 from bharatpotnuri/master
btl/openib: remove unwanted ompi header inclusion in opal code.
2016-10-21 14:05:55 -05:00
Ralph Castain
9131eca9c6 Update to latest PMIx master 2016-10-20 21:13:40 -07:00
Ralph Castain
be3197fe27 Ensure that the libevent headers are installed for external libevent when --with-devel-headers is given. Correct the path for opal_config.h in the external hwloc header 2016-10-20 20:57:50 -07:00
Ralph Castain
2f966bf3bf Cleanup external PMIx v3 component for copy/paste errors - component and module require unique names 2016-10-20 09:11:46 -07:00
Ralph Castain
8113a8d1b0 Now that we are hiding symbols in the internal PMIx component, we cannot reuse that component for integration to the external PMIx master as the symbols don't match. So create a new "ext3x" component and copy the PMIx v3 integration over there.
Also, remove a couple of build-product files from the pmix3x component.
2016-10-18 13:15:32 -07:00
Ralph Castain
50c9f3de55 Ensure the PMIx progress thread is stopped prior to tearing anything down. Thanks to Gilles for spotting this error! 2016-10-18 00:27:52 -07:00
Gilles Gouaillardet
4e19cd51b1 hwloc/external: add a missing include file 2016-10-14 09:27:33 +09:00
Ralph Castain
6f65d0a173 Repair event notification support. Cleanup the long-suffering "epoll: warning" coming out of libevent whenever a process abnormally terminated.
Add changes to test program

Sync to PMIx master
2016-10-13 16:27:39 -07:00
Ralph Castain
6417f217e1 Turn PMIx dstore off by default as MTT was effectively broken 2016-10-13 08:14:51 -07:00
Potnuri Bharat Teja
29f1aa836f btl/openib: remove unwanted ompi header inclusion in opal code.
OMPI header cannot be included in OPAL source code, hence removed it.
Fixes: (740b636db) btl/openib: Disqualify rdmacm CPC if
MPI_THREAD_MULTIPLE.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
2016-10-13 16:21:36 +05:30
Ralph Castain
8f05beb1ec Sync pmix/master@cb53105 2016-10-11 20:54:59 -07:00
rhc54
ad156e3e91 Merge pull request #2207 from rhc54/topic/pmixupdate
Update PMIx support to latest PMIx master
2016-10-11 18:57:11 -05:00
Jeff Squyres
bcbf0bc4f9 usnic: s/OMPI/OPAL/
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-10-11 16:43:35 -07:00
Ralph Castain
6ce4b6d098 Eliminate -Wall from being hardcoded 2016-10-11 12:50:31 -07:00
Ralph Castain
1859b03416 Enable PMIx shared memory support by default 2016-10-11 12:18:01 -07:00
Ralph Castain
1d7d7c201b Update PMIx support to latest PMIx master 2016-10-11 10:17:23 -07:00
Ralph Castain
5b1484a836 Implement the backend support for process-generated event notification 2016-10-08 09:24:28 -07:00
Gilles Gouaillardet
b55dd2442a libevent2022: rename _event_strlcpy 2016-10-08 16:58:20 +09:00
Gilles Gouaillardet
c92e9a5406 use the new OPAL_HASH_TABLE_FOREACH convenience macro 2016-10-08 16:58:20 +09:00
Gilles Gouaillardet
f1f1fb15eb pmix3x: configury: output major, minor and release version after checking them
and hence fix the configure output

(back-ported from upstream commit pmix/master@7b7cdda2de)
2016-10-08 13:01:28 +09:00
Gilles Gouaillardet
f3af799608 pmix3x: misc fixes to get pmix build on Solaris
- replace MAXHOSTNAMELEN with hardcoded 1024.
  unlike Linux, Solaris #define MAXHOSTNAMELEN in <netdb.h>,
  so use a hard coded value to keep the test simpl
- stdout cannot be assigned on Solaris, so use freopen instead

(back-ported from upstream commit pmix/master@a63f6e53f4)
2016-10-08 13:01:28 +09:00
Gilles Gouaillardet
5cbfddb8f1 pmix3x: fix misc memory leaks
(back-ported from upstream commit pmix/master@1eff526929)
2016-10-08 13:01:28 +09:00
Gilles Gouaillardet
b4e4e4a5f1 pmix3x: enhance pmix_nspace_t destructor
PMIX_RELEASE all elements stored in the internal and modex hash tables

(back-ported from upstream commit pmix/master@b90674fc52)
2016-10-08 13:01:27 +09:00
Gilles Gouaillardet
f1dc033767 pmix3x: add the PMIX_HASH_TABLE_FOREACH macro
this is a convenience macro similar to the PMIX_LIST_FOREACH macro,
that can be used to iterate on all the key/value pairs of a pmix_hash_table_t

(back-ported from upstream commit pmix/master@349971c68c)
2016-10-08 13:01:27 +09:00
Jeff Squyres
67684be7c9 usnic: fix one last stray fabric_attr->name --> linux_device_name
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-10-04 18:17:38 -07:00
Jeff Squyres
8b77359cac usnic: remove some legacy libfabric 1.0/1.1 code
We only support running with libfabric v1.3 or greater.  So it's safe
to remove the legacy/adaptive cq_readerr() behavior.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-10-03 11:59:41 -07:00
Jeff Squyres
345c07a252 usnic: require libfabric >= v1.3 at run time
There are critical usnic libfabric AV insert bugs before v1.3, so
don't allow any version prior to v1.3 at run time (still allow
*compiling* with earlier versions, though, since the ABI guarantees
allow us to compile with an earlier libfabric and run with a later
libfabric).

Switch to using fi_version() to check the version (instead of calling
fi_getinfo()) as a potentially lighter-weight / simpler solution.
This allows us to only call fi_getinfo() once.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-10-03 11:59:41 -07:00
Jeff Squyres
b13813810f usnic: print a helpful message invoke PML error callback
The previous message was unhelpful / confusing.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-10-03 11:59:41 -07:00
Gilles Gouaillardet
7601e783cc pmix3x: sec/munge: add a missing include file
(cherry picked from upstream pmix/master@f7cfb11f6b)
2016-10-03 16:09:10 +09:00
Ralph Castain
e773c17cf3 Put show_help thru the PMIx "log" API. This pushes the show_help output from apps into the pmix thread, thus avoiding conflicts in the RML thread, which should help with thread lock situations. 2016-10-02 16:02:23 -07:00
Jeff Squyres
545d8f2e66 usnic cagent: correctly compute the "large" ping message size
The (effective) "+42" computation was, in fact, the incorrect answer
in this case (gasp!).

We should just take the max_msg_size from the command (which came from
the libfabric endpoint max_msg_size attribute in the client) and
subtract off the max header size: 68 (which is explained in the
comment).  This will result in a "large" message size which is likely
slightly smaller than the MTU, but still right up near the MTU, and
therefore good enough.

Note: the old computation (i.e., -(68-42)) worked fine when we asked
for Libfabric API v1.1 because the usnic provider would return a
max_msg_size that was already less than the MTU due to FI_PREFIX
behavior shenanigans.  Once we started asking for Libfabric API v1.4,
the usnic Libfabric provider started returning (MTU + prefix_size),
and the -(68-42) computation started giving a value that was over the
MTU.  This caused sendto() on the connectivity checker UDP socket
to fail.

This commit also removes an old/misleading comment.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-30 17:01:05 -07:00
Joshua Hursey
f6f24a4f67 build: Custom libmpi(_FOO) name option in configure
* Add a configure time option to rename libmpi(_FOO).*
   - `--with-libmpi-name=STRING`
 * This commit only impacts the installed libraries.
   Internal, temporary libraries have not been renamed to limit the
   scope of the patch to only what is needed.

For example:
```shell
shell$ ./configure --with-libmpi-name=wookie
...
shell$ find . -name "libmpi*"
shell$ find . -name "libwookie*"
./lib/libwookie.so.0.0.0
./lib/libwookie.so.0
./lib/libwookie.so
./lib/libwookie.la
./lib/libwookie_mpifh.so.0.0.0
./lib/libwookie_mpifh.so.0
./lib/libwookie_mpifh.so
./lib/libwookie_mpifh.la
./lib/libwookie_usempi.so.0.0.0
./lib/libwookie_usempi.so.0
./lib/libwookie_usempi.so
./lib/libwookie_usempi.la
shell$
```
2016-09-29 21:47:24 -05:00
Gilles Gouaillardet
871ade9231 pmix/{cray,s1,s2}: make pmi_opcaddy_t class static
theses three pmix components use the same class name,
declare it as static so Open MPI can be built with --disable-dlopen

Thanks Limin Gu for the report
2016-09-28 09:18:36 +09:00
Jeff Squyres
1a5a5fb400 Merge pull request #1861 from bharatpotnuri/master
btl/openib: Disqualify rdmacm CPC if MPI_THREAD_MULTIPLE
2016-09-27 13:03:35 -04:00
Potnuri Bharat Teja
740b636dbe btl/openib: Disqualify rdmacm CPC if MPI_THREAD_MULTIPLE
The rdmacm CPC in the openib BTL is not thread safe. The rdmacm CPC
should disqualify itself (instead of failing in random ways) if
MPI_THREAD_MULTIPLE is the thread level.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
2016-09-27 14:20:59 +05:30
Gilles Gouaillardet
1fbc9a5431 pmix3x: dstore/pmix: flock portability
Using the fcntl-locking instead of the flock

(back-ported from upstream pmix/master@3030a0cca1)
2016-09-27 13:21:03 +09:00
George Bosilca
066370202d Support non-monotonic assembly timers.
If monotonic support has been required by the runtime and the
assembly timers are unable to provide it, fall back to clock_gettime.
2016-09-23 21:51:34 -04:00
George Bosilca
45dcf1f5d7 Always use the best timer available
If we have better timer than clock_gettime use it, even if it an
assembly timer.
2016-09-23 19:32:58 -04:00
George Bosilca
93fa94f96f Re-enable support for local addresses.
This patch is based on the "RFC: Reenabling the TCP BTL over local
interfaces (when specifically requested)". It removes the hardcoded
exception for the local devices that has been enforced by the
TCP BTL. Instead, we exclude the local interface only via the
exclude MCA (both IPv4 and IPv6 local addresses are already in the
default if_exclude), which is also the behavior currently described in
our README file.
2016-09-23 13:04:33 -04:00
Gilles Gouaillardet
362a5886de pmix3x: client: fix PMIx_Finalize() sequence
pmix_progress_thread_finalize() invokes libevent event_base_free,
so all libevent stuff cannot be used after.
Hence, pmix_client_globals.myserver must be PMIX_DESTRUCT'ed
before invoking pmix_progress_thread_finalize()
2016-09-24 00:01:23 +09:00
Gilles Gouaillardet
5479c6cca7 pmix3x: add missing #include
and get Open MPI build on OpenBSD 6.0
2016-09-23 11:23:18 +09:00
Ralph Castain
a14ec3bdbc Mucho thanks to Gilles - his patch to reorder the CPPFLAGS solves the problem of inadvertently picking up hwloc and libevent headers from locations in CPPFLAGS while continuing to build the embedded versions. Also silence a minor warning about an uninitialized var. 2016-09-22 07:39:22 -07:00
Gilles Gouaillardet
fbf03299c3 Merge pull request #2079 from ggouaillardet/topic/pmix_configury_dlopen
pmix3x: configury: correctly handle --disable-dlopen
2016-09-21 10:59:33 +09:00
Gilles Gouaillardet
6c1e25b76e pmix/ext11: fix pmix1_value_unload() prototype and call
pmix1_value_unload() was added a "key" argument which is unused,
and pmix1_value_unload() was sometimes invoked with two arguments instead of three.

since the "key" argument is unused, simply remove it from the
subroutine prototype and calls.
2016-09-20 14:34:41 +09:00
Gilles Gouaillardet
041a431966 pmix3x: configury: correctly handle --disable-dlopen
the LT_* macros do overwrite the enable_dlopen variable,
so it must be tested and saved before invoking LT_INIT.
delay the invokation of the LT_* macros and use the
PMIX_ENABLE_DLOPEN_SUPPORT variable to figure out whether
--disable-dlopen was invoked
2016-09-15 13:26:20 +09:00
Nathan Hjelm
a681837ba8 btl/tcp: fix double list remove
This commit fixes an abort during finalize because pending events were
removed from the list twice.

References #2030

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-13 09:23:12 -06:00
Artem Polyakov
9eba1b0b75 Merge pull request #2042 from artpol84/pmix_sdirs
Several fixes related to session directories:
2016-09-07 14:15:47 +07:00
Gilles Gouaillardet
cd2b5a82ed hwloc: plug memory leak
as reported by Coverity with CID 1270441
2016-09-07 10:08:44 +09:00
Jeff Squyres
527efec4fb Merge pull request #2050 from jsquyres/pr/btl-tcp-help-messages
Add a show_help message to TCP BTL when peer unexpectedly disconnects
2016-09-06 09:40:31 -04:00
Jeff Squyres
1953e3406f btl/tcp: add show_help message when peer hangs up
We commonly see messages on the users list where a peer has hung up
because it has crashed.  Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-06 09:40:03 -04:00
Gilles Gouaillardet
4b208e4463 btl/tcp: make mca_btl_tcp_proc_insert re-entrant
otherwise bad things happen with
 --mca btl_tcp_progress_thread 1 (non default)
and
 --mca mpi_add_procs_cutoff 0 (default)
2016-09-05 15:57:34 +09:00
Artem Polyakov
dc0ab674de Add PMIx key to provide RM with ability to indicate that it will cleanup
session directories provided at through OPAL_PMIX_TMPDIR,
OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR
2016-09-05 07:48:44 +03:00
Jeff Squyres
95c6f6cfc0 btl/tcp: fix help message
It looks like one help message was accidentally pasted in the middle
of another.  Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-02 17:14:22 -04:00
Nathan Hjelm
f93c1f2106 btl/ugni: fix erroneous warning message
This commit prevents the connection code from trying to connect an
endpoint if the directed datagram has been posted but not received.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 09:17:44 -06:00
Ralph Castain
34f04a7924 Remove spurious Makefile.am line 2016-09-01 15:31:09 -07:00
Ralph Castain
0ea1cff733 Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional 2016-09-01 13:10:10 -07:00
rhc54
39d086e000 Merge pull request #2035 from rhc54/topic/memprofile
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint
2016-08-31 14:06:48 -05:00
Ralph Castain
39992d1ad7 Silence trivial Coverity warnings 2016-08-31 09:42:33 -07:00
Ralph Castain
c1050bc01e Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose. 2016-08-31 09:32:07 -07:00
Ralph Castain
cfa784c9a6 Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking 2016-08-29 20:22:24 -07:00
Nathan Hjelm
d33204b0dc Merge pull request #2021 from hjelmn/xlc_fix
opal/patcher: fix xlc support
2016-08-26 18:15:41 -06:00
rhc54
b90a64e734 Merge pull request #2022 from rhc54/topic/nnodes
Provide the number of nodes in the job
2016-08-26 18:15:24 -05:00
Ralph Castain
2f6e0fec90 Provide the number of nodes in the job 2016-08-26 14:50:41 -07:00
Jeff Squyres
09ad7e81eb Merge pull request #2007 from jsquyres/pr/usnic-show-local-udp-ports
usnic: show the local UDP ports
2016-08-26 17:03:16 -04:00
Nathan Hjelm
a9bc692d99 opal/patcher: fix xlc support
The xlc compiler seems to behave in a different way that gcc when it
comes the inline asm. There were two problems with the code with xlc:

 - The TOC read in mca_patcher_base_patch_hook used the syntax
   register unsigned long toc asm("r2") to read $r2 (the TOC
   pointer). With gcc this seems to behave as expected but with xlc
   the result in toc is not the same as $r2. I updated the code to use
   asm volatile ("std 2, %0" : "=m" (toc)) to load the TOC pointer.

 - The OPAL_PATCHER_BEGIN macro is meant to be the first thing in a
   hook. On PPC64 it loads the correct TOC pointer (thanks to
   mca_patcher_base_patch_hook) and saves the old one. The
   OPAL_PATCHER_END macro restores the TOC pointer. Because we *need*
   the TOC to be correct before it is accessed in the hook the
   OPAL_PATCHER_BEGIN macro MUST come first. We did this and all was
   well with gcc. With xlc on the other hand there was a TOC access
   before the assembly inserted by OPAL_PATCHER_BEGIN. To fix this
   quickly I broke each hook into a pair of function with the
   OPAL_PATCHER_* macros on the top level functions. This works around
   the issue but is not a clean way to fix this. In the future we
   should 1) either update overwrite to not need this, or 2) figure
   out why xlc is not inserting the asm before the first TOC read.

This fixes open-mpi/ompi#1854

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-26 14:43:03 -06:00
Jeff Squyres
87a5ccc060 usnic: show the local UDP ports
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-26 12:25:18 -07:00
Jeff Squyres
e03a40a0e9 pmix3x: remove generated file
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-26 10:30:47 -07:00
Jeff Squyres
9ae51a09f2 Merge pull request #1989 from jsquyres/pr/update-usnic-to-libfabric-v1.4
Update usnic BTL to libfabric v1.4
2016-08-26 09:53:07 -04:00
Gilles Gouaillardet
e4bf915e75 pmix3x: remove auto-generated file
remove opal/mca/pmix/pmix3x/pmix/src/include/pmix_config.h.in
.gitignore is correct, so it seems this file was added before .gitignore was updated
2016-08-26 15:00:18 +09:00
Ralph Castain
af67f16422 Update configury to support multiple PMIx versions, rename pmix2x component to pmix3x for support of PMIx master
Update support for external v1.1.x and v2.x libraries. Minor corrections to the v3.x component
2016-08-25 18:19:05 -07:00
Jeff Squyres
f56b16f079 usnic: remove unused variable
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:18 -07:00
Jeff Squyres
9717bcb7e6 btl/usnic: remove stale comment
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:18 -07:00
Jeff Squyres
6f5e377fe0 btl/usnic: update for libfabric v1.4
With libfabric v1.4, the usnic provider changed the values of its
fabric and domain name strings (compared to libfabric <v1.4).  Update
the Open MPI usNIC BTL to handle both pre-v1.4 and v1.4 fabric/domain
names.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:17 -07:00
George Bosilca
3adff9d323 Fixes #1793.
Reshape the tearing down process (connection close) to prevent race
conditions between the main thread and the progress thread.

Minor cleanups.
2016-08-24 22:45:19 -04:00
Nathan Hjelm
83062db7cb btl/ugni: actually make the endpoint lock recursive
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-24 10:36:08 -06:00
Gilles Gouaillardet
02847d9e7b pmix2x: dstore: add missing <fcntl.h> include file in pmix_esh.c
(back-ported from upstream pmix/master@5c66ffe0f0)
2016-08-24 11:18:46 +09:00
Gilles Gouaillardet
c11e8163f8 pmix2x: sec/native: fix the pmix_native module under solaris by using getpeerucred()
and fail with a user friendly message if no method is available:
"sec: native cannot validate_cred on this system"

(back-ported from upstream pmix/master@c474a1fc60)
2016-08-24 11:18:40 +09:00
Gilles Gouaillardet
e91292aa41 pmix2x: configury: add missing check for <netdb.h> header file
(back-ported from upstream pmix/master@e54ce6d423)
2016-08-24 11:18:32 +09:00
Potnuri Bharat Teja
9b7f9ece20 Add Chelsio T6 adapter device parameters.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
2016-08-23 10:38:13 +05:30
Ralph Castain
639dbdb7ea For maintainability, fold the external PMIx 2.x integration into the internal PMIx 2.x library component. This ensures that we always stay in sync with the two as that is becoming a problem. 2016-08-22 13:28:55 -07:00
George Bosilca
fd57f5bccd Remove some of the clang warnings. 2016-08-20 14:21:42 -04:00
Ralph Castain
61ffba668b Roll in the latest PMIx version - includes shared memory datastore and reduced memory footprint 2016-08-20 07:53:06 -07:00
Artem Polyakov
6ea8cccdab Merge pull request #1969 from artpol84/pmix_jobid_fix
Pmix jobid fix
2016-08-18 17:24:58 +07:00
Ralph Castain
7da9793fef Support the PMIX_TIMEOUT key at the PMIx server when timeout=0 - this indicates that the user doesn't want a lookup of any data from the host RM. 2016-08-17 16:26:58 -05:00
Gilles Gouaillardet
3126ff77e2 pmix2x: common syms: whitelist bison-generated common symbols
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-08-16 11:29:06 +09:00
Artem Polyakov
c5a91c5c9d opal/pmix: fix pmix jobid calculation if external PMIx server is used. 2016-08-15 21:13:51 +03:00
Ralph Castain
ecbedee8bb Fix typo 2016-08-15 07:32:00 -07:00
Artem Polyakov
f3c816b52e opal/pmix: fix indentation in some files. 2016-08-15 18:21:50 +07:00
Gilles Gouaillardet
483685eb6a update .gitignore
remove autogenerated opal/mca/pmix/pmix2x/pmix/src/include/pmix_config.h.in
2016-08-15 17:00:20 +09:00
Ralph Castain
be8424b691 Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts

dd
2016-08-13 12:13:04 -07:00
rhc54
ddde154d28 Merge pull request #1962 from rhc54/topic/notify
Ensure we properly convert pmix status to ORTE state before activatin…
2016-08-13 06:59:50 -07:00
Ralph Castain
48d35a9627 Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program 2016-08-12 21:14:29 -07:00
Ralph Castain
4a4c9703a9 Setup the job list in the PMIx integration so that static ports can run 2016-08-12 13:27:10 -07:00
Ralph Castain
1d44f0c0e2 Silence Coverity warnings 2016-08-11 21:22:01 -07:00
Ralph Castain
73544d2e00 Rename symbol 2016-08-11 13:06:46 -07:00
Ralph Castain
b0cc9b0bc8 Update to latest PMIx toolext branch
Fix indentations

Update the ext20 component to match latest PMIx master.

Cleanup name conflicts and uninit vars
2016-08-11 12:29:48 -07:00
rhc54
60f789dca1 Merge pull request #1948 from rhc54/topic/pmixtool
Update to include extended tool support, new datatypes
2016-08-09 16:17:28 -07:00
Nathan Hjelm
19be439998 Merge pull request #1949 from hjelmn/ugni_fix
btl/ugni: fix another connection race
2016-08-09 08:32:40 -06:00
Gilles Gouaillardet
6f6b3ac68a configury: standardize memory/patcher symbol detection and make it more robust
by default, Sun compilers optimize out the original test, and hence fail detecting a symbol is missing.
2016-08-09 09:35:52 +09:00
Nathan Hjelm
adb668209b btl/ugni: fix another connection race
This commit fixes a race that can occur when two threads are in the
ugni progress function at the same time. This race occurs when one
thread calls GNI_PostDataProbeById then goes to sleep then another
thread calls GNI_PostDataProbeById then GNI_EpPostDataWaitById before
the other thread wakes up. If this happens the first thread will print
a warning on GNI_EpPostDataWaitById about no matching post.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-08 15:38:11 -06:00
Ralph Castain
527b5c692a Update to include extended tool support, new datatypes 2016-08-08 13:39:46 -07:00
Todd Kordenbrock
b90da992c8 Merge pull request #1895 from PDeveze/Patchs-on-btl-portals4
btl/portals4: Take into account the limitation of portals4 (max_msg_s…
2016-08-08 15:12:50 -05:00
Nathan Hjelm
5ced037488 Merge pull request #1939 from hjelmn/ugni_fix
btl/ugni: protect against re-entry and races in connections
2016-08-08 08:55:30 -06:00
Artem Polyakov
b24ec3e3b9 pmix/s2: fix indentation (only) 2016-08-06 16:31:19 +06:00
Artem Polyakov
2cb923a413 pmix/s1: fix indentation (only) 2016-08-06 16:30:45 +06:00
Artem Polyakov
8aa3ef7799 pmix/s2: fix s2 component data placement
Use wildcard for the information related to the job-level data.
Fixes s2 component with regard to PR https://github.com/open-mpi/ompi/pull/1897.
2016-08-06 15:49:16 +06:00
Artem Polyakov
81063f1717 pmix/s1: fix s1 component data placement
Use wildcard for the information related to the job-level data.
Fixes s1 component with regard to PR https://github.com/open-mpi/ompi/pull/1897.
2016-08-06 15:45:46 +06:00
Nathan Hjelm
14b36d4503 btl/ugni: protect against re-entry and races in connections
This commit fixes two issues that can occur during a connection:

 - Re-entry to connection progress from modex lookup. Added an
   additional endpoint state that will keep the code from re-entering
   the common endpoint create.

 - Fixed a race between a process posting a directed datagram through
   a send and a connection being progressed through opal_progress().
   The progress code was not obtaining the endpoint lock before
   attempting to update the endpoint. To limit the amount of code
   changed for 2.0.1 this commit makes the endpoint lock recursive. In
   a future update this may be changed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-04 16:08:01 -06:00
Jeff Squyres
c42d8867e6 Merge pull request #1925 from jsquyres/pr/warnings-fixes
hwloc: fix Valgrind warning
2016-08-04 08:48:50 -07:00
Jeff Squyres
36555b7a1d Merge pull request #1933 from thananon/fix_random
Make libevent use internal random
2016-08-04 08:27:56 -07:00
Boris Karasev
9d6a4b3b2d configury/libevent: fix incorrect drop of OPAL_HAVE_WORKING_EVENTOPS
Fixes PR https://github.com/open-mpi/ompi/pull/1687
The code that sets OPAL_HAVE_WORKING_EVENTOPS for internal libevent
was executed even if the external libevent component was configured.

As the result libevent progress wasn't called in opal_progress which
for example caused ring_c to hang when pml/ob1 was used.
2016-08-04 16:37:37 +06:00
Gilles Gouaillardet
30f98cd9d0 pmix: redefine OPAL_PMIX_ARCH macro
Architecture is set by the ompi layer *after* job startup, so the key cannot
have the "pmix" prefix since optimizations in open-mpi/ompi@01a653d50a
otherwise architecture cannot be retrieved
2016-08-04 13:31:28 +09:00
Thananon Patinyasakdikul
b3e9dadff2 libevent: use opal_random() instead of rand(3)
This commits changed rand(3) and family in libevent to use internal
random function provided in opal to prevent pertubing user's random seed.

Fixes open-mpi/ompi#1877
2016-08-03 09:18:12 -07:00
Howard Pritchard
08266a1a56 mpool/hugepage mntent intro fallout
On Cray, PR #1846 introduced a double free
situation which led to all kinds of random memory
corruption problems.

This commit fixes this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-08-02 05:52:31 -05:00
Jeff Squyres
7bea563e02 hwloc: fix Valgrind warning
Cherry picked from open-mpi/hwloc@d4565c351e

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-01 18:50:40 -07:00
Gilles Gouaillardet
21e7f31dbe pmix2x: fix unpack sequence in PMIx_Get callback
first unpack the nspace (PMIX_STRING) before unpacking the various keys (PMIX_KVAL)
2016-08-01 14:21:22 +09:00
Howard Pritchard
477f6cb6a8 Merge pull request #1846 from ggouaillardet/topic/mntent
mpool/hugepage: set mntent API instead of manually parsing /proc/mounts
2016-07-31 20:17:37 -06:00
Ralph Castain
16fccd4964 Establish a way for ORTE to tell PMIx the base tmpdir to use, and update PMIx to understand such directives 2016-07-29 09:52:36 -07:00
Ralph Castain
cacb582ecd Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application 2016-07-28 14:09:06 -07:00
Nathan Hjelm
4658b761e4 rcache/udreg: make reference count thread safe
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-07-27 13:40:35 -06:00
Nathan Hjelm
1eb4ef438e Merge pull request #1903 from hjelmn/openib_fixes
btl/openib: set send flags only after endpoint is connected
2016-07-27 09:01:49 -06:00
Howard Pritchard
1dc7e9ed8f Merge pull request #1904 from hppritcha/topic/fix_cray_srun_native_launch
pmix/cray: switch to using wildcards for some
2016-07-27 07:12:02 -06:00
Howard Pritchard
b65bbe017f pmix/cray: switch to using wildcards for some
items so that at least srun native launch on
cray works again.

More issues to fix when using alps.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-07-26 17:07:58 -05:00
Nathan Hjelm
5e13e1ab7d btl/openib: set send flags only after endpoint is connected
The max inline send size on a queue pair is not available until after
the endpoint is connected. Before this commit the send flags
(including the inline flag) were set before this value was
initialized. This commit moves setting the send_flags down to
mca_btl_openib_put_internal which is only called after the endpoint is
connected. This fixes a bug when using osc/rdma.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-07-26 16:01:11 -06:00
Gilles Gouaillardet
91ccec342c btl/openib: remove some dead code
remove useless call to opal_mem_hooks_support_level() and the value local variable.
2016-07-22 09:26:33 +09:00
Gilles Gouaillardet
1b3be0ac8c configury + btl/openib: fix a typo
test for existence of struct ibv_exp_device_attr.exp_atomic_cap.
That was previously mistyped struct ibv_exp_device_attr.ext_atomic_cap
2016-07-22 09:26:33 +09:00
Ralph Castain
71de03fc67 Cleanup the new naming requirements to ensure that info is correctly retrieved
Cleanup permissions

Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
2b55ee8118 Cleanup Coverity warnings 2016-07-20 20:31:58 -07:00
Ralph Castain
01a653d50a Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
Remove stale file reference

Restore autogen pass thru pmix

Remove generated file
2016-07-20 00:58:19 -07:00
Pascal Deveze
6d6ec66705 btl/portals4: Take into account the limitation of portals4 (max_msg_size) 2016-07-19 15:19:29 +02:00
Nathan Hjelm
03bce91de8 pmix/pmix2x: add missing increment in loop
This commit fixes a bug in the pmix2x client code where a loop
variable is not correctly incremented. This was leading to hangs and
crashes when creating intercommunicators. Also fixed two double
increments in other loops.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-07-18 10:35:05 -06:00
Jeff Squyres
72f41d4490 pmix: replace all tabs with spaces
No code or logic changes

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-07-17 15:08:33 -04:00
Jeff Squyres
1c32742c66 pmix_ext20: fix syntax error
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-07-17 15:04:12 -04:00
Ralph Castain
99f7096031 Fix permissions 2016-07-16 21:03:55 -07:00
Ralph Castain
d4071fbd1c Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling 2016-07-16 13:20:41 -07:00
Ralph Castain
1ceb35ba5c Fix singletons - do not include the PMIx tool URI in the environment provided to child processes 2016-07-13 17:33:34 -07:00
Ralph Castain
20a91c2baf Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program.
Cleanup debug message
2016-07-13 15:28:33 -07:00
Artem Polyakov
72585a905f opal/pmix: add blocking Fence to SLURM components.
Blocking fence is used in yalla del proc. Native pmix exposes this functionality.
We need to expose it for SLURM's s1/s2 components as well.

Also this commit fixes uninitialized `rc` in fencenb's of both
components.
2016-07-11 09:43:15 +03:00
Artem Polyakov
8e16f47492 Merge pull request #1688 from artpol84/fix_base64
Fix base64 implementation in pmix framework.
2016-07-07 10:47:50 +06:00
Gilles Gouaillardet
1ba7e2b20b mpool/hugepage: set mntent API instead of manually parsing /proc/mounts
Refs open-mpi/ompi#1822
2016-07-06 15:00:19 +09:00
Gilles Gouaillardet
acda07472a configury: revamp and re-ident sub configure.m4 after open-mpi/ompi@846360fd4c 2016-07-06 11:59:51 +09:00
Gilles Gouaillardet
846360fd4c configury: correctly perform make distclean when {libevent,hwloc,pmix} are external components
Thanks Jeff for the guidance

Fixes open-mpi/ompi#1683

note:
in order to keep this commit easy to review, some AS_IF([...]) were replaced with
AS_IF([false], ...) or AS_IF_([true], ...)
these will be removed and re-idented in a subsequent commit
2016-07-06 11:57:24 +09:00
Ralph Castain
ee56d9dc1a Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field 2016-07-05 14:59:50 -07:00
Ralph Castain
7e0af3f4f0 Update pmix2x to track upstream changes 2016-07-05 11:54:22 -07:00
Gilles Gouaillardet
267821f0dd pmix2x/pmix: fix a typo in PMIx_tool_init()
and remove now useless local variable i
2016-07-05 13:47:50 +09:00
Gilles Gouaillardet
efce8cc734 pmix2x/pmix: add missing include files
pmix cannot be built on alpine linux because of some missing includes.
uid_t and gid_t are defined in unistd.h or sys/types.h, and unistd.h
is not indirectly pulled under alpine linux, so do it manually.

Thanks N.L.K Nguyen for the report

(back-ported from upstream pmix/master@c8d55350a9)
2016-07-05 09:03:14 +09:00
Ralph Castain
c9ada8e095 Silence Coverity warnings 2016-07-03 20:45:08 -07:00
Ralph Castain
673f82e2b6 Update the PMIx listener to avoid leaking sockets into children, and better handle race condition errors 2016-07-03 08:23:33 -07:00
Nathan Hjelm
01d6da31af btl/openib: fix rdmacm locking bug
This commit fixes a long standing bug in rdmacm. It is required that
the thread that calls mca_btl_openib_endpoint_cpc_complete holds the
endpoint lock. This was not the case for rdmacm. This causes debug
builds to abort. This change also required changing
mca_btl_openib_endpoint_send_cts to require the endpoint lock to be
held when calling.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-30 15:50:07 -06:00
Nathan Hjelm
cc2b3e0c3f Merge pull request #1830 from hjelmn/rdmacm_test
Test for rdmacm hang fix
2016-06-30 10:41:46 -06:00
Nathan Hjelm
960fcd292c btl/openib: fix rdma hang
This commit is an attempt to fix a hang in finalize of rdmacm. This fixes
a path where no rdmacm client is found for an endpoint.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-29 20:31:26 -06:00
Ralph Castain
6e434d6785 Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information
Update to match PMIx RFC

Fix configury to point to correct libevent and hwloc locations
2016-06-29 19:19:19 -07:00
Jeff Squyres
f18d6606da Merge pull request #1824 from hjelmn/rdmacm_fix
btl/openib: fix segmentation fault
2016-06-28 18:10:35 -04:00
Nathan Hjelm
8128c8eb29 btl/openib: fix segmentation fault
This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi/ompi#1823

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-28 10:31:32 -06:00
Nathan Hjelm
dac9201f3b Merge pull request #1770 from hjelmn/rdma_wth
btl/openib: fix rdmacm
2016-06-24 22:46:53 -06:00
Nathan Hjelm
2992d6d238 Merge pull request #1808 from abjoshi-brcm/timer_arm64
arm64: add timer support
2016-06-23 07:10:56 -06:00
Abhishek Joshi
f06f7eb3e6 arm64: add timer support
Signed-off-by: Sreenidhi Bharathkar Ramesh <sreenidhi-bharathkar.ramesh@broadcom.com>
2016-06-23 11:01:00 +00:00
Ralph Castain
08b1438f15 Add missing PMIx range value so OPAL and PMIx align again 2016-06-22 22:03:25 -07:00
Gilles Gouaillardet
bf133c401e pmix2x: fix a typo in dereg_event_hdlr()
This bug has been fixed when open-mpi/ompi@dde69e1be2 was backported into upstream pmix in pmix/master@5e5577778c
but it was not fixed in open-mpi/ompi
2016-06-22 13:45:29 +09:00
Jeff Squyres
af614afedf Merge pull request #1800 from thananon/common_sym_fix
Fixed common symbol error in btl/usnic.
2016-06-21 20:11:52 -04:00
Ralph Castain
441739b5a4 Cleanup a lagging message that generates an annoying (but seemingly harmless) warning 2016-06-20 12:23:27 -07:00
Thananon Patinyasakdikul
afe07cd5d5 Fixed common symbol in btl/usnic
- This commit fixes the accidental common symbol btl_usnic_lock
- It also moves the btl_usnic_lock declaration to btl_usnic.h
2016-06-20 10:05:44 -07:00
Howard Pritchard
1bed9fdb59 Merge pull request #1799 from hppritcha/topic/help_aries_with_knl
common/ugni: help out knl with aries
2016-06-20 08:09:24 -06:00
Ralph Castain
0ba02821e6 Add requested key and job-level info 2016-06-19 18:22:31 -07:00
Ralph Castain
0a29f5cb77 Sigh - missed two typos 2016-06-18 20:57:53 -07:00
Ralph Castain
dd38cf1fed Fix typo 2016-06-18 20:56:43 -07:00
Howard Pritchard
8b53487977 common/ugni: help out knl with aries
The way the gni btl is currently coded,
it will run completely out of gas on KNL at
123 processes/node.  Since there are bound to be
those who try to run a MPI process/hyperthread
on KNL nodes, the fma sharing mode needs to be requested.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-06-18 15:09:05 -05:00
Ralph Castain
dde69e1be2 Cleanup CIDs 1362763, 1362762, 1362760, 1362759, 1362758, 1362757, 1362756, 1362755, 1362754. Unsure how to resolve 1362761.
Fixes #1792
2016-06-18 12:28:46 -07:00
Jeff Squyres
7a8d7fb948 openib: fix compiler warnings
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-18 07:15:11 -07:00
Jeff Squyres
c332ee5884 Merge pull request #1784 from thananon/fix_usnic_thread
Fix btl/usnic deadlock when the connectivity check is turned off.
2016-06-17 11:15:14 -04:00
Nathan Hjelm
f59c2fce6b Merge pull request #1786 from hjelmn/32_fix
opal/progress: use 32-bit atomics for call counter
2016-06-17 08:54:41 -06:00
Ralph Castain
044c561cba Roll to latest PMIx master 2016-06-16 17:30:30 -07:00
rhc54
702a982271 Merge pull request #1767 from rhc54/topic/pmix2
Enable the PMIx event notification capability
2016-06-16 15:27:43 -07:00
Nathan Hjelm
7349ddc937 patcher/overwrite: use OPAL_ASSEMBLY_ARCH to determine architecture
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-16 10:00:00 -06:00
Thananon Patinyasakdikul
7bd18214a7 Fix btl/usnic deadlock when the connectivity check is turned off. 2016-06-15 07:42:55 -07:00
Jeff Squyres
b7e937fea5 Merge pull request #1778 from thananon/usnic_thread_safe
Added MPI_THREAD_MULTIPLE support for btl/usnic.
2016-06-14 18:43:04 -04:00
Ralph Castain
5d330d5220 Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
Add PMIx 2.0

Remove PMIx 1.1.4

Cleanup copying of component

Add missing file

Touchup a typo in the Makefile.am

Update the pmix ext114 component

Minor cleanups and resync to master

Update to latest PMIx 2.x

Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00
Thananon Patinyasakdikul
ee85204c12 Added MPI_THREAD_MULTIPLE support for btl/usnic. 2016-06-13 13:47:06 -07:00
Ralph Castain
d58da99dbc Shift to memcpy to avoid Solaris issues 2016-06-09 12:07:17 -07:00
Ralph Castain
8fa935534b Abstract the strnlen function for environments that do not have it (e.g., Solaris 10) 2016-06-08 10:12:43 -07:00
Nathan Hjelm
17ae1aceeb btl/openib: fix rdmacm
The rdma_disconnect function specifies that both the server and client
should call rdma_disconnect. The code was not calling rdma_disconnect
on an endpoint if the event came before the endpoint finalization.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-07 17:53:58 -06:00
Nathan Hjelm
dd519c55b1 btl/openib: fix cq resize calculation
Before dynamic add_procs the openib_btl_size_queues was called exactly
once for non-dynamic jobs. Now the function is called on each new
connection so the calculation was wrong. Re-wrote the function to
correctly calculate the CQ size and only attempt to adjust the CQ if
the requested size has changed. This fixes a bug when using the openib
btl on psm2 hardware that is caused by the time needed to resize a
CQ. The overhead was causing udcm to timeout and fail.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-07 16:05:56 -06:00
Gilles Gouaillardet
b707d138fe pmix114/pmix1_client: fix misc memory leaks
Fixes CID 1325146-1325149
2016-06-06 09:33:35 +09:00
Nathan Hjelm
6169d03ea3 btl: adjust values of new atomic flags
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-02 19:21:34 -06:00
Nathan Hjelm
9f43b23725 Merge pull request #1710 from hjelmn/ugni_atomics
Additional ugni atomics
2016-06-02 18:25:49 -06:00
Ralph Castain
ecea1e3bb5 Update to 1.1.4rc3 2016-06-01 20:56:07 -07:00
rhc54
b85a5e62ab Merge pull request #1739 from rhc54/topic/pmix
Split the pmix external component into one for the 1.1.4 release, and…
2016-06-01 16:24:44 -07:00
Ralph Castain
12ecf972af Split the pmix external component into one for the 1.1.4 release, and another for the upcoming 2.0 release. Clean up the configury so the components look for a series-specific function instead of running a program.
NOTE: the changes for the 2.0 series are not yet in the PMIx master.
2016-06-01 14:15:24 -07:00
Nathan Hjelm
ceb2912838 Merge pull request #1736 from hjelmn/ugni_fixes
ugni BTL fixes
2016-06-01 14:59:55 -06:00
Jeff Squyres
d175fd692d README.ompi: track patches added to hwloc
Track post-v1.11.3-release patches applied to the hwloc copy embedded
in Open MPI.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-01 07:17:05 -07:00
Jeff Squyres
3867bd3640 hwloc.m4: only check for valgrind in non-embedded mode
This fixes https://github.com/open-mpi/ompi/issues/1732: i.e., the
case where the outer project has its own check for
<valgrind/valgrind.h>, but also supplements CPPFLAGS (to find
Valgrind's header files) before doing that check.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Ideally, we would tell OMPI to disable autoconf's caching of our
valgrind check result so that its check gets the right result after
adding CPPFLAGS. Not sure if we can do that.

For now, just disable our Valgrind code in embedded mode.
This will keep the x86 backend enabled under Valgrind but
it will auto-disable itself when finding identical APIC ids anyway
(because CPUID returns same outputs for all PUs).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

Fixes open-mpi/ompi#1732

(cherry picked from commit open-mpi/hwloc@8b44fb1c81)
2016-06-01 06:58:53 -07:00
Gilles Gouaillardet
57978a75d0 Merge pull request #1717 from ggouaillardet/topic/lex_cleanup
configury: clean the flex generated .c files
2016-06-01 13:06:21 +09:00
Nathan Hjelm
5d4bcce042 Merge pull request #1700 from shamisp/topic/cma_config
CMA: Fixing logic for CMA system call detection
2016-05-31 20:33:48 -06:00
Nathan Hjelm
340152a635 Merge pull request #1720 from shamisp/topic/vader/max_addr
VADER: Adjusting VADER_MAX_ADDRESS for non x86 platforms.
2016-05-31 20:33:28 -06:00
Gilles Gouaillardet
5f565dfec3 configury: clean the flex generated .c files 2016-06-01 11:13:31 +09:00
Nathan Hjelm
bf10d79914 btl/ugni: remove erroneous unlock
The endpoint lock was being released twice in mca_btl_ugni_get_ep.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-31 16:52:53 -06:00
Nathan Hjelm
cc96097873 btl/ugni: fix bug when attempting unaligned get on aries
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-31 16:52:09 -06:00
Jeff Squyres
5cfee95ea4 hwloc1113: add missing file to Makefile.am
Lack of this file causes a failure when you run autogen.pl on a
distribution tarball.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-31 09:57:50 -07:00
George Bosilca
d2abff583e Fix race condition during BTL TCP tear-down.
bot🏷️bug
bot:assign:@hjelmn
2016-05-30 10:47:14 -05:00
Jeff Squyres
e126d2cd18 Merge pull request #1584 from bgoglin/master
Update hwloc to v1.11.3
2016-05-28 11:01:54 -04:00
Ralph Castain
55923eacd3 Stealing some pieces of Josh Hursey's PR #1583 and modifying a bit, allow the opal/pmix external component to handle both PMIx 1.1.4 and PMIx 2.0 versions. Automatically detect the version of the target external library and adjust the only two APIs that changed (PMIx_Init and PMIx_Finalize)
Rename temp vars in .m4 to avoid conflict with Travis
2016-05-27 08:06:31 -07:00
Nathan Hjelm
28dfa36a3f btl/ugni: fix bug when attempting unaligned get on aries
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
c19426ac1b btl/ugni: add support for additional atomic operations
This commit adds support for Cray Aries atomic operations. This
includes 32-bit and floating point support.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
23fe19a956 btl: add support for more atomics
This commit add support for more atomic operations and type. The
operations added are logical and, logical or, logical xor, swap, min,
and max. New types are 32-bit int by using the
MCA_BTL_ATOMIC_FLAG_32BIT flag, 64-bit float by using the
MCA_BTL_ATOMIC_FLAG_FLOAT flag, and 32-bit float by using both
flags. Floating point numbers are supported by packing the number in
as an int64_t or int32_t. We will update the btl interface in the
future to make this less confusing.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
d25b846c01 Merge pull request #1704 from hpcraink/pr/configure_framework
Fix configure for FreePGI on OSX
2016-05-26 17:01:08 -06:00
Nathan Hjelm
8c9292d5d1 Merge pull request #1721 from hjelmn/xrc_fix
btl/openib: fix XRC WQE calculation
2016-05-26 17:00:31 -06:00
Nathan Hjelm
56bdcd0888 btl/openib: fix XRC WQE calculation
Before dynamic add_procs support was committed to master we called
add_procs with every proc in the job. The XRC code in the openib btl
was taking advantage of this and setting the number of work queue
entries (WQE) based on all the procs on a remote node. Since that is
no longer the case we can not simply increment the sd_wqe field on the
queue pair. To fix the issue a new field has been added to the xrc
queue pair structure to keep track of how many wqes there are total on
the queue pair. If a new endpoint is added that increases the number
of wqes and the xrc queue pair is already connected the code will
attempt to modify the number of wqes on the queue pair. A failure is
ignored because all that will happen is the number of active send work
requests on an XRC queue pair will be more limited.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-26 15:58:31 -06:00
Aurelien Bouteiller
49bd28d0ac Merge pull request #1714 from hjelmn/scif_exclusivity
btl/scif: reduce default exclusivity
2016-05-26 17:53:11 -04:00
Pavel Shamis (Pasha)
60fd25f3fb VADER: Adjusting VADER_MAX_ADDRESS for non x86 platforms.
The original VADER_MAX_ADDRESS was tunned for x86_64 platforms only.
For non x86_64 platforms we can use XPMEM_MAXADDR_SIZE.

Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com>
2016-05-26 16:38:04 -05:00
Nathan Hjelm
99627319f0 btl/ugni: reduce overhead of progress function
This commit reduces the overhead of calling the ugni progress
function. It does the following:

 - Check for new connections once every eight calls.

 - Do not call remote smsg progress unless we are connected to at
   least one remote peer.

 - Do not call rdma progress unless at least one rdma fragment is
   outstanding.

 - Check endpoint wait list size before obtaining a lock.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-25 14:27:34 -06:00
Nathan Hjelm
5caf12cd9b btl/scif: reduce default exclusivity
This commit reduces the default exclusivity so that btl/scif is not
used for send/recv over other shared memory transports.

Fixes open-mpi/ompi#1712

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-25 14:25:07 -06:00
Rainer Keller
3727cba9bb Fix compilation for FreePGI on OSX
Our checks and the ones of libevent are somewhat flawed.
If adding multiple "-framework" to CXXFLAGS or CFLAGS, we strip
the keyword from the command-line, not good.
libevent however assumes plain gcc without testing properly
that the compiler supports -Wno-deprecated-declarations.
2016-05-25 09:12:39 +02:00
Nathan Hjelm
af52dad8f8 rcache/grdma: fix typo in cuda code
Fixes open-mpi/ompi#1702

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-24 15:56:39 -06:00
Pavel Shamis (Pasha)
d984b4b3f9 CMA: Fixing logic for CMA system call detection
The OPAL_CMA_NEED_SYSCALL_DEFS is always defined/set to 0 or 1.  Therefore
instead of checking if the macro is defined, we have to look at the value
itself.

Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com>
2016-05-24 14:53:25 -05:00
Nathan Hjelm
37e9e2c660 mca/base: fix typo in flag enumeration
This commit fixes a typo in flag enumeration that can cause the parser
to miss valid flags or crash.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-23 12:21:34 -06:00
Artem Polyakov
725eea2819 Fix base64 implementation in pmix framework.
In the commit 80f07b65f1 setting of '-' marker used
as the string termination sign was moved from base64 code:
from: 80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67L491)
to: 80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67R189)

However the decoding function wasn't fixed and still expects on extra
byte at the end of the encoded string which leads to data truncation
during extraction (was noticed on standalone code that was using base64
from OMPI).
2016-05-23 23:30:31 +06:00
Gilles Gouaillardet
d5a2ac6f2f btl/openib: fix #if vs #ifdef 2016-05-23 14:27:33 +09:00
Gilles Gouaillardet
5a8cbe5a8f btl/openib: remove obsolete reference to MEMORY_LINUX_MALLOC_ALIGN_ENABLED macro 2016-05-23 14:12:21 +09:00
Gilles Gouaillardet
8466a3daf3 pmix: update .gitignore
git ignore opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in
git rm opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in
git ignore opal/mca/pmix/pmix*/...
2016-05-23 11:58:07 +09:00
Ralph Castain
4e0749f03d Remove verbose error messages 2016-05-20 10:04:26 -07:00
Ralph Castain
42ecffb6d0 Move the registration of MCA params out of the init of the var system - put them in with the rest of the OPAL MCA param registrations
Take another shot at untangling the spaghetti

orterun: fix for command line parsing

orte-submit calls opal_init_util () before parsing out MCA command line
options (-mca, -am, etc). This prevents mpirun from setting opal MCA
variables for some frameworks as well as the MCA base. This is because
when a framework is opened all of its variables are set to read-only.
Eventually we want to lift this restriction on some MCA variables but
since -mca is affected we must parse out the MCA command line options
before opal_init_util(). This commit fixes the bug by adding a new
option to opal_cmd_line_parse (ignore unknown option) so orte-submit
can pre-parse the command line for MCA options.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>

Minor cleanups to avoid releasing/recreating the cmd line
2016-05-20 09:59:50 -07:00
Brice Goglin
ca621330a6 Update hwloc to v1.11.3
Remove contrib/windows/
Merge hwlocXYZ/hwloc/README-ompi.txt back into hwlocXYZ/README-ompi.txt instead of having both.
Add README.txt in new automake-required directory contrib/systemd/

Keep the following patches applied since they are not in 1.11.3
    linux: actually enable libudev based on the result of AC_CHECK_LIB
    (cherry picked from open-mpi/hwloc@9549fd59af)
    configure: check the actual may_alias syntax that we use
    (cherry picked from open-mpi/hwloc@0ab7af5e90)
2016-05-20 07:20:16 +02:00
Gilles Gouaillardet
cbbdce05b1 pmix/pmix114: silence a warning 2016-05-20 09:35:26 +09:00
Gilles Gouaillardet
ed3fd1775f rcache/grdma: silence a warning 2016-05-20 09:30:29 +09:00
Ralph Castain
6f743f81b6 Update PMIx 114 to current release candidate 2016-05-19 12:55:05 -07:00
Jeff Squyres
66f53ec29a Merge pull request #1628 from kmroz/wip-btl-tcp-ethtool-speed
btl/tcp: autodetect bandwidth and latency if unset by the user
2016-05-18 12:12:55 -04:00
Nathan Hjelm
9371a6a52d Merge pull request #1673 from hjelmn/fix_rcache_deadlock
rcache: fix deadlock in multi-threaded environments
2016-05-18 08:32:21 -07:00
Karol Mroz
ca6ddf3270 btl/tcp: autodetect bandwidth and latency if unset
Fixes open-mpi/ompi#120

Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-05-18 16:25:52 +02:00
Karol Mroz
b9c6c43c6b btl/tcp: add default defines for bandwidth and latency
Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-05-18 16:25:52 +02:00
Nathan Hjelm
ab8ed177f5 rcache: fix deadlock in multi-threaded environments
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-17 09:02:40 -06:00
rhc54
8b534e9897 Merge pull request #1668 from rhc54/topic/slurm
When direct launching applications, we must allow the MPI layer to pr…
2016-05-16 12:23:19 -07:00
Howard Pritchard
1a676e5b35 pmix/cray: fix some breakage
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-05-16 12:45:05 -05:00
Gilles Gouaillardet
4e21933a74 memory/patcher: declare __curbrk as extern in order not to generate an (unitialized) common symbol 2016-05-16 09:30:11 +09:00
Ralph Castain
01ba861f2a When direct launching applications, we must allow the MPI layer to progress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization.
Update external as well

Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
2016-05-14 16:37:00 -07:00
Gilles Gouaillardet
456b73da69 btl/openib: fix error path in init_one_device()
do not explicitly release ib verbs components since they will
be released in the object destructor

Thanks Durga for the report
2016-05-13 09:03:48 +09:00
Jeff Squyres
30f913f217 Merge pull request #1652 from jsquyres/pr/remove-aix-timer
timer/aix: remove stale code
2016-05-10 15:47:02 -04:00
Jeff Squyres
eccf0ff4cd hwloc/external: set WRAPPER_EXTRA_* vars in proper location
WRAPPER_EXTRA flags are checked *before* the POST_CONFIG macro is
invoked.  So set them in the main CONFIG macro.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-10 07:34:56 -07:00
Josh Hursey
44d95cb610 Merge pull request #1657 from bgoglin/hwloc-for-2.0
configure: check the actual may_alias syntax that we use
2016-05-09 13:37:08 -05:00
Ralph Castain
7767882346 Per user request, add some missing data and definitions:
OPAL_PMIX_UNIV_RANK - synonym for OPAL_PMIX_GLOBAL_RANK
OPAL_PMIX_APP_SIZE - #ranks in the application of this proc
2016-05-09 08:39:01 -07:00
Brice Goglin
6839d928c2 configure: check the actual may_alias syntax that we use
xlc 13.1.0 crashes because of our may_alias attributes in nolibxml.c
on Power7. libxml.c and nolibxml.c are the only may_alias users for now,
so change our configure check to match the actual code using it.

Thanks to Paul Hargrove for reporting and debugging the issue,
and providing the patch.

https://www.open-mpi.org/community/lists/devel/2016/05/18918.php

(cherry picked from open-mpi/hwloc@0ab7af5e90)
2016-05-08 22:22:30 +02:00
Ralph Castain
7594b95e4b Ensure the hwloc external header is include when --with-devel-headers is given 2016-05-08 10:18:14 -07:00
Jeff Squyres
acbd2c608d memory/patcher: check for <sys/syscall.h>
Thanks to Paul Hargrove for reporting.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-07 09:48:14 -07:00
Jeff Squyres
b4982d7725 timer/aix: remove stale code
Per discussion on the mailing list and with IBM, remove the AIX timer
code (since AIX is no longer supported).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-07 09:31:34 -07:00
Ralph Castain
7e5ef6a240 Fix the env_list support - the MCA param was being set way too early, so provide a "backdoor" way of providing the value 2016-05-06 15:38:39 -07:00
Ralph Castain
58dd41facf Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>.
Remove debug, let orterun send terminate cmd to DVM

Recover the DVM support
2016-05-06 13:14:03 -07:00
Josh Hursey
35ae7e33d7 Merge pull request #1639 from jjhursey/topic/dl-open-null-fname
dl/dlopen/libltdl: Allow opal_dl_open to take a NULL filename.
2016-05-05 22:15:46 -05:00
Ralph Castain
8ec1891d11 Silence warning 2016-05-05 20:04:10 -07:00
Ralph Castain
08022d7af1 Some minor cleanups of warnings from gcc 6.0.0. Update s1/s2 pmix to get max_procs as required. 2016-05-05 15:28:13 -07:00
Joshua Hursey
677178f206 dl/dlopen/libltdl: Allow opal_dl_open to take a NULL filename. 2016-05-05 17:07:26 -04:00
Nathan Hjelm
ff2a54bd37 patcher/linux: code cleanup
Update based on cleanup made to the upstream version on OpenUCX.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:53:45 -06:00
Nathan Hjelm
6c9a0e1c55 patcher/overwrite: disable ia64 support for now
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:53:24 -06:00
Nathan Hjelm
6ad68da407 patcher/linux: disable the linux patcher component
This commit disables the linux patcher component due to a limitation
in loader patching. While this component is effective in patching
calls made within Open MPI and by the application it fails to hook
calls made within glibc. This means the munmap call made by free is
not correctly hooked. Until this problem can be resolved this
component will remain disabled. If it can't be resolved this component
should probably be removed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:50:51 -06:00
Nathan Hjelm
71be36d380 patcher: fix ppc32 support
The table of contents (TOC) code only appears to only apply to
ppc64. The code was incorrectly assuming the existence of the TOC on
ppc32. This commit updates the necessary code to only apply to ppc64.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:50:32 -06:00
Nathan Hjelm
41f00b7465 memory/patcher: initialize patcher framework when needed
This commit moves the patcher framework initialization to the
memory/patcher component.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:46:42 -06:00
Nathan Hjelm
0f54a95408 Merge pull request #1626 from hjelmn/vader_32
btl/vader: fix compilation on 32-bit systems
2016-05-03 16:39:46 -06:00
Nathan Hjelm
4a740e9f27 Merge pull request #1619 from hjelmn/ext_verbs_fix
btl/openib: fix check for exp verbs struct members
2016-05-03 14:16:17 -06:00
Nathan Hjelm
e7ccbdee27 btl/vader: fix compilation on 32-bit systems
This commit fixes a compile/link issue caused by vader. The vader btl
was using OPAL_THREAD_ADD64 to increment a counter which may not be
available on 32-bit systems. Changed to use OPAL_THREAD_ADD_SIZE_T
which will be 64-bit or 32-bit depending on the system.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-03 10:14:44 -06:00
Nathan Hjelm
2d0e2b6233 patcher: do not clobber ebx
ebx can not be clobbered when using -fPIC so save and restore the
register instead of allowing it to be clobbered.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-05-03 08:24:33 -06:00
Brice Goglin
a2a721f961 linux: actually enable libudev based on the result of AC_CHECK_LIB
instead of doing AC_CHECK_HEADERS+AC_CHECK_LIB and only using the result of the former.

Thanks to Paul Hargrove for reporting the issue (OMPI build with -m32).

(cherry picked from open-mpi/hwloc@9549fd59af)
2016-05-03 10:00:40 +02:00
Nathan Hjelm
da695a6ce6 Merge pull request #1618 from hjelmn/new_hooks_update
More hook updates
2016-05-02 18:12:50 -06:00
Nathan Hjelm
a65af6d079 btl/openib: fix check for exp verbs struct members
This commit fixes a compilation issue with some versions of exp
verbs. In some cases struct ibv_exp_device_attr does not have either
the exp_atom or exp_atomic_cap fields. It is fine to drop one check
and fall back to the non-exp attribute check on the other.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 17:13:33 -06:00
Nathan Hjelm
1ff79656dd patcher: remove debug fprintf
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 17:11:00 -06:00
Nathan Hjelm
581e47c271 patcher: check for clflush
Add a feature check for clflush before trying to use the clflush
instruction. As far as I can tell there is no equivalent before the
SSE2 instruction set.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 17:10:42 -06:00
Nathan Hjelm
67fd6fa6eb Merge pull request #1615 from hjelmn/new_hooks_update
memory/patcher: add #if check for MREMAP_FIXED
2016-05-02 16:26:58 -06:00
Nathan Hjelm
eb14b34f04 memory/patcher: fix compilation on BSDs
The function signature of mremap on BSD (NetBSD, FreeBSD) differs from
the linux version. Added support for the BSD style of mremap.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 14:54:08 -06:00
Nathan Hjelm
52edb43bdc memory/patcher: check for linux/mman.h
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 14:29:46 -06:00
Nathan Hjelm
f8b3be6236 patcher/overwrite: fix ia64 compilation
Fixed a couple of typos in ia64 code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 14:10:34 -06:00
Nathan Hjelm
14c34ae9f0 memory/patcher: add #if check for MREMAP_FIXED
This commit fixes a compile error when the system has mremap but not
MREMAP_FIXED. In this case we do not care about the value of
new_address as the argument does not exist.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 13:58:51 -06:00
rhc54
648043597a Merge pull request #1612 from ggouaillardet/poc/pmix_external_configury
pmix/external: revamp external pmix package detection
2016-05-02 09:46:05 -07:00
Jeff Squyres
265e5b9795 Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1
ompi/opal/orte/oshmem/test: max hostname length cleanup
2016-05-02 09:44:18 -04:00
Gilles Gouaillardet
45f9a47d77 pmix/external: fix typo and silence a warning 2016-05-02 17:15:52 +09:00
Gilles Gouaillardet
08d91b9a03 pmix/external: revamp external pmix package detection 2016-05-02 16:23:31 +09:00
Ralph Castain
6ac7929bd0 Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
George Bosilca
3445577f4c Avoid race conditions during BTP TCP handshake.
In some rare cases when a process receives the connect ack while
locally updating the peer endpoint structure, we could drop the
incomming connect ack due to the fact that the send handler is
protected with a try lock (on the endpoint) and our initial send
event was not persistent. Making the send event persistent solves
all issues.
2016-05-01 14:19:29 -04:00
George Bosilca
702f80ad7e Remove "signed vs. unsigned" warnings. 2016-05-01 11:45:48 -04:00
Ralph Castain
42d9d861fc Fix minor typo in PMIx packing of pmix_app_t - thanks to Gilles for pointing it out 2016-04-29 08:55:46 -07:00
Howard Pritchard
f52dd511d4 Merge pull request #1600 from hppritcha/topic/pmix_fix_for_finalize
pmix/cray: set fence_nb to NULL
2016-04-28 13:50:15 -06:00
hppritcha
aa1d7b9c50 pmix/cray: set fence_nb to NULL
Rather than have a stub function for the pmix fence_nb
operation, just set to NULL.  Causes fewer problems.

Fixes #1597
Fixes #1527

Signed-off-by: hppritcha <howardp@lanl.gov>
2016-04-28 13:48:54 -05:00
Nysal Jan K.A
18cf65dc24 Remove a stray print statement 2016-04-28 18:00:52 +05:30
Nathan Hjelm
03f4a854cb btl/tcp: fix add_procs race condition
This commit fixes a race between a thread calling the tcp btl's
add_procs and a thread processing an incomming connection. The race
occured because the add_procs thread adds a newly created proc object
to the hash table *before* the object is fully initialized. The
connection thread then attempts to use the object before the endpoints
array on the object has beeen allocation. The fix is to only add the
proc to the hash table after it has been completely initialized.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-27 10:24:39 -06:00
Nathan Hjelm
8f93b15e90 Merge pull request #1580 from hjelmn/new_hooks_update
memory/patcher: cast away const in shmdt hook
2016-04-26 17:48:01 -06:00
Nathan Hjelm
df194087c7 Merge pull request #1591 from hjelmn/rcache_update
rcache: fix leave_pinned failure path
2016-04-26 16:50:06 -06:00
Nathan Hjelm
25a97af695 rcache: fix leave_pinned failure path
This commit fixes an error in the failure path of leave_pinned. When
the rcache tries to enable leave_pinned but leave_pinned was not
specifically requested (opal_leave_pinned == -1) the code was
erroneously printing an error and returning NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-26 14:39:23 -06:00
Ralph Castain
02876564d4 Silence warning of zero-byte malloc 2016-04-26 11:55:59 -07:00
Nathan Hjelm
5612998d21 memory/patcher: cast away const in shmdt hook
The opal_mem_hooks_release_hook does not have const on the pointer
(though it probably should). This commit eliminates a warning by
casting away the const until opal_mem_hooks_release_hook is updated to
use const.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-25 15:32:11 -06:00
Jeff Squyres
78b367eb0d memory patcher: add some clarifying comments
This is complicated stuff: add some comments so that future
maintainers have some rationale to understand the way things have been
done.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-04-25 13:12:02 -07:00
Geoffrey Paulsen
55a15fb1d0 Missed one IBM Copyright message for contributions in memory patcher component 2016-04-25 15:47:15 -04:00
Geoffrey Paulsen
ed6f508735 Updated IBM Copyright message for contributions in memory patcher component. 2016-04-25 15:13:38 -04:00
Karol Mroz
e1c64e6e59 opal: standardize on max hostname length
Define OPAL_MAXHOSTNAMELEN to be either:
  (MAXHOSTNAMELEN + 1) or
  (limits.h:HOST_NAME_MAX + 1) or
  (255 + 1)

For pmix code, define above using PMIX_MAXHOSTNAMELEN.

Fixup opal layer to use the new max.

Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-04-24 08:19:47 +02:00
Jeff Squyres
dc18c32437 usnic: fix resource check
The math for checking the number of QPs and CQs per usNIC/VF was
incorrect, allowing you to run MPI processes even when usNICs (i.e.,
VIC VFs) had fewer QPs and CQs than were necessary.  This led to a
confusing error later when fi_enable(3) failed (because we lazily
create QPs).  Fixing the math here ensure that we actually print a
helpful error message telling the user specifically what is wrong.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-04-22 15:58:27 -07:00
Jeff Squyres
68c1a5eb6c Merge pull request #1567 from jsquyres/pr/fix-ompi-to-opal-name-conversion
m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*
2016-04-20 13:10:06 -04:00
Jeff Squyres
6800ef9ec0 m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*
These macros should really be named OPAL_SUMMARY_*; they're used in
all projects, and therefore should be in the lowest later project (OPAL).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-04-20 08:40:00 -07:00
Nathan Hjelm
db854c368a memory/patcher: do not hook madvise if the syscall doesn't exist
Fixes open-mpi/ompi#1565

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-20 09:18:31 -06:00
Nathan Hjelm
fdd1ff7c29 Merge pull request #1562 from hjelmn/opal_coverity
mca/base: fix coverity issue
2016-04-19 14:29:05 -06:00
Nathan Hjelm
3f15d442de Merge pull request #1561 from hjelmn/mpool_rewrite
rcache/base: add missing file to tarball
2016-04-19 12:00:06 -06:00
Nathan Hjelm
16c28399cd rcache/base: add missing file to tarball
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 11:03:38 -06:00
Nathan Hjelm
d981a9fc7d patcher/overwrite: fix compile error on x86
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 10:16:39 -06:00
Nathan Hjelm
5bc9d9d1f8 patcher/linux: fix compiler warnings
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 10:16:20 -06:00
Nathan Hjelm
1147fb3dd1 patcher/linux: ensure component is only enabled on Linux
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-19 10:16:00 -06:00
Nathan Hjelm
a8e90e8796 memory/patcher: munmap hook could be called from within a malloc() implementation
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-18 15:00:07 -06:00
Nathan Hjelm
7f271171f6 memory/patcher: fix coverity warning
Fix CID 1358512:  Error handling issues  (NEGATIVE_RETURNS):

C libraries usually handle read (-1, ...) fine but it is safer to
avoid calling read with a negative handle. Added negative file
descriptor check.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-18 11:29:20 -06:00
Gilles Gouaillardet
d96919638f pmix: remote autogenerated file and update .gitignore
removed: opal/mca/pmix/pmix114/pmix/src/include/private/autogen/config.h.in
2016-04-18 12:57:41 +09:00
Ralph Castain
b009e58d25 Roll to PMIx 1.1.4rc2 - replaces some code that was incorrectly removed in prior update 2016-04-16 18:24:24 -07:00
Ralph Castain
8ff114e668 Update to official PMIx 1.1.4rc1 2016-04-15 21:47:46 -07:00
Ralph Castain
449ec41532 Roll to PMIx 1.1.4rc1 and remove the PMIx 1.2.0 directory as the community has decided to not do that release version. This incorporates a number of bug fixes that have been identified and repaired in the PMIx and OMPI code bases. Also includes several minor corrections to the PMIx code so it now supports run-thru without hanging on collectives involving a process that exits 2016-04-15 10:11:11 -07:00
Nathan Hjelm
4d9c047f04 Merge pull request #1546 from hjelmn/mpool_rewrite
rcache: add missing file
2016-04-14 10:22:09 -06:00
Nathan Hjelm
9046424be5 rcache: add missing file
Fixes #1545

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-14 09:21:09 -06:00
Ralph Castain
4b3995dd27 Trivial change to silence warning 2016-04-14 05:54:02 -07:00
Nathan Hjelm
1e6b4f2f55 Merge pull request #1495 from hjelmn/new_hooks
Add new patcher memory hooks
2016-04-13 18:19:23 -06:00
Nathan Hjelm
c2b6fbb124 opal/memory: move initialization to first rcache creation
Because of the removal of the linux memory component it is no longer
necessary to initialize the memory component in opal_init(). This
commit moves the initialization to the creation of the first rcache
component.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-13 17:21:46 -06:00
Nathan Hjelm
80ec79cfc8 memory/patcher: updates to memory hooks
This commit fixes bugs that can cause crashes and memory corruption
when the mremap hook is called. The problem occurs because of the
ellipses (...) in the mremap intercept function. The ellipses cover
the optional new_addr argument on Linux. This commit removes the
ellipses and adds an explicit 5th argument.

This commit also adds a hook for shmdt. The code only works on Linux
at the moment as it needs to read /proc/self/maps to determine the
size of the shared memory segment.

Additionally, this commit removes the mmap hook. There is no
apparent benefit for detecting mmap(..., PROT_NONE, ...) and it
seems to cause problems when threads are in use.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-13 17:20:24 -06:00
Nathan Hjelm
91bcab93cb opal/memory: remove ptmalloc2
This commit removes the ptmalloc2 memory hooks. This is necessary in
order to support lazy registration of memory hooks. A feature that is
not supported by the ptmalloc hooks but is supported by the new
patcher hooks.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-13 17:18:15 -06:00
Nathan Hjelm
27f8a4e806 opal: add code patcher framework
This commit adds a framework to abstract runtime code patching.
Components in the new framework can provide functions for either
patching a named function or a function pointer. The later
functionality is not being used but may provide a way to allow memory
hooks when dlopen functionality is disabled.

This commit adds two different flavors of code patching. The first is
provided by the overwrite component. This component overwrites the
first several instructions of the target function with code to jump to
the provided hook function. The hook is expected to provide the full
functionality of the hooked function.

The linux patcher component is based on the memory hooks in ucx. It
only works on linux and operates by overwriting function pointers in
the symbol table. In this case the hook is free to call the original
function using the function pointer returned by dlsym.

Both components restore the original functions when the patcher
framework closes.

Changes had to be made to support Power/PowerPC with the Linux
dynamic loader patcher. Some of the changes:

 - Move code necessary for powerpc/power support to the patcher
   base. The code is needed by both the overwrite and linux
   components.

 - Move patch structure down to base and move the patch list to
   mca_patcher_base_module_t. The structure has been modified to
   include a function pointer to the function that will unapply the
   patch. This allows the mixing of multiple different types of
   patches in the patch_list.

 - Update linux patching code to keep track of the matching between
   got entry and original (unpatched) address. This allows us to
   completely clean up the patch on finalize.

All patchers keep track of the changes they made so that they can be
reversed when the patcher framework is closed.

At this time there are bugs in the Linux dynamic loader patcher so
its priority is lower than the overwrite patcher.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-13 17:16:13 -06:00
Nathan Hjelm
4cac623aeb opal/patch: add call to check if binary patching is supported
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-13 17:16:12 -06:00
Nathan Hjelm
11e2d7886e opal/memory: update component structure
This commit makes it possible to set relative priorities for
components. Before the addition of the patched component there was
only one component that would run on any system but that is no longer
the case. When determining which component to open each component's
query function is called and the one that returns the highest priority
is opened. The default priority of the patcher component is set
slightly higher than the old ptmalloc2/ummunotify component.

This commit fixes a long-standing break in the abstration of the
memory components. ompi_mpi_init.c was referencing the linux malloc
hook initilize function to ensure the hooks are initialized for
libmpi.so. The abstraction break has been fixed by adding a memory
base function that calls the open memory component's malloc hook init
function if it has one. The code is not yet complete but is intended
to support ptmalloc in 2.0.0. In that case the base function will
always call the ptmalloc hook init if exists.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-13 17:14:51 -06:00
Nathan Hjelm
7aa03d66b3 opal/memory: add support for patch based memory hooks
This commit adds support for runtime binary patching. The support is
broken down into two parts: util/opal_patcher.[ch] which contains the
functionality for runtime patching of symbols, and mca/memory/patcher
which patches the various symbols needed to provide support for memory
hooks. This work is preliminary and is based off work donated by IBM.

The patcher code is disabled if dlopen is disabled.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-13 17:14:31 -06:00
Gilles Gouaillardet
c72688e8cf Merge pull request #1362 from ggouaillardet/topic/openib_warn_default_gid_prefix
btl/openib: correctly issue a warning when two btls or more are in th…
2016-04-11 13:22:48 +09:00
Gilles Gouaillardet
4ab6c8ad56 mpool/hugepage: use statvfs() instead of statfs() when needed.
Thanks Siegmar Gross for the report.
2016-04-11 11:13:29 +09:00
Ralph Castain
2432daf065 Some minor cleanups of a memory leak and error output 2016-04-08 07:46:18 -07:00
Rainer Keller
52080a5736 As per the pull request to pmix/master:
https://github.com/pmix/master/pull/71

Have OMPI's current version of pmix120 nicely fail in case of
too long sun_path (longer than 108 or in case of OSX 103 chars).
And have OMPI return proper error messages with hints how to
amend.
2016-04-07 22:12:53 +02:00
Thananon Patinyasakdikul
92290b94e0 Fixed Coverity reports 1358014-1358018 (DEADCODE and CHECK_RETURN) 2016-04-07 12:52:17 -04:00
Nathan Hjelm
9efd465539 Merge pull request #1517 from hjelmn/ugni_fixes
Gemini/Aries bug fixes
2016-04-05 07:23:18 -06:00
George Bosilca
26fc8533f8 Remove compiler warnings. 2016-04-04 16:34:23 -04:00
Gilles Gouaillardet
6f450630d8 pmix/external: fix misc missing conversion and type issues 2016-04-04 10:12:34 +09:00
Gilles Gouaillardet
2ede47c462 pmix: fix misc missing conversion and type issues 2016-04-04 10:12:34 +09:00
Nathan Hjelm
7e806791aa rcache/udreg: bug fixes
This commit fixes bugs that caused hangs or crashes when running out
of registration resources.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-02 12:24:20 -06:00
Nathan Hjelm
d7874920aa btl/ugni: set the frag reference count in the eager get path
This comit adds code that sets the fragment reg_cnt to 1 when sending
the completion message for an eager get. Without this the btl will
either hang or abort.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-02 12:10:22 -06:00
Ralph Castain
2bfd6a92b0 Add missing include 2016-04-02 08:52:13 -07:00
Nathan Hjelm
6df5a663b7 mca/base: fix coverity issue
Fixes CID 1357979: Resource leak (RESOURCE_LEAK):

flags array was being leaked. Fixed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-30 12:49:37 -06:00
Ralph Castain
7eca2f9650 Add missing include 2016-03-30 01:34:01 -07:00
Gilles Gouaillardet
e852d85cc1 btl/tcp: add missing mca_btl_tcp_dump() subroutine 2016-03-30 16:10:15 +09:00
Ralph Castain
f70c5c495b Tsk...tsk...replace references to ompi values with opal 2016-03-29 13:35:43 -07:00
George Bosilca
d0165818b3 Initialize all common symbols. 2016-03-29 16:08:27 -04:00
Jeff Squyres
91c54d7a07 Merge pull request #1491 from ICLDisco/progress_thread
BTL TCP async progress
2016-03-29 06:26:10 -04:00
George Bosilca
f69eba1bc4 Update the copyright and cleanup the code.
Per @jsquyres suggestion remove all trailing spaces.
Credit to `sed -i.bak 's/ *$//' */[ch]`.
2016-03-28 14:41:01 -04:00
Thananon Patinyasakdikul
92062492b9 Enable Threading in the BTL TCP
Added mca parameter to turn progress thread on/off
Add a flag to check if we have btl progress thread.
Added macro for ob1 matching lock.
Update the AUTHORS file.
2016-03-28 14:41:01 -04:00
George Bosilca
32277db6ab Add support for async progress in the BTL TCP.
All BTL-only operations (basically all data movements
with the exception of the matching operation) can now
be handled for the TCP BTL by a progress thread.
2016-03-28 14:40:50 -04:00
Jeff Squyres
4a3c986a80 usnic: remove need for hwloc verbs helper
Haven't needed this for a while, but it got left in by accident.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-03-28 09:10:12 -07:00
Jeff Squyres
05e2423756 usnic: specify the cache name
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-03-28 09:02:52 -07:00
Gilles Gouaillardet
4a76f23f40 btl/openib: do not issue an error message if modex cannot retrieve openib info 2016-03-28 10:42:16 +09:00
Gilles Gouaillardet
861564df94 btl/openib: correctly issue a warning when two btls or more are in the default subnet gid
Thanks Matias Cabral for reporting this issue.

Fixes #1352
2016-03-28 09:17:29 +09:00
Nysal Jan K.A
75233573d1 pmix: Increment the reference count in PMIx_Init
The reference counting was broken which led PMIx_Finalize
to release resources early. This fixes the "use after free" scenarios
that I encountered.

(based on commit pmix/master@abfaa4c)
2016-03-27 04:11:25 -04:00
Nathan Hjelm
d6e90f24b1 Merge pull request #1483 from hjelmn/flag_enum_2
RFC: Add support for flag enumerators for MCA variables
2016-03-26 11:43:33 -06:00
Jeff Squyres
017f242b1b opal: remove some unused variables / compiler warnings
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-03-26 03:50:57 -07:00
Josh Hursey
099170bb31 Merge pull request #1496 from jjhursey/topic/pmix120-obj-patch
pmix/pmix120: Fix OBJ_ to PMIX_ symbol name
2016-03-25 19:35:43 -05:00
Ralph Castain
0b4310b186 Remove an unnecessary header that forced exposure of the PMIx internal headers 2016-03-25 16:57:41 -07:00
Ralph Castain
e8246e079b Minor cleanup to match the changes in the PMIx master 2016-03-25 15:12:41 -07:00
Joshua Hursey
8ebeaa5861 pmix/pmix120: Fix OBJ_ to PMIX_ symbol name 2016-03-25 16:17:08 -05:00
rhc54
ba8c8700aa Merge pull request #1493 from rhc54/topic/sing
Update singularity support to track changes in upstream Singularity code
2016-03-24 15:16:38 -07:00
Ralph Castain
8c14df2328 Revert "Modify singularity support per patch from Greg Kurtzer"
This reverts commit open-mpi/ompi@f7257a8310.

Ensure that we properly cleanup the session directory tree. Prior code had issues with symlinks, especially if the file that the link points to was already removed as we traverse the tree. Also found that the dirent checks for directory type weren't fully portable, and so fall back to the stat-based approach which is known to be portable.

Fix singularity singletons by detecting we are in a container and properly setting the pmix selection to pick the isolated component. Remove a stale restriction blocking use of the sm btl
2016-03-24 11:27:18 -07:00
Joshua Ladd
c2813a48e6 Merge pull request #1488 from alinask/topic/openib_enhance_verbose
btl/openib: enhance the verbosity level when using rdmacm without a first PP QP
2016-03-24 10:17:21 -04:00
George Bosilca
ab8008e9fe Nathan missed one reference to mpool. 2016-03-24 00:52:59 -04:00
Nathan Hjelm
478feaf622 btl/scif: update for mpool/rcache rewrite
This commit brings the scif btl up to date with changes made on master
to rework the mpool and rcache frameworks.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-23 16:30:10 -06:00
Alina Sklarevich
2e5a1372dd btl/openib: enhance the verbosity level when using rdmacm without a first PP QP. 2016-03-23 18:49:12 +02:00
Nathan Hjelm
7572c8b74f btl: use flag enumerator for btl_*_flags and btl_*_atomic_flags
This commit uses the new flag "enumerator" to support comma-delimited
lists of flags for both the btl and btl atomic flags. After this
commit is is valid to specify something like -mca btl_foo_flags
self,put,get,in-place. All non-deprecated flags are supported by the
enumerator.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-21 15:23:37 -06:00
Nathan Hjelm
b15a45088c mca: add support for flag enumerators
This commit adds a new type of enumerator meant to support flag
values. The enumerator parses comma-delimited strings and matches
each string or value to a list of valid flags. Additionally, the
enumerator does some basic checks to see if 1) a flag is valid in the
enumerator, and 2) if any conflicting flags are specified.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-21 15:20:56 -06:00
Nathan Hjelm
645bd9d9dd btl/openib: update rdmacm for dynamic add_procs
This commit adds the data necessesary for supporting dynamic add_procs
to the rdma message (opal_process_name_t). The endpoint lookup
function has been updated to match the code in udcm.

Closes open-mpi/ompi#1468.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-21 10:00:41 -06:00
Nathan Hjelm
4d4fa28f75 opal: fix coverity issues
Fix CID 1345825 (1 of 1): Dereference before null check (REVERSE_INULL):

ib_proc should not be NULL in this case. Removed the check and added a
check for NULL after OBJ_NEW.

CID 1269821 (1 of 1): Dereference null return value (NULL_RETURNS):

I labeled this one as a false positive (which it is) but the code in
question could stand be be cleaned up.

Fix CID 1356424 (1 of 1): Argument cannot be negative (NEGATIVE_RETURNS):

While trying to silence another Coverity issue another was
flagged. Protect the close of fd with if (fd >= 0).

CID 70772 (1 of 1): Dereference null return value (NULL_RETURNS):
CID 70773 (1 of 1): Dereference null return value (NULL_RETURNS):
CID 70774 (1 of 1): Dereference null return value (NULL_RETURNS):

None of these are errors and are intentional but now that we have a
list release function use that to make these go away. The cleanup is
similar to CID 1269821.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-18 15:56:08 -06:00
Nathan Hjelm
676a33bfff rcache/grdma: do not OBJ_RELEASE vma tree too early
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-17 11:31:41 -06:00
Nathan Hjelm
852cc8cfbc opal: fix various coverity errors
Fix CID 1356358:  Null pointer dereferences  (REVERSE_INULL):

flist->fl_mpool can no longer be NULL. Removed the conditional.

Fix CID 1356357:  Resource leaks  (RESOURCE_LEAK):

Added the call to free the hints array.

Fix CID 1356356:  Resource leaks  (RESOURCE_LEAK):

This is a false error but it is safe to call close (-1) so just always
call close.

Fix CID 1356354:  Control flow issues  (MISSING_BREAK):
Fix CID 1356353:  Control flow issues  (MISSING_BREAK):

Add comments that indicate the fall-through is intentional.

Fix CID 1356351:  Null pointer dereferences  (FORWARD_NULL):

Fix potential SEGV if the page_size key is malformed.

Fix CID 1356350:  Error handling issues  (CHECKED_RETURN):

Add (void) to indicate that we do not care about the return code of
sscanf in this case.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-17 10:05:57 -06:00
Mike Dubman
8f1838df4b Merge pull request #1459 from alinask/topic/openib_diff_subnets
btl/openib: enable connecting processes from different subnets.
2016-03-17 08:40:08 +02:00
Nathan Hjelm
cbce085b12 rcache/grdma: fix typo
This typo was originally fixed on the mpool_rewrite branch but the change
was lost.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-16 18:30:44 -06:00
Jeff Squyres
d44804f0c9 usnic: use version 1 of the API, not the current version 2016-03-16 16:03:51 -07:00
Jeff Squyres
e7ef711455 usnic: allow mpool_hints to be empty
Follow on to open-mpi/ompi@eac0b11
2016-03-16 15:04:39 -07:00
Alina Sklarevich
bbcbe3cacd btl/openib: enable connecting processes from different subnets.
+ Added an mca parameter to allow connecting processes from different
subnets. Its current default value is 'false' - don't allow, to keep the
current flow the way it is now.

+ rmdacm: when calling ibv_query_gid, use the gid index from
btl_openib_gid_index.
2016-03-16 10:52:06 +02:00
Gilles Gouaillardet
99809162b0 rcache: initialize common symbol mca_rcache_base_used_mem_hooks 2016-03-16 09:27:33 +09:00
Ralph Castain
beecf1b6eb Add missing include, remove unused vairable 2016-03-15 13:45:27 -07:00