1
1

3321 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
47fd2313ab btl/vader: move backing files into /dev/shm on Linux
This commit moves the backing files to /dev/shm to avoid limitations
that may be set on /tmp. The files are registered with pmix to ensure
they are cleaned up after an erroneous exit.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 48101278160672317ade352365592f56ef3b8977)
2017-12-18 07:09:18 -08:00
Ralph Castain
07427c6d89 Update to PMIx v3.0 PR for cleanup registration
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.

Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-12-18 06:53:11 -08:00
Nathan Hjelm
d3fa1bbbb0 rcache/grdma: try to prevent erroneous free error messages
It is possible to have parts of an in-use registered region be passed
to munmap or madvise. This does not necessarily mean the user has made
an error but does mean the entire region should be invalidated. This
commit checks that the munmap or madvise base matches the beginning of
the cached region. If it does and the region is in-use then we print
an error. There will certainly be false-negatives where a user
unmaps something that really is in-use but that is preferrable to a
false-positive.

References #4509

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-12-12 09:18:39 -07:00
Nathan Hjelm
a82f761a4a btl/vader: change the way fast boxes are used
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References #4260

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-12-11 10:38:33 -07:00
Gilles Gouaillardet
11e5f86bf8 mpool/base: plug a memory leak
set the key of all mpool_tree_item objects, so they can be retrieved
in mpool_base_free and then returned back to the
mca_mpool_base_tree_item_free_list free list.

Refs. open-mpi/ompi#4567

Thanks Philip Blakely for the bug report.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-12-07 09:06:25 +09:00
Ben Menadue
90fa8af10b Use correct alignment request in mca_mpool_base_alloc.
Signed-off-by: Ben Menadue <ben.menadue@nci.org.au>
2017-12-06 07:02:17 +11:00
Ben Menadue
db3e25edad Update mca_mpool_base_alloc to use malloc instead of posix_memalign for alignment requests of <= sizeof(void *). This works around issue #4564.
Signed-off-by: Ben Menadue <ben.menadue@nci.org.au>
2017-12-05 09:51:31 +11:00
Matias Cabral
2c86b8723d
Merge pull request #4510 from matcabral/mtl_psm2_shadow_vars
New flag for MCA parameters that allows a behaving with a default value of "unset".
2017-12-04 12:25:37 -08:00
Nathan Hjelm
7893248c5a opal/asm: add fetch-and-op atomics
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:23 -07:00
Nathan Hjelm
1282e98a01 opal/asm: rename existing arithmetic atomic functions
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:22 -07:00
Nathan Hjelm
9d0b3fe9f4 opal/asm: remove opal_atomic_bool_cmpset functions
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-11-30 10:41:22 -07:00
Nathan Hjelm
647b40f3f2
Merge pull request #4442 from bosilca/topic/ob1_pvar
Topic/ob1 pvar
2017-11-29 09:31:07 -07:00
Gilles Gouaillardet
3b4b3bb6f9 pmix/ext3x: add a missing cnctcbfunc field to ext3x_opalcaddy_t
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-11-28 16:11:08 +09:00
Ralph Castain
3906aaf41a Silence warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-25 11:50:18 -08:00
Ralph Castain
30f23ac67a Save one more file descriptor per process by not opening one for stddiag
if PMIx (version > 1.x) is active since all diagnostic messages will instead flow thru
the PMIx connection. Unfortunately, PMIx v1 does not support this
feature, but we can remove the stddiag support once PMIx v1 slides out
of the support window

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-25 11:48:53 -08:00
Matias A Cabral
1fad59465f New flag for MCA parameters that allows a behaving with a default value
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.

Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
2017-11-16 16:28:50 -08:00
Jeff Squyres
c19822dad4 pmix: pack pointer to object (vs. pointer to pointer)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-13 09:50:44 -08:00
Ralph Castain
6eb3c124e1
Merge pull request #4498 from rhc54/topic/pmixup
Some minor cleanups of the DVM
2017-11-12 19:01:15 -08:00
Ralph Castain
9c84e1485b Some minor cleanups of the DVM
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-12 16:27:37 -08:00
Ralph Castain
64e838c1ac
Merge pull request #4495 from rhc54/topic/pmixup
Sync to PMIx master
2017-11-11 18:25:45 -08:00
Ralph Castain
d75d0bc5f6 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-11 17:06:41 -08:00
Jeff Squyres
99662757e2 usnic: only output unknown frames in verbose mode
Per
https://www.mail-archive.com/users@lists.open-mpi.org/msg31758.html,
only output unknown frames when we're outputting verbose BTL messages.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-11-10 11:14:05 -08:00
Gilles Gouaillardet
cfdf042d89
Merge pull request #4461 from ggouaillardet/topic/cygwin_fixes
memory/patcher: #ifdef out some parts when SYS_munmap is not defined
2017-11-08 13:24:12 +09:00
Ralph Castain
d4b83cc951 Sync with PMIx master
Implement direct modex protection to turn off PMIx dstore when direct modex scenario is detected

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-07 18:10:56 -08:00
Gilles Gouaillardet
d19a8351c8 memory/patcher: #ifdef out some parts when SYS_munmap is not defined
so memory/patcher can work under cygwin

Thanks Marco Atzeri for bringing this to our attention

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-11-07 16:44:40 +09:00
George Bosilca
e57834aaaa
Point to the corect MPI object.
Store the pointer to the object handle and not the pointer to the
pointer.
We should not assert(0) in the code !

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-11-03 01:20:34 -04:00
bosilca
63e8a8c608
Merge pull request #4431 from hjelmn/asm_cleanup
opal: rename opal_atomic_cmpset* to opal_atomic_bool_cmpset*
2017-11-02 18:45:56 -04:00
Ralph Castain
b97caf8f05 Correct copy/paste error in example
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-02 10:33:28 -07:00
Nathan Hjelm
3ff34af355 opal: rename opal_atomic_cmpset* to opal_atomic_bool_cmpset*
This commit renames the atomic compare-and-swap functions to indicate
the return value. This is in preperation for adding support for a
compare-and-swap that returns the old value. At the same time the
return type has been changed to bool.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-10-31 12:47:23 -06:00
Ralph Castain
27f3d417ca Revert the MPI_Init fence operations to use volatile bool instead of thread macros.
The problem is that the waiting thread is cycling using OMPI_LAZY_WAIT_FOR_COMPLETION so it can exercise opal_progress. This probably isn't as critical for the modex step, but definitely necessary for the barrier at the end of mpi_init. The problem this creates is that the lazy macro exits as soon as "active" becomes false, and then we destruct the lock.

However, wakeup_thread sets "active" to false - and then calls the condition broadcast to wakeup any waiting threads. So there is a race condition between that broadcast and the lock destruct.

Add OPAL_ACQUIRE_OBJECT and OPAL_POST_OBJECT memory barriers to help protect against thread race conditions on some platforms

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-31 08:09:02 -07:00
Ralph Castain
7839dc91a8 Sync to PMIx v3.0 (master)
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-30 13:06:41 -07:00
Ralph Castain
36d7e752b6 I think we have all concluded that there is no good answer to locating the external libevent library, so surrender to the situation and simply remove that requirement. Users wanting to utilize the embedded PMIx library can install it, but will have to use mpicc _and_ add an explicit -lpmix to their cmd line to compile their application.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-29 07:39:02 -07:00
Gilles Gouaillardet
5c61a4e3a5 configury: fix handling of external libevent library
Search external libevent library in both DIR/lib64 and DIR/lib
when --with-libevent=DIR is specified but --with-libevent-libdir=DIR is not

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-10-27 15:52:18 +09:00
Ralph Castain
ea3508b26b Sync to PMIx master (now v3.0)
Fix an apparent typo in external libevent configury
Require external libevent for install of separate libpmix

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-26 21:05:17 -07:00
Ralph Castain
01ed7548c4 Update to PMIx v3.0a
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-25 12:25:27 -07:00
Ralph Castain
8fbfe68754 Alter the PMIx embedded configuration so that we can build static with devel headers - if the builder requests that we install a separate libpmix, then don't prefix the PMIx variables.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-24 21:45:27 -07:00
Ralph Castain
292983261a We should never block when requesting dmodex data from the PMIx server as this will block it from being able to accept connections from local clients. Do not deregister standing dmodx requests when a fence completes unless we actually collected the data in the fence
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-24 07:51:10 -07:00
bosilca
ac348da13a Merge pull request #4374 from bosilca/topic/osx_syslog
Topic/osx syslog
2017-10-23 18:06:36 -04:00
Ralph Castain
6ea3c8a0bd Update the interlib example to show an alternative method for model declaration. Add a missing range value to the OPAL layer. Make it easier to see OMPI model callbacks
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-23 11:27:42 -07:00
George Bosilca
8f32b345de
Address syslog issues on OSX 10.13 with gcc 7.x
gcc 7.[1,2] (at least) fails to correctly parse the OSX 10.13 sys/syslog.h
header. As a results we need to potect syslog support in OPAL, PMIX and
ORTE.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-10-23 14:02:10 -04:00
Ralph Castain
a63904d47f Updates to support cross-version operations with OMPI v2.x
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-22 08:38:33 -07:00
Ralph Castain
f8ce31f13c Fix event registration so OpenMP/MPI coordination sides can both get notification of model declarations
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-19 18:06:38 -07:00
Howard Pritchard
e8bfd494e7 pmix/cray: define fence method for cray pmix
Turns out UCX PML calls opal_pmix.fence in its del procs
method without checking whether or not the fence method
for the pmix component was defined.  Rather than patch
UCX PML, actually define a fence method for the cray pmix.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-10-17 15:58:01 -06:00
Ralph Castain
60b338e857 Sync to PMIx v3. Ensure prun uses the ess/tool component.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-14 08:24:57 -07:00
Ralph Castain
8ae10c9e1a Ensure we exit with an appropriate error code when hitting a PMI2 error
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-13 19:30:28 -07:00
Jeff Squyres
5705192151 configure: add --en|disable-show-load-errors-by-default
Give packagers a configure CLI option to set the value of the MCA
variable mca_base_component_show_load_errors.

The --disable form of this option is intended for Open MPI packagers
who tend to enable support for many different types of networks and
systems in their packages.  For example, consider a packager who
includes support for both the FOO and BAR networks in their Open MPI
package, both of which require support libraries (libFOO.so and
libBAR.so).  If an end user only has BAR hardware, they likely only
have libBAR.so available on their systems -- not libFOO.so.  Disabling
load errors by default will prevent the user from seeing potentially
confusing warnings about the FOO components failing to load because
libFOO.so is not available on their systems.

Conversely, system administrators tend to build an Open MPI that is
targeted at their specific environment, and contains few (if any)
components that are not needed.  In such cases, they might want their
users to be warned that the FOO network components failed to load
(e.g., if libFOO.so was mistakenly unavailable), because Open MPI may
otherwise silently failover to a slower network path for MPI traffic.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-10-11 11:02:21 -07:00
Ralph Castain
388034c814 Add support for the -v (verbose) option to prun and silence the "executing" and "completed" output otherwise.
Debounce "unreachable" notifications for tools when they disconnect
Enable the -x cmd line option for prun

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 0a5b36180a22959654461ac1303cec35313f8b4a)
2017-10-10 12:54:49 -07:00
Ralph Castain
c696e04c5e Since PMIx is moving to release v3.0, embed the new release candidate in opal/pmix framework. Move the pmix2x code over to the ext2x component. Create a new ext3x component
Remove some build product. Tell PMIx that we don't need a new nspace generated when OMPI calls connect
Add missing Makefile

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-09 13:51:08 -07:00
Ralph Castain
51f3fbdb3e Fix cmd line passing of DVM URI
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-06 18:10:46 -07:00
Ralph Castain
c3b239cee8 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-06 12:40:23 -07:00