1
1
Граф коммитов

24659 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
4a55fba414 Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration. 2016-03-02 15:01:01 -08:00
Nathan Hjelm
5a85a039fa Merge pull request #1421 from hjelmn/btl_vader_threads
btl/vader: various threading fixes
2016-03-02 16:53:57 -06:00
Nathan Hjelm
2a0b3a5700 btl/vader: various threading fixes
This commit fixes several threading bugs:

 - Add an additional lock to the btl_base_endpoint_t structure to lock
   the list of pending frags. This allows the progress function to
   attempt to send pending frags without needing to drop/reaquire the
   lock. This should provide a small improvement in performance and
   fixes a potential race between adding an removing items from the
   pending list.

 - Ensure fast boxes are only set up once by updating the send count
   using atomics when needed and do not set the fast box buffer
   pointer until the fast box is set up.

Closes open-mpi/ompi#1408

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-03-02 10:50:59 -07:00
Ralph Castain
f0680008d1 Add test file for singularity 2016-03-02 05:40:41 -08:00
Ralph Castain
06e811c5a6 Properly use the OPAL_MCA_PREFIX in orte_submit 2016-03-01 18:16:40 -08:00
Ralph Castain
1b81d90eaa Minor cleanups required for orte-dvm operation 2016-03-01 18:12:53 -08:00
Aurélien Bouteiller
892e1ed57e Fix a potential race condition in which a progress matching thread could match a request while we are cancelling it. 2016-03-01 16:43:45 -05:00
Ralph Castain
c9f7bb6751 Add the include file to all the schizo components 2016-03-01 13:18:23 -08:00
Ralph Castain
625083fe18 Add include file 2016-03-01 13:04:20 -08:00
rhc54
9df73568f4 Merge pull request #1411 from rhc54/topic/pmixdefault
Fix a number of issues, some of which have lingered for a long time
2016-03-01 11:08:28 -08:00
Ralph Castain
011403c04a Fix a number of issues, some of which have lingered for a long time:
* provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case.

* change the relative priority of the pmix120 and pmix112 components to make pmix120 the default

* fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons

* ensure orterun doesn't propagate any ess or pmix directives in its environment

* Cleanup a few valgrind issues and memory leaks

* Fix a race condition that prevented the client from completing notification registrations (missing thread shift)

* Ensure the shizo/alps component detects launch by mpirun
2016-03-01 06:53:00 -08:00
Gilles Gouaillardet
67e45028df Merge pull request #1414 from jsquyres/pr/egrep-for-examples-makefile
examples: update ompi_info bindings checks
2016-03-01 11:55:49 +09:00
Gilles Gouaillardet
8aff67c399 topo/base: correctly support MPI_UNWEIGHTED in mca_topo_base_dist_graph_neighbors()
Thanks Jun Kudo for the bug report.
2016-03-01 10:28:28 +09:00
Gilles Gouaillardet
e5d6b97db4 opal: fix pragma for GCC 6 and later
GCC 6 and later should ignore -Wpedantic instead of -pedantic
2016-02-29 13:56:22 +09:00
Jeff Squyres
677a31bc9f examples: update ompi_info bindings checks
Use "-q" option to grep/egrep to suppress output (we only need the
exit status).  Also, use egrep for the "use mpi" check, because some
versions of ompi_info say 'bindings:use_mpi:yes' and others say
'bindings:use_mpi:"yes' (i.e., with the double quote).  This regexp
will work with both versions.
2016-02-28 17:19:54 -08:00
Jeff Squyres
20fade1345 examples: fix check for Fortran "use mpi" bindings
The output from "ompi_info --parsable" for the Fortran "use mpi"
bindings apparently has changed over time.  It is now:

   "yes (full: ignore TKR)"
or "yes (limited: overloading)"

(including the quotes)

So update the test in examples/Makefile to also look for the quote.
2016-02-28 16:30:09 -08:00
rhc54
78a1fd5d54 Merge pull request #1413 from rhc54/topic/iof
Fix a segfault that can occur when very short-lived, non-ORTE procs are run
2016-02-28 13:55:15 -08:00
Ralph Castain
263b0c95a8 Fix a segfault that can occur when very short-lived, non-ORTE procs are run 2016-02-28 12:30:20 -08:00
Jeff Squyres
89f225ea7f Merge pull request #1410 from jsquyres/pr/cxx11-has-jumped-the-shark
cxx: "rank" is now a function in C++11
2016-02-27 09:37:13 -05:00
Jeff Squyres
89d0a033b7 cxx: "rank" is now a function in C++11
Use "myrank" instead (I tried using ::rank, but had varied
success... so I just renamed the variable).
2016-02-25 15:56:08 -06:00
rhc54
6ae75f007a Merge pull request #1409 from rhc54/topic/singleton
Provide an option to allow isolated singletons
2016-02-25 15:36:43 -06:00
Ralph Castain
cdb494566d Provide an option to allow isolated singletons 2016-02-25 11:33:26 -06:00
George Bosilca
dbe93b0b19 Use mca_bml_base_get_endpoint
Correctly use mca_bml_base_get_endpoint instead of accessing the
endpoint directly.
2016-02-25 11:00:30 -06:00
Sylvain Jeaugey
5f32f49eb8 pml/ob1: Fix segmentation fault on CUDA path.
Fix segfault due to mca_pml_ob1_cuda_need_buffers not handling the case of the
endpoint not being there. Calling mca_bml_get_endpoint() seems to fix the problem.

Fixes open-mpi/ompi#1402
2016-02-24 21:32:25 -08:00
rhc54
026cb37c4e Merge pull request #1400 from rhc54/topic/config
Adjust the pmix external component configure error messages
2016-02-24 14:22:59 -06:00
Ralph Castain
d28d3ee901 Make the error message on external pmix library a little clearer by separating out the libevent from the libhwloc checks 2016-02-24 11:20:25 -06:00
Ralph Castain
e8d347d7bd Add missing includes 2016-02-24 08:56:02 -06:00
Jeff Squyres
be22971c3a Merge pull request #1398 from jsquyres/pr/test-lib-name-cleanups
tests: fix library name
2016-02-23 21:31:39 -06:00
Gilles Gouaillardet
477991b5aa btl/openib: fix abstraction violation and use opal_memory->memoryc_set_alignment 2016-02-24 09:50:13 +09:00
Gilles Gouaillardet
d8482ce6f4 opal/mca/memory: add a memoryc_set_alignment subroutine to the OPAL memory MCA
this commit also (partially) reverts :
 - open-mpi/ompi@7de01b347c
 - open-mpi/ompi@8b05f308f9
2016-02-24 09:50:12 +09:00
Jeff Squyres
1340d51ddd tests: fix library name
Use @OPAL_LIB_PREFIX@ as appropriate in the library that we link against.
2016-02-23 16:22:59 -08:00
Edgar Gabriel
2c4b93f72b Merge pull request #1395 from edgargabriel/pr/fcoll-static-large-ops-fix
fix the data size counter for large ops for the static fcoll component
2016-02-23 10:26:16 -06:00
Edgar Gabriel
45003ef78d fix the data size counter for large ops for the static fcoll component 2016-02-23 08:33:50 -06:00
George Bosilca
d6fb56af29 Use the correct printf conversion specifier. 2016-02-23 01:26:27 -06:00
Gilles Gouaillardet
308bbcbad1 ompi/dpm: retrieves OPAL_PMIX_ARCH in heterogeneous mode
also remove code duplication by using ompi_proc_complete_init_single()

Thanks Siegmar Gross for reporting this issue, and Ralph for the guidance.
2016-02-22 11:01:06 +09:00
Gilles Gouaillardet
a4aa4c9571 ompi_proc_complete_init_single: make the subroutine public
and accept a proc from a different job
2016-02-22 11:01:06 +09:00
rhc54
1df4457af2 Merge pull request #1392 from rhc54/topic/dvm
Tools don't create the orte_job_data table, so don't remove jobs from it
2016-02-21 17:42:48 -08:00
Ralph Castain
77f800b7e8 Tools don't create the orte_job_data table, so don't remove jobs from it 2016-02-21 16:29:00 -08:00
Ralph Castain
64b7728f33 Fix typo - do not look at daemon job when considering completion of launch 2016-02-21 14:44:51 -08:00
rhc54
b499d4ba2a Merge pull request #1391 from rhc54/topic/dvm
Convert the orte_job_data pointer array to a hash table so it doesn't…
2016-02-21 13:07:59 -08:00
Ralph Castain
d653cf2847 Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM. 2016-02-21 11:55:49 -08:00
Ralph Castain
309e23ab3a Fix minor typo 2016-02-20 01:33:10 -08:00
yohann
59b6d041f8 mtl/ofi: Check allocated pointer. 2016-02-19 16:59:47 -08:00
yohann
bd47062764 mtl/ofi: Fix error handling. 2016-02-19 16:58:41 -08:00
yohann
404987e9b3 mtl/ofi: Fix mismatching types. 2016-02-19 16:57:26 -08:00
yohann
3ad59435ce mtl/ofi: Prevent possible memory leak. 2016-02-19 16:57:02 -08:00
Ralph Castain
8c92a179c0 Minor memory leak 2016-02-19 15:05:39 -08:00
rhc54
1f7e2d7d41 Merge pull request #1388 from rhc54/topic/iof
Cleanup the output-filename options so they work as expected.
2016-02-19 13:56:05 -08:00
Ralph Castain
0c72ba89b9 Cleanup the output-filename options so they work as expected. Have the remote nodes output locally to the files instead of sending it all back to the HNP.
Fix Solaris issues by renaming struct field
2016-02-19 12:41:46 -08:00
Edgar Gabriel
b33db517c1 Merge pull request #1387 from edgargabriel/dynamic_gen2-overlap
Updates to the dynamic_gen2 component
2016-02-19 12:00:29 -06:00