1
1
Граф коммитов

6091 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
4079eec974 pml/ob1: be more selective when using rdma capable btls
This commit updates the btl selection logic for the RDMA and RDMA
pipeline protocols to use a btl iff: 1) the btl is also used for eager
messages (high exclusivity), or 2) no other RDMA btl is available on
an endpoint and the pml_ob1_use_all_rdma MCA variable is true. This
fixes a performance regression with shared memory when an RDMA capable
network is available.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-09 20:54:42 -06:00
Nathan Hjelm
2788083b98 Merge pull request #1936 from hjelmn/osc_pt2pt_fix
osc/pt2pt: do not set rdma_frag after start
2016-08-08 14:17:40 -06:00
Nathan Hjelm
e4d7ea75a9 Merge pull request #1935 from hjelmn/persistent_fix
pml/ob1: reset req_bytes_packed on start
2016-08-08 14:17:13 -06:00
Todd Kordenbrock
3be6052523 Merge pull request #1896 from PDeveze/Patchs-on-coll-portals4
Patchs on coll portals4
2016-08-08 14:57:02 -05:00
Edgar Gabriel
fb9fa4fbc4 Merge pull request #1938 from edgargabriel/pr/barrier-on-close
io/ompio: Add barrier to file_close and to file_set_size
2016-08-08 09:22:08 -05:00
Edgar Gabriel
4709f4229b Merge pull request #1929 from edgargabriel/pr/ompio-code-reorg
io/ompio: next step in code-reorganization
2016-08-08 09:20:54 -05:00
Thananon Patinyasakdikul
23b27c510c romio: make romio use internal opal_random instead of rand(3).
This fixes issue #1877
2016-08-05 09:04:52 -07:00
Howard Pritchard
ff669e7b15 code cleanup: clang is now a happier panda
Clang 5.1 on my mac was a sad panda compiling a couple
of files,  complaining about uninitialized stack variables.

This commit makes clang a happier panda (or at least not so sad).

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-08-04 19:34:44 -06:00
Edgar Gabriel
9c3180160c io/ompio: Add barrier to file_close and to file_set_size
This fixes a bug reported on the mailing for ompio.
https://www.open-mpi.org/community/lists/users/2016/05/29333.php
2016-08-04 11:20:31 -05:00
Gilles Gouaillardet
60e91e890a coll/base: give a boost to ompi_coll_base_sendrecv_nonzero_actual()
Based on current implementation it is faster to use a blocking
send than the non-blocking version. Switch the exchange function
used in the barrier to use the blocking version combined with
the non-blocking version of the receive.

This is similar to open-mpi/ompi@223d75595d
2016-08-04 13:31:07 +09:00
Nathan Hjelm
11c853d05e osc/pt2pt: do not set rdma_frag after start
It is possible for the start call to complete the requests. For this
reason the module rdma_frag field should be filled in before start is
called. If the request completes the completion callback will reset
the rdma_frag field to NULL. Fixes a bug discovered by @tkordenbrock.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-03 15:20:36 -06:00
Nathan Hjelm
889dd32806 pml/ob1: reset req_bytes_packed on start
On start we were not correctly resetting all request fields. This was
leading to a double-completion on persistent receives. This commit
updates the base start code to reset the receive req_bytes_packed and
the send request convertor.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-03 11:29:30 -06:00
Edgar Gabriel
aa7e852e44 common/ompio: files are only compiled in case MPI I/O is requested
fixes: open-mpi/ompi#1932
2016-08-02 15:01:38 -05:00
Edgar Gabriel
19fe5cac50 io/ompio: next step in code-reorganization
- move the sort_iovec operations to fcoll/base
 - move set_view_internal to common/ompio
 - move set_file_default to common/ompio
 - remove io_ompio_sort, not used anymore.
2016-08-02 09:18:29 -05:00
Gilles Gouaillardet
917d96ba50 coll/libnbc: cleanup handling of the second temporary buffer in ireduce 2016-08-02 16:32:15 +09:00
Gilles Gouaillardet
ed9139ca13 coll/libnbc: correctly handle datatype alignment when allocating two buffers at once 2016-08-02 15:44:12 +09:00
Edgar Gabriel
c0bd8728fd io/ompio: move aggregator selection code to a separate file
- move all functions related to aggregator selection to a single file
- perform code cleanup fixing many Coverty complains along the way.
2016-08-01 14:04:27 -05:00
Edgar Gabriel
160d9a78c1 Merge pull request #1886 from edgargabriel/pr/ompio-reorg
io/ompio: move io/ompio functionality to common/ompio
2016-07-29 12:24:21 -05:00
Joshua Ladd
4a03a657c6 Merge pull request #1913 from vspetrov/hcoll_derived_datatypes
coll/hcoll mpi datatypes support
2016-07-29 10:08:23 -04:00
Nathan Hjelm
1da558407c Merge pull request #1911 from hjelmn/threads
opal/thread: clean up and add additional OPAL_THREAD macros
2016-07-29 06:44:11 -06:00
Valentin Petrov
3582bba6b7 coll/hcoll mpi datatypes support 2016-07-29 10:06:39 +03:00
Howard Pritchard
5ff6b81eee Merge pull request #1871 from hppritcha/topic/ofi_mtl_params
mtl/ofi: add some more mca parameters
2016-07-28 18:21:23 -06:00
Nathan Hjelm
aac611237b opal/thread: clean up and add additional OPAL_THREAD macros
This commit expands the OPAL_THREAD macros to include 32- and 64-bit
atomic swap. Additionally, macro declararations have been updated to
include both OPAL_THREAD_* and OPAL_ATOMIC_*. Before this commit the
former was used with add and the later with cmpset.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-07-28 09:23:14 -06:00
Howard Pritchard
22c8743557 mtl/ofi: add some more mca parameters
allow for toggling of both control/data progress models.
allow for using FI_AV_TABLE or FI_AV_MAP for av type.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-07-28 02:35:09 -06:00
Gilles Gouaillardet
a0a999e63d coll/base: fix ompi_coll_base_allgatherv_intra_basic_default() with MPI_IN_PLACE 2016-07-28 13:57:18 +09:00
Gilles Gouaillardet
b8a1ffb87e coll/base: fix ompi_coll_base_allgatherv_intra_basic_default()
Fixes open-mpi/ompi#1907
2016-07-28 13:50:04 +09:00
Pascal Deveze
10763f5abc mtl/portals4: Take into account the limitation of portals4 (max_msg_size) and split messages if necessary 2016-07-26 08:44:07 +02:00
Pascal Deveze
724801b018 mtl-portals4: Introduce a "short_limit" for the short message size. "eager_limit" will only be used for
the limit of the eager part of the messages sent with the rndv protocol
2016-07-26 08:43:24 +02:00
Pascal Deveze
9e58b4842f mtl-portals4: Correct how the request_status._ucount is set 2016-07-26 08:42:48 +02:00
Pascal Deveze
3ca194f10a mtl-portals4: Store ptl_process_id (from PtlGetPhysId) and display it. 2016-07-26 08:42:08 +02:00
Pascal Deveze
bd3b1cf7be mtl-portals4: Control that flowctl_idx is egal to REQ_FLOWCTL_TABLE_ID
and use OPAL_ATOMIC_CMPSET_32 to test and set flowctl_active flag to true
2016-07-26 08:41:31 +02:00
Ralph Castain
9ab20cafe3 Pass the nodeid for each proc in the job. Fix a mistaken error output message 2016-07-25 15:41:15 -07:00
Edgar Gabriel
b0fa1fd2a1 move the internal file_open/close functions to common/ompio 2016-07-21 13:08:32 -05:00
Edgar Gabriel
ccf76b7791 moving the internal read/write functions to common/ompio
and update all fs/fcoll/sharedfp components to use these functions.
2016-07-21 13:08:32 -05:00
Edgar Gabriel
688710d408 make common/ompio compile 2016-07-21 13:08:32 -05:00
Edgar Gabriel
39ae93b87b modify the fcoll components to use the common/ompio print queues 2016-07-21 13:08:32 -05:00
Edgar Gabriel
fe17410943 next step in making the print_queue functionality move to common/ompio 2016-07-21 13:08:32 -05:00
Edgar Gabriel
af67c8f239 first cut on moving some ompio functionality to common/ompio 2016-07-21 13:08:32 -05:00
Edgar Gabriel
a899c0fb38 fcoll/static: fix coverty warnings
fix coverty warnings CID 72144, CID 710677, CID 1364164
2016-07-21 13:08:15 -05:00
Pascal Deveze
a7e3de6c4f coll-portals4: No more messages passed to Portals4 bigger than the limit given by PtlNIInit 2016-07-21 15:58:20 +02:00
Pascal Deveze
175e6aa385 coll-portals4: Before calling PtlCTWait, call PtlTriggeredInc twice so be sure all pending PtlTriggredPut are triggered 2016-07-21 15:58:20 +02:00
Pascal Deveze
df59d6cdd4 coll-portals4: Correct and simplify how the data are cut in segment_nb segments (bcast) 2016-07-21 15:58:09 +02:00
Pascal Deveze
274f8d608c coll-portals4: Change output format and change variable names (minor changes). 2016-07-21 11:06:45 +02:00
Todd Kordenbrock
37ad6aa711 Merge pull request #1853 from PDeveze/Patchs-on-osc-portals4
Patchs on osc portals4
2016-07-20 09:22:19 -05:00
Todd Kordenbrock
210534adb3 Merge pull request #1850 from PDeveze/Patchs-on-mtl-portals4
Patchs on mtl portals4
2016-07-20 08:21:03 -05:00
Pascal Deveze
9cac32ba6a mtl/portals4: Modifications concerning the short message management 2016-07-19 11:21:50 +02:00
Pascal Deveze
49e9936914 mtl/portals4: Some little patches 2016-07-19 11:18:55 +02:00
Pascal Deveze
f19a2b961c osc/portals4: Correct an error in an if statement 2016-07-18 13:16:12 +02:00
Pascal Deveze
81823d7a63 osc/portals4: Store the no_locks parameter in osc_portals4_component.no_locks 2016-07-18 11:51:52 +02:00
Pascal Deveze
76b38651da osc/portals4: For the contiguous datatype, take into account the lower bound before calling portals4 2016-07-18 11:20:50 +02:00
Pascal Deveze
7aaf16e7fe osc/portals4: Put/Get splitting because Portals4 may restrict sizes 2016-07-18 10:49:28 +02:00
Pascal Deveze
025201b459 osc/portals4: set the initial value of req_status.MPI_ERROR to MPI_SUCCESS 2016-07-18 09:52:56 +02:00
Pascal Deveze
aa0d687a0a osc/portals4: Display an ouput message if ompi_osc_portals4_get_dt() or ompi_osc_portals4_get_op() returns an error 2016-07-18 09:52:56 +02:00
Pascal Deveze
c4181909a4 osc/portals4: Be sure that the ME are operationnal (wait for the PTL_EVENT_LINK) 2016-07-18 09:52:56 +02:00
Pascal Deveze
e99e7d08ed osc/portals4: For the ME, use the uid from PtlGetUid instead of PTL_UID_ANY 2016-07-18 09:52:56 +02:00
Pascal Deveze
56b36eeb7e osc/portals4: Format of "target_disp" is OPAL_PTRDIFF_TYPE and %lu is the appropriate format to display it. 2016-07-18 09:52:55 +02:00
Pascal Deveze
a76566c754 osc/portals4: To allocate a PT, use REQ_OSC_TABLE_ID and test that the right ID is allocated 2016-07-18 09:52:55 +02:00
Edgar Gabriel
195ec89732 fcoll/base: mv coll_array functionis to fcoll base
the coll_array functions are truly only used by the fcoll modules, so move
them to fcoll/base. There is currently one exception to that rule (number of aggreagtors
logic), but that function will be moved in a long term also to fcoll/base.
2016-07-14 08:41:14 -05:00
Edgar Gabriel
1f1504ebbb remove some unused code 2016-07-14 08:41:14 -05:00
Joshua Ladd
06930a0423 Merge pull request #1840 from artpol84/yalla_perf_fix
pml/yalla: fix yalla performance regression
2016-07-14 10:55:30 +03:00
Pascal Deveze
b87ed1ad4a mtl/portals4: Display actual limits given by the portals4 PtlNIInit function 2016-07-12 15:07:31 +02:00
Pascal Deveze
f666b0d9aa mtl/portals4: Allocate a PT with the PTL_PT_FLOWCTRL flag only if OMPI_MTL_PORTALS4_FLOW_CONTROL is set 2016-07-12 15:07:31 +02:00
Pascal Deveze
bed572cd6c mtl/portals4: Unlink the ME first, then free the CT and at the end free the PT 2016-07-12 15:07:30 +02:00
Ralph Castain
0e433eaa78 Silence warning 2016-07-11 19:43:02 -07:00
Nathan Hjelm
b47208e909 osc/rdma: fix bug in CAS
This commit fixes a bug in the RDMA compare-and-swap implementation
that caused the origin value to always be written even if the compare
should have failed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-07-11 09:54:23 -06:00
Edgar Gabriel
c8b1c6cae1 Merge pull request #1856 from edgargabriel/pr/zero-size-iread-iwrite
io/ompio: fix the request in case of a zero size write/read operation
2016-07-11 08:19:02 -05:00
Gilles Gouaillardet
14624506df coll/libnbc: do not exchange data between roots in ompi_coll_libnbc_ireduce_scatter_inter()
this is now useless since the scatter is done via the local communicator
2016-07-11 17:18:30 +09:00
Edgar Gabriel
3dd81e9e09 io/ompio: fix the request in case of a zero size write/read operation 2016-07-08 14:11:22 -05:00
Gilles Gouaillardet
a55d57406b coll/base: fix non zero lower bound datatype handling in mca_coll_base_alltoallv_intra_basic_inplace() 2016-07-08 16:55:26 +09:00
Gilles Gouaillardet
7b8094aac1 coll/base: silence misc warning
as reported by Coverity with CIDs 1363349-1363362

Offset temporary buffer when a non zero lower bound datatype is used.

Thanks Hristo Iliev for the report

(cherry picked from commit 0e393195d9)
2016-07-08 13:06:26 +09:00
Gilles Gouaillardet
678d08647b coll/libnbc: various fixes
- correctly handle non commutative operators
 - correctly handle non zero lower bound ddt
 - correctly handle ddt with size > extent
 - revamp NBC_Sched_op so it takes two buffers and matches ompi_op_reduce semantic
 - various fix for inter communicators

Thanks Yuki Matsumoto for the report
2016-07-07 15:55:49 +09:00
Gilles Gouaillardet
3e559a14a9 coll/inter: fix non standard ddt handling
- correctly handle non zero lower bound ddt
 - correctly handle ddt with size > extent

Thanks Yuki Matsumoto for the report
2016-07-07 15:49:59 +09:00
Gilles Gouaillardet
488d037d51 coll/basic: fix non standard ddt handling
- correctly handle non zero lower bound ddt
 - correctly handle ddt with size > extent

Thanks Yuki Matsumoto for the report
2016-07-07 15:49:53 +09:00
Gilles Gouaillardet
c06fb04a9a coll/base: fix non zero lower bound ddt handling in ompi_coll_base_reduce_intra_basic_linear()
Thanks Yuki Matsumoto for the report
2016-07-07 15:49:48 +09:00
Ralph Castain
ee56d9dc1a Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field 2016-07-05 14:59:50 -07:00
George Bosilca
eac5b3c668 Various cleanups in the monitoring PML. 2016-07-05 18:31:25 +02:00
Artem Polyakov
a4ff9bef6d fix #2 2016-07-05 14:38:35 +03:00
Artem Polyakov
bc973cad30 fix 2016-07-05 14:33:31 +03:00
Artem Polyakov
7d96f12fec pml/yalla: fix yalla performance regression
It was introduced in PR https://github.com/open-mpi/ompi/pull/1228
 in particular in commit 041a6a9f53.

 Original solution was using "flexible array member" called "mxm_base"
 to "fall-through" to the "mxm" send/recv member that located in the
 outer structure.

 After changing number of elements in "mxm_base" from 0 to 1 we actually
 allocating 2 mxm_req_base_t elements which leads to increased overal
 size and harms cache performance.

 It also brakes "mca_pml_yalla_check_request_state" function.
2016-07-05 10:52:48 +03:00
Joshua Hursey
0a09f8bc51 coll/hcoll: Protect module destruct when not fully initialized
* If hcoll is given a negative priority, but not enabled=0 then
   the module is constructed, but then destructed before calling
   it's query(). So the previous pointers are not initialized.
   If we try to OBJ_RELEASE them in a debug build an assert will fire.
   This commit adds some protection against that and initializes
   the _module pointers to NULL.
2016-07-01 13:41:27 -05:00
Joshua Hursey
59f304b9e9 coll/base: neg. priority cleanup, verbose output improvements
* Print a verbose message if the component was disqualified because of
   a negative priority.
 * If a disqualified component provided a module, release it.
 * Display list of selected components in priority order
   - During the process of volunteering collective functions for a
     communicator, print the component name and priority. This will
     cause the verbose messages to be displayed in reverse priority
     order (lowest priority first, up to highest). This is helpful
     when determining which collective components are active in which
     order for a given communicator.
     To see the messages you need the following MCA parameter set to 9
     or higher: `-mca coll_base_verbose 9`
 * Adjust verbose for commonly needed verbose output from 10 to 9 to
   make it easier to access this information.
2016-07-01 13:41:27 -05:00
Nathan Hjelm
5f390b5f5a bml/r2: be more restrictive on rdma endpoints
This commit makes bml/r2 more restrictive on which endpoints end up in the rdma
endpoint list. Before this commit an endpoint was added if it supported either
put or get. This was done to ensure that endpoints are available for RMA.
Thought it is possible to support put or get endpoints we only currently
support endpoints that have put, get, and amos. bml/r2 now reflects this
support.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-29 18:54:58 -06:00
Nathan Hjelm
2409024c17 osc/rdma: fix typo
Need to increment the total size after checking the local offset not
before. This typo causes large allocations with MPI_Win_allocate() to
fail.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-21 09:50:29 -06:00
George Bosilca
9c4f56be4b Fix the coll_base_sendrecv function. 2016-06-18 18:23:51 +02:00
Ralph Castain
5d330d5220 Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
Add PMIx 2.0

Remove PMIx 1.1.4

Cleanup copying of component

Add missing file

Touchup a typo in the Makefile.am

Update the pmix ext114 component

Minor cleanups and resync to master

Update to latest PMIx 2.x

Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00
Edgar Gabriel
1ddfd6cdca io/ompio: fix the preallocate function
handle preallocating sizes less than the current file size correctly.
2016-06-14 10:50:32 -05:00
Gilles Gouaillardet
80e362de52 coll/base: fix memory free in ompi_coll_base_allreduce_intra_recursivedoubling err handler
Fix CID 1362630

Fixes open-mpi/ompi@0e393195d9
2016-06-09 13:12:25 +09:00
Gilles Gouaillardet
ead7efef3f coll/basic: silence CID 1362614 in mca_coll_basic_allreduce_inter() 2016-06-09 09:40:19 +09:00
Gilles Gouaillardet
ad2e1a5ae9 coll/base: silence CID 1362613 in ompi_coll_base_alltoall_intra_basic_linear() 2016-06-09 09:40:05 +09:00
Gilles Gouaillardet
80b267af1c coll/base: silence CID 1362601 in ompi_coll_base_sendrecv_zero() 2016-06-09 09:37:31 +09:00
Gilles Gouaillardet
0e393195d9 coll/base: fix [all]reduce with non zero lower bound datatypes
Offset temporary buffer when a non zero lower bound datatype is used.

Thanks Hristo Iliev for the report
2016-06-08 16:48:00 +09:00
Nathan Hjelm
3ddf3ccbf3 Merge pull request #1758 from hjelmn/ob1_fixes
pml/ob1: bug fixes
2016-06-07 11:18:55 -06:00
Todd Kordenbrock
9671d6af47 Merge pull request #1689 from francois-wellenreiter/remove_trig_rdv_portals4
MTL portals4 : remove the triggered rendez-vous protocol
2016-06-06 21:55:01 -05:00
Nathan Hjelm
5d0b4679ea pml/ob1: bug fixes
This commit fixes two bugs in pml/ob1:

 - Do not called MCA_PML_OB1_PROGRESS_PENDING from
   mca_pml_ob1_send_request_start_copy as this may lead to a recursive
   call to mca_pml_ob1_send_request_process_pending.

 - In mca_pml_ob1_send_request_start_rdma return the rdma frag object
   if a btl fragment can not be allocated. This fixes a leak
   identified by @abouteiller and @bosilca.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-06 17:54:55 -06:00
Gilles Gouaillardet
c976559877 coll/basic: fix log basic bcast
The log basic bcast was completely broken. The rank 0 gets the
hibit set to -1, so it always returned an error.
2016-06-06 11:01:51 +09:00
Gilles Gouaillardet
99fedcb7a3 fs/base: silence a memory leak in mca_fs_base_get_fstype()
Fixes CID 1351211
2016-06-06 09:20:14 +09:00
George Bosilca
9376b0340b Fix the basic barrier.
The log basic barrier was completely broken. The rank 0 gets the
hibit set to 0, so it always returned an error.
2016-06-03 23:46:25 -04:00
Edgar Gabriel
d6af5444a6 fix the get_byte_offset code 2016-06-03 11:36:53 -05:00
Nathan Hjelm
e968ddfe64 start bug fixes (#1729)
* mpi/start: fix bugs in cm and ob1 start functions

There were several problems with the implementation of start in Open
MPI:

 - There are no checks whatsoever on the state of the request(s)
   provided to MPI_Start/MPI_Start_all. It is erroneous to provide an
   active request to either of these calls. Since we are already
   looping over the provided requests there is little overhead in
   verifying that the request can be started.

 - Both ob1 and cm were always throwing away the request on the
   initial call to start and start_all with a particular
   request. Subsequent calls would see that the request was
   pml_complete and reuse it. This introduced a leak as the initial
   request was never freed. Since the only pml request that can
   be mpi complete but not pml complete is a buffered send the
   code to reallocate the request has been moved. To detect that
   a request is indeed mpi complete but not pml complete isend_init
   in both cm and ob1 now marks the new request as pml complete.

 - If a new request was needed the callbacks on the original request
   were not copied over to the new request. This can cause osc/pt2pt
   to hang as the incoming message callback is never called.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

* osc/pt2pt: add request for gc after starting a new request

Starting a new receive may cause a recursive call into the pt2pt
frag receive function. If this happens and the prior request is
on the garbage collection list it could cause problems. This commit
moves the gc insert until after the new request has been posted.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-02 20:22:40 -04:00
Matias A Cabral
29ab28f4f6 Adding owner.txt file for PSM2 MTL. 2016-06-02 16:26:16 -07:00