a buffer defined by (buf, count, dt)
will have data starting at buf+offset and ending len bytes later with
len = opal_datatype_span(&dt.super, count, &offset);
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
origin_datatype and target_datatype might be different and hence have different extent,
so use either origin_extent or target_extent when appropriate.
Refs open-mpi/ompi#3569
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
The osc_rdma_get_remote_segment() has the 3rd and 4th args as
* target_disp
* length
which it uses to determine if the rdma falls within the bounds of
the window or not (actually it only checks the upper bound, but I'm
okay with that).
Anyway the caller previously was passing in the length argument as
target_datatype->super.size * target_count
which which doesn't really represent the number of bytes after target_disp
for which data exists. In particular I could create a datatype as
{ disp -4, len 4 } and use target_disp 4
and that would be bytes 0-3 of the window where the original code
would think it was bytes 4-7 and could abort at the range check.
Ive changed it to use the opal_datatype_span() function.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
See bug report
https://github.com/open-mpi/ompi/issues/3548
If a 1sided test is launched -host hostA:2,hostB:1 some of the ranks
call allocate_state_single() and others call allocate_state_shared().
These functions were producing different values for module->state_size
but that's used when they lookup peer info from each other in
ompi_osc_rdma_peer_setup() so they need to all have matching
module->state_offset values.
This change adds a few unused bytes in the memory allocate_state_single()
creates so it matches.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
The expected sequence of events for processing info during object creation
is that if there's an incoming info arg, it is opal_info_dup()ed into the obj
at obj->s_info first. Then interested components register callbacks for
keys they want to know about using opal_infosubscribe_infosubscribe().
Inside info_subscribe_subscribe() the specified callback() is called with
whatever matching k/v is in the object's info, or with the default. The
return string from the callback goes into the new k/v stored in info, and
the input k/v is saved as __IN_<key>/<val>. It's saved the same way
whether the input came from info or whether it was a default. A null return
from the callback indicates an ignored key/val, and no k/v is stored for
it, but an __IN_<key>/<val> is still kept so we still have access to the
original.
At MPI_*_set_info() time, opal_infosubscribe_change_info() is used. That
function calls the registered callbacks for each item in the provided info.
If the callback returns non-null, the info is updated with that k/v, or if
the callback returns null, that key is deleted from info. An __IN_<key>/<val>
is saved either way, and overwrites any previously saved value.
When MPI_*_get_info() is called, opal_info_dup_mpistandard() is used, which
allows relatively easy changes in interpretation of the standard, by looking
at both the <key>/<val> and __IN_<key>/<val> in info. Right now it does
1. includes system extras, eg k/v defaults not expliclty set by the user
2. omits ignored keys
3. shows input values, not callback modifications, eg not the internal values
Currently the callbacks are doing things like
return some_condition ? "true" : "false"
that is, returning static strings that are not to be freed. If the return
strings start becoming more dynamic in the future I don't see how unallocated
strings could support that, so I'd propose a change for the future that
the callback()s registered with info_subscribe_subscribe() do a strdup on
their return, and we change the callers of callback() to free the strings
it returns (there are only two callers).
Rough outline of the smaller changes spread over the less central files:
comm.c
initialize comm->super.s_info to NULL
copy into comm->super.s_info in comm creation calls that provide info
OBJ_RELEASE comm->super.s_info at free time
comm_init.c
initialize comm->super.s_info to NULL
file.c
copy into file->super.s_info if file creation provides info
OBJ_RELEASE file->super.s_info at free time
win.c
copy into win->super.s_info if win creation provides info
OBJ_RELEASE win->super.s_info at free time
comm_get_info.c
file_get_info.c
win_get_info.c
change_info() if there's no info attached (shouldn't happen if callbacks
are registered)
copy the info for the user
The other category of change is generally addressing compiler warnings where
ompi_info_t and opal_info_t were being used a little too interchangably. An
ompi_info_t* contains an opal_info_t*, at &(ompi_info->super)
Also this commit updates the copyrights.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
ompi_communicator_t, ompi_win_t, ompi_file_t all have a super class of type opal_infosubscriber_t instead of a base/super type of opal_object_t (in previous code comm used c_base, but file used super). It may be a bit bold to say that being a subscriber of MPI_Info is the foundational piece that ties these three things together, but if you object, then I would prefer to turn infosubscriber into a more general name that encompasses other common features rather than create a different super class. The key here is that we want to be able to pass comm, win and file objects as if they were opal_infosubscriber_t, so that one routine can heandle all 3 types of objects being passed to it.
MPI_INFO_NULL is still an ompi_predefined_info_t type since an MPI_Info is part of ompi but the internal details of the underlying information concept is part of opal.
An ompi_info_t type still exists for exposure to the user, but it is simply a wrapper for the opal object.
Routines such as ompi_info_dup, etc have all been moved to opal_info_dup and related to the opal directory.
Fortran to C translation tables are only used for MPI_Info that is exposed to the application and are therefore part of the ompi_info_t and not the opal_info_t
The data structure changes are primarily in the following files:
communicator/communicator.h
ompi/info/info.h
ompi/win/win.h
ompi/file/file.h
The following new files were created:
opal/util/info.h
opal/util/info.c
opal/util/info_subscriber.h
opal/util/info_subscriber.c
This infosubscriber concept is that communicators, files and windows can have subscribers that subscribe to any changes in the info associated with the comm/file/window. When xxx_set_info is called, the new info is presented to each subscriber who can modify the info in any way they want. The new value is presented to the next subscriber and so on until all subscribers have had a chance to modify the value. Therefore, the order of subscribers can make a difference but we hope that there is generally only one subscriber that cares or modifies any given key/value pair. The final info is then stored and returned by a call to xxx_get_info.
The new model can be seen in the following files:
ompi/mpi/c/comm_get_info.c
ompi/mpi/c/comm_set_info.c
ompi/mpi/c/file_get_info.c
ompi/mpi/c/file_set_info.c
ompi/mpi/c/win_get_info.c
ompi/mpi/c/win_set_info.c
The current subscribers where changed as follows:
mca/io/ompio/io_ompio_file_open.c
mca/io/ompio/io_ompio_module.c
mca/osc/rmda/osc_rdma_component.c (This one actually subscribes to "no_locks")
mca/osc/sm/osc_sm_component.c (This one actually subscribes to "blocking_fence" and "alloc_shared_contig")
Signed-off-by: Mark Allen <markalle@us.ibm.com>
Conflicts:
AUTHORS
ompi/communicator/comm.c
ompi/debuggers/ompi_mpihandles_dll.c
ompi/file/file.c
ompi/file/file.h
ompi/info/info.c
ompi/mca/io/ompio/io_ompio.h
ompi/mca/io/ompio/io_ompio_file_open.c
ompi/mca/io/ompio/io_ompio_file_set_view.c
ompi/mca/osc/pt2pt/osc_pt2pt.h
ompi/mca/sharedfp/addproc/sharedfp_addproc.h
ompi/mca/sharedfp/addproc/sharedfp_addproc_file_open.c
ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
ompi/mpi/c/lookup_name.c
ompi/mpi/c/publish_name.c
ompi/mpi/c/unpublish_name.c
opal/mca/mpool/base/mpool_base_alloc.c
opal/util/Makefile.am
since Open MPI now requires a C99, and ptrdiff_t type is part of C99,
there is no more need for the abstract OPAL_PTRDIFF_TYPE type.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
One should use the correct module object when calling
c_coll.coll_allgather. Otherwise there will be a segfault in the
case, for example, when hcoll is used. In that case
c_coll.coll_allgather = mca_coll_hcoll_allgather while
c_coll.coll_gather_module = tuned.
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
As we changed the ABI (forcing a major release), we can limit
the size of the predefined communicators by moving the collective
structure outside the communicator. This might have a minimal,
but unnoticeable, impact on performance. This approach has been
discussed during the January 2017 devel meeting.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Under heavy load the locking code could fail if the underlying btl
module started to return OPAL_ERR_OUT_OF_RESOURCE on atomic
operations. This commit updates the code to gracefully handle btl
errors.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Instead of ompi_datatype_get_extent(), use ompi_datatype_get_true_extent()
to get the local and remote lower bound. For derived types like
subarray, true_lb is the correct offset for RDMA operations.
This commit fixes a typo in compare-and-swap when retrieving the
memory region associated with a displacement. It was erroneously 8
bytes instead of the datatype size. This can cause an incorrect RMA
range error when the compare-and-swap is less than 4 bytes from the
end of the region.
Fixedopen-mpi/ompi#2080
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for using network AMOs for MPI_Accumulate,
MPI_Fetch_and_op, and MPI_Compare_and_swap. This support is only
enabled if the ompi_single_intrinsic info key is specified or the
acc_single_interinsic MCA variable is set. This configuration
indicates to this implementation that no long accumulates will be
performed since these do not currently mix with the AMO
implementation.
This commit also cleans up the code somwhat. This includes removing
unnecessary struct keywords where the type is also typedef'd.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes an ordering bug in the code that keeps track of all
attached memory windows. The code is intended to keep the memory
regions sorted but was often inserting at the wrong index. Thanks to
Christoph Niethammer for reporting the issue. The reproducer will be
added to nightly MTT testing.
Fixesopen-mpi/ompi#2012
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug in the RDMA compare-and-swap implementation
that caused the origin value to always be written even if the compare
should have failed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Need to increment the total size after checking the local offset not
before. This typo causes large allocations with MPI_Win_allocate() to
fail.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).
* Fix the condition to release the request.
This commit fixes a bad synchronization detection bug that occurs when
mixing MPI_Win_fence() and MPI_Win_lock(). If no communication has
occurred in the fence epoch it is safe to just clear the all_sync
object (it was set up by fence).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug that occurs when ranks are either not mapped
evenly or by something other than core.
Fixesopen-mpi/ompi#1599
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1324726 (#1 of 1): Free of address-of expression (BAD_FREE):
Indeed, if a lock conflicts with the lock_all we will end up trying to
free an invalid pointer.
Fix CID 1328826 (#1 of 1): Dereference after null check (FORWARD_NULL):
This was intentional but it would be a good idea to check for
module->comm being non_NULL to be safe. Also cleaned out some checks
for NULL before free().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds code to handle large unaligned gets. There are two
possible code paths for these transactions:
1) The remote region and local region have the same alignment. In
this case the get will be broken down into at most three get
transactions: 1 transaction to get the unaligned start of the region
(buffered), 1 transaction to get the aligned portion of the region,
and 1 transaction to get the end of the region.
2) The remote and local regions do not have the same alignment. This
should be an uncommon case and is not optimized. In this case a
buffer is allocated and registered locally to hold the aligned data
from the remote region. There may be cases where this fails (low
memory, can't register memory). Those conditions are unlikely and
will be handled later.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If atomics are not globally visible (cpu and nic atomics do not mix)
then a btl endpoint must be used to access local ranks. To avoid
issues that are caused by having the same region registered with
multiple handles osc/rdma was updated to always use the handle for
rank 0. There was a bug in the update that caused osc/rdma to continue
using the local endpoint for accessing the state even though the
pointer/handle are not valid for that endpoint. This commit fixes the
bug.
Fixesopen-mpi/ompi#1241.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Optimizing put aggregation in the presence of threads will require a
redesign of the code. For now just ensure that put aggregation is
turned off when MPI_THREAD_MULTIPLE is enabled.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.
Closes#1204 (replaces it)
Fixes#1064
A previous commit updated the one-sided code to register the state
region only once. This created an issue when using the scratch lock
with fetching atomics. In this case on any rank that isn't local rank
0 the module->state_handle is NULL. This commit fixes the issue by
removing the scratch lock and using a fragment pointer instead.
Fixesopen-mpi/ompi#1290
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit changes the way ompi_proc_t's are retained/released by
ompi_group_t's. Before this change ompi_proc_t's were retained once
for the group and then once for each retain of a group. This method
adds unnecessary overhead (need to traverse the group list each time
the group is retained) and causes problems when using an async
add_procs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were two bugs in osc/rdma when using threads:
- Deadlock is ompi_osc_rdma_start_atomic. This occurs because
ompi_osc_rdma_frag_alloc is called with the module lock. To fix the
issue the module lock is now recursive. In the future I will add a
new lock to protect just the current rdma fragment.
- Do not drop the lock in ompi_osc_rdma_frag_alloc when calling
ompi_osc_rdma_frag_complete. Not only is it not needed but dropping
the lock at this point can cause a competing thread to mess up the
state.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug that can occur on Cray Gemini networks. If
multiple registrations are used for the local state then we looks the
atomicity guarantees. To avoid issues like this use only a single
registration handle for all local state on a node.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes the following:
- CIDs 1328491, 1328492: Dead code caused by typos in a prior
commit.
- Fix the calculation of dynamic memory regions. This was causes
incorrect RMA range errors when accessing the last partial page of
an attachment.
- Fix a SEGV when using dynamic memory windows with local state (all
processes on the same node).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes several bugs in the osc/rdma component:
- Complete aggregated requests immediately. Completion of RMA
requests indicates local completion anyway. This fixes a hang in
the c_reqops test.
- Correctly mark Rget_accumulate requests.
- Set the local base flag correctly on the local peer.
- Clear or set the no locks flag on the window if the value is
changed by MPI_Win_set_info.
- Actually update the target when using MPI_OP_REPLACE.
Fixesopen-mpi/ompi#1010
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>