When we exceed the threshold number of contexts created, print appropriate help
text
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit 9cabcfdbba49f8b97f830f90e3d88be2f7685de4)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Provide the av_attr.count hint (number of addresses that will be
inserted into the address vector through the life of the process)
at initialization of the address vector. It's ok to be a bit
wrong, but some endpoints (RxR) can benefit by not going through
the slow growth realloc churn.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 44be7f139ac8e3130ffeb0afd4f43abdde31dd83)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
With MTLs, there's no "other transport" when the remote side
does not have an active NIC, so we should print a useful error
message when the modex failed (indicating lack of a NIC on
the remote side).
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit fe25097194145234c175ee0487b1fa9b13658d49)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Moving to a model where we have users actively _enable_ SEP feature for use
rather than opening SEP by default if provider supports it. This allows us to
not regress (either functionally or for performance reasons) any apps that were
working correctly on regular endpoints.
Also, providing MCA to specify number of OFI contexts to create and default
this value to 1 (Given btl/ofi also creates one by default, this reduces the
incidence of a scenario where we allocate all available contexts by default and
if btl/ofi asks for one more, then provider breaks as it doesn't support it).
While at it, spruce up README on SEP content.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit 37f9aff2a02e9d51628cc7075853ec4dba93d1e7)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
-> Added new targets in Makefile.am to call a new build script
generate-opt-funcs.pl to generate specialized functions for
each *.pm file.
-> Added new perl module *.pm files for send,isend,irecv,iprobe,improbe
which are loaded by generate-opt-funcs.pl to create new source files
that correspond to the name of the .pm file to be used as part of
MTL OFI.
-> Added mtl_ofi_opt.pm.template and updated README with details on the
specialization features and how to add additional specialization
support.
-> Added new opt_common/mtl_ofi_opt_common.pm containing common
functions for generating the specialized functions used by
all other *.pm modules.
-> Added new mtl_ofi.h which includes the definitions for the
function symbol table for storing the specialized functions along
with the definitions for the initialization functions for the
corresponding function pointers.
-> Based off the OFI provider capabilities the specialized function
pointers are assigned at mtl_ofi_component_init to the corresponding
MTL OFI function.
-> mca_mtl_ofi_module_t has been updated with the symbol table
struct which is assigned at component init.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit bef5f50a42f53f4c6c30610aba236861f652f30e)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
For cases when the number of local processes is greater than the number of
available contexts, the SEP initialization phase would calculate the number of
contexts to provision for each rank to be 0 and would eventually crash.
Fix the issue here by using regular endpoints in the event the number of local
processes is more than available contexts. This fixes issue #6182.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit e5e19dfcf7618185bdc89f1e91506a9bf15355b1)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Commit 109d0569ffd introduced a crash when an error occurred
before ofi_ctxt was allocated, including when no providers
passed the selection logic. Properly check that the pointer
is not NULL in the error cleanup code before dereferencing
the pointer.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 6e15128d960aaa40dd7e905ae3e9e53d9cdaac2a)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
OFI MTL supports OFI Scalable Endpoints feature as means to improve
multi-threaded application throughput and message rate. Currently the feature
is designed to utilize multiple TX/RX contexts exposed by the OFI provider in
conjunction with a multi-communicator MPI application model. For more
information, refer to README under mtl/ofi.
Reviewed-by: Matias Cabral <matias.a.cabral@intel.com>
Reviewed-by: Neil Spruit <neil.r.spruit@intel.com>
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit 109d0569ffdc29f40518d02ad7a4d5bca3adc3d1)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
When an application is not using multiple threads to call into MPI, we can
safely ask for FI_THREAD_DOMAIN setting from the provider as it should
translate to the least amount of locking in provider.
Conversely, for applications using THREAD_MULTIPLE, explicitly ask for
FI_THREAD_SAFE to prevent race conditions.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit 5cbcae79d8619b99e4768578c8d11d3149e5c7c8)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
OFI providers may reserve some of the upper bits of the tag for
internal usage and expose it using mem_tag_format. Check for that
and adjust communicator bits as needed.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit d996f529c0377d794dea261c801e504dfbb33170)
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This fix ensures that all progress functions return the number of
completed events.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 72501f8f9c37b8db8dd08b274abd9774687a60cb)
Change the provider include and exclude list name comparison check to
ignore case. The UDP provider's name is uppercase and was being selected
despite being in the exclude list.
Signed-off-by: Robert Wespetal <wesper@amazon.com>
(cherry picked from commit 9b72e9465da3f2891ac13ed0443db44136506a1a)
When the provider does not support FI_REMOTE_CQ_DATA, the OFI tag does not have
sizeof(int) bits for the rank. Therefore, unexpected behavior will occur when
this limit is crossed.
Check the max allowed number of ranks during add_procs() and return if there is
danger of exceeding this threshold.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit 5cf43de44538b818b014bdad0490439e2d212395)
This commit fixes a segfault in mtl-portals4 finalize(). The segfault
occurs if finalize() is called without any calls to add_procs(). This
commit resolves the segfault by skipping the flow control fini() call if
Portals4 was not initialized.
Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
(cherry picked from commit e7b867c044f8b776b75f3c6d917745c06237743e)
INTERNAL: STL-59403
The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.
This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.
(cherry-picked from commit 3aca4af548a3d781b6b52f89f4d6c7e66d379609)
Change-Id: Ie50445e5edfb0f30916de0836db0edc64ecf7c60
Signed-off-by: Michael Heinz <michael.william.heinz@intel.com>
Reviewed-by: Adam Goldman <adam.goldman@intel.com>
Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>
Update the OPAL_CHECK_OFI configury macro:
- Make it safe to call the macro multiple times:
- The checks only execute the first time it is invoked
- Subsequent invocations, it just emits a friendly "checking..."
message so that configure output is sensible/logical
- With the goal of ultimately removing opal/mca/common/ofi, rename the
output variables from OPAL_CHECK_OFI to be
opal_ofi_{happy|CPPFLAGS|LDFLAGS|LIBS}.
- Update btl/usnic and mtl/ofi for these new conventions.
- Also, don't use AC_REQUIRE to invoke OPAL_CHECK_OFI because that
causes the macro to be invoked at a fairly random time, which makes
configure stdout confusing / hard to grok.
- Remove a little left-over kruft in OPAL_CHECK_OFI, too (which
resulted in an indenting change, making the change to
opal_check_ofi.m4 look larger than it really is).
Thanks Alastair McKinstry for the report and initial fix.
Thanks Rashika Kheria for the reminder.
Updated from master cherry pick: the OFI BTL does not exist on the
v4.0.x branch. Therefore, did not include the OFI BTL changes on
master in this cherry pick.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f5e1a672ccd5db127e85e1e8f6bcfeb8a8b04527)
The intention of lowering the priority when all processes are local
was to favor Vader BTL. However, in builds including the OFI MTL it
gets selected instead.
Reviewed-by: Spruit, Neil R <neil.r.spruit@intel.com>
Reviewed-by: Gopalakrishnan, Aravind <aravind.gopalakrishnan@intel.com>
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
(cherry picked from commit fc8582c5606b7a3d1b711f8f7b6144808290a48f)
Change default provider selection logic for the OFI MTL. The
old logic was whitelist-only, so any new HPC NIC provider would
have to ask users to do extra work or wait for an OMPI release
to be whitelisted. The reason for the logic was to avoid
selecting a "generic" provider like sockets or shm that would
frequently have worse performance than the optimized BTL options
Open MPI supports.
With the change, we blacklist the (small, relatively static) list
of providers that duplicate internal capabilities. Users can use
one of thse blacklisted providers in two ways: first, they can
explicitly request the provider in the include list (which will
override the default exclude list) and second, the can set a new
empty exclude list.
Since most HPC networks require special libraries and therefore
an explicit build of libfabric, it is highly unlikely that this
change will cause users to use libfabric when they didn't want to
do so. It does, however, solve the whitelisting problem.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit c5eaa38491c7197f7dbc74c299ade18e09bf5f64)
As agreed on #4574, where removed in past release branches
to avoid perfomance impacts in the default values for
some paramters.
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
Since progress entries array is globally allocated, it is susceptible
to race conditions when using multi-threaded applications. Allocating it
on the stack resolves any potential races as it is thread local by default.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
(cherry picked from commit ed2343034d09b33eb44a0a727bef97a108edc8aa)
-Updated blocking send to directly call functionality and
set completion events expected to 0 initally. This allows for optimization for
providers that support fi_tinject up to larger sizes. This also reduces
latency on running the OFI mtl with smaller sizes without requiring
calls to progress given fi_tinject is required to complete the messaging
before returning and will not create any events in the Completion Queue.
-Updated non-blocking send to directly call fi_tsend and avoid calling
fi_tinject as the functionality should not wait on completions. This
resolves a bug where applications calling MPI_Isend can overrun the
TX buffer with small (inject) messages causing a deadlock. In addition
this improves performance in message rates by preventing
waiting on any size message to complete in non-blocking send messages.
-Created common ompi_mtl_ofi_ssend_recv function to post the ssend recv
which is common between isend and send code paths.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit 7dc8c8ba3fa630df8c5c7ab36fcf25249a82bfe7)
- If a message for a recv that is being cancelled gets completed after
the call to fi_cancel, then the OFI mtl will enter a deadlock state
waiting for ofi_req->super.ompi_req->req_status._cancelled which will
never happen since the recv was successfully finished.
- To resolve this issue, the OFI mtl now checks ofi_req->req_started
to see if the request has been started within the loop waiting for the
event to be cancelled. If the request is being completed, then the loop
is broken and fi_cancel exits setting
ofi_req->super.ompi_req->req_status._cancelled = false;
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit 767135c580f75d3dde9cb9c88601dd18afda949a)
- Added support in MTL_OFI_RETRY_UNTIL_DONE to handle -FI_EAGAIN
from the provider and correctly attempt to progress the OFI Completion
queue by calling ompi_mtl_ofi_progress.
- If events were pending that blocked OFI operations from being enqueued
they will be completed and the OFI operation will be retried once
ompi_mtl_ofi_progress has successfully completed.
- Updated MTL_OFI_RETRY_UNTIL_DONE to take a RETURN variable instead of
requiring the existance of a "ret" variable to pass back the return
value from completing the OFI operation.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
(cherry picked from commit d4f408a7f867b2f7bab84b9c966e1eba59f59e0e)
-Updated the design for sync send MPI calls to use 2 protocol bits for
denoting "sync_send" or "sync_send_ack".
-"Sync_send" is added to the send tag only and is masked out in receives
such that it can be read by the original Recv posted in the send/recv
operation.
-"Sync_send_ack" is sent from the recv callback to the send side. This 0
byte send does not generate a completion entry and instead sends the
message and immediately completes the opal completion in the recv.
-Tag formats ofi_tag_1 and ofi_tag_2 have been updated to include 2
more tag bits per format type due to the reduced protocal bits required
by OMPI.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
Extend number of supported ranks with providers that support
FI_REMOTE_CQ_DATA. Add README file to OFI MTL
Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
Remove the MXM MTL, which has been deprecated in preference for
the Yalla PML. This was discussed at the last developers meeting
and somehow I ended up with the action item to do the removal.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit fixes a segfault in mtl-portals4 finalize(). The segfault
occurs if finalize() is called without any calls to add_procs(). This
commit resolves the segfault by skipping the progress() loop in
finalize() if the Portals was not initialized.
Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
-Updated ompi_mtl_ofi_progress to use an array to read CQ events up to a
threshold that can be set by the Open MPI User.
-Users can adjust the number of events that can be handled in the
ompi_mtl_ofi_progress by setting "--mca mtl_ofi_progress_event_cnt #".
-The default value for the the number of CQ events that can be read in a
single call to ofi progress is 100 which is an average
based off workload usecase anaylsis showing 70-128 as the range of
multiple events returned during ofi progress.
Signed-off-by: Spruit, Neil R <neil.r.spruit@intel.com>
This fixes a regression in sockets provider which could return -EINTR value
from fi_cq_read() due to a syscall being interrupted. The error value is
currently interpreted as fatal condition. Relax the rule so that we can retry
fi_cq_read() operation.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
At least with some providers (sockets and GNI), the mprobe/mrecv
ofi mtl methods were incorrect. For these two providers at least
one must supply the original tag and mask bits used with the
prior FI_PEEK | FI_CLAIM request that had been used to probe for
the message.
These providers take a strict interpretation of the following sentence
from the libfabric fi_tagged man page:
```
Claimed messages can only be retrieved using a subsequent, paired receive operation with the FI_CLAIM flag set.
```
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
of "unset".
mtl/psm2: Update some shadow mca parameters to use the default "unset".
mtl/psm2: Add new shadow parameter to allow specifying the service level.
Signed-off-by: Matias A Cabral <matias.a.cabral@intel.com>
gcc 5.2 complains:
```
mtl_ofi_component.c: In function ‘ompi_mtl_ofi_finalize’:
mtl_ofi_component.c:613:5: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
if (ret = fi_close((fid_t)ompi_mtl_ofi.fabric)) {
^
```
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Before this commit, the presence of usNIC devices -- which will
(currently) return no data when fi_getinfo() is queried for tagged
matching providers -- would cause an error message to be displayed.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
The value of ret is negative (e.g., -61), but it is displayed in the
help message as `%zd`, which renders as unsigned (i.e., a giant
positive value). So make sure to negate the negative value before
rendering it (e.g., so we display "61", not "4294967235").
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>