openmpi/opal/mca/btl/ofi/README

========================================
Design notes on BTL/OFI
========================================

This is the RDMA only btl based on OFI Libfabric. The goal is to enable RDMA
with multiple vendor hardware through one interface. Most of the operations are
managed by upper layer (osc/rdma). This BTL is mostly doing the low level work.

Tested providers: sockets,psm2,ugni

========================================

Component

This BTL is requesting libfabric version 1.5 API and will not support older versions.

The required capabilities of this BTL is FI_ATOMIC and FI_RMA with the endpoint type
of FI_EP_RDM only. This BTL does NOT support libfabric provider that requires local
memory registration (FI_MR_LOCAL).

BTL/OFI will initialize a module with ONLY the first compatible info returned from OFI.
This means it will rely on OFI provider to do load balancing. The support for multiple
device might be added later.

The BTL creates only one endpoint and one CQ.

========================================

Memory Registration

Open MPI has a system in place to exchange remote address and always use the remote
virtual address to refer to a piece of memory. However, some libfabric providers might
not support the use of virtual address and instead will use zero-based offset addressing.

FI_MR_VIRT_ADDR is the flag that determine this behavior. mca_btl_ofi_reg_mem() handles
this by storing the base address in registration handle in case of the provider does not
support FI_MR_VIRT_ADDR. This base address will be used to calculate the offset later in
RDMA/Atomic operations.

The BTL will try to use the address of registration handle as the key. However, if the
provider supports FI_MR_PROV_KEY, it will use provider provided key. Simply does not care.

The BTL does not register local operand or compare. This is why this BTL does not support
FI_MR_LOCAL and will allocate every buffer before registering. This means FI_MR_ALLOCATED
is supported. So to be explicit.

Supported MR mode bits (will work with or without):
    enum:
    - FI_MR_BASIC
    - FI_MR_SCALABLE

    mode bits:
    - FI_MR_VIRT_ADDR
    - FI_MR_ALLOCATED
    - FI_MR_PROV_KEY

The BTL does NOT support (will not work with):
    - FI_MR_LOCAL
    - FI_MR_MMU_NOTIFY
    - FI_MR_RMA_EVENT
    - FI_MR_ENDPOINT

Just a reminder, in libfabric API 1.5...
FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)

========================================

Completions

Every operation in this BTL is asynchronous. The completion handling will occur in
mca_btl_ofi_component_progress() where we read the CQ with the completion context and
execute the callback functions. The completions are local. No remote completion event is
generated as local completion already guarantee global completion.

The BTL keep tracks of number of outstanding operations and provide flush interface.

========================================

Sockets Provider

Sockets provider is the proof of concept provider for libfabric. It is supposed to support
all the OFI API with emulations. This provider is considered very slow and bound to raise
problems that we might not see from other faster providers.

Known Problems:
    - sockets provider uses progress thread and can cause segfault in finalize as we free
      the resources while progress thread is still using it. sleep(1) was put in
      mca_btl_ofi_componenet_close() for this reason.
    - sockets provider deadlock in two-sided mode. Might be something about buffered recv.
      (August 2018).

========================================

Scalable Endpoint

This BTL will try to use scalable endpoint to create communication context. This will increase
multithreaded performance for some application. The default number of context created is 1 and
can be tuned VIA MCA parameter "btl_ofi_num_contexts_per_module". It is advised that the number
of context should be equal to number of physical core for optimal performance.

User can disable scalable endpoint by MCA parameter "btl_ofi_disable_sep".
With scalable endpoint disbled, the BTL will alias OFI endpoint to both tx and rx context.

========================================

Two sided communication

Two sided communication is added later on to BTL OFI to enable non tag-matching provider
to be able to use in Open MPI with this BTL. However, the support is only for "functional"
and has not been optimized for performance at this point. (August 2018)
new btl/ofi: RDMA only btl using libfabric. This commit added new transport layer to be used with osc rdma module. This BTL provides put, get, atomic and fetch atomic operations. It can be used with multiple hardware vendors as long as they have their provider under Libfabric and have the right capabilities. Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com> 2018-06-01 23:53:53 +03:00			`========================================`
			`Design notes on BTL/OFI`
			`========================================`

			`This is the RDMA only btl based on OFI Libfabric. The goal is to enable RDMA`
			`with multiple vendor hardware through one interface. Most of the operations are`
			`managed by upper layer (osc/rdma). This BTL is mostly doing the low level work.`

			`Tested providers: sockets,psm2,ugni`

			`========================================`

			`Component`

			`This BTL is requesting libfabric version 1.5 API and will not support older versions.`

			`The required capabilities of this BTL is FI_ATOMIC and FI_RMA with the endpoint type`
			`of FI_EP_RDM only. This BTL does NOT support libfabric provider that requires local`
			`memory registration (FI_MR_LOCAL).`

			`BTL/OFI will initialize a module with ONLY the first compatible info returned from OFI.`
			`This means it will rely on OFI provider to do load balancing. The support for multiple`
			`device might be added later.`

			`The BTL creates only one endpoint and one CQ.`

			`========================================`

			`Memory Registration`

			`Open MPI has a system in place to exchange remote address and always use the remote`
			`virtual address to refer to a piece of memory. However, some libfabric providers might`
			`not support the use of virtual address and instead will use zero-based offset addressing.`

			`FI_MR_VIRT_ADDR is the flag that determine this behavior. mca_btl_ofi_reg_mem() handles`
			`this by storing the base address in registration handle in case of the provider does not`
			`support FI_MR_VIRT_ADDR. This base address will be used to calculate the offset later in`
			`RDMA/Atomic operations.`

			`The BTL will try to use the address of registration handle as the key. However, if the`
			`provider supports FI_MR_PROV_KEY, it will use provider provided key. Simply does not care.`

			`The BTL does not register local operand or compare. This is why this BTL does not support`
			`FI_MR_LOCAL and will allocate every buffer before registering. This means FI_MR_ALLOCATED`
			`is supported. So to be explicit.`

			`Supported MR mode bits (will work with or without):`
			`enum:`
			`- FI_MR_BASIC`
			`- FI_MR_SCALABLE`

			`mode bits:`
			`- FI_MR_VIRT_ADDR`
			`- FI_MR_ALLOCATED`
			`- FI_MR_PROV_KEY`

			`The BTL does NOT support (will not work with):`
			`- FI_MR_LOCAL`
			`- FI_MR_MMU_NOTIFY`
			`- FI_MR_RMA_EVENT`
			`- FI_MR_ENDPOINT`

			`Just a reminder, in libfabric API 1.5...`
			`FI_MR_BASIC == (FI_MR_PROV_KEY \| FI_MR_ALLOCATED \| FI_MR_VIRT_ADDR)`

			`========================================`

			`Completions`

			`Every operation in this BTL is asynchronous. The completion handling will occur in`
			`mca_btl_ofi_component_progress() where we read the CQ with the completion context and`
			`execute the callback functions. The completions are local. No remote completion event is`
			`generated as local completion already guarantee global completion.`

			`The BTL keep tracks of number of outstanding operations and provide flush interface.`

			`========================================`

			`Sockets Provider`

			`Sockets provider is the proof of concept provider for libfabric. It is supposed to support`
			`all the OFI API with emulations. This provider is considered very slow and bound to raise`
			`problems that we might not see from other faster providers.`

			`Known Problems:`
			`- sockets provider uses progress thread and can cause segfault in finalize as we free`
			`the resources while progress thread is still using it. sleep(1) was put in`
			`mca_btl_ofi_componenet_close() for this reason.`
btl/ofi: Added 2 side communication support. The 2 sided communication support is added for non-tagmatching provider to take advantage of this BTL and PML OB1. The current state is "functional" and not optimized for performance. Two sided support is disabled by default and can be turned on by mca parameter: "mca_btl_ofi_mode". Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com> 2018-08-03 22:30:03 +03:00			`- sockets provider deadlock in two-sided mode. Might be something about buffered recv.`
			`(August 2018).`

			`========================================`

			`Scalable Endpoint`

			`This BTL will try to use scalable endpoint to create communication context. This will increase`
			`multithreaded performance for some application. The default number of context created is 1 and`
			`can be tuned VIA MCA parameter "btl_ofi_num_contexts_per_module". It is advised that the number`
			`of context should be equal to number of physical core for optimal performance.`

			`User can disable scalable endpoint by MCA parameter "btl_ofi_disable_sep".`
			`With scalable endpoint disbled, the BTL will alias OFI endpoint to both tx and rx context.`

			`========================================`

			`Two sided communication`

			`Two sided communication is added later on to BTL OFI to enable non tag-matching provider`
			`to be able to use in Open MPI with this BTL. However, the support is only for "functional"`
			`and has not been optimized for performance at this point. (August 2018)`