This BTL accesses the Cisco usNIC Linux device via the Linux verbs
API via Unreliable Datagram queue pairs. A few noteworthy points:
* This BTL does most of its own fragmentation; it tells the PML that
it has a very high max_send_size (much higher than the network
MTU).
* Since UD fragments are, by definition, unreliable, the usnic BTL
handles all of its own reliability via a sliding window approach
using the opal_hotel construct and many tricks stolen from the
corpus of knowledge surrounding efficient TCP.
* There is a fun PML latency-metric based optimization for NUMA
awareness of short messages.
* Note that this is ''not'' a generic UD verbs BTL; it is specific to
the Cisco usNIC device.
This commit was SVN r28879.
George and I were talking about ORTE's error handling the other day in regards to the right way to deal with errors in the updated OOB. Specifically, it seemed a bad idea for a library such as ORTE to be aborting the job on its own prerogative. If we lose a connection or cannot send a message, then we really should just report it upwards and let the application and/or upper layers decide what to do about it.
The current code base only allows a single error callback to exist, which seemed unduly limiting. So, based on the conversation, I've modified the errmgr interface to provide a mechanism for registering any number of error handlers (this replaces the current "set_fault_callback" API). When an error occurs, these handlers will be called in order until one responds that the error has been "resolved" - i.e., no further action is required - by returning OMPI_SUCCESS. The default MPI layer error handler is specified to go "last" and calls mpi_abort, so the current "abort" behavior is preserved unless other error handlers are registered.
In the register_callback function, I provide an "order" param so you can specify "this callback must come first" or "this callback must come last". Seemed to me that we will probably have different code areas registering callbacks, and one might require it go first (the default "abort" will always require it go last). So you can append and prepend, or go first. Note that only one registration can declare itself "first" or "last", and since the default "abort" callback automatically takes "last", that one isn't available. :-)
The errhandler callback function passes an opal_pointer_array of structs, each of which contains the name of the proc involved (which can be yourself for internal errors) and the error code. This is a change from the current fault callback which returned an opal_pointer_array of just process names. Rationale is that you might need to see the cause of the error to decide what action to take. I realize that isn't a requirement for remote procs, but remember that we will use the SAME interface to report RTE errors internal to the proc itself. In those cases, you really do need to see the error code. It is legal to pass a NULL for the pointer array (e.g., when reporting an internal failure without error code), so handlers must be prepared for that possibility. If people find that too burdensome, we can remove it.
Should we ever decide to create a separate callback path for internal errors vs remote process failures, or if we decide to do something different based on experience, then we can adjust this API.
This commit was SVN r28852.
- extend the framework API
- remove the dummy component, not require anymore
- add four components to perform the actual job.
This commit was SVN r28828.
soon:
- add a new abstraction layer to be used internally for some operations
- add a new mca parameter to control lazy intialization of shared file
pointer structures
This commit was SVN r28826.
This commit adds an API for registering and querying performance
variables (mca_base_pvar) in the MCA base. The existing MCA variable
system API has been updated to reflect the new API: MCA variable
groups have performance variables, and new types have been added (double,
unsigned long long) to reflect what is required by the MPI_T
interface. Additionally, the MCA variable group code has been split
into its own set of files: mca_base_var_group.[ch].
Details of the new API can be found in doxygen comments in the header:
mca_base_pvar.h.
Other changes to the variable system:
- Use an opal_hash_table to speed up variable/group lookup.
- Clean up code associated with MCA variable types.
- Registered performance variables are printed by ompi_info -a. In the
future an option should be added to control this behavior.
Changes to OMPI:
- Added full support for the MPI_T performance variable interface.
This commit was SVN r28800.
identified by Takahiro Kawashima. The packed length was reported as a
max bound and not provided on the unpacking side, so the unpacking
buffer could become out of sync with the content stored after the
packed representation.
The fix force the packing operation itself before reporting the length,
so we always report now the real number of bytes in the packed
representation.
cmr:v1.7.3:reviewer=jsquyres
This commit was SVN r28790.
This commit improved the small message latency and bandwidth when using
the vader btl. These improvements should make performance competative
with other MPI implementations.
This commit was SVN r28760.
many builds. I am temporarily .ompi_ignore'ing this component until
it can be fixed by its owner.
* It calls AC_MSG_ERROR, which configure.m4 scripts are ''never''
supposed to do. If you don't want to build, then call $2.
* All static and --disable-dlopen builds are broken; they fall afoul
of whatever test configure.m4 is doing and therefore error out of
configure entirely (vs. simply disabling the hcoll component).
* There appear to be multiple shell scripting errors in the
configure.m4. Here's the output of "./configure --disable-dlopen":
{{{
--- MCA component coll:hcoll (m4 configuration macro)
checking for MCA component coll:hcoll compile mode... static
checking --with-hcoll value... simple ok (unspecified)
./configure: line 421: test: basic: integer expression expected
configure: error: Can not use coll/hcoll and coll/ml (static build)
simultaneously. You have two options:
1. Use static build & disable ml with:
--enable-mpi-no-build=coll-ml
2. Use dso build for ML & disable ml at runtime: -mca
coll self
./configure: line 310: return: basic: numeric argument required
./configure: line 320: exit: basic: numeric argument required
}}}
Finally, all of these configure.m4 errors aside, I don't understand
why there is a ''compile-time'' exclusion between the hcoll and ml
components. Why isn't this a ''run-time'' decision? Having what
seems to be an unnecessary compile-time exclusion goes against the
general Open MPI philosophy.
Note: Open MPI 1.7 is also broken in all the same ways. I suggest
that the RM's .ompi_ignore hcoll over there, too.
Mellanox: please fix.
This commit was SVN r28748.
for the SM and TCP BTLs, as well as the mca_btl_base_param_register()
function (which registers MCA params for all BTLs).
The guidelines in
https://svn.open-mpi.org/trac/ompi/wiki/MCAParamLevels were used to
pick these levels.
This commit was SVN r28746.
value to signal that the operation of retrieving the element from the free list
failed. However in this case the returned pointer was set to NULL as well, so the
error code was redundant. Moreover, this was a continuous source of warnings when
the picky mode is on.
The attached parch remove the rc argument from the OMPI_FREE_LIST_GET and
OMPI_FREE_LIST_WAIT macros, and change to check if the item is NULL instead of
using the return code.
This commit was SVN r28722.