(i.e., ensure that more data items get zeroed out/set to NULL) so that
if something goes wrong during initialization, we don't try to clean
up something that isn't there (and segv).
The chance of this happening on the trunk is very low (and will also
be low once the verbs improvements are brought over to v1.7). But it
can actually happen in the v1.6 branch (e.g., if no CPC is available,
we'll try to get the length of the endpoints list, but the endpoints
list is NULL).
Hence, even though the real goal is to get this functionality over to
v1.6, I figured I'd commit to the trunk/CMR to v1.7 just to try to
keep commonality in the openib between all three where possible.
This commit was SVN r28317.
Notes:
- This commit also eliminates the need for an available components list in use
in several frameworks. None of the code in question was making use of the
priority field of the priority component list item so these extra lists were
removed.
- Cleaned up selection code in several frameworks to sort lists using opal_list_sort.
- Cleans up the ompi/orte-info functions. Expose the functions that construct the
list of params so they can be used elsewhere.
patches for mtl/portals4 from brian
missed a few output variables in openib
This commit was SVN r28241.
Features:
- Support for an override parameter file (openmpi-mca-param-override.conf).
Variable values in this file can not be overridden by any file or environment
value.
- Support for boolean, unsigned, and unsigned long long variables.
- Support for true/false values.
- Support for enumerations on integer variables.
- Support for MPIT scope, verbosity, and binding.
- Support for command line source.
- Support for setting variable source via the environment using
OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
- Cleaner API.
- Support for variable groups (equivalent to MPIT categories).
Notes:
- Variables must be created with a backing store (char **, int *, or bool *)
that must live at least as long as the variable.
- Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
mca_base_var_set_value() to change the value.
- String values are duplicated when the variable is registered. It is up to
the caller to free the original value if necessary. The new value will be
freed by the mca_base_var system and must not be freed by the user.
- Variables with constant scope may not be settable.
- Variable groups (and all associated variables) are deregistered when the
component is closed or the component repository item is freed. This
prevents a segmentation fault from accessing a variable after its component
is unloaded.
- After some discussion we decided we should remove the automatic registration
of component priority variables. Few component actually made use of this
feature.
- The enumerator interface was updated to be general enough to handle
future uses of the interface.
- The code to generate ompi_info output has been moved into the MCA variable
system. See mca_base_var_dump().
opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system
This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.
This commit was SVN r28236.
ompi_show_help, because opal_show_help is replaced with an
aggregating version when using ORTE, so there's no reason to
directly call orte_show_help.
This commit was SVN r28051.
necessarily mean an error -- it could (and usually does) mean that the
peer realized that we both initiated a connect at the same time, and
therefore it decided to hang up.
I also added a friendly show_help error message for other cases where
recv_blocking() fails (i.e., "Something went wrong. Kaboom! Your job
will abort...").
This commit was SVN r28023.
The following Trac tickets were found above:
Ticket 3494 --> https://svn.open-mpi.org/trac/ompi/ticket/3494
flags, and mca flags are kept seperate until the very end. The main configure
wrapper flags should now be modified by using the OPAL_WRAPPER_FLAGS_ADD
macro. MCA components should either let <framework>_<component>_{LIBS,LDFLAGS}
be copied over OR set <framework>_<component>_WRAPPER_EXTRA_{LIBS,LDFLAGS}.
The situations in which WRAPPER CPPFLAGS can be set by MCA components was
made very small to match the one use case where it makes sense.
This commit was SVN r27950.
all pending fragments when the destination goes down. This allows the PML
to recalibrate its behavior, either find an alternate route or just give up.
This commit was SVN r27881.
using the modex or RML to share sm initialization information, have node rank 0
create a file containing initialization information in a well-known place. Then
during add_procs, the rest of the node processes requiring sm BTL initialization
will just read from that file to complete their initialization.
This commit was SVN r27789.
There was a race condition in the eager get protocol where the RDMA complete message could be received before the local completion of the SMSG message that started the eager get protocol.
cmr:v1.7
This commit was SVN r27740.
* btl sendi(): if message can be send inline try to avoid signal
* signal is requested one per 64 or when
there are no send wqes
when message can not be send inline
any other btl method then sendi()
This commit was SVN r27724.
An upcoming BTL from Cisco used ofud as a starting point, and should
probably be used as a starting point for any future UD-based BTL.
And this OFUD BTL is obviously still in history if anyone ever wants
to resurrect it.
This commit was SVN r27655.
of modules, print a BTL_ERROR and exit(1) (previous behavior was to
segv). This at least explicitly tells the developer that their BTL
component is behaving badly.
This commit was SVN r27634.
Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality.
This commit was SVN r27570.
pml/v:
- If vprotocol is not being used vprotocol_include_list is leaked. Assume vprotocol never takes ownership (see below) and always free the string.
coll/ml:
- (patch verified) calling mca_base_param_lookup_string after mca_base_param_reg_string is unnecessary. The call to mca_base_param_lookup_string causes the value returned by mca_base_param_reg_string to be leaked.
- Need to free mca_coll_ml_component.config_file_name on component close.
btl/openib:
- calling mca_base_param_lookup_string after mca_base_param_reg_string is unnecessary. The call to mca_base_param_lookup_string causes the value returned by mca_base_param_reg_string to be leaked.
vprotocol/base:
- There was no way for pml/v to determine if vprotocol took ownership of vprotocol_include_list. Fix by always never ownership (use strdup).
mca/base:
- param_lookup will result in storage->stringval to be a newly allocated string if the mca parameter has a string value. ensure this string is always freed.
cmr:v1.7
This commit was SVN r27569.
It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now.
This commit was SVN r27527.
The following SVN revision numbers were found above:
r27451 --> open-mpi/ompi@d59034e6ef
r27456 --> open-mpi/ompi@ecdbf34937
* Add OMPI_COMMON_VERBS_FLAGS_NOT_RC, which looks for a device that
does ''not'' support RC
* Add ompi_common_verbs_find_max_inline(), and remove that code from
the openib BTL component
This commit was SVN r27393.
* Moved "check basics" sanity check from openib BTL to common/verbs
(which also allows us to have openib ''not'' include
<infiniband/driver.h>, which is a Very Good Thing)
* Add new ompi_common_verbs_qp_test() function, which tests to see
whether a device supports RC and/or UD QPs. The openib BTL now
uses this function to ensure that the device supports RC QPs.
* Rename ompi_common_verbs_find_ibv_ports() to be
ompi_common_verbs_find_ports() -- the "ibv" was redundant.
* Re-work ompi_common_verbs_find_ports() to use
ompi_common_verbs_qp_test() instead of testing for RC/UD QPs itself
* Add bunches of opal_output_verbose() to the find_ports() routine
(to help diagnosing connectivity problems -- imaging running with
--mca btl_base_verbose 10; you'll see all the find_ports() test
results)
* Make ompi_common_verbs_qp_test() warn if devices/ports are supplied
in the if_include/if_exclude strings that do not exists (quite
similar to what the openib BTL does today).
* Add ompi_common_verbs_mca_register() function, which registers
common verbs MCA params. It will also register MCA param synonyms
for thse MCA params to upper-level components (e.g.,
btl_<upper-level-component>_<the-mca-param>).
* common_verbs_warn_nonexistent_if: warn if
if_include/if_exclude-specified devices or ports do not exist.
This commit was SVN r27332.
We ran into a case where the OMPI SVN trunk grew a new acceptable MCA
parameter value, but this new value was not accepted on the v1.6
branch (hwloc_base_mem_bind_failure_action -- on the trunk it accepts
the value "silent", but on the older v1.6 branch, it doesn't). If you
set "hwloc_base_mem_bind_failure_action=silent" in the default MCA
params file and then accidentally ran with the v1.6 branch, every OMPI
executable (including ompi_info) just failed because hwloc_base_open()
would say "hey, 'silent' is not a valid value for
hwloc_base_mem_bind_failure_action!". Kaboom.
The only problem is that it didn't give you any indication of where
this value was being set. Quite maddening, from a user perspective.
So we changed the ompi_info handles this case. If any framework open
function return OMPI_ERR_BAD_PARAM (either because its base MCA params
got a bad value or because one of its component register/open
functions return OMPI_ERR_BAD_PARAM), ompi_info will stop, print out
a warning that it received and error, and then dump out the parameters
that it has received so far in the framework that had a problem.
At a minimum, this will show the user the MCA param that had an error
(it's usually the last one), and ''where it was set from'' (so that
they can go fix it).
We updated ompi_info to check for O???_ERR_BAD_PARAM from each from
the framework opens. Also updated the doxygen docs in mca.h for this
O???_BAD_PARAM behavior. And we noticed that mca.h had MCA_SUCCESS
and MCA_ERR_??? codes. Why? I think we used them in exactly one
place in the code base (mca_base_components_open.c). So we deleted
those and just used the normal OPAL_* codes instead.
While we were doing this, we also cleaned up a little memory
management during ompi_info/orte-info/opal-info finalization.
Valgrind still reports a truckload of memory still in use at ompi_info
termination, but they mostly look to be components not freeing
memory/resources properly (and outside the scope of this fix).
This commit was SVN r27306.
The following Trac tickets were found above:
Ticket 3275 --> https://svn.open-mpi.org/trac/ompi/ticket/3275
some new common OpenFabrics functionality to ompi/mca/common/verbs.
Also move everything that was in ompi/mca/common/ofautils under
ompi/mca/common/verbs.
* Move ofautils -> verbs
* Add new functionality in ompi/mca/common/verbs (see doxygen
* comments in ompi/mca/common/verbs/common_verbs.h for details):
* ompi_common_verbs_find_ibv_ports()
* ompi_common_verbs_port_bw()
* ompi_common_verbs_mtu()
* '''If you're writing verbs-based code, you should be using this
common functionality'''
* Adapt openib BTL to use some trivial common functionality in
common/verbs
* Don't use "#ifdef OMPI_HAVE_RDMAOE",use
"#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)"
* Update the following to include/link against common/verbs
* bcol/iboffload
* sbgp/ibnet
* btl/openib
This commit was SVN r27212.
that causes MPI jobs to abort if there is not enough registered memory
available (vs. just warning).
This commit was SVN r27140.
The following Trac tickets were found above:
Ticket 3258 --> https://svn.open-mpi.org/trac/ompi/ticket/3258
- OMPI_SUCCESS
- OMPI_ERROR
- OMPI_ERR_RESOURCE_BUSY
If an "OMPI_ERR_OUT_OF_RESOURCE" occurs, the request is added to the pending list, and will be handled later. An error message
should not be printed to the user in this case. This is not an error, but rather a notification of a possible valid condition.
Only in the case of "OMPI_ERROR" should it be printed to the user.
This commit was SVN r27065.
btl_openib_connect_udcm when notifying not to listen to an fd to ensure
that the main thread does not continue until the service thread has
processed the message
Adds ability to send message to openib async thread to tell it to
ignore the ERR state on a specific QP. Adds this call to udcm_module_finalize
so when we set the error state on the QP it doesn't cause the
openib async thread to abort the mpi program prematurely
Fixes trac:3161
This commit was SVN r27064.
The following Trac tickets were found above:
Ticket 3161 --> https://svn.open-mpi.org/trac/ompi/ticket/3161
receive queues value) so that we don't break the use of RDMA CM, and
therefore break RoCE.
This commit was SVN r27017.
The following SVN revision numbers were found above:
r26869 --> open-mpi/ompi@fe0e7f81df
- For now we'll use 8192 as a base value
- We leave the adjust_cq() as is
- For the long term we can work on an appropriate setting to expose through the INI file.
8K CQEs are 512K per process, which is 8MB for ppn=16
This commit was SVN r26877.
ibv_get_device_list_compat() and not finding it, I finally realized
that it was a function in OMPI. So let's name it with a proper ompi_
prefix, not an ibv_ prefix.
This commit was SVN r26867.