This commit fixes two nasty races:
- One can occur if the connection request message and connection completion
message arrive out of order. This can happen normally when adaptive routing
is used and also in a timeout situation where a UD message is lost.
- One occurs when handling an ack at the same time as we are handling the
message timeout. In this case we can not free the message or the timeout
will be operating on invalid data. This fix is a band-aid until I can come
up with a better approach. Instead of freeing the message it is marked
as inactive and the event callback is triggered immediately (this has no
affect if the callback is already active). The callback then frees the
message if it is inactive.
cmr=v1.8.1:reviewer=pasha
This commit was SVN r31305.
finalize.
Closes trac:4290
cmr=v1.7.5:reviewer=miked
This commit was SVN r30854.
The following Trac tickets were found above:
Ticket 4290 --> https://svn.open-mpi.org/trac/ompi/ticket/4290
- Fix several typos is osc/rdma.
- Fix a locking issue in osc/sm that was caused by an incorrect
assumption about the semantics of opal_atomic_add_32.
- Always unlock the accumulation lock in osc/sm.
- The base of a processes shared memory window should be NULL if
the size is zero. Fixed.
cmr=v1.7.5:ticket=trac:4304
This commit was SVN r30853.
The following Trac tickets were found above:
Ticket 4304 --> https://svn.open-mpi.org/trac/ompi/ticket/4304
1. Changed rng_buff_t --> opal_rng_buff_t
2. All global variables obey the prefix rule
3. Old code has been removed
4. Found a couple of unnecessary includes
Refs trac:4298
This commit was SVN r30807.
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
The error was caused by leaving the pipe to the async thread uninitialized, then writing to it regardless of this.
Fix is to check the existance of the async thread and the pipe to it.
reviewd by miked
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30644.
flexible members.
UDCM is ready to go for 1.7.4 with this patch.
cmr=v1.7.4:ticket=3940
This commit was SVN r29861.
The following Trac tickets were found above:
Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940
Originally udcm acks used the immediate data to indicate which message was
being acknowleged. This data was (mysteriously) junk when using QLogic HCAs so I
updated udcm to use the source info (slid, qp, etc) to determine which message was being
acked. This works as long as we don't have two messages simultaneously in flight
to a particular peer and then loose the first of the two messages. The chances of this
happening are tiny. To fix this case I updated the udcm message header to include
a pointer to the in flight message. This pointer is then sent back to the sending
process to ack receipt.
cmr=v1.7.4:ticket=trac:3940
This commit was SVN r29775.
The following Trac tickets were found above:
Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940
This commit updates the udcm cpc to support xrc. The steps followed by udcm
mimic those in the removed xoob cpc. This update has been tested with both XRC
and RC.
Mellanox, this is intended to go into 1.7.4. Please review carefully and let
me know if there are any issues.
cmr=v1.7.4:reviewer=miked
This commit was SVN r29767.
r29479.
This fixes some issues reported awhile ago in the openib btl. There
are a couple more unchecked mallocs but they are a bit more difficult
to fix since they are in void functions (btl_openib_endpoint.c).
Refs trac:2401.
cmr=v1.7.4:reviewer=miked
This commit was SVN r29543.
The following SVN revision numbers were found above:
r29479 --> open-mpi/ompi@d6ead2a3a5
The following Trac tickets were found above:
Ticket 2401 --> https://svn.open-mpi.org/trac/ompi/ticket/2401
Turns out that AC_CHECK_DECLS is one of the "new style" Autoconf
macros that #defines the output to be 0 or 1 (vs. #define'ing or
#undef'ing it). So don't check for "#if defined(..."; just check for
"#if ...".
This commit was SVN r29059.
The following Trac tickets were found above:
Ticket 3730 --> https://svn.open-mpi.org/trac/ompi/ticket/3730
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***
Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.
***************************************************************************************
I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.
The code is in https://bitbucket.org/rhc/ompi-oob2
WHAT: Rewrite of ORTE OOB
WHY: Support asynchronous progress and a host of other features
WHEN: Wed, August 21
SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:
* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)
* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.
* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients
* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort
* only one transport (i.e., component) can be "active"
The revised OOB resolves these problems:
* async progress is used for all application processes, with the progress thread blocking in the event library
* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")
* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.
* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.
* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object
* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions
* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel
* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport
* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active
* all blocking send/recv APIs have been removed. Everything operates asynchronously.
KNOWN LIMITATIONS:
* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline
* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker
* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways
* obviously, not every error path has been tested nor necessarily covered
* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.
* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways
* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC
This commit was SVN r29058.
Commit r27211 added ifdef checks for #define
HAVE_IBV_LINK_LAYER_ETHERNET, which is incorrect. The correct #define
is HAVE_DECL_IBV_LINK_LAYER_ETHERNET. This broke OMPI over iWARP.
This fixes trac:3726 and should be added to cmr:v1.7.3:reviewer=jsquyres
This commit was SVN r29053.
The following SVN revision numbers were found above:
r27211 --> open-mpi/ompi@b27862e5c7
The following Trac tickets were found above:
Ticket 3726 --> https://svn.open-mpi.org/trac/ompi/ticket/3726
* add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit.
* remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL"
* modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded
* removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base
* added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames
This commit was SVN r29052.
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.
Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:
* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally
* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test
* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).
Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it
Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.
This commit was SVN r29040.
The primary issue with udcm is that the immediate data in message
acks were often bogus. This caused the sender to keep trying even
though a message was received and acked. The fix is to use the
source LID and QP to determine which message is being acked. In
most cases this should work well since only one message will be
in flight to any peer.
This commit was SVN r28444.
Notes:
- This commit also eliminates the need for an available components list in use
in several frameworks. None of the code in question was making use of the
priority field of the priority component list item so these extra lists were
removed.
- Cleaned up selection code in several frameworks to sort lists using opal_list_sort.
- Cleans up the ompi/orte-info functions. Expose the functions that construct the
list of params so they can be used elsewhere.
patches for mtl/portals4 from brian
missed a few output variables in openib
This commit was SVN r28241.
Features:
- Support for an override parameter file (openmpi-mca-param-override.conf).
Variable values in this file can not be overridden by any file or environment
value.
- Support for boolean, unsigned, and unsigned long long variables.
- Support for true/false values.
- Support for enumerations on integer variables.
- Support for MPIT scope, verbosity, and binding.
- Support for command line source.
- Support for setting variable source via the environment using
OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
- Cleaner API.
- Support for variable groups (equivalent to MPIT categories).
Notes:
- Variables must be created with a backing store (char **, int *, or bool *)
that must live at least as long as the variable.
- Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
mca_base_var_set_value() to change the value.
- String values are duplicated when the variable is registered. It is up to
the caller to free the original value if necessary. The new value will be
freed by the mca_base_var system and must not be freed by the user.
- Variables with constant scope may not be settable.
- Variable groups (and all associated variables) are deregistered when the
component is closed or the component repository item is freed. This
prevents a segmentation fault from accessing a variable after its component
is unloaded.
- After some discussion we decided we should remove the automatic registration
of component priority variables. Few component actually made use of this
feature.
- The enumerator interface was updated to be general enough to handle
future uses of the interface.
- The code to generate ompi_info output has been moved into the MCA variable
system. See mca_base_var_dump().
opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system
This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.
This commit was SVN r28236.
ompi_show_help, because opal_show_help is replaced with an
aggregating version when using ORTE, so there's no reason to
directly call orte_show_help.
This commit was SVN r28051.
some new common OpenFabrics functionality to ompi/mca/common/verbs.
Also move everything that was in ompi/mca/common/ofautils under
ompi/mca/common/verbs.
* Move ofautils -> verbs
* Add new functionality in ompi/mca/common/verbs (see doxygen
* comments in ompi/mca/common/verbs/common_verbs.h for details):
* ompi_common_verbs_find_ibv_ports()
* ompi_common_verbs_port_bw()
* ompi_common_verbs_mtu()
* '''If you're writing verbs-based code, you should be using this
common functionality'''
* Adapt openib BTL to use some trivial common functionality in
common/verbs
* Don't use "#ifdef OMPI_HAVE_RDMAOE",use
"#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)"
* Update the following to include/link against common/verbs
* bcol/iboffload
* sbgp/ibnet
* btl/openib
This commit was SVN r27212.
btl_openib_connect_udcm when notifying not to listen to an fd to ensure
that the main thread does not continue until the service thread has
processed the message
Adds ability to send message to openib async thread to tell it to
ignore the ERR state on a specific QP. Adds this call to udcm_module_finalize
so when we set the error state on the QP it doesn't cause the
openib async thread to abort the mpi program prematurely
Fixes trac:3161
This commit was SVN r27064.
The following Trac tickets were found above:
Ticket 3161 --> https://svn.open-mpi.org/trac/ompi/ticket/3161