openmpi

Автор	SHA1	Сообщение	Дата
George Bosilca	e361bcb64c	Send optimizations. 1. The send path get shorter. The BTL is allowed to return > 0 to specify that the descriptor was pushed to the networks, and that the memory attached to it is available again for the upper layer. The MCA_BTL_DES_SEND_ALWAYS_CALLBACK flag can be used by the PML to force the BTL to always trigger the callback. Unmodified BTL will continue to work as expected, as they will return OMPI_SUCCESS which force the PML to have exactly the same behavior as before. Some BTLs have been modified: self, sm, tcp, mx. 2. Add send immediate interface to BTL. The idea is to have a mechanism of allowing the BTL to take advantage of send optimizations such as the ability to deliver data "inline". Some network APIs such as Portals allow data to be sent using a "thin" event without packing data into a memory descriptor. This interface change allows the BTL to use such capabilities and allows for other optimizations in the future. All existing BTLs except for Portals and sm have this interface set to NULL. This commit was SVN r18551.	2008-05-30 03:58:39 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Jeff Squyres	ba5615a18f	Merge in /tmp-public/cpc3 branch to trunk. oob/xoob still remains the default CPC. This commit was SVN r18356.	2008-05-02 11:52:33 +00:00
Ralph Castain	fa082cafa9	Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex. Note: this is an early preliminary step in the movement of portions of the datatype engine to the opal layer. This commit was SVN r18198.	2008-04-17 20:43:56 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Pavel Shamis	54ad8d7446	The issue was reported/fixed by Jon Mason one month ago but the fix was not committed. So I'm commiting it now. This commit was SVN r17835.	2008-03-17 11:13:06 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Gleb Natapov	60c151608c	Set flags inside fragment allocation function. This commit was SVN r17508.	2008-02-19 12:26:45 +00:00
George Bosilca	fa31ec81d0	Add the ownership flags to the PML/BTL interface. The layer owning the descriptor is responsible for releasing it once the descriptor is not in use anymore. This commit was SVN r17497.	2008-02-18 17:39:30 +00:00
Gleb Natapov	c9a1b06771	Remove trailing whitespaces. No code changes in this commit. This commit was SVN r17167.	2008-01-21 12:11:18 +00:00
Pavel Shamis	add4d9df8a	XRC fixes for MPI2 dynamics. This commit was SVN r17144.	2008-01-15 21:14:48 +00:00
George Bosilca	6310ce955c	The first patch related to the Active Message stuff. So far, here is what we have: - the registration array is now global instead of one by BTL. - each framework have to declare the entries in the registration array reserved. Then it have to define the internal way of sharing (or not) these entries between all components. As an example, the PML will not share as there is only one active PML at any moment, while the BTLs will have to. The tag is 8 bits long, the first 3 are reserved for the framework while the remaining 5 are use internally by each framework. - The registration function is optional. If a BTL do not provide such function, nothing happens. However, in the case where such function is provided in the BTL structure, it will be called by the BML, when a tag is registered. Now, it's time for the second step... Converting OB1 from a switch based PML to an active message one. This commit was SVN r17140.	2008-01-15 05:32:53 +00:00
Jon Mason	a0d4122606	The new cpc selection framework is now in place. The patch below allows for dynamic selection of cpc methods based on what is available. It also allows for inclusion/exclusions of methods. It even futher allows for modifying the priorities of certain cpc methods to better determine the optimal cpc method. This patch also contains XRC compile time disablement (per Jeff's patch). At a high level, the cpc selections works by walking through each cpc and allowing it to test to see if it is permissable to run on this mpirun. It returns a priority if it is permissable or a -1 if not. All of the cpc names and priorities are rolled into a string. This string is then encapsulated in a message and passed around all the ompi processes. Once received and unpacked, the list received is compared to a local copy of the list. The connection method is chosen by comparing the lists passed around to all nodes via modex with the list generated locally. Any non-negative number is a potentially valid connection method. The method below of determining the optimal connection method is to take the cross-section of the two lists. The highest single value (and the other side being non-negative) is selected as the cpc method. svn merge -r 16948:17128 https://svn.open-mpi.org/svn/ompi/tmp-public/openib-cpc/ . This commit was SVN r17138.	2008-01-14 23:22:03 +00:00
Jon Mason	626e0814a2	Style clean-up This commit was SVN r17126.	2008-01-12 18:47:17 +00:00
Pavel Shamis	99f51482e3	Fixing openib finalization flow. This commit was SVN r17085.	2008-01-09 12:36:30 +00:00
Gleb Natapov	621fa223c5	Create free lists of fragments per HCA, not per BTL. Saves memory in case of multiple LMCs. This commit was SVN r17082.	2008-01-09 10:26:21 +00:00
Gleb Natapov	5ce3213158	Rearrange functions order so that functions are defined before they are used. No code changes here. This commit was SVN r17081.	2008-01-09 10:05:41 +00:00
Gleb Natapov	c3bbf69356	Set send_flags correctly in btl_openib_put. Otherwise we may reuse flags from previous use of the buffer and they may be incorrect. This commit was SVN r17058.	2008-01-07 10:19:07 +00:00
Gleb Natapov	2fb6947f88	Destroy endpoints that use eager rdma communication before destroying SRQ. Do't skip async event thread destruction if SRQ was not destroyed, or it will segfault on module removal. This commit was SVN r17025.	2007-12-23 13:58:31 +00:00
Gleb Natapov	b06d92bdab	OpenIB BTL has three channels through which data can be received (eager rdma, high prio QPs and low prio QPs) and because not all of them are polled each time progrgess() is called (to save on latency) starvation is possible. The commit fixes this. Now each channel is polled, but higher priority channels are polled more often. Three new parameters are introduced that control polling ratios between different channels. This commit was SVN r17024.	2007-12-23 12:29:34 +00:00
George Bosilca	906e8bf1d1	Replace the ompi_pointer_array with opal_pointer_array. The next step (sometimes after the merge with the ORTE branch), the opal_pointer_array will became the only pointer_array implementation (the orte_pointer_array will be removed). This commit was SVN r17007.	2007-12-21 06:02:00 +00:00
Gleb Natapov	2a59b2a68f	1. Set segments length in prepare_src() after packing because actual size may be smaller then allocated size. 2. If reserve zero don't allocate coalesced frag since it will be RDMAed, not send. The logic was other way around. This commit was SVN r16928.	2007-12-11 13:10:52 +00:00
Gleb Natapov	17611dafbe	Fix pointer casting on 32bit machines. This commit was SVN r16907.	2007-12-09 14:15:35 +00:00
Gleb Natapov	2f9c5b46cf	Return OMPI_ERR_RESOURCE_BUSY from openib_btl_send() if fragment is not on wire. This commit was SVN r16906.	2007-12-09 14:14:11 +00:00
Gleb Natapov	493951e09d	Add heterogeneous support to message coalescing. This commit was SVN r16903.	2007-12-09 14:10:25 +00:00
Gleb Natapov	b4698dc6df	Use flags provided during allocation to coalesce to correct priority queue. This commit was SVN r16902.	2007-12-09 14:08:55 +00:00
Gleb Natapov	e2e211f23b	Add flags parameter to btl_alloc() and btl_prepare_src() functions. If BTL knows at the time of allocation priority of a descriptor it may do some optimizations. This commit was SVN r16901.	2007-12-09 14:08:01 +00:00
Gleb Natapov	5313a2baa7	Message coalescing for openib BTL. If fragment is waiting to be transmitted in a pending queue pack another message into it if there is enough space there. This commit was SVN r16900.	2007-12-09 14:05:13 +00:00
Gleb Natapov	7302cd24eb	Call btl_alloc() from btl_prepare_src() to have one point of frag allocation. This commit was SVN r16899.	2007-12-09 14:02:32 +00:00
Gleb Natapov	7364b7cf47	Add endpoint parameter to btl_alloc() function. Enables various optimizations inside BTL. This commit was SVN r16898.	2007-12-09 14:00:42 +00:00
Pavel Shamis	57728986f8	Fixing XRC multiport/multisubnet support. This commit was SVN r16819.	2007-12-03 09:49:53 +00:00
Gleb Natapov	a774cd98f8	Put send completions to low prio CQ. Receive is more important. This commit was SVN r16817.	2007-12-02 14:46:37 +00:00
Pavel Shamis	8aca6eb31b	OFED 1.3 doesn't implement ibv_resize_cq for connectX. On error exit from ibv_resize_cq we should to check if the function is implemented. This commit was SVN r16799.	2007-11-28 15:23:19 +00:00
Pavel Shamis	3e2e4f6d2a	Removing unused lid. This commit was SVN r16794.	2007-11-28 10:06:57 +00:00
Pavel Shamis	aa79bdabc8	Removing port_touse - we don't really need it This commit was SVN r16793.	2007-11-28 09:57:48 +00:00
Pavel Shamis	2ffbe8776a	Fixing compilation problems in openib This commit was SVN r16792.	2007-11-28 09:38:49 +00:00
Gleb Natapov	218adb2a96	Account for eager rdma credit fragments when creating send queue. Create XRC receive QP with zero receive and send queue length. We don't going to use this QP for send and receives a posted to SRQs. This commit was SVN r16791.	2007-11-28 07:22:01 +00:00
Gleb Natapov	601952a952	Don't shared endpoint->qps array, only pointer to actual QP. Calculate send queue size for shared QP based on all endpoints that want to use it. This commit was SVN r16790.	2007-11-28 07:21:07 +00:00
Gleb Natapov	b46c9cc7bc	Make xrc use srq_qp unions instead of the xrc_qp which is exactly like srq_qp. This commit was SVN r16789.	2007-11-28 07:20:26 +00:00
Gleb Natapov	bd47da4699	Initial XRC support by Mellanox. This commit was SVN r16787.	2007-11-28 07:18:59 +00:00
Gleb Natapov	923666b75c	Process pending put/get frags on endpoint connection establishment. This commit was SVN r16785.	2007-11-28 07:16:52 +00:00
Gleb Natapov	5a4e953aaa	Allow share the same qp for different buffer sizes. Needed for XRC support. This commit was SVN r16783.	2007-11-28 07:15:20 +00:00
Gleb Natapov	b123696d57	Fix async thread creation and destruction. Create async thread only when it is needed instead of creating it and then canceling if it is not needed. Change error handling during finalize so that it will not skip async thread destruction. Otherwise async thread may segfault during openib module unloading. This commit was SVN r16782.	2007-11-28 07:14:34 +00:00
Gleb Natapov	a9f864d15c	If there is an eager rdma credit, but there is no WQE to send a packet we add it to a pending queue of eager rdma QP instead of correct pending list. This patch fixes this by getting reed of "eager rdma qp" notion. Packet is always send over its order QP. The patch also adds two pending queues for high and low prio packets. Only high prio packets are sent over eager RDMA channel. This commit was SVN r16780.	2007-11-28 07:12:44 +00:00
Gleb Natapov	6a2d210b7d	Use OMPI object system to make fragment hierarchy more object oriented. The main idea (except of cleanup) is to save on initialisation of unneeded fields and to use C type checking system to catch obvious errors. This commit was SVN r16779.	2007-11-28 07:11:14 +00:00
Gleb Natapov	267cd2342a	Cleanup. Remove unused functions. This commit was SVN r16778.	2007-11-28 07:08:56 +00:00
Gleb Natapov	3a63eb6c17	Cleanup macro definitions. This commit was SVN r16554.	2007-10-23 13:33:19 +00:00
Jeff Squyres	94b1e9cff9	Update to use BTL_VERBOSE and BTL_ERROR instead of opal_output'ing to the mca_btl_base_output stream directly (and relying on it to be -1 if we didn't want any output). This commit was SVN r16449.	2007-10-15 17:53:02 +00:00
Gleb Natapov	60af46d541	We have QP description in component structure, module structure and endpoint. Each one of them has a field to store QP type, but this is redundant. Store qp type only in one structure (the component one). This commit was SVN r16272.	2007-09-30 16:14:17 +00:00
Gleb Natapov	c7105eadc7	Update Voltaire copyright. This commit was SVN r16189.	2007-09-24 10:11:52 +00:00

1 2 3

146 Коммитов