openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	bc55276844	osc/rdma: fix bug in the active message code that could cause erroneous results The code to handle completion messages did not correctly increment the number of expected messages. This could cause wait to return before all incoming messages are complete. I also added a check to ensure that start returns an error if we are in a passive access epoch. cmr=v1.8:reviewer=jsquyres This commit was SVN r31203.	2014-03-25 15:28:36 +00:00
Nathan Hjelm	0ed44f2fdb	osc/rdma: add support for datatypes with large descriptions This commit adds large datatype description support to the osc/rdma component. Support is provided by an additional send/recv of the datatype description if the description does not fit in an eager buffer. The code is designed to require minimal new code and not for speed. We consider this code path to be a slow path. Refs trac:1905 cmr=v1.8:reviewer=jsquyres This commit was SVN r31197. The following Trac tickets were found above: Ticket 1905 --> https://svn.open-mpi.org/trac/ompi/ticket/1905	2014-03-24 18:57:29 +00:00
Nathan Hjelm	e70809e169	osc/rdma: fix the spelling of incoming cmr=v1.7.5:ticket=trac:4379 This commit was SVN r31050. The following Trac tickets were found above: Ticket 4379 --> https://svn.open-mpi.org/trac/ompi/ticket/4379	2014-03-12 21:43:23 +00:00
Nathan Hjelm	1fc9a55d08	osc/rdma: do not use MPI_SOURCE to determine the peer in an send operation. This fixes a bug in r31029 which removes the use of the pml base request (also not a good way since cm doesn't use the base request). We now allocate a data structure (ugh) to determine the needed information. Tested with mtt/onesided. cmr=v1.7.5:ticket=trac:4379 This commit was SVN r31044. The following SVN revision numbers were found above: r31029 --> open-mpi/ompi@29e00f9161 The following Trac tickets were found above: Ticket 4379 --> https://svn.open-mpi.org/trac/ompi/ticket/4379	2014-03-12 17:14:11 +00:00
Nathan Hjelm	29e00f9161	osc/rdma: fix issues with mpi_leave_pinned when using rdma capable btls It seems we can't release accumulate buffers in completion callbacks because the btls don't release registration resources until after the callback has fired. The fix is to keep track of the unused buffers and free them later. This should resolve issues when running IMB-EXT and IMB-RMA. cmr=v1.7.5:reviewer=jsquyres This commit was SVN r31029.	2014-03-12 14:39:03 +00:00
Nathan Hjelm	cbb531ed13	osc/rdma: use OPAL_ALIGN macro cmr=v1.7.5:ticket=trac:4357 This commit was SVN r30975. The following Trac tickets were found above: Ticket 4357 --> https://svn.open-mpi.org/trac/ompi/ticket/4357	2014-03-10 18:57:20 +00:00
Nathan Hjelm	5df8cd75a9	osc/rdma: ensure fragment headers and the packed datatype are 8-byte aligned. The datatype unpacking code assumes that the packed datatype buffer has the same alignment as an OPAL_PTRDIFF_TYPE. This was not enforced by the rdma one-sided component. I changed the ordering and sized of various osc/rdma headers to ensure their sizes are a multiple of 8-bytes and modified the fragment allocation call to ensure all headers are 8-byte aligned. While not the cleanest way to handle this situation it should resolve the issue. Fixes trac:4315 cmr=v1.7.5:reviewer=jsquyres This commit was SVN r30974. The following Trac tickets were found above: Ticket 4315 --> https://svn.open-mpi.org/trac/ompi/ticket/4315	2014-03-10 18:11:22 +00:00
Nathan Hjelm	85515f2587	osc/rdma: silence warning cmr=v1.7.5:ticket=trac:4355 This commit was SVN r30970. The following Trac tickets were found above: Ticket 4355 --> https://svn.open-mpi.org/trac/ompi/ticket/4355	2014-03-10 16:11:25 +00:00
Yossi Etigin	b04a2339c5	Fix segmentation fault when osc_rdma is used with pml_cm: osc_rdma assumes the send request is derived from mca_pml_base_send_request_t, but this is not true for pml cm, so we end up freeing invalid pointer. We cannot take the data pointer from the pml send request, so we pass the allocated buffer pointer in req_complete_cb_data, and put the osc_rdma_module pointer in that buffer as well. Previously, osc_pt2pt was used with pml_cm which didn't have this problem. cmr=v1.7.5:reviewer=ompi-rm1.7 This commit was SVN r30967.	2014-03-10 15:21:37 +00:00
Nathan Hjelm	5a4037df4f	osc/rdma: fix typo in rdma osc component. cmr=v1.7.5:reviewer=jsquyres This commit was SVN r30931.	2014-03-04 16:57:56 +00:00
Nathan Hjelm	acbd6032f9	Helps to include the correct header. cmr=v1.7.5:ticket=trac:4304 This commit was SVN r30821. The following Trac tickets were found above: Ticket 4304 --> https://svn.open-mpi.org/trac/ompi/ticket/4304	2014-02-25 19:14:48 +00:00
Nathan Hjelm	5edacac301	osc/rdma: add missing include cmr=v1.7.5:ticket=trac:4304 This commit was SVN r30820. The following Trac tickets were found above: Ticket 4304 --> https://svn.open-mpi.org/trac/ompi/ticket/4304	2014-02-25 19:11:19 +00:00
Ralph Castain	49d938de29	Merge one-sided updates to the trunk - written by Brian Barrett and Nathan Hjelmn cmr=v1.7.5:reviewer=hjelmn:subject=Update one-sided to MPI-3 This commit was SVN r30816.	2014-02-25 17:36:43 +00:00
Brian Barrett	16a1166884	Remove the proc_pml and proc_bml fields from ompi_proc_t and replace with a configure-time dynamic allocation of flags. The net result for platforms which only support BTL-based communication is a reduction of 8*nprocs bytes per process. Platforms which support both MTLs and BTLs will not see a space reduction, but will now be able to safely run both the MTL and BTL side-by-side, which will prove useful. This commit was SVN r29100.	2013-08-30 16:54:55 +00:00
Ralph Castain	5d1fa4fa0e	Silence warnings: osc_pt2pt_data_move.c: In function 'ompi_osc_pt2pt_sendreq_recv_accum_long_cb': osc_pt2pt_data_move.c:643:9: warning: variable 'ret' set but not used [-Wunused-but-set-variable] osc_rdma_data_move.c: In function 'ompi_osc_rdma_control_send_cb': osc_rdma_data_move.c:1312:37: warning: variable 'header' set but not used [-Wunused-but-set-variable] This commit was SVN r29092.	2013-08-29 20:56:36 +00:00
George Bosilca	dc9352faf6	Remove some unused variables. This commit was SVN r28726.	2013-07-05 13:31:54 +00:00
Nathan Hjelm	9d4a26f47d	Update OMPI frameworks to use the MCA framework system. Notes: - This commit also eliminates the need for an available components list in use in several frameworks. None of the code in question was making use of the priority field of the priority component list item so these extra lists were removed. - Cleaned up selection code in several frameworks to sort lists using opal_list_sort. - Cleans up the ompi/orte-info functions. Expose the functions that construct the list of params so they can be used elsewhere. patches for mtl/portals4 from brian missed a few output variables in openib This commit was SVN r28241.	2013-03-27 21:17:31 +00:00
Nathan Hjelm	249066e06d	Timeout! Per RFC update the BTL interface to hide segment keys. All BTLs (with the exception of wv), all relevant PMLs, and osc/rdma have been updated for the new interface. This commit was SVN r26626.	2012-06-21 17:09:12 +00:00
Nathan Hjelm	8962ce25b0	fixed some compiler errors caused by seg_key changes. osc/rdma may need to be updated to use btls that use 128 bit segment keys This commit was SVN r25448.	2011-11-06 20:19:14 +00:00
Terry Dontje	fbda6aaf89	Fixes trac:2532 issues with 32-bit binaries This commit was SVN r24891. The following Trac tickets were found above: Ticket 2532 --> https://svn.open-mpi.org/trac/ompi/ticket/2532	2011-07-13 16:38:03 +00:00
Shiqing Fan	1ed0f40d35	Fix a few type casts on Windows. This commit was SVN r24857.	2011-07-06 08:08:53 +00:00
Brian Barrett	a4b2bd903b	* Implement long-ago discussed RFC to add a callback data pointer in the request completion callback * Use the completion callback pointer to remove all need for opal_progress calls in the one-sided layer This commit was SVN r24848.	2011-06-30 20:05:16 +00:00
Eugene Loh	2770a12beb	Continue clean up of thread options started in r22841, 22842, and 22849. No need for any CMRs to 1.5... that was already done in CMR 2728. This commit was SVN r24545. The following SVN revision numbers were found above: r22841 --> open-mpi/ompi@b400b84162	2011-03-18 21:36:35 +00:00
Rolf vandeVaart	91c1ee86d7	Fix for fix of fix for handling misalignment when sending onesided multifrag. This fixes trac:2532. This commit was SVN r23760. The following Trac tickets were found above: Ticket 2532 --> https://svn.open-mpi.org/trac/ompi/ticket/2532	2010-09-16 18:58:11 +00:00
Rolf vandeVaart	47940f2aa0	Fix the fix (r23649) for ticket 2532. We were neglecting to update the remain_len field for the buffer. This really fixes ticket #2532. This commit was SVN r23706. The following SVN revision numbers were found above: r23649 --> open-mpi/ompi@f42c2a737f	2010-09-01 14:12:08 +00:00
Ethan Mallove	f42c2a737f	Fixes trac:2532 - "MPI_Put can result in SIGBUS on SPARC" Reviewed by Rolf V and Brian B This commit was SVN r23649. The following Trac tickets were found above: Ticket 2532 --> https://svn.open-mpi.org/trac/ompi/ticket/2532	2010-08-24 18:10:43 +00:00
Rainer Keller	6c5532072a	- Split the datatype engine into two parts: an MPI specific part in OMPI and a language agnostic part in OPAL. The convertor is completely moved into OPAL. This offers several benefits as described in RFC http://www.open-mpi.org/community/lists/devel/2009/07/6387.php namely: - Fewer basic types (int* and float* types, boolean and wchar - Fixing naming scheme to ompi-nomenclature. - Usability outside of the ompi-layer. - Due to the fixed nature of simple opal types, their information is completely known at compile time and therefore constified - With fewer datatypes (22), the actual sizes of bit-field types may be reduced from 64 to 32 bits, allowing reorganizing the opal_datatype structure, eliminating holes and keeping data required in convertor (upon send/recv) in one cacheline... This has implications to the convertor-datastructure and other parts of the code. - Several performance tests have been run, the netpipe latency does not change with this patch on Linux/x86-64 on the smoky cluster. - Extensive tests have been done to verify correctness (no new regressions) using: 1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and ompi-ddt: a. running both trunk and ompi-ddt resulted in no differences (except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run correctly). b. with --enable-memchecker and running under valgrind (one buglet when run with static found in test-suite, commited) 2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt: all passed (except for the dynamic/ tests failed!! as trunk/MTT) 3. compilation and usage of HDF5 tests on Jaguar using PGI and PathScale compilers. 4. compilation and usage on Scicortex. - Please note, that for the heterogeneous case, (-m32 compiled binaries/ompi), neither ompi-trunk, nor ompi-ddt branch would successfully launch. This commit was SVN r21641.	2009-07-13 04:56:31 +00:00
Greg Koenig	60485ff95f	This is a very large change to rename several #define values from OMPI_* to OPAL_*. This allows opal layer to be used more independent from the whole of ompi. NOTE: 9 "svn mv" operations immediately follow this commit. This commit was SVN r21180.	2009-05-06 20:11:28 +00:00
Brian Barrett	7f898d4e2b	* Make rdma the default. Somehow, the code didn't match what was supposed to happen * Properly error out (rather than cause buffer overflow) in case where the datatype packed description is larger than our control fragments. This still isn't standards conforming, but at least we know what happened. * Expose win_set_name to external libraries (like the osc modules) * Set default window name to the CID of the communcator it's using for communication Refs trac:1905 This commit was SVN r21134. The following Trac tickets were found above: Ticket 1905 --> https://svn.open-mpi.org/trac/ompi/ticket/1905	2009-04-30 22:36:09 +00:00
Rainer Keller	9dea63d63a	- Last of intrusive commits (promised)... err for now. Anyway, this is blocking the move: do not include pml.h if not really needed, aka none of the following used: mca_pml MCA_PML_CALL OMPI_ANY_TAG OMPI_ANY_SOURCE OMPI_PROC_NULL - Notable exceptions (deleting in one header->adding): - ompi/mca/mtl/psm/ - ompi/mca/osc/rdma/ - ompi/mca/btl/openib/btl_openib_endpoint.c depended on pml_base_sendreq.h - Tested on Linux/x86-64, this time including make check (thanks Jeff and Ralph) This commit was SVN r20725.	2009-03-04 17:06:51 +00:00
Terry Dontje	0178b6c45f	Added padding to predefined handle structures to maintain library version to version compatibility. This commit was SVN r20627.	2009-02-24 17:17:33 +00:00
Rainer Keller	d81443cc5a	- On the way to get the BTLs split out and lessen dependency on orte: Often, orte/util/show_help.h is included, although no functionality is required -- instead, most often opal_output.h, or orte/mca/rml/rml_types.h Please see orte_show_help_replacement.sh commited next. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration actually showed two missing #include "orte/util/show_help.h" in orte/mca/odls/base/odls_base_default_fns.c and in orte/tools/orte-top/orte-top.c Manually added these. Let's have MTT the last word. This commit was SVN r20557.	2009-02-14 02:26:12 +00:00
Brian Barrett	cfc400eb57	* Enable eager sending for Accumulate * If the accumulate is local, make it short-circuit the request path. Accumulate requires local ops due to its window rules, so this is likely to help a bunch (on the codes I"m messing with at least) * Due a better job at flushing everything that can go out on the wire in a resource constrained problem * Move some debugging values around to make large problems somewhat easier to deal with This commit was SVN r20277.	2009-01-14 20:15:15 +00:00
Brian Barrett	e1f40c6a71	Fixes to make the rdma osc component work again: * Don't overwrite the des_flags field, removing the all important always callback field * Fix up return status of bml_base_send, since the rest of the code expects OMPI_SUCCESS or an error code This commit was SVN r20178.	2009-01-01 23:48:29 +00:00
George Bosilca	00d24bf8ab	Scalability patch, or slim-fast effect #1 . All BML structures just got a whole lot smaller, decreasing the memory footprint of the running application. How much it's a good question. Here is a breakdown: - in mca_bml_base_endpoint_t: 3 size_t + 1 uint32_t - in mca_bml_base_btl_t: 1 * int + 1 * double - 1 * float + 6 * size_t + 9 * (void) The decrease in mca_bml_base_endpoint_t is for each peer and the decrease in mca_bml_base_btl_t is for each BTL for each peer. So, if we consider the most convenient case where there is only one network between all peers, this decrease the memory foot print per peer by 9size_t + 9(void) + 2 * int32_t + 1 * double - 1 * float. On a 64 bits machine this will be 156 bytes per peer. Now we access all these fields directly from the underlying BTL structure, and as this structure is common to multiple BML endpoint, we are a lot more cache friendly. Even if this do not improve the latency, it makes the SM performance graph a lot smoother. This commit was SVN r19659.	2008-09-30 21:02:37 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
George Bosilca	e361bcb64c	Send optimizations. 1. The send path get shorter. The BTL is allowed to return > 0 to specify that the descriptor was pushed to the networks, and that the memory attached to it is available again for the upper layer. The MCA_BTL_DES_SEND_ALWAYS_CALLBACK flag can be used by the PML to force the BTL to always trigger the callback. Unmodified BTL will continue to work as expected, as they will return OMPI_SUCCESS which force the PML to have exactly the same behavior as before. Some BTLs have been modified: self, sm, tcp, mx. 2. Add send immediate interface to BTL. The idea is to have a mechanism of allowing the BTL to take advantage of send optimizations such as the ability to deliver data "inline". Some network APIs such as Portals allow data to be sent using a "thin" event without packing data into a memory descriptor. This interface change allows the BTL to use such capabilities and allows for other optimizations in the future. All existing BTLs except for Portals and sm have this interface set to NULL. This commit was SVN r18551.	2008-05-30 03:58:39 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	fa082cafa9	Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex. Note: this is an early preliminary step in the movement of portions of the datatype engine to the opal layer. This commit was SVN r18198.	2008-04-17 20:43:56 +00:00
Shiqing Fan	1c4c7e0f2f	Add memchecker support for osc rdma communication. This commit was SVN r18173.	2008-04-16 13:29:55 +00:00
Tim Prins	b88a3f7a94	Update onesided components to fix the case (on 64 bit machines) where the total offset is greater than 2^31-1 bytes. See: http://www.open-mpi.org/community/lists/users/2008/01/4880.php This commit was SVN r17400.	2008-02-07 18:45:35 +00:00
Gleb Natapov	e2e211f23b	Add flags parameter to btl_alloc() and btl_prepare_src() functions. If BTL knows at the time of allocation priority of a descriptor it may do some optimizations. This commit was SVN r16901.	2007-12-09 14:08:01 +00:00
Gleb Natapov	04578ffdd6	Change calls to bml_btl->btl_alloc() to mca_bml_base_alloc(). This commit was SVN r16596.	2007-10-28 16:04:17 +00:00
Brian Barrett	7a9a8c7e17	Support reduction operations other than MPI_REPLACE for user-defined datatypes with MPI_ACCUMULATE This commit was SVN r15418.	2007-07-13 20:46:12 +00:00
Brian Barrett	739fed9dc9	Don't poke at internal structure fiealds of communicators or groups, but instead use accessor functions This commit was SVN r15366.	2007-07-11 17:16:06 +00:00
Brian Barrett	25e52238ab	add ability to buffer put/accumulate messages during an epoch This commit was SVN r15295.	2007-07-05 21:40:06 +00:00
Brian Barrett	74008aac53	Support real RDMA operations for networks that support it This commit was SVN r15288.	2007-07-05 03:32:32 +00:00
Brian Barrett	8031f6561e	Make a bunch of debugging calls use the macro version This commit was SVN r15170.	2007-06-21 22:24:40 +00:00
Brian Barrett	7e57bbb0ef	React slightly better when datatype creation from a buffer fails This commit was SVN r14806.	2007-05-30 20:32:02 +00:00
Galen Shipman	3401bd2b07	Add optional ordering to the BTL interface. This is required to tighten up the BTL semantics. Ordering is not guaranteed, but, if the BTL returns a order tag in a descriptor (other than MCA_BTL_NO_ORDER) then we may request another descriptor that will obey ordering w.r.t. to the other descriptor. This will allow sane behavior for RDMA networks, where local completion of an RDMA operation on the active side does not imply remote completion on the passive side. If we send a FIN message after local completion and the FIN is not ordered w.r.t. the RDMA operation then badness may occur as the passive side may now try to deregister the memory and the RDMA operation may still be pending on the passive side. Note that this has no impact on networks that don't suffer from this limitation as the ORDER tag can simply always be specified as MCA_BTL_NO_ORDER. This commit was SVN r14768.	2007-05-24 19:51:26 +00:00

1 2

62 Коммитов