openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
George Bosilca	f09e3ce5a4	Spring cleanup. Nothing important. This commit was SVN r26247.	2012-04-06 15:48:07 +00:00
George Bosilca	a78a7bd8e8	The tuned collectives can now deal with more than 2Gb of data. This commit was SVN r26103.	2012-03-05 22:23:44 +00:00
George Bosilca	23e8ce91ba	Rework the selection logic for the tuned collectives. All supported collectives now are able to use the dynamic rules. Moreover, these rules are loaded only once, and stored at the component level. All communicators are able to use these rules (not only MPI_COMM_WORLD as until now). A lot of minor corrections, memory management issues and reduction in the amount of memory used by the tuned collectives. This commit was SVN r21825.	2009-08-14 21:06:23 +00:00
Rainer Keller	6c5532072a	- Split the datatype engine into two parts: an MPI specific part in OMPI and a language agnostic part in OPAL. The convertor is completely moved into OPAL. This offers several benefits as described in RFC http://www.open-mpi.org/community/lists/devel/2009/07/6387.php namely: - Fewer basic types (int* and float* types, boolean and wchar - Fixing naming scheme to ompi-nomenclature. - Usability outside of the ompi-layer. - Due to the fixed nature of simple opal types, their information is completely known at compile time and therefore constified - With fewer datatypes (22), the actual sizes of bit-field types may be reduced from 64 to 32 bits, allowing reorganizing the opal_datatype structure, eliminating holes and keeping data required in convertor (upon send/recv) in one cacheline... This has implications to the convertor-datastructure and other parts of the code. - Several performance tests have been run, the netpipe latency does not change with this patch on Linux/x86-64 on the smoky cluster. - Extensive tests have been done to verify correctness (no new regressions) using: 1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and ompi-ddt: a. running both trunk and ompi-ddt resulted in no differences (except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run correctly). b. with --enable-memchecker and running under valgrind (one buglet when run with static found in test-suite, commited) 2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt: all passed (except for the dynamic/ tests failed!! as trunk/MTT) 3. compilation and usage of HDF5 tests on Jaguar using PGI and PathScale compilers. 4. compilation and usage on Scicortex. - Please note, that for the heterogeneous case, (-m32 compiled binaries/ompi), neither ompi-trunk, nor ompi-ddt branch would successfully launch. This commit was SVN r21641.	2009-07-13 04:56:31 +00:00
Jeff Squyres	11b375f8b5	CIDs 1080-1090: assert() checks were not sufficient to check for NEGATIVE_RETURNS from _reg_int() because those are not always checked. So replace them with real if() checks. This commit was SVN r20195.	2009-01-03 15:56:25 +00:00
Rainer Keller	ee1fe9015a	- Make sure, that the *param_index are > 0 (here, we don't pass errors up...). Coverity CID 1080 - 1090 - Really make sure, the user does not specify stupid negative values. This commit was SVN r19233.	2008-08-11 11:21:04 +00:00
Jeff Squyres	0af7ac53f2	Fixes trac:1392, #1400 * add "register" function to mca_base_component_t * converted coll:basic and paffinity:linux and paffinity:solaris to use this function * we'll convert the rest over time (I'll file a ticket once all this is committed) * add 32 bytes of "reserved" space to the end of mca_base_component_t and mca_base_component_data_2_0_0_t to make future upgrades [slightly] easier * new mca_base_component_t size: 196 bytes * new mca_base_component_data_2_0_0_t size: 36 bytes * MCA base version bumped to v2.0 * '''We now refuse to load components that are not MCA v2.0.x''' * all MCA frameworks versions bumped to v2.0 * be a little more explicit about version numbers in the MCA base * add big comment in mca.h about versioning philosophy This commit was SVN r19073. The following Trac tickets were found above: Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392	2008-07-28 22:40:57 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Shiqing Fan	a0660f4deb	- Just some type casts. This commit was SVN r16100.	2007-09-12 15:29:58 +00:00
Brian Barrett	af4e86c25f	Update collectives selection logic to allow for multiple components to be used at nce (up to one unique collective module per collective function). Matches r15795:15921 of the tmp/bwb-coll-select branch This commit was SVN r15924. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r15795 r15921	2007-08-19 03:37:49 +00:00
Jelena Pjesivac-Grbovic	1b66a52c50	Modifying type of binomial tree used for binomial reduce: switching: 0 0 / \ \ / \ \ 1 \ \ --> 4 \ \ / \ \ / \ \ 3 2 \ 3 2 \ 4 1 (duh). The first form is the bmtree suitable for bcast, but the latter is better for reduce. Updating default decision function accordingly. This commit was SVN r15422.	2007-07-13 21:07:51 +00:00
Jelena Pjesivac-Grbovic	625c6739ab	Removing warning about unsed variable This commit was SVN r14579.	2007-05-03 20:26:41 +00:00
Jelena Pjesivac-Grbovic	9eff74ad4d	Modifying generalized reduce "synchronized" behavior: - Removing "small" message size limit because it really does not relate to the eager size accross the board. Now, the leaf nodes in generalized reduce will use blocking send (DEFAULT/ORIGINAL BEHAVIOR) either when the maximum number of outstanding requests is 0 or when the total number of segments is less than the maximum number of outstanding requests. Otherwise, it will send messages using non-blocking synchronized send operation. This commit was SVN r14572.	2007-05-02 21:42:45 +00:00
George Bosilca	69642a9cd4	Remove 2 warnings about ptrdiff_t to unsigned long implicit conversion. This commit was SVN r14565.	2007-05-01 19:47:33 +00:00
Jelena Pjesivac-Grbovic	3eac49aa59	Adding flow control for leaf nodes in generalized reduce structure. This "feature" is disabled by default and it should not affect the current performance. In case when the message size is large and segment size is smaller than eager size for particular interface, the leaf nodes in generalized reduce function can overflood parent nodes by sending all segments without any synchronization. This can cause the parent to have HIGH number of unexpected messages (think 16MB message with 1KB segments for example). In case of binomial algorithm root node always has at least one child which is leaf, so this can potentially affect the root's performance significantly [Especially in large communicators where root may have quite a few children (binomial tree for example)]. When the segment size is bigger than the eager size, rendezvous protocol ensures that this does not happen so it is not necessary. Originally, the problem was exposed in "infinite" bucket allocator clean up time for "small" segment sizes (which may explain some "deadlocks" on Thunderbird tests). To prevent this, we allow user to specify mca parameter "--mca coll_tuned_reduce_algorithm_max_requests NUM" this limits number of outstanding messages from a leaf node in generalized reduce to the parent to NUM. Messages are sent as non-blocking synchrnous messages, so syncronization happens at "wait" time. The synchronization actually improved performance of pipeline and binomial algorithm for large message sizes with 1KB segments over MX, but I need to test it some more to make sure it is consistent. Since there is no easy way to find out what is "the eager" size for particular btl, I set the limit to 4000B. If message/individual segment size is greater than 4000B - we will not use this feature. This variable may or may not be exposed as mca parameter later... I did not have any problems running it and both "default" and "synchronous" tests passed Intel Reduce* tests up to 80 processes (over MX). This commit was SVN r14518.	2007-04-25 20:39:53 +00:00
Jelena Pjesivac-Grbovic	9780a000ba	Cleanup of generic reduce function and possible (low probability) bug fix. - fixing line lengths and some of the comments - possible bug fix (but I do not think we exposed it in any tests so far) temporary buffers were allocated as multiples of extent instead of true_extent + (count -1) * extent. Everything is still passing Intel tests over tcp and btl mx up to 64 nodes. This commit was SVN r13956.	2007-03-08 00:54:52 +00:00
Jelena Pjesivac-Grbovic	e532b928af	Adding segmented binary reduce algorithm which works with non-commutative operations. Implementation passed intel: MPI_Reduce_c , MPI_Reduce_loc_c, and MPI_Reduce_user_c tests over TCP, BTL MX, and MTL MX, as well as, mpi_test_suite Reduce tests (up to 64 nodes). The algorithm is still not activated by decision function (will be in the near future). This commit was SVN r13657.	2007-02-14 22:38:38 +00:00
Jelena Pjesivac-Grbovic	6efca498ec	Fixes trac:692 in trunk: receive buffer in MPI_Reduce operation is no longer overwritten on non-root nodes. This commit was SVN r13538. The following Trac tickets were found above: Ticket 692 --> https://svn.open-mpi.org/trac/ompi/ticket/692	2007-02-07 18:57:03 +00:00
George Bosilca	c2c6a1b37e	Correctly compute the number of elements in a segment. For broadcast send the correct size for all intermediary nodes. This commit was SVN r12552.	2006-11-10 23:04:50 +00:00
George Bosilca	7102147b9f	Correctly detect when the specified algorithm is out of range. In this case we reset it to zero. This commit was SVN r12551.	2006-11-10 21:47:07 +00:00
George Bosilca	af68171253	Use the macro to compute the number of elements in a segment in both bcast and reduce and update the default values for the variables as required by the comment in the coll_tuned.h file. This commit was SVN r12546.	2006-11-10 20:04:08 +00:00
George Bosilca	476b922074	Updates & upgrades: - consistent arguments checking (not allowing to select an algorithm which is not available) - consistent way of computing the segcount (number of datatypes by segment). - small cleanups. - more informative debugging messages. This commit was SVN r12545.	2006-11-10 19:54:09 +00:00
George Bosilca	a82ce427e4	Update the number of reduce algorithms available. This commit was SVN r12503.	2006-11-08 22:20:34 +00:00
George Bosilca	8529238d93	Add 2 more algorithms to the dynamic list. This commit was SVN r12415.	2006-11-02 19:19:08 +00:00
George Bosilca	393657ee26	Initialize the sndbuf in all cases. Do not forget to initialize the tree used in each of the broadcast functions. This commit was SVN r12332.	2006-10-27 00:13:33 +00:00
George Bosilca	ba3c247f2a	Big collective commit. I lightly test it, but I think it should be quite stable. Anyway, the default decision functions (for broadcast, reduce and barrier) are based on a high performance network (not TCP). It should give good performance (really good) for any network having the following caracteristics: small latency (5 microseconds) and good bandwidth (more than 1Gb/s). + Cleanup of the reduce algorithms, plus 2 new algorithms (binary and binomial). Now most of the reduce algorithms use a generic tree based function for completing the reduce. + Added macros for computing the trees (they are used for bcast and reduce right now). + Allow the usage of all 5 topologies. + Jelena's implementation of a binary tree that can be used for non commutative operations. Right now only the tree building function is there, it will get activated soon. + Some others minor cleanups. This commit was SVN r12326.	2006-10-26 22:53:05 +00:00
George Bosilca	99631ccf66	Cleanups. This commit was SVN r12272.	2006-10-23 22:29:17 +00:00
George Bosilca	39cd8d3d17	One to rule them all. We only need one topology information: a tree. How we build it it's hat make the difference. This commit was SVN r12268.	2006-10-23 21:46:30 +00:00
George Bosilca	9cf3040e5f	Allocate enough memory for the reduce operation when MPI_IN_PLACE is specified. This commit was SVN r12260.	2006-10-23 17:51:36 +00:00
George Bosilca	a7b6078b73	No more segfault. Still some wrong data around ... This commit was SVN r12238.	2006-10-20 20:17:34 +00:00
George Bosilca	02759cf515	Update the reduce chain collective. This commit was SVN r12237.	2006-10-20 19:47:52 +00:00
George Bosilca	06563b5dec	Last set of explicit conversions. We are now close to the zero warnings on all platforms. The only exceptions (and I will not deal with them anytime soon) are on Windows: - the write functions which require the length to be an int when it's a size_t on all UNIX variants. - all iovec manipulation functions where the iov_len is again an int when it's a size_t on most of the UNIXes. As these only happens on Windows, so I think we're set for now :) This commit was SVN r12215.	2006-10-20 03:57:44 +00:00
George Bosilca	caefd6d0ee	Do not leak memory. Allocate the intermediary buffer only when we really need it (not leafs) and release on the same way. This commit was SVN r12200.	2006-10-19 22:20:33 +00:00
George Bosilca	c9da782804	Keep only one function to get the size of a datatype. This commit was SVN r12170.	2006-10-18 17:33:01 +00:00
George Bosilca	be27ee6fa0	Correct the bcast problem where we always did a bcast with segzise of 0. Activate the reduce decision function. Others small updates (mostly TAB to spaces). This commit was SVN r12161.	2006-10-18 02:00:46 +00:00
George Bosilca	8852c00c36	Look like a big commit but in fact it address only one issue. The way we're working with size and diplacement of data-type. After this patch all data can contain size_t bytes and the displacements are defined as ptrdiff_t. All of the files I was able to compile have been modified to match this requirement. This commit was SVN r12146.	2006-10-17 20:20:58 +00:00
George Bosilca	3f0a7cad9e	The last patch for Windows support. Mostly casting and conversion to C++ friendly headers. This commit was SVN r11400.	2006-08-24 16:38:08 +00:00
Graham Fagg	c31a5ad4b3	A few small changes that just expanded in the name of neatness... (1) As pointed out by Torsten after Jeff comment that there are 15 collectives yesterday.. nope.. I have 16 but miss counted them in my ifdefs (I had two #11s). Replaces with enum... (2) Added a readonly MCA param for how many backend algorithms are available per collective (used by benchmarker/STS) This allowed me to remove the tuned query internal functions and replace them with ompi_coll_tuned_forced_max_algorithms[COLL]. (3) I was reading the user forced MCA params for the collectives on each comm create (module init) but I then put the values into a global set of variables (like ompi_coll_tuned_reduce_forced_algorithm). To fix this and make the code neater: (a) The component looks up the MCA param indices on Open if dynamic_rules is set via the ompi_coll_tuned_COLLECTIVE_intra_check_forced_init () call. (b) Got rid of the ompi_coll_ompi_coll_tuned_COLLECTIVE_forced_algorithm/segmentsize/etc globals with a struct that is now cached on the module data hung off the communicator. i.e. done right. (c) On module init if dynamic rules enabled we call a general getvalues routine (in coll_tuned_forced.c) to get the CURRENT values using the MCA param indices and then put them on the modules data segment. A shorter version of getvalues exists for barrier which only needs the algorithm choice This commit was SVN r9663.	2006-04-19 23:42:06 +00:00
George Bosilca	39252b764f	Correctly compute the size of the datatype. This commit was SVN r9127.	2006-02-23 04:30:52 +00:00
George Bosilca	805c45de29	Don't let a division by zero happens ... This commit was SVN r9109.	2006-02-22 06:34:05 +00:00
Brian Barrett	566a050c23	Next step in the project split, mainly source code re-arranging - move files out of toplevel include/ and etc/, moving it into the sub-projects - rather than including config headers with <project>/include, have them as <project> - require all headers to be included with a project prefix, with the exception of the config headers ({opal,orte,ompi}_config.h mpi.h, and mpif.h) This commit was SVN r8985.	2006-02-12 01:33:29 +00:00
Graham Fagg	232bb9534a	Start moving stuff out of modules that should be in the component. This commit was SVN r8874.	2006-02-01 20:50:14 +00:00
Graham Fagg	5f2d82347f	a couple of changes to make barrier synchronous.. means last communication to any possible peer must be locally completing. for now using synchronous calls until the new functionality is available. then will change the code to use the new PML send flags. This commit was SVN r8867.	2006-01-31 23:21:46 +00:00
Jeff Squyres	54c4bd3ce2	Update to have public symbols be consistent; use new prefix rule (apparently we've been doing this in opal and orte, but not in ompi yet). All public symbols begin with "ompi_coll_tuned_" (not mca_coll_tuned_) except the component struct. Now this component passes the illegal symbol report with no hits. This commit was SVN r8589.	2005-12-22 13:49:33 +00:00
Graham Fagg	8651658816	minor compile warnings fix This commit was SVN r8497.	2005-12-14 19:09:46 +00:00
George Bosilca	b7353c707d	Remove unprotected header files. This commit was SVN r8432.	2005-12-10 17:04:46 +00:00
George Bosilca	1aa6d27ffe	Remove all the compilation warnings I found including unused variables and functions. This commit was SVN r8226.	2005-11-22 03:42:15 +00:00
Graham Fagg	877f7bbe6a	File based dynamic up and tested... Lots of misc fixes: printfs->opal_output, handles fanin/out correctly for forced ops unused vars, correct calculations on meaning of 'msgsize' for decision functions (varies depending on algorithm), etc This commit was SVN r8113.	2005-11-11 04:49:29 +00:00

1 2

61 Коммитов