openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	5f1254d710	Update code base to use the new opal_free_list_t Use of the old ompi_free_list_t and ompi_free_list_item_t is deprecated. These classes will be removed in a future commit. This commit updates the entire code base to use opal_free_list_t and opal_free_list_item_t. Notes: OMPI_FREE_LIST__MT -> opal_free_list_ (uses opal_using_threads ()) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-02-24 10:05:45 -07:00
Gilles Gouaillardet	0f983d5a4f	add a disable function for coll module	2014-10-14 14:46:36 +09:00
Howard Pritchard	0f74467264	switch to ompi_mpi_thread_provided for ts check Use ompi_mpi_thread_provided rather than opal_using_threads macro to check whether MPI_THREAD_MULTIPLE is being used. This commit was SVN r32815.	2014-09-29 22:20:35 +00:00
Howard Pritchard	7069f2361a	disqualify coll ml for MPI_THREAD_MULTIPLE This commit was SVN r32814.	2014-09-29 21:02:15 +00:00
Ralph Castain	552c9ca5a0	George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-) WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic. This commit was SVN r32317.	2014-07-26 00:47:28 +00:00
Nathan Hjelm	56ad231b7c	coll/ml: temporarily disable binding check This commit was SVN r32178.	2014-07-09 14:39:49 +00:00
Nathan Hjelm	484a3f6147	coll/ml: fix issues identified by the clang static analyser and fix a segmentation fault in the reduce cleanup Some of the changes address false warnings produced by scan-build. I added asserts and changed some malloc calls to calloc to silence these warnings. The was one issue in cleanup for reduce since the component_functions member is changed by the allreduce call. There may be other issues with how this code works but releasing the allocated component_functions after setting up the static functions addresses the primary issue (SIGSEGV). cmr=v1.8.1:reviewer=manjugv This commit was SVN r31417.	2014-04-16 22:43:35 +00:00
Nathan Hjelm	459431622b	Revert "coll/ml: there is no reason not to enable coll/ml when a process in not" Discussed this with Manju and we decided to back this one out until a later time. This reverts commit r31188 and closes trac:4435 This commit was SVN r31282. The following SVN revision numbers were found above: r31188 --> open-mpi/ompi@f1dd589092 The following Trac tickets were found above: Ticket 4435 --> https://svn.open-mpi.org/trac/ompi/ticket/4435	2014-03-28 21:16:34 +00:00
Manjunath Gorentla Venkata	8c849ee991	coll/ml : Replace longer error message with opal_show_help; thanks Jeff for identifying those This commit was SVN r31279.	2014-03-28 19:25:54 +00:00
Nathan Hjelm	a9fb4976d5	coll/ml: more fixes There were a couple of issues with the memory leak fixes and several more verbose issues. This fixes those issues. cmr=v1.8.1:ticket=trac:4473 This commit was SVN r31273. The following Trac tickets were found above: Ticket 4473 --> https://svn.open-mpi.org/trac/ompi/ticket/4473	2014-03-28 18:31:28 +00:00
Nathan Hjelm	bd3b550c6d	coll/ml: fix leaks Thanks to ggouaillardet for finding and fixing these issues. Closes trac:4460 cmr=v1.8.1:reviewer=manjugv This commit was SVN r31264. The following Trac tickets were found above: Ticket 4460 --> https://svn.open-mpi.org/trac/ompi/ticket/4460	2014-03-27 23:25:31 +00:00
Nathan Hjelm	15a8c9d7b8	coll/ml: addendum to r31189. increment the bcol_index cmr=v1.8:ticket=trac:4436 This commit was SVN r31193. The following SVN revision numbers were found above: r31189 --> open-mpi/ompi@c7d830f4b9 The following Trac tickets were found above: Ticket 4436 --> https://svn.open-mpi.org/trac/ompi/ticket/4436	2014-03-21 22:03:56 +00:00
Nathan Hjelm	c7d830f4b9	coll/ml: improve the buffer size calculation and ensure the bcol_index in a hierarchy actually matches a bcol that is in use. There was a bug in one of the paths to calculate the ml buffer size. I fixed the bug and squashed all the paths together to avoid further issues (the result was correct in another path that calculated the same value). Additionally, the i_hier was being used as the bcol_index. This is not correct in a couple of cases so I added a variable to keep track of the real bcol_index. cmr=v1.8:reviewer=pasha This commit was SVN r31189.	2014-03-21 21:54:28 +00:00
Nathan Hjelm	f1dd589092	coll/ml: there is no reason not to enable coll/ml when a process in not bound. This case is correctly handled by coll/ml so remove the check that diables coll/ml in the not bound case. cmr=v1.8:reviewer=manjugv This commit was SVN r31188.	2014-03-21 21:54:21 +00:00
Nathan Hjelm	08bbdcbf61	coll/ml: fix leaks in coll/ml resources This patch fixes two leaks: - Fix typo in fallback collective code that caused coll/ml to retain the ibcast module twice but only release it once. One of those ibcast saves was supposed to be bcast. - Do not check for module initialization in the module destructor. It is possible to destruct a module that is partially setup. cmr=v1.8:reviewer=manjugv This commit was SVN r31187.	2014-03-21 21:54:14 +00:00
Nathan Hjelm	e030443d45	coll/ml: further improve the hierarchy discovery to handle the case where a sbgp module fails to group any processes on any nodes. cmr=v1.7.5:reviewer=manjugv This commit was SVN r31131.	2014-03-18 21:26:24 +00:00
Nathan Hjelm	8b2d723fd4	coll/ml: fix valgrind warning about reading uninitialed value This isn't causing any errors that I know about but it does fix an annoying valgrind warning. Simple fix, no review required. cmr=v1.7.5:reviewer=ompi-rm1.7 This commit was SVN r31130.	2014-03-18 21:26:17 +00:00
Nathan Hjelm	d9c8bf3785	coll/ml: move error messages to verbose output There are situations where coll/ml does not initialize properly. These will eventually need to be fixed but in the meantime it is better to not always print an error message because the collective framework can still fall back on another collective module. This commit reduces the verbose output. cmr=v1.7.5:reviewer=manjugv This commit was SVN r31129.	2014-03-18 21:26:10 +00:00
Jeff Squyres	5efd961149	Remove unnecessary \n's in ML_VERBOSE and ML_ERROR. Also fixed spelling: IS_NOT_RECHABLE -> IS_NOT_REACHABLE. Also mark a few places where opal_show_help() should have been used; Manju will take care of these. This commit was SVN r31104.	2014-03-18 12:24:32 +00:00
Nathan Hjelm	f92579dce5	coll/ml: fix a case not correctly handled by r31071 In r31071 I modified the logic to not increment the hierarchy level if no processes were selected by that sbgp. That fixed a problem seen on systems where we don't support process binding. The problem is there is a case where we actually did select processes yet the number of selected processes is 0. We need to increment the hierarchy in this case as well. This should fix the segmentation fault found by recent MTT runs. Once this is committed to 1.7.5 remove the .ompi_ignore's from coll/ml and bcol/ptpcoll. Tested with ompi-tests/ibm. cmr=v1.7.5:reviewer=rhc This commit was SVN r31081. The following SVN revision numbers were found above: r31071 --> open-mpi/ompi@1911d97044	2014-03-15 22:37:28 +00:00
Jeff Squyres	34d92315ae	Remove extraneous "while(0)". Oops. cmr=v1.7.5:ticket=trac:4395 This commit was SVN r31075. The following Trac tickets were found above: Ticket 4395 --> https://svn.open-mpi.org/trac/ompi/ticket/4395	2014-03-14 20:41:54 +00:00
Jeff Squyres	036db91f3d	For the love of all that is holy, do not put 1MB arrays on the stack. This was causing JVMs to run out of stack space, and all manner of badness ensued. Instead, use the heap -- that's what it's there for. cmr=v1.7.5:reviewer=rhc:subject=make coll/ml use the heap for large debug array This commit was SVN r31073.	2014-03-14 20:39:39 +00:00
Nathan Hjelm	1911d97044	coll/ml: fix assertion failure that occurs when level 0 of the hierarchy fails to select any processes on any nodes. Also modified basesmsocket to only print debugging info to the framework output. cmr=v1.7.5:reviewer=jsquyres This commit was SVN r31071.	2014-03-14 19:39:00 +00:00
Jeff Squyres	da87b506bd	Remove warnings identified by clang 3.4 * Remove unused static functions * Remove unused static variables cmr=v1.8:reviewer=hjelmn This commit was SVN r31023.	2014-03-12 13:17:54 +00:00
Nathan Hjelm	0af741810c	coll/ml: do not access group proc pointers directly. use ompi_comm_peer_lookup instead. Resolves an issue seen with --enable-sparse-groups. cmr=v1.7.5:reviewer=manjugv This commit was SVN r30945.	2014-03-05 22:57:21 +00:00
Pavel Shamis	3a683419c5	Fixing broken dependency between ML/BCOLS This is hot-fix patch for the issue reported by Ralph. In future we plan to restructure ml data structure layout. Tested by Nathan. cmr=v1.7.5:ticket=trac:4158 This commit was SVN r30619. The following Trac tickets were found above: Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158	2014-02-07 19:15:45 +00:00
Nathan Hjelm	c2b061cc84	basesmuma: clean up code Several changes are contained in this commit: - Clean up tabs and trailing whitespaces - Use consistent indentation in changed files - Remove unused code. None of the removed code will ever have been used in a trunk build. - Clean up the smcm code quite a bit - Do not fflush stderr and use opal_output instead of fprintf. These changes have been tested on Cray XE-6 and PSM systems. cmr=v1.7.5:ticket=trac:4158 This commit was SVN r30533. The following Trac tickets were found above: Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158	2014-02-03 17:01:46 +00:00
Nathan Hjelm	afae924e29	coll/ml: fix some warnings and the spelling of indices This commit fixes one warning that should have caused coll/ml to segfault on reduce. The fix should be correct but we will continue to investigate. cmr=v1.7.5:ticket=trac:4158 This commit was SVN r30477. The following Trac tickets were found above: Ticket 4158 --> https://svn.open-mpi.org/trac/ompi/ticket/4158	2014-01-29 18:44:21 +00:00
Nathan Hjelm	1a021b8f2d	coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.	2014-01-22 15:39:19 +00:00
Ralph Castain	c7a94a57d7	Per Marco, rename ERROR tags to exit_ERROR to avoid cygwin reserved name issues. Refs trac:4085 This commit was SVN r30239. The following Trac tickets were found above: Ticket 4085 --> https://svn.open-mpi.org/trac/ompi/ticket/4085	2014-01-10 18:00:49 +00:00
Nathan Hjelm	f5495ace48	coll/ml: update the coll_ml_enable_fragmentation variable to support the option to autodetect whether fragmentation should be enabled cmr=v1.7.3:ticket=trac:3717 This commit was SVN r29065. The following Trac tickets were found above: Ticket 3717 --> https://svn.open-mpi.org/trac/ompi/ticket/3717	2013-08-27 16:36:54 +00:00
Jeff Squyres	baa3182794	Per RFC (http://www.open-mpi.org/community/lists/devel/2013/07/12534.php), remove a bunch of dead code. This commit was SVN r28756.	2013-07-11 17:34:28 +00:00
Aurelien Bouteiller	e1066143a4	rename ompi_free_list operations to _mt, as per discussions at last face to face meeting This commit was SVN r28734.	2013-07-08 22:07:52 +00:00
Pavel Shamis	a31bc57849	Moving mca/common/netpatterns and commpaterns to ompi/patterns. This commit was SVN r28035.	2013-02-05 21:52:55 +00:00
Brian Barrett	f42783ae1a	Move the RTE framework change into the trunk. With this change, all non-CR runtime code goes through one of the rte, dpm, or pubsub frameworks. This commit was SVN r27934.	2013-01-27 23:25:10 +00:00
Pavel Shamis	1e7b958c2a	Cleaning warning in collectives code This commit was SVN r27331.	2012-09-12 19:47:23 +00:00
Pavel Shamis	8cf3c95494	Fixing ML COLL compilation issues on some SUN platforms. For more detail see following mail thread: http://www.open-mpi.org/community/lists/devel/2012/08/11448.php A lot of thanks to Paul Hargrove for the issue analysis and patch testing. Refs trac:3243 This commit was SVN r27178. The following Trac tickets were found above: Ticket 3243 --> https://svn.open-mpi.org/trac/ompi/ticket/3243	2012-08-29 14:10:42 +00:00
Ralph Castain	eda4cd5aa7	Cleanup warnings for improper use of C++ comment style, set ignores This commit was SVN r27079.	2012-08-16 21:52:14 +00:00
Pavel Shamis	b89f8fabc9	Adding Hierarchical Collectives project to the Open MPI trunk. The project includes following components and frameworks: - ML Collective component - NETPATTERNS and COMMPATTERNS common components - BCOL framework - SBGP framework Note: By default the ML collective component is disabled. In order to enable new collectives user should bump up the priority of ml component (coll_ml_priority) ============================================= Primary Contributors (in alphabetical order): Ishai Rabinovich (Mellanox) Joshua S. Ladd (ORNL / Mellanox) Manjunath Gorentla Venkata (ORNL) Mike Dubman (Mellanox) Noam Bloch (Mellanox) Pavel (Pasha) Shamis (ORNL / Mellanox) Richard Graham (ORNL / Mellanox) Vasily Filipov (Mellanox) This commit was SVN r27078.	2012-08-16 19:11:35 +00:00

39 Коммитов