openmpi

Автор	SHA1	Сообщение	Дата
Brian Barrett	8b4825ff37	Updates to make trunk run on Catamount again: * Don't build the pstat component if all defines needed aren't there. * Update platform file to work better * Work around two places that depended on modex being operational This commit was SVN r22536.	2010-02-03 05:07:40 +00:00
Jeff Squyres	6e46fbdd7c	Remove some unused variables / silence some compiler warnings This commit was SVN r22419.	2010-01-15 03:15:18 +00:00
Shiqing Fan	6dc506c9de	Make the MS compiler happy when building static libraries. This commit was SVN r22416.	2010-01-14 22:01:26 +00:00
Edgar Gabriel	5c6384e771	clean up the comm_cid code by removing everything related to the block_cid algorithm. This makes it much easier to read again. This commit was SVN r22379.	2010-01-07 16:26:30 +00:00
Edgar Gabriel	9abeaad6e2	so here is what happens: in the v1.2 series the cid's could never go above the max. allowed for a particular pml. Because of that, pml_add_comm never checked for the cid, and in fact pml_add_comm was called in comm_set, which is before we knew the cid. in the v1.3 series (and trunk) we check now the cid to detect overflow, and because of that pml_add_comm has been moved after the cid allocation routine, namely into the comm_activate routine. in the v1.2 series, the comm_activate contained a synchronization step of the old communicator in order to prevent incoming fragments on the new communicator, with the main problem being that the allreduce in the communicator allocation finished at different times on different processes, and thus, this scenario could and did really occur. in the v1.3 series, the comm_activate does not contain the synchronization step anymore, since we introduced the new queue for fragments with unknown cid. The problem is however, that whether a fragment is known or not is decided by using ompi_comm_lookup(), which will return something useful as soon as the cid allocation finished, even before pml_add_comm has been called. So there is a small time gap where we will not post a message into queue for unknown cid's, but we can also not look up the process structure belonging to the rank in that comm ( that is in pml_ob1_match_recv_frag or something like that). The current fix reintroduces the synchronization step in comm_activate, and ensures that no fragment can be received for a new communicator before the synchronization occurs , and thus comm_nextcid() and pml_add_comm has been called. It seems to be the safest and easiest way for now. Welcome back, v1.2. This commit was SVN r21970.	2009-09-17 14:37:02 +00:00
Lenny Verkhovsky	4a84f29fa6	__func__ changed to hardcoded name, after a long thread of emails :) This commit was SVN r21965.	2009-09-10 08:11:38 +00:00
Lenny Verkhovsky	a4ae241769	replaced __FUNCTION__ with __func__ This commit was SVN r21956.	2009-09-09 12:02:45 +00:00
Lenny Verkhovsky	130d15384f	fixed error message. thanks to Arthur Huillet This commit was SVN r21952.	2009-09-08 15:36:37 +00:00
Brian Barrett	468bb42f83	Per discussion in ticket #2009 , temporarily disable the block CID allocation algorithms until they properly reuse CIDs. This commit was SVN r21879.	2009-08-25 15:13:31 +00:00
Rainer Keller	5a4c6434e7	- Single request, so use single wait Technically we should have used MPI_STATUSES_IGNORE, anyhow. (however both MPI_STATUS_IGNORE are ((MPI_Status)0) ;-) This commit was SVN r21757.	2009-08-04 14:51:15 +00:00
Ralph Castain	3561880546	Silence compiler warning about comparing signed and unsigned values This commit was SVN r21637.	2009-07-11 18:36:43 +00:00
Edgar Gabriel	b6f292f794	add a uint8_t to the startup modex which allows us to recognize whether different processes have requested different levels of thread support. This verification is restricted to MPI_COMM_WORLD. In case one ore more processes have requested support for MPI_THREAD_MULTIPLE, the cid selection algorithm will fall back to the original, thread safe approach. Else, it uses the block-algorithm. For dynamic communicators, we always fall back now to the original algorithm. This has been tested for homogeneous and heterogeneous settings for MCW. However, I could not test yet the dynamic comm scenario for technical reasons, and that's why I don't close yet ticket 1949. This commit was SVN r21613.	2009-07-07 18:32:14 +00:00
Rainer Keller	b572dc3591	- As discussed revert r21330, Fortran-configure info should not end up in OPAL - Will post an updated patch for the OMPI_ALIGNMENT_ parts (within C). This commit was SVN r21342. The following SVN revision numbers were found above: r21330 --> open-mpi/ompi@95596d1814	2009-06-01 19:02:34 +00:00
Rainer Keller	95596d1814	- Move alignment and size output generated by configure-tests into the OPAL namespace, eliminating cases like opal/util/arch.c testing for ompi_fortran_logical_t. As this is processor- and compiler-related information (e.g. does the compiler/architecture support REAL*16) this should have been on the OPAL layer. - Unifies f77 code using MPI_Flogical instead of opal_fortran_logical_t - Tested locally (Linux/x86-64) with mpich and intel testsuite but would like to get this week-ends MTT output - PLEASE NOTE: configure-internal macro-names and ompi_cv_ variables have not been changed, so that external platform (not in contrib/) files still work. This commit was SVN r21330.	2009-05-30 15:54:29 +00:00
Edgar Gabriel	d93def71ea	second part of the 'running out of cids problem', this time focusing on what happens when hierarch is used. . Two major items: - modify the comm_activate step to take an additional argument, indicating whether the new communicatio has to go through the collective selection step. This is not required sometimes (e.g. when a process calls MPI_COMM_SPLIT with color=MPI_UNDEFINED), and contributed significantly to the exhaustion of cids. - when freeing a communicator, check whether we can reuse the block of cids assigned to that comm. This only works if the current front of the cid assignment (cid_block_start) is right ater the block of cids assigned to this comm. Fixes trac:1904 Fixes trac:1926 This commit was SVN r21296. The following Trac tickets were found above: Ticket 1904 --> https://svn.open-mpi.org/trac/ompi/ticket/1904 Ticket 1926 --> https://svn.open-mpi.org/trac/ompi/ticket/1926	2009-05-27 15:21:07 +00:00
Greg Koenig	60485ff95f	This is a very large change to rename several #define values from OMPI_* to OPAL_*. This allows opal layer to be used more independent from the whole of ompi. NOTE: 9 "svn mv" operations immediately follow this commit. This commit was SVN r21180.	2009-05-06 20:11:28 +00:00
Shiqing Fan	387ee0ad29	fix a type cast This commit was SVN r21161.	2009-05-05 13:51:02 +00:00
Edgar Gabriel	338b136c28	adding a feature which tries to reuse a block of cids assigned to a communicator. This works, if all processes agree that all communicators utilizing the cids in the block have been freed. If they don't, they assign a new block of cid's. This fixes the application scenario reported in the week, in fact the test succefully creates 100,000 communicators without exceeding a cid of 20. The fix also keeps the main property of the algorithm (namely using a single Allreduce operation to get a new block) and did not modify the communicator structure. This commit was SVN r21142.	2009-05-02 18:03:57 +00:00
Brian Barrett	77cf736f48	Make max_contextid field match same type as cid in communicator. Refs trac:1904 This commit was SVN r21141. The following Trac tickets were found above: Ticket 1904 --> https://svn.open-mpi.org/trac/ompi/ticket/1904	2009-05-01 21:11:59 +00:00
Rainer Keller	221fb9dbca	... Delayed due to notifier commits earlier this day ... - Delete unnecessary header files using contrib/check_unnecessary_headers.sh after applying patches, that include headers, being "lost" due to inclusion in one of the now deleted headers... In total 817 files are touched. In ompi/mpi/c/ header files are moved up into the actual c-file, where necessary (these are the only additional #include), otherwise it is only deletions of #include (apart from the above additions required due to notifier...) - To get different MCAs (OpenIB, TM, ALPS), an earlier version was successfully compiled (yesterday) on: Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled This commit was SVN r21096.	2009-04-29 01:32:14 +00:00
George Bosilca	b7c1ae4f76	Nothing important, just an identation. This commit was SVN r20919.	2009-04-01 15:27:16 +00:00
Rainer Keller	6f808d9b05	Preparation work for another commit (after RFC): - This patch solely _adds_ required headers and is rather localized The next patch (after RFC) heavily removes headers (based on script) - ompi/communicator/communicator.h: For sources that use ompi_mpi_comm_world, don't require them to include "mpi.h" - ompi/debuggers/ompi_common_dll.c: mca_topo_base_comm_1_0_0_t needs #include "ompi/mca/topo/topo.h" - ompi/errhandler/errhandler_predefined.h: ompi/communicator/communicator.h depends on this header file! To prevent recursion just have fwd declarations. #include "ompi/types.h" for fwd declarations of the main structs. - ompi/mca/btl/btl.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/mpool/base/mpool_base_tree.c: We use ompi_free_list_t and ompi_rb_tree_t, so have the proper classes - ompi/mca/op/op.h: Op is pretty self-contained: Nobody up to now has done #include "opal/class/opal_object.h" - ompi/mca/osc/pt2pt/osc_pt2pt_replyreq.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/pml/base/base.h: We use opal_lists - ompi/mca/pml/dr/pml_dr_vfrag.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/pml/ob1/pml_ob1_hdr.h: #include "ompi/mca/btl/btl.h" for mca_btl_base_segment_t - opal/dss/dss_unpack.c: #include "opal/types.h" - opal/mca/base/base.h: #include "opal/util/cmd_line.h" for opal_cmd_line_t - orte/mca/oob/tcp/oob_tcp.c: #include "opal/types.h" for opal_socklen_t - orte/mca/oob/tcp/oob_tcp.h: #include "opal/threads/threads.h" for opal_thread_t - orte/mca/oob/tcp/oob_tcp_msg.c: #include "opal/types.h" - orte/mca/oob/tcp/oob_tcp_peer.c: #include "opal/types.h" for opal_socklen_t - orte/mca/oob/tcp/oob_tcp_send.c: #include "opal/types.h" - orte/mca/plm/base/plm_base_proxy.c: #include "orte/util/name_fns.h" for ORTE_NAME_PRINT - orte/mca/rml/base/rml_base_receive.c: #include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE - orte/mca/rml/oob/rml_oob_recv.c: #include "opal/types.h" for ompi_iov_base_ptr_t - orte/mca/rml/oob/rml_oob_send.c: #include "opal/types.h" for ompi_iov_base_ptr_t - orte/runtime/orte_data_server.c #include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE - orte/runtime/orte_globals.h: #include "orte/util/name_fns.h" for ORTE_NAME_PRINT Tested on Linux/x86-64 This commit was SVN r20817.	2009-03-17 21:34:30 +00:00
Terry Dontje	9215100ac4	Increase communicator padding to accomodate ppc larger lock structures. This commit was SVN r20728.	2009-03-04 19:54:58 +00:00
Rainer Keller	fd28b392bf	- An intrusive commit yet again (sorry): with the separation we get bitten by header depending on having already included the corresponding [opal\|orte\|ompi]_config.h header. When separating, things like [OPAL\|ORTE\|OMPI]_DECLSPEC are missed. Script to add the corresponding header in front of all following (taking care of possible #ifdef HAVE_...) - Including some minor cleanups to - ompi/group/group.h -- include _after_ #ifndef OMPI_GROUP_H - ompi/mca/btl/btl.h -- nclude _after_ #ifndef MCA_BTL_H - ompi/mca/crcp/bkmrk/crcp_bkmrk_btl.c -- still no need for orte/util/output.h - ompi/mca/pml/dr/pml_dr_recvreq.c -- no need for mpool.h - ompi/mca/btl/btl.h -- reorder to fit - ompi/mca/bml/bml.h -- reorder to fit - ompi/runtime/ompi_mpi_finalize.c -- reorder to fit - ompi/request/request.h -- additionally need ompi/constants.h - Tested on linux/x86-64 This commit was SVN r20720.	2009-03-04 15:35:54 +00:00
Terry Dontje	0178b6c45f	Added padding to predefined handle structures to maintain library version to version compatibility. This commit was SVN r20627.	2009-02-24 17:17:33 +00:00
Rainer Keller	d81443cc5a	- On the way to get the BTLs split out and lessen dependency on orte: Often, orte/util/show_help.h is included, although no functionality is required -- instead, most often opal_output.h, or orte/mca/rml/rml_types.h Please see orte_show_help_replacement.sh commited next. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration actually showed two missing #include "orte/util/show_help.h" in orte/mca/odls/base/odls_base_default_fns.c and in orte/tools/orte-top/orte-top.c Manually added these. Let's have MTT the last word. This commit was SVN r20557.	2009-02-14 02:26:12 +00:00
Jeff Squyres	d1c6f3f89a	* Fix a truckload of Cisco copyrights to be the same as the rest of the code base. * Fix a few misspellings in other copyrights. This commit was SVN r20241.	2009-01-11 02:30:00 +00:00
Edgar Gabriel	b05393b363	now that we have the unexpected message queue for unknown communicators, there is no need to have this additional synchronization operation for multi-threaded communicator creations. This commit was SVN r20046.	2008-12-02 16:30:15 +00:00
George Bosilca	82d1d5d785	The patch for "Unexpected message queue for unknown CID's required" ticket #1460 . I'm unable to split it in two parts, my patch and Edgar's one. So I just update copyright information for both of us. What this patch do: - it use the unexpected queue create by commit r19562 to dispatch the unexpected message to the right communicator (once this communicator is created and initialized). - delay the PML comm_add until we have the context_id for the new communicator. - only do the PML comm_add on processes that really belong to the new communicator. Please read the lengthy comment in the source code for the reason behind this. This commit was SVN r19929. The following SVN revision numbers were found above: r19562 --> open-mpi/ompi@acd3406aa7	2008-11-04 21:58:06 +00:00
Jeff Squyres	0ae2c27d3b	Ensure that the mutex is properly constructed/destructed. This commit was SVN r19527.	2008-09-09 12:57:45 +00:00
George Bosilca	bf25b3339d	Minor cleanup. This commit was SVN r19464.	2008-08-31 21:03:39 +00:00
Jeff Squyres	008fa8c5cc	Fixes trac:1236, #1237 . * Various changes to enable 0-dimensional cartesian communicators: * Set various mtc_* members to NULL when there are 0 dimensions (and don't bother trying to memcpy these arrays when duplicating the communicator -- because they're NULL) * adjust topo_base_cart_sub to correctly handle 0 dimensions (simplified it a bit) * adjust a few error codes to return ERR_OUT_OF_RESOURCE * adjust error checking of CART_CREATE, CART_RANK * Allow MPI_GRAPH_CREATE to accept 0 == nnodes. * Bump reported MPI version in mpi.h to 2.1 This commit was SVN r19461. The following Trac tickets were found above: Ticket 1236 --> https://svn.open-mpi.org/trac/ompi/ticket/1236	2008-08-31 19:31:10 +00:00
Rainer Keller	9cc83d7414	- Fix the freeing of already allocated buffers, if one fails. Fixes Coverity CID 291 & CID 292 - Adjust the rc for other functions as well. This commit was SVN r19232.	2008-08-11 09:43:01 +00:00
Jeff Squyres	765749209f	Fix CID 413: possible uninitialized variable This commit was SVN r19176.	2008-08-06 12:25:56 +00:00
Edgar Gabriel	1adb3a6cda	Fixes trac:1408 The optimization that was introduced a year ago for saving a collective synchronization step for certain communicator creation functions has to be disabled for now. The bug has been exposed by the hierarch module, but could appear as well for inter-communicator creations. The problem is, that within a communicator creation step we invoke a comm_dup (for intercomm_create) or other collective operations (in case of hierarch) before all processes have been synchronized. This lead to the "Dropped message for non-existant communicators" error. This commit disables the optimization without removing it from the code base. In theory, it can be enabled again as soon as we have the unexpected message queues for unknown cid's, which were required if I remember right anyway for the multi-threaded scenarios and potentially for fault tolerance. Before moving the patch to 1.3 I would like to let it soak for a couple of days on trunk. Please note, taht my 2nd comment on ticket #1408 was semi-correct, since the order of activation of the communicator and quering the collective module have already been changed earlier. This commit was SVN r19139. The following Trac tickets were found above: Ticket 1408 --> https://svn.open-mpi.org/trac/ompi/ticket/1408	2008-08-04 14:55:09 +00:00
Jeff Squyres	0af7ac53f2	Fixes trac:1392, #1400 * add "register" function to mca_base_component_t * converted coll:basic and paffinity:linux and paffinity:solaris to use this function * we'll convert the rest over time (I'll file a ticket once all this is committed) * add 32 bytes of "reserved" space to the end of mca_base_component_t and mca_base_component_data_2_0_0_t to make future upgrades [slightly] easier * new mca_base_component_t size: 196 bytes * new mca_base_component_data_2_0_0_t size: 36 bytes * MCA base version bumped to v2.0 * '''We now refuse to load components that are not MCA v2.0.x''' * all MCA frameworks versions bumped to v2.0 * be a little more explicit about version numbers in the MCA base * add big comment in mca.h about versioning philosophy This commit was SVN r19073. The following Trac tickets were found above: Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392	2008-07-28 22:40:57 +00:00
Jeff Squyres	90576a435b	Fixes trac:1345 The issue is that the field mca_topo_base_comm_t->mtc_periods_or_edges has a different length, depending on whether the communicator is a graph or a cart. One of the comm dup functions always assumed that it was the length required by graph comms, which could lead to badness in some cases. This commit makes the legnth of that field on a comm dup be the proper length and copies the data over appropriately. I also changed the syntax of the ompi_comm_copy_topo() function to use shorter pointer notation; it made the code much easier to read and fix. This commit was SVN r18752. The following Trac tickets were found above: Ticket 1345 --> https://svn.open-mpi.org/trac/ompi/ticket/1345	2008-06-26 16:59:31 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
George Bosilca	a00ca20446	More cleanups. This commit was SVN r18069.	2008-04-02 06:38:33 +00:00
Jeff Squyres	597266fdec	Present state of MPI debugger work: * New/improved bootstrapping technique for DLLs * First cut of the MPI handle debugging interface. It is still evolving, but the interface is getting more stable. * Some minor bugs were fixed in the unity topo component (brought to light because of the new MPI handle debugging stuff). Fixes trac:1209. This commit was SVN r17730. The following Trac tickets were found above: Ticket 1209 --> https://svn.open-mpi.org/trac/ompi/ticket/1209	2008-03-05 12:22:34 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Shiqing Fan	54c7b71cfd	Use the correct way of including memchecker.h, which will work with '--with-devel-headers'. This commit was SVN r17435.	2008-02-12 18:01:17 +00:00
Shiqing Fan	f5792bbda5	merging the memchecker into trunk. This commit was SVN r17424.	2008-02-12 08:46:27 +00:00
George Bosilca	906e8bf1d1	Replace the ompi_pointer_array with opal_pointer_array. The next step (sometimes after the merge with the ORTE branch), the opal_pointer_array will became the only pointer_array implementation (the orte_pointer_array will be removed). This commit was SVN r17007.	2007-12-21 06:02:00 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
Tim Prins	34966edaf1	remove unneeded and never-initialized lock. The orte_ns.assign_tag function does all the locking we need for us. This commit was SVN r16299.	2007-10-02 14:22:29 +00:00
Tim Prins	1d1d0f6d4c	Fix segfault when user provides a working directory for comm_spawn. Thanks to Murat Knecht for reporting this and suggesting a fix. This commit was SVN r16266.	2007-09-27 23:30:40 +00:00
Tim Prins	4033a40e4e	Coding standards... This commit was SVN r16118.	2007-09-13 14:00:59 +00:00

1 2 3

149 Коммитов