openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	ea5c0cb4a2	Now that the nightly tarball has safely been made, let's try this commit again. Remove the svn:ignore from problematic directories and try a merge from /tmp-public/plpa-merge-area2. This commit was SVN r17718.	2008-03-05 02:45:15 +00:00
Jeff Squyres	8189fcc7d5	Back out r17702; it went very badly. This commit was SVN r17704. The following SVN revision numbers were found above: r17702 --> open-mpi/ompi@3df754ebd7	2008-03-05 00:42:39 +00:00
Jeff Squyres	3df754ebd7	Bring over PLPA v1.1 from /tmp-public/plpa-v1.1 branch. This commit was SVN r17702.	2008-03-05 00:16:49 +00:00
Aurelien Bouteiller	284115208c	Try to blindly solve warning about size_t printf format, as I can't reproduce the warning on my machines. This commit was SVN r17701.	2008-03-04 22:30:35 +00:00
Tim Prins	824c298abf	Move the carto finalize from the util finalize to the main finalize where it belongs. Otherwise, the modules are unloaded by the mca before we try to do carto_finalize, and bad things happen. This commit was SVN r17665.	2008-02-29 12:49:04 +00:00
Tim Prins	84b2099fe8	Remove the now-unused orte_value_array. As this is the last 'class' split between orte and ompi, remove the big comment about the split in ompi_bitmap. Also, update some properties (source files should not be executeable...), and remove a couple unneeded inclusions of orte_proc_table.h This commit was SVN r17655.	2008-02-28 21:39:42 +00:00
Tim Prins	2e1bda6d23	Remove the now-unused arithmatic interface to the dss This commit was SVN r17654.	2008-02-28 21:36:51 +00:00
Ralph Castain	8d819cf3d3	Move carto open/close/finalize to opal layer so that ORTE can get access to topo info. This will be used to support a topo grpcomm that optimizes communications in non-uniform topologies like RR. This commit was SVN r17652.	2008-02-28 21:04:30 +00:00
Ralph Castain	5e6928d710	Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message. Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv. Created two new macros to assist: ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd. Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output. This has been tested on Mac, TM, and SLURM. This commit was SVN r17647.	2008-02-28 19:58:32 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
George Bosilca	f256dd6010	Don't free the node2_name it is not yet set at this point. This commit was SVN r17634.	2008-02-28 05:17:20 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Aurelien Bouteiller	6ea23283a8	Added a PRIsize_t constant to help printing size_t without having to cast them to long long explicitely everywhere. This commit was SVN r17626.	2008-02-27 19:38:14 +00:00
Josh Hursey	5e0d17ec99	Forgot a case in which we should check is the checkpoint is ready during the threaded CR builds. MTT caught this by running the IU FT CR test 'inflight' which under certian timing scenarios will trigger this. This commit was SVN r17538.	2008-02-21 13:34:27 +00:00
Josh Hursey	a169575ab2	A quick fix for opal only apps (really this time) This commit was SVN r17537.	2008-02-20 22:33:42 +00:00
Josh Hursey	ad9fbf2a92	a fix for opal only apps This commit was SVN r17536.	2008-02-20 21:17:08 +00:00
Josh Hursey	99144db970	Improve checkpoint/restart support by allowing a checkpoint to progress when the process is not in the MPI library. This involves creating a separate thread for polling for a checkpoint request. This thread is active when the MPI process is not in the MPI library, and paused when the MPI process is in the library. Some MPI C interface files saw some spacing changes to conform to the coding standards of Open MPI. Changed MPI C interface files to use {{{OPAL_CR_ENTER_LIBRARY()}}} and {{{OPAL_CR_EXIT_LIBRARY()}}} instead of just {{{OPAL_CR_TEST_CHECKPOINT_READY()}}}. This will allow the checkpoint/restart system more flexibility in how it is to behave. Fixed the configure check for {{{--enable-ft-thread}}} so it has a know dependance on {{{--enable-mpi-thread}}} (and/or {{{--enable-progress-thread}}}). Added a line for Checkpoint/Restart support to {{{ompi_info}}}. Added some options to choose at runtime whether or not to use the checkpoint polling thread. By default, if the user asked for it to be compiled in, then it is used. But some users will want the ability to toggle its use at runtime. There are still some places for improvement, but the feature works correctly. As always with Checkpoint/Restart, it is compiled out unless explicitly asked for at configure time. Further, if it was configured in, then it is not used unless explicitly asked for by the user at runtime. This commit was SVN r17516.	2008-02-19 22:15:52 +00:00
Rainer Keller	b22e8e7567	- Need stdbool.h if included in userland This commit was SVN r17504.	2008-02-19 00:39:48 +00:00
Rainer Keller	d53131f261	- Need stdbool.h if included in userland; additionally protect stdbool / stdarg.h This commit was SVN r17488.	2008-02-18 08:11:57 +00:00
Aurelien Bouteiller	e7aaf6aa67	Patch to introduce PRI printf constants on architecture that do not provide C99 inttypes.h. Mainly usefull on windows, but might also prove helpful to deal with all the size_t and other size changing datatypes that used to be casted long long in printf/opal_output to avoid warnings. This commit was SVN r17451.	2008-02-14 03:31:49 +00:00
Josh Hursey	95c31388e1	It was observed that the component constraint logic is currently only used by the checkpoint/restart feature. Other constraints could be enforced here, but at the moment it is only the checkpointable constraint. So this commit just removes this logic from non-c/r builds. If someone wanted to add a new constraint in the future then there is a comment in the code that directs them a bit. This commit was SVN r17447.	2008-02-13 19:26:25 +00:00
Sharon Melamed	5b2dab2439	Reverted commit # r17443 This commit was SVN r17446. The following SVN revision numbers were found above: r17443 --> open-mpi/ompi@88ce5a2b73	2008-02-13 14:07:12 +00:00
Sharon Melamed	88ce5a2b73	Replaced PLPA to the latest PLPA (plpa-1.1a3r123) This commit was SVN r17443.	2008-02-13 13:09:11 +00:00
Rainer Keller	9cd2c6f48b	- Instead of calling RUNNING_ON_VALGRIND, implement specific function, thereby removing bogus requirement on valgrind/valgrind.h dough... - Call specific function runindebugger() before doing expensive checks on each component of struct. - Get rid of void* warnings.. This commit was SVN r17438.	2008-02-12 20:37:51 +00:00
Rainer Keller	7621800477	- Fix and add comments -- output full name for pd - Protect argument in macro... This commit was SVN r17434.	2008-02-12 16:59:59 +00:00
Rainer Keller	b20f434306	- really minor fix in comment. This commit was SVN r17433.	2008-02-12 16:54:27 +00:00
Shiqing Fan	f5792bbda5	merging the memchecker into trunk. This commit was SVN r17424.	2008-02-12 08:46:27 +00:00
Sharon Melamed	51f8308c68	Added Bi-Directional connection in the carto file. This commit was SVN r17393.	2008-02-07 09:51:19 +00:00
Sharon Melamed	c9f80caf7c	fixed a printing bug in case the carto file is not found. This commit was SVN r17392.	2008-02-07 09:02:23 +00:00
Sharon Melamed	98e8de264d	Wraped the carto API in carto_base_wrapers.c This commit was SVN r17380.	2008-02-05 19:29:16 +00:00
Sharon Melamed	9ef46de2f5	added proper wraping to the paffinity new APIs This commit was SVN r17379.	2008-02-05 17:37:17 +00:00
Pak Lui	6900fe36c2	Restore the solaris paffinity with an older but working implementation with processor_bind() instead of the pset_*() implementation that is commented out. There's also a fix for allowing some Sun platforms which have non-contiguous CPU IDs to do processor binding. This commit was SVN r17309.	2008-01-29 16:09:56 +00:00
Ralph Castain	71378305ed	The static-components.h file should never be under svn control - it is dynamically generated during build. Update properties to ignore that file. Update properties to ignore the carto_file_lex.c file since that is also dynamically generated. Update the build-hgignore.pl to properly disregard DS_Store files This commit was SVN r17301.	2008-01-29 14:18:00 +00:00
Sharon Melamed	3374d56739	This file was added to the carto tree by mistake. this file is supposed to be generated by lex. This commit was SVN r17257.	2008-01-27 09:09:55 +00:00
George Bosilca	fc4bb9c87e	Update the generated file. This one was generated using a very recent version of flex (2.5.33). This commit was SVN r17253.	2008-01-26 20:22:57 +00:00
George Bosilca	7dddbe5e29	Protect the system headers. This commit was SVN r17252.	2008-01-26 18:54:27 +00:00
Jeff Squyres	3f94d6a494	Properly qualify the filename. #$%@#%#@!!! This commit was SVN r17229.	2008-01-25 12:04:35 +00:00
George Bosilca	ddcfc78f52	Add the missing header to the header list. This commit was SVN r17222.	2008-01-25 02:28:16 +00:00
George Bosilca	f7e8fda58b	Remove the dependencies on the libopen-pal. Add the visibility attributes. This commit was SVN r17220.	2008-01-25 00:33:55 +00:00
George Bosilca	7b1132b623	Remove some warnings about uninitialized variables (the code was correct but the compilers are not yet that smart). Add the dependency to output.h in order to be able to use opal_output. This commit was SVN r17195.	2008-01-24 00:39:24 +00:00
Sharon Melamed	025b68becf	Move the carto framework to the trunk. This commit was SVN r17177.	2008-01-23 09:20:34 +00:00
Sharon Melamed	526a12620d	Expanded the paffinity interface. Added: map_to_processor_id, map_to_socket_core, max_processor_id, max_socket, max_core. In OS other then Linux, those functions will return OPAL_ERR_NOT_SUPPORTED. --This Line, and those below, will be ignored-- M paffinity/linux/paffinity_linux_module.c M paffinity/paffinity.h M paffinity/base/base.h M paffinity/base/paffinity_base_wrappers.c M paffinity/windows/paffinity_windows_module.c M paffinity/solaris/paffinity_solaris_module.c This commit was SVN r17173.	2008-01-22 07:22:24 +00:00
Adrian Knoth	601fb4389d	Cosmetics for r17150. Closes trac:1201 This commit was SVN r17151. The following SVN revision numbers were found above: r17150 --> open-mpi/ompi@4b50f02126 The following Trac tickets were found above: Ticket 1201 --> https://svn.open-mpi.org/trac/ompi/ticket/1201	2008-01-17 12:29:12 +00:00
Adrian Knoth	4b50f02126	Only free res iff it's been allocated before. Re #1201 This patch fixes the segfault, so closing the ticket might be possible. It's a very conservative patch. Perhaps the freeaddrinfo spec says that it will never allocate res in case of errors, but for now, I neither have the spec nor the will to rely on it. This commit was SVN r17150.	2008-01-17 10:01:52 +00:00
Jeff Squyres	cc3805d861	Because opal_list is used in the C++ bindings, where not having "const" in the argument creates [correct] warnings (because __FILE__ is a (const char)). Plus, opal_object.cls_init_file_name is already (const char). This commit was SVN r17145.	2008-01-15 23:50:30 +00:00
George Bosilca	7b0e295057	Fix a small memory leak. This commit was SVN r17095.	2008-01-09 20:37:02 +00:00
Gleb Natapov	09de1da7ee	Undefine MORECORE_CANNOT_TRIM. We don't call free() from the callback any more. This commit was SVN r17065.	2008-01-08 10:08:35 +00:00
George Bosilca	3d387bdab9	Add defines for the INT16 min and max value. This commit was SVN r17052.	2008-01-04 23:09:31 +00:00
Jeff Squyres	95fa693273	In r17007, ompi_pointer_array.c the logic from the ompi_pointer_array.c:ompi_pointer_array_set_item() was slightly changed such that the "find the next open slot when the requested index was already open" logic was no longer right -- since the new lowest_free value is not set until ''after'' we look for the next open slot, we need to start searching for the new lowest_free slot at the (index+1) position (not the index position). This commit was SVN r17021. The following SVN revision numbers were found above: r17007 --> open-mpi/ompi@906e8bf1d1	2007-12-21 20:19:55 +00:00
Ralph Castain	401dc49686	Cleanup compiler warnings about comparing signed and unsigned values This commit was SVN r17011.	2007-12-21 14:22:27 +00:00
George Bosilca	906e8bf1d1	Replace the ompi_pointer_array with opal_pointer_array. The next step (sometimes after the merge with the ORTE branch), the opal_pointer_array will became the only pointer_array implementation (the orte_pointer_array will be removed). This commit was SVN r17007.	2007-12-21 06:02:00 +00:00
Jeff Squyres	a1b0914037	Fix prototypes for platforms that fall back to the inline C versions of opal_atomic_[add\|sub]_[32\|64]. This commit was SVN r17005.	2007-12-20 22:13:25 +00:00
Ethan Mallove	2b48f42637	Mark XLC atomics as non-inline. This commit was SVN r16989.	2007-12-18 16:18:49 +00:00
Jeff Squyres	213b5d5c6e	Per long threads on the mailing list and much confusion discussion about linkers, have all OPAL, ORTE, and OMPI components '''not'' link against the OPAL, ORTE, or OMPI libraries. See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a better-formatted version of the same info). This commit was SVN r16968.	2007-12-15 13:32:02 +00:00
Ethan Mallove	a20a1a806a	Rework of r16807. For opal atomics: * Conditionalize around `static inline` using `OPAL_HAVE_INLINE_ATOMIC` macros Remove redundant `opal_atomic*` prototypes (they belong in the top-level `sys/atomic.h` This commit was SVN r16957. The following SVN revision numbers were found above: r16807 --> open-mpi/ompi@b7c885247a	2007-12-14 15:11:35 +00:00
Jon Mason	d77c2430c0	Fix minor spelling error This commit was SVN r16936.	2007-12-11 20:11:03 +00:00
Terry Dontje	351117a254	This commit fixes trac:747 This commit was SVN r16892. The following Trac tickets were found above: Ticket 747 --> https://svn.open-mpi.org/trac/ompi/ticket/747	2007-12-07 15:56:07 +00:00
Jeff Squyres	00131df353	Fix typo in incorrect variable name; only noticed now because someone actually compiled on a system without syslog support (Brian B.). :-) This commit was SVN r16863.	2007-12-06 11:36:44 +00:00
Ethan Mallove	58bcf14f8b	Back r16807 out of sys/atomic.h. This commit was SVN r16825. The following SVN revision numbers were found above: r16807 --> open-mpi/ompi@b7c885247a	2007-12-03 19:32:43 +00:00
Josh Hursey	27c9016b93	sleep -> usleep so we can be a bit more eager when waiting for events to finish. Still working on solutions that do not involve sleeping, but this will do for now. This commit was SVN r16824.	2007-12-03 19:27:32 +00:00
Ethan Mallove	b7c885247a	* Typo: change `__volatile` to `__volatile__`. Some compilers (e.g., gcc) are indifferent about this, while others are more particular (e.g., Sun Studio 12). * Typo: `asms.s` to `asm.s` * Eliminate "foo is multiply-defined" linker errors on Solaris by making the declarations in `opal/sys/atomic.h` agree with their corresponding definitions (use `static inline` in both places). This commit was SVN r16807.	2007-11-30 17:59:12 +00:00
Josh Hursey	bbef304f04	Convert the runtime version checks to be configure time checks (As they should have been from the start). This should fix the nightly build. This commit was SVN r16706.	2007-11-09 06:13:40 +00:00
Josh Hursey	287ca882d3	Only process a checkpoint request from BLCR if this process was the one requesting it. This commit adds a bit of error checking to keep us from participating in a checkpoint that we did not initiate and therefore are not ready for. Thanks to Paul Hargrove and Eric Roman for their help with this. This commit was SVN r16694.	2007-11-08 14:37:11 +00:00
Jeff Squyres	714b409595	Fix an uninitialized variable in the error case. Thanks to Ake Sandgren for pointing out the mistake. This commit was SVN r16682.	2007-11-07 01:52:23 +00:00
Rainer Keller	37c1b6a67e	- As with rev16656, value is not modified. Get rid of compiler warning from g++ - trunk This commit was SVN r16670.	2007-11-06 10:56:06 +00:00
Rainer Keller	9045c5a6f1	- Value pointed to is not modified (file-name / FILE-macro), getting rid of compiler-warning when compiled with trunk of g++: when doing --enable-debug: ../../../../orte/class/orte_pointer_array.h:128: warning: deprecated conversion from string constant to 'char*' This commit was SVN r16656.	2007-11-05 13:03:35 +00:00
Ethan Mallove	005652c9d4	* Embed ident strings into the Open MPI libraries using one of the following methods (in order of precedence): 1. #pragma ident <ident string> (e.g., Intel and Sun) 1. #ident <ident string> (e.g., GCC) 1. static const char ident[] = <ident string> (all others) By default, the ident string used is the standard Open MPI version string. Only the following libraries will get the embedded version strings (e.g., DSOs will not): * libmpi.so * libmpi_cxx.so * libmpi_f77.so * libopen-pal.so * libopen-rte.so * Added two new configure options: * `--with-package-name="STRING"` (defaults to "Open MPI username@hostname Distribution"). `STRING` is displayed by `ompi_info` next to the "Package" heading. * `--with-ident-string="STRING"` (defaults to the standard Open MPI version string - e.g., X.Y.Zr######). `%VERSION%` will expand to the Open MPI version string if it is supplied to this configure option. This commit was SVN r16644.	2007-11-03 02:40:22 +00:00
Jeff Squyres	dd27622814	Fix fd leak noted by Paul Hargrove. http://www.open-mpi.org/community/lists/devel/2007/10/2493.php This commit was SVN r16564.	2007-10-25 16:03:21 +00:00
Josh Hursey	0bf61a1b84	Move in some accumulated small features and minor bug fixes for C/R support. {{{ svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs . }}} This commit was SVN r16478.	2007-10-17 13:47:36 +00:00
Tim Prins	12d3ad4c5c	remove unused and outdated opal message buffer code This commit was SVN r16436.	2007-10-11 22:09:01 +00:00
Josh Hursey	06a30e7f3a	Add a quick check to make sure the BLCR being used has a working cr_request. If it doesn (version < 0.6.0) then fallback to fork/exec of cr_checkpoint command. This commit was SVN r16400.	2007-10-09 13:51:28 +00:00
Josh Hursey	7437f37e96	This commit contains the following: * Fix some missing includes in a few places. * Add the cr_request() functionality to the BLCR CRS component. We are now dependent upon the 0.6.* series of BLCR. * Made the CR notification mechanism a registered function. This way we can have an OPAL-only version and it can be replaced at runtime with the ORTE version. * Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only CR functionality when the user wants it. Default: Disabled. * Fix the placement of a checkpoint request check in MPI_Init * Pull the OPAL notification mechanism into the SnapC framework. * We no longer fork/exec the 'opal-checkpoint' command for local checkpointing, the Local coordinator in the orted does this directly. * The Local and Application coordinator talk together bypassing the OPAL notifiation mechanism. * Optimized the Local <-> App Coordinator communication. * Improved the structure used to track vpid_snapshots in the local coord. * Fix a race condition in which an application under heavy communication load may produce an inconsistent global checkpoint. This commit was SVN r16389.	2007-10-08 20:53:02 +00:00
Torsten Hoefler	e985812e1f	fixing a comment to be more detailed about opal_output_open functionality ... This commit was SVN r16370.	2007-10-06 17:33:57 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
Josh Hursey	e10f476c87	Bring over the jjh-filem branch which contains a non-blocking FileM interface and implementation. This has shown drastic performance benefit when transferring Many files at roughly the same time. I tested this for many different filem operations and everything was working fine. Let me know if you have any problems with this functionality. Some Notes: - opal-checkpoint now has a 'quiet' flag to keep it from being too verbose. - FileM RSH component is fully non-blocking. - FileM RSH component has incomming connection throttling since by default ssh only allows 10 concurrent scp connections to any single host. This default can be adjusted via an MCA parameter. {{{-mca filem_rsh_max_incomming 10}}} - There is an MCA parameter for max outgoing connections, but it is currently not implemented. If someone needs it then it should not be hard to implement. {{{-mca filem_rsh_max_outgoing 10}}} - Changed the FileM request structure so that it is a bit more explicit and flexible. - Moved the 'preload-binary' and 'preload-files' functionality into odls/base allowing for code reuse in the 'process' and 'default' ODLS components. - Fixed a bug in the process name resolution which broke the 'preload-*' functionality due to GPR table structure changes. - The FileM RSH component might be able to see even more speedup from using a thread pool to operate on the work_pool structures, but that is for future work. - Added a 'opal-show-help' file to ODLS Base This commit was SVN r16252.	2007-09-27 13:13:29 +00:00
Tim Prins	e25bb7f187	Some platforms (such as FreeBSD) need libutil.h included for openpty. Thanks to Karol Mroz for pointing this out. This commit was SVN r16163.	2007-09-19 21:59:22 +00:00
George Bosilca	d1364c53de	Don't allocate the temporary buffer on the stack. It get way too much space. This commit was SVN r16127.	2007-09-14 02:09:38 +00:00
George Bosilca	2c8c75ef94	Coverty blame list: - Remove memory leaks - uninitialized return This commit was SVN r16126.	2007-09-14 02:08:37 +00:00
George Bosilca	921d79c2b8	Remove few memory leaks. Close the files where we're done with them. This commit was SVN r16125.	2007-09-14 02:06:26 +00:00
George Bosilca	41ed50f901	Use secure version of strncpy and srtncat. Release the temporary resources on error. This commit was SVN r16124.	2007-09-14 02:04:34 +00:00
George Bosilca	61989cc4d4	Don't hardcode the length, there is an argument for that. Don't do the NULL check as we already know thaty tmp cannot be NULL. This commit was SVN r16123.	2007-09-14 02:02:03 +00:00
Josh Hursey	b4735c9719	Remove an old workaround in which we had to 'mv' the checkpoint file after it was taken form the $CWD to the storage directory. Now we just store directly to the storage directory which can reduce NFS traffic if working in that mode. A slight performance boost, but at the point you are using NFS you are paying a penalty anyway. Now you just don't have to pay it twice :) This commit was SVN r16099.	2007-09-12 15:03:21 +00:00
Gleb Natapov	140dce7614	Fix ABA problem in atomic_lifo code. This is temporary solution for now. We are looking for a better one. This commit was SVN r16091.	2007-09-11 15:40:30 +00:00
Shiqing Fan	a389e61330	- Add some type casts, required by MS compiler. This commit was SVN r16085.	2007-09-11 09:32:11 +00:00
Gleb Natapov	febdade113	Make non threaded OPAL_ATOMIC_CMPSET macros work correctly. This commit was SVN r16071.	2007-09-09 08:00:16 +00:00
Jeff Squyres	3653bfcbe7	This function returns void. This commit was SVN r15934.	2007-08-20 13:12:38 +00:00
Brian Barrett	2b8af283de	Add ability to completely turn off MPI one-sided support, so that users can experiment with using ROMIO directly. This commit was SVN r15922.	2007-08-18 21:35:51 +00:00
Josh Hursey	729c63cf9d	Fix invalid MCA 'base' names so they appear in ompi_info. A subset of this patch needs to be applied to v1.2 Refs trac:928 This commit was SVN r15918. The following Trac tickets were found above: Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928	2007-08-18 03:05:45 +00:00
Brian Barrett	2d4918b09d	Support versions of the Libtool 2.1a snapshots after the lt_dladvise code was brought in. This supercedes the GLOBL patch that we had been using with Libtool 2.1a versions prior to the lt_dladvise code. Autogen tries to figure out which version you're on, so either will now work with the trunk. This commit was SVN r15903.	2007-08-17 04:08:23 +00:00
Brian Barrett	20fe0952f7	compare should compare the framework names as well. Fixes a potential bug in the modex component compare code (thanks to Tim P. for finding the problem) This commit was SVN r15885.	2007-08-16 16:51:41 +00:00
Adrian Knoth	3115816733	Poor off-by-one line error. This now really builds on kFreeBSD. Re #1105 This commit was SVN r15842.	2007-08-13 19:00:18 +00:00
Tim Prins	188771901d	Fix typo. This commit was SVN r15802.	2007-08-08 14:37:50 +00:00
Sven Stork	f22ab47f84	- one more required symbol This commit was SVN r15801.	2007-08-08 13:02:10 +00:00
Sven Stork	3c753a4cf7	- export required symbol This commit was SVN r15800.	2007-08-08 12:57:53 +00:00
Brian Barrett	a48f07b1d9	If we don't have event ops, we don't have a current_base, so don't dereference the pointer (fixes a segfault Josh was seeing). This commit was SVN r15796.	2007-08-07 17:09:54 +00:00
Sven Stork	3a640603a4	- remove wrong va_end This commit was SVN r15789.	2007-08-07 13:32:05 +00:00
Sven Stork	5e257fadbd	- add missing va_end This commit was SVN r15788.	2007-08-07 12:25:20 +00:00
George Bosilca	31dfa5592e	Few clean-ups, few indentations. Nothing really important. This commit was SVN r15767.	2007-08-04 00:44:23 +00:00
George Bosilca	629bacbb07	Don't include the atomic header file, if we're building a non threaded version. This commit was SVN r15766.	2007-08-04 00:43:15 +00:00
George Bosilca	e2f6d69669	Only use one va_list, as it seems that only one is allowed. This commit was SVN r15765.	2007-08-04 00:41:26 +00:00

1 2 3 4 5 ...

941 Коммитов