openmpi

Автор	SHA1	Сообщение	Дата
George Bosilca	7dddbe5e29	Protect the system headers. This commit was SVN r17252.	2008-01-26 18:54:27 +00:00
Jeff Squyres	3f94d6a494	Properly qualify the filename. #$%@#%#@!!! This commit was SVN r17229.	2008-01-25 12:04:35 +00:00
George Bosilca	ddcfc78f52	Add the missing header to the header list. This commit was SVN r17222.	2008-01-25 02:28:16 +00:00
George Bosilca	f7e8fda58b	Remove the dependencies on the libopen-pal. Add the visibility attributes. This commit was SVN r17220.	2008-01-25 00:33:55 +00:00
George Bosilca	7b1132b623	Remove some warnings about uninitialized variables (the code was correct but the compilers are not yet that smart). Add the dependency to output.h in order to be able to use opal_output. This commit was SVN r17195.	2008-01-24 00:39:24 +00:00
Sharon Melamed	025b68becf	Move the carto framework to the trunk. This commit was SVN r17177.	2008-01-23 09:20:34 +00:00
Sharon Melamed	526a12620d	Expanded the paffinity interface. Added: map_to_processor_id, map_to_socket_core, max_processor_id, max_socket, max_core. In OS other then Linux, those functions will return OPAL_ERR_NOT_SUPPORTED. --This Line, and those below, will be ignored-- M paffinity/linux/paffinity_linux_module.c M paffinity/paffinity.h M paffinity/base/base.h M paffinity/base/paffinity_base_wrappers.c M paffinity/windows/paffinity_windows_module.c M paffinity/solaris/paffinity_solaris_module.c This commit was SVN r17173.	2008-01-22 07:22:24 +00:00
Adrian Knoth	601fb4389d	Cosmetics for r17150. Closes trac:1201 This commit was SVN r17151. The following SVN revision numbers were found above: r17150 --> open-mpi/ompi@4b50f02126 The following Trac tickets were found above: Ticket 1201 --> https://svn.open-mpi.org/trac/ompi/ticket/1201	2008-01-17 12:29:12 +00:00
Adrian Knoth	4b50f02126	Only free res iff it's been allocated before. Re #1201 This patch fixes the segfault, so closing the ticket might be possible. It's a very conservative patch. Perhaps the freeaddrinfo spec says that it will never allocate res in case of errors, but for now, I neither have the spec nor the will to rely on it. This commit was SVN r17150.	2008-01-17 10:01:52 +00:00
Jeff Squyres	cc3805d861	Because opal_list is used in the C++ bindings, where not having "const" in the argument creates [correct] warnings (because __FILE__ is a (const char)). Plus, opal_object.cls_init_file_name is already (const char). This commit was SVN r17145.	2008-01-15 23:50:30 +00:00
George Bosilca	7b0e295057	Fix a small memory leak. This commit was SVN r17095.	2008-01-09 20:37:02 +00:00
Gleb Natapov	09de1da7ee	Undefine MORECORE_CANNOT_TRIM. We don't call free() from the callback any more. This commit was SVN r17065.	2008-01-08 10:08:35 +00:00
George Bosilca	3d387bdab9	Add defines for the INT16 min and max value. This commit was SVN r17052.	2008-01-04 23:09:31 +00:00
Jeff Squyres	95fa693273	In r17007, ompi_pointer_array.c the logic from the ompi_pointer_array.c:ompi_pointer_array_set_item() was slightly changed such that the "find the next open slot when the requested index was already open" logic was no longer right -- since the new lowest_free value is not set until ''after'' we look for the next open slot, we need to start searching for the new lowest_free slot at the (index+1) position (not the index position). This commit was SVN r17021. The following SVN revision numbers were found above: r17007 --> open-mpi/ompi@906e8bf1d1	2007-12-21 20:19:55 +00:00
Ralph Castain	401dc49686	Cleanup compiler warnings about comparing signed and unsigned values This commit was SVN r17011.	2007-12-21 14:22:27 +00:00
George Bosilca	906e8bf1d1	Replace the ompi_pointer_array with opal_pointer_array. The next step (sometimes after the merge with the ORTE branch), the opal_pointer_array will became the only pointer_array implementation (the orte_pointer_array will be removed). This commit was SVN r17007.	2007-12-21 06:02:00 +00:00
Jeff Squyres	a1b0914037	Fix prototypes for platforms that fall back to the inline C versions of opal_atomic_[add\|sub]_[32\|64]. This commit was SVN r17005.	2007-12-20 22:13:25 +00:00
Ethan Mallove	2b48f42637	Mark XLC atomics as non-inline. This commit was SVN r16989.	2007-12-18 16:18:49 +00:00
Jeff Squyres	213b5d5c6e	Per long threads on the mailing list and much confusion discussion about linkers, have all OPAL, ORTE, and OMPI components '''not'' link against the OPAL, ORTE, or OMPI libraries. See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a better-formatted version of the same info). This commit was SVN r16968.	2007-12-15 13:32:02 +00:00
Ethan Mallove	a20a1a806a	Rework of r16807. For opal atomics: * Conditionalize around `static inline` using `OPAL_HAVE_INLINE_ATOMIC` macros Remove redundant `opal_atomic*` prototypes (they belong in the top-level `sys/atomic.h` This commit was SVN r16957. The following SVN revision numbers were found above: r16807 --> open-mpi/ompi@b7c885247a	2007-12-14 15:11:35 +00:00
Jon Mason	d77c2430c0	Fix minor spelling error This commit was SVN r16936.	2007-12-11 20:11:03 +00:00
Terry Dontje	351117a254	This commit fixes trac:747 This commit was SVN r16892. The following Trac tickets were found above: Ticket 747 --> https://svn.open-mpi.org/trac/ompi/ticket/747	2007-12-07 15:56:07 +00:00
Jeff Squyres	00131df353	Fix typo in incorrect variable name; only noticed now because someone actually compiled on a system without syslog support (Brian B.). :-) This commit was SVN r16863.	2007-12-06 11:36:44 +00:00
Ethan Mallove	58bcf14f8b	Back r16807 out of sys/atomic.h. This commit was SVN r16825. The following SVN revision numbers were found above: r16807 --> open-mpi/ompi@b7c885247a	2007-12-03 19:32:43 +00:00
Josh Hursey	27c9016b93	sleep -> usleep so we can be a bit more eager when waiting for events to finish. Still working on solutions that do not involve sleeping, but this will do for now. This commit was SVN r16824.	2007-12-03 19:27:32 +00:00
Ethan Mallove	b7c885247a	* Typo: change `__volatile` to `__volatile__`. Some compilers (e.g., gcc) are indifferent about this, while others are more particular (e.g., Sun Studio 12). * Typo: `asms.s` to `asm.s` * Eliminate "foo is multiply-defined" linker errors on Solaris by making the declarations in `opal/sys/atomic.h` agree with their corresponding definitions (use `static inline` in both places). This commit was SVN r16807.	2007-11-30 17:59:12 +00:00
Josh Hursey	bbef304f04	Convert the runtime version checks to be configure time checks (As they should have been from the start). This should fix the nightly build. This commit was SVN r16706.	2007-11-09 06:13:40 +00:00
Josh Hursey	287ca882d3	Only process a checkpoint request from BLCR if this process was the one requesting it. This commit adds a bit of error checking to keep us from participating in a checkpoint that we did not initiate and therefore are not ready for. Thanks to Paul Hargrove and Eric Roman for their help with this. This commit was SVN r16694.	2007-11-08 14:37:11 +00:00
Jeff Squyres	714b409595	Fix an uninitialized variable in the error case. Thanks to Ake Sandgren for pointing out the mistake. This commit was SVN r16682.	2007-11-07 01:52:23 +00:00
Rainer Keller	37c1b6a67e	- As with rev16656, value is not modified. Get rid of compiler warning from g++ - trunk This commit was SVN r16670.	2007-11-06 10:56:06 +00:00
Rainer Keller	9045c5a6f1	- Value pointed to is not modified (file-name / FILE-macro), getting rid of compiler-warning when compiled with trunk of g++: when doing --enable-debug: ../../../../orte/class/orte_pointer_array.h:128: warning: deprecated conversion from string constant to 'char*' This commit was SVN r16656.	2007-11-05 13:03:35 +00:00
Ethan Mallove	005652c9d4	* Embed ident strings into the Open MPI libraries using one of the following methods (in order of precedence): 1. #pragma ident <ident string> (e.g., Intel and Sun) 1. #ident <ident string> (e.g., GCC) 1. static const char ident[] = <ident string> (all others) By default, the ident string used is the standard Open MPI version string. Only the following libraries will get the embedded version strings (e.g., DSOs will not): * libmpi.so * libmpi_cxx.so * libmpi_f77.so * libopen-pal.so * libopen-rte.so * Added two new configure options: * `--with-package-name="STRING"` (defaults to "Open MPI username@hostname Distribution"). `STRING` is displayed by `ompi_info` next to the "Package" heading. * `--with-ident-string="STRING"` (defaults to the standard Open MPI version string - e.g., X.Y.Zr######). `%VERSION%` will expand to the Open MPI version string if it is supplied to this configure option. This commit was SVN r16644.	2007-11-03 02:40:22 +00:00
Jeff Squyres	dd27622814	Fix fd leak noted by Paul Hargrove. http://www.open-mpi.org/community/lists/devel/2007/10/2493.php This commit was SVN r16564.	2007-10-25 16:03:21 +00:00
Josh Hursey	0bf61a1b84	Move in some accumulated small features and minor bug fixes for C/R support. {{{ svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs . }}} This commit was SVN r16478.	2007-10-17 13:47:36 +00:00
Tim Prins	12d3ad4c5c	remove unused and outdated opal message buffer code This commit was SVN r16436.	2007-10-11 22:09:01 +00:00
Josh Hursey	06a30e7f3a	Add a quick check to make sure the BLCR being used has a working cr_request. If it doesn (version < 0.6.0) then fallback to fork/exec of cr_checkpoint command. This commit was SVN r16400.	2007-10-09 13:51:28 +00:00
Josh Hursey	7437f37e96	This commit contains the following: * Fix some missing includes in a few places. * Add the cr_request() functionality to the BLCR CRS component. We are now dependent upon the 0.6.* series of BLCR. * Made the CR notification mechanism a registered function. This way we can have an OPAL-only version and it can be replaced at runtime with the ORTE version. * Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only CR functionality when the user wants it. Default: Disabled. * Fix the placement of a checkpoint request check in MPI_Init * Pull the OPAL notification mechanism into the SnapC framework. * We no longer fork/exec the 'opal-checkpoint' command for local checkpointing, the Local coordinator in the orted does this directly. * The Local and Application coordinator talk together bypassing the OPAL notifiation mechanism. * Optimized the Local <-> App Coordinator communication. * Improved the structure used to track vpid_snapshots in the local coord. * Fix a race condition in which an application under heavy communication load may produce an inconsistent global checkpoint. This commit was SVN r16389.	2007-10-08 20:53:02 +00:00
Torsten Hoefler	e985812e1f	fixing a comment to be more detailed about opal_output_open functionality ... This commit was SVN r16370.	2007-10-06 17:33:57 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
Josh Hursey	e10f476c87	Bring over the jjh-filem branch which contains a non-blocking FileM interface and implementation. This has shown drastic performance benefit when transferring Many files at roughly the same time. I tested this for many different filem operations and everything was working fine. Let me know if you have any problems with this functionality. Some Notes: - opal-checkpoint now has a 'quiet' flag to keep it from being too verbose. - FileM RSH component is fully non-blocking. - FileM RSH component has incomming connection throttling since by default ssh only allows 10 concurrent scp connections to any single host. This default can be adjusted via an MCA parameter. {{{-mca filem_rsh_max_incomming 10}}} - There is an MCA parameter for max outgoing connections, but it is currently not implemented. If someone needs it then it should not be hard to implement. {{{-mca filem_rsh_max_outgoing 10}}} - Changed the FileM request structure so that it is a bit more explicit and flexible. - Moved the 'preload-binary' and 'preload-files' functionality into odls/base allowing for code reuse in the 'process' and 'default' ODLS components. - Fixed a bug in the process name resolution which broke the 'preload-*' functionality due to GPR table structure changes. - The FileM RSH component might be able to see even more speedup from using a thread pool to operate on the work_pool structures, but that is for future work. - Added a 'opal-show-help' file to ODLS Base This commit was SVN r16252.	2007-09-27 13:13:29 +00:00
Tim Prins	e25bb7f187	Some platforms (such as FreeBSD) need libutil.h included for openpty. Thanks to Karol Mroz for pointing this out. This commit was SVN r16163.	2007-09-19 21:59:22 +00:00
George Bosilca	d1364c53de	Don't allocate the temporary buffer on the stack. It get way too much space. This commit was SVN r16127.	2007-09-14 02:09:38 +00:00
George Bosilca	2c8c75ef94	Coverty blame list: - Remove memory leaks - uninitialized return This commit was SVN r16126.	2007-09-14 02:08:37 +00:00
George Bosilca	921d79c2b8	Remove few memory leaks. Close the files where we're done with them. This commit was SVN r16125.	2007-09-14 02:06:26 +00:00
George Bosilca	41ed50f901	Use secure version of strncpy and srtncat. Release the temporary resources on error. This commit was SVN r16124.	2007-09-14 02:04:34 +00:00
George Bosilca	61989cc4d4	Don't hardcode the length, there is an argument for that. Don't do the NULL check as we already know thaty tmp cannot be NULL. This commit was SVN r16123.	2007-09-14 02:02:03 +00:00
Josh Hursey	b4735c9719	Remove an old workaround in which we had to 'mv' the checkpoint file after it was taken form the $CWD to the storage directory. Now we just store directly to the storage directory which can reduce NFS traffic if working in that mode. A slight performance boost, but at the point you are using NFS you are paying a penalty anyway. Now you just don't have to pay it twice :) This commit was SVN r16099.	2007-09-12 15:03:21 +00:00
Gleb Natapov	140dce7614	Fix ABA problem in atomic_lifo code. This is temporary solution for now. We are looking for a better one. This commit was SVN r16091.	2007-09-11 15:40:30 +00:00
Shiqing Fan	a389e61330	- Add some type casts, required by MS compiler. This commit was SVN r16085.	2007-09-11 09:32:11 +00:00
Gleb Natapov	febdade113	Make non threaded OPAL_ATOMIC_CMPSET macros work correctly. This commit was SVN r16071.	2007-09-09 08:00:16 +00:00

1 2 3 4 5 ...

856 Коммитов