openmpi

Автор	SHA1	Сообщение	Дата
Lenny Verkhovsky	f4811d6c4d	NUMA Awareness support. Gleb's patch This commit was SVN r18658.	2008-06-15 13:43:28 +00:00
George Bosilca	170b9c344e	Mea culpa. I forget to initialize the max_data before the call to the convertor. This commit was SVN r18651.	2008-06-12 17:24:39 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
George Bosilca	e361bcb64c	Send optimizations. 1. The send path get shorter. The BTL is allowed to return > 0 to specify that the descriptor was pushed to the networks, and that the memory attached to it is available again for the upper layer. The MCA_BTL_DES_SEND_ALWAYS_CALLBACK flag can be used by the PML to force the BTL to always trigger the callback. Unmodified BTL will continue to work as expected, as they will return OMPI_SUCCESS which force the PML to have exactly the same behavior as before. Some BTLs have been modified: self, sm, tcp, mx. 2. Add send immediate interface to BTL. The idea is to have a mechanism of allowing the BTL to take advantage of send optimizations such as the ability to deliver data "inline". Some network APIs such as Portals allow data to be sent using a "thin" event without packing data into a memory descriptor. This interface change allows the BTL to use such capabilities and allows for other optimizations in the future. All existing BTLs except for Portals and sm have this interface set to NULL. This commit was SVN r18551.	2008-05-30 03:58:39 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Gleb Natapov	6844ff32ba	Return OMPI_ERR_RESOURCE_BUSY from sm->btl_send() function if there is no place in cb. This will prevent OB1 from doing early completion of small sends. This commit was SVN r18424.	2008-05-12 07:15:29 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Gleb Natapov	9b6db25182	Fix compilation warning. This commit was SVN r17839.	2008-03-17 13:37:57 +00:00
Gleb Natapov	f488b94899	More SM BTL initialization cleanups. This commit was SVN r17833.	2008-03-16 10:01:56 +00:00
Gleb Natapov	772772b944	Remove unneeded include. This commit was SVN r17813.	2008-03-12 10:01:20 +00:00
Gleb Natapov	90c70e37b9	Clean up SM btl startup code. Remove no longer needed code leftovers from two BTL times. Remove old and no longer correct comment. This commit was SVN r17805.	2008-03-11 14:39:10 +00:00
Gleb Natapov	ffa09c44fd	Pass correct pointer to mpool_base function. This commit was SVN r17795.	2008-03-09 13:22:12 +00:00
Gleb Natapov	b0b21c68b4	Remove trailing spaces from SM BTL. This commit was SVN r17794.	2008-03-09 13:17:13 +00:00
George Bosilca	fa31ec81d0	Add the ownership flags to the PML/BTL interface. The layer owning the descriptor is responsible for releasing it once the descriptor is not in use anymore. This commit was SVN r17497.	2008-02-18 17:39:30 +00:00
George Bosilca	6310ce955c	The first patch related to the Active Message stuff. So far, here is what we have: - the registration array is now global instead of one by BTL. - each framework have to declare the entries in the registration array reserved. Then it have to define the internal way of sharing (or not) these entries between all components. As an example, the PML will not share as there is only one active PML at any moment, while the BTLs will have to. The tag is 8 bits long, the first 3 are reserved for the framework while the remaining 5 are use internally by each framework. - The registration function is optional. If a BTL do not provide such function, nothing happens. However, in the case where such function is provided in the BTL structure, it will be called by the BML, when a tag is registered. Now, it's time for the second step... Converting OB1 from a switch based PML to an active message one. This commit was SVN r17140.	2008-01-15 05:32:53 +00:00
Gleb Natapov	8b511b969d	Introduce a new BTL parameter btl_rndv_eager_limit which determines size of a first fragment of rendezvous protocol. Remove no longer used btl_min_send_size parameter. This commit was SVN r16969.	2007-12-16 08:35:17 +00:00
Gleb Natapov	e2e211f23b	Add flags parameter to btl_alloc() and btl_prepare_src() functions. If BTL knows at the time of allocation priority of a descriptor it may do some optimizations. This commit was SVN r16901.	2007-12-09 14:08:01 +00:00
Gleb Natapov	7364b7cf47	Add endpoint parameter to btl_alloc() function. Enables various optimizations inside BTL. This commit was SVN r16898.	2007-12-09 14:00:42 +00:00
Rich Graham	27a748e7eb	change all instances of ompi_free_list_init to ompi_free_list_init_new. Header and payload data are specified separately at this stage. This commit was SVN r16633.	2007-11-01 23:38:50 +00:00
Gleb Natapov	63dde87076	If SM BTL cannot send fragment because the cyclic buffer is full put the fragment on the pending list and send it later instead of spinning on opal_progress(). This commit was SVN r16537.	2007-10-22 12:07:22 +00:00
Gleb Natapov	435e7d80e9	Remove rc parameter from MCA_BTL_SM_FIFO_WRITE() macro. It cannot fail in current implementation. This commit was SVN r16015.	2007-08-30 13:21:52 +00:00
Rich Graham	bc97d22182	remove tabs. Remove old code that was commented out. This commit was SVN r15975.	2007-08-28 03:08:36 +00:00
Rich Graham	4d58f9aed7	Add comments. Move temporary receive object from a free list object to a stack object. This commit was SVN r15971.	2007-08-27 21:41:04 +00:00
Sven Stork	21f12f29f8	- fix a sm bug that causes segfaults in the case of threaded builds. The problem is that in the case of threaded builds for every fifo a head and tail lock will be allocated inside the shared memory segment and the ptr is stored inside the fifo. In the case that the sm backend file will be mapped in all processes at the same address (mostly the case for non-thread builds) this is fine, but in the cases when the processes map the file at different addresses this addresses cause big trouble in other processes than the one that allocted the locks. Therefore the send lock addresses have to be recalculated to match the local mapping of the processes that use them. This commit was SVN r15291.	2007-07-05 14:26:32 +00:00
Gleb Natapov	b88b7dedfe	Rename btl_rdma_offset to btl_pipeline_send_length. This commit was SVN r15153.	2007-06-21 07:12:40 +00:00
Galen Shipman	3401bd2b07	Add optional ordering to the BTL interface. This is required to tighten up the BTL semantics. Ordering is not guaranteed, but, if the BTL returns a order tag in a descriptor (other than MCA_BTL_NO_ORDER) then we may request another descriptor that will obey ordering w.r.t. to the other descriptor. This will allow sane behavior for RDMA networks, where local completion of an RDMA operation on the active side does not imply remote completion on the passive side. If we send a FIN message after local completion and the FIN is not ordered w.r.t. the RDMA operation then badness may occur as the passive side may now try to deregister the memory and the RDMA operation may still be pending on the passive side. Note that this has no impact on networks that don't suffer from this limitation as the ORDER tag can simply always be specified as MCA_BTL_NO_ORDER. This commit was SVN r14768.	2007-05-24 19:51:26 +00:00
Gleb Natapov	3ebaff8dfe	Implement new BTL parameters: We eagerly send data up to btl__eager_limit with the match Upon ACK of the MATCH we start using send/receives of size btl__max_send_size up to the btl__rdma_pipeline_offset After the btl__rdma_pipeline_offset we begin using RDMA writes of size btl__rdma_pipeline_frag_size. Now, on a per message basis we only use the above protocol if the message is larger than btl__min_rdma_pipeline_size btl__eager_limit - > same btl__max_send_size -> same btl__rdma_pipeline_offset -> btl__min_rdma_size btl__rdma_pipeline_frag_size -> btl__max_rdma_size btl_*_min_rdma_pipeline_size is new.. This patch also moves all BTL common parameters initialisation into btl_base_mca.c file. This commit was SVN r14681.	2007-05-17 07:54:27 +00:00
Li-Ta Lo	ec8a859a44	fixed typo This commit was SVN r14207.	2007-04-03 17:21:54 +00:00
Gleb Natapov	e5450613b5	Add new SM BTL parameter btl_sm_cb_max_num. If set to value greater then zero it limits the number of circular buffers allocated between each pair of peers. This allows for more tight memory usage control. This commit was SVN r14120.	2007-03-22 12:21:42 +00:00
Gleb Natapov	efe0323d35	Initialize fifos at SM BTL init time instead of waiting for first send. This waist slightly more memory, but prevents problem when fifo cannot be allocated later during a job run when memory resource is exhausted. This commit was SVN r14119.	2007-03-22 12:18:44 +00:00
Gleb Natapov	c389c47d79	Fix SM connectivity calculations. This commit was SVN r14109.	2007-03-21 13:29:19 +00:00
Gleb Natapov	a1a14aa4c3	Add memory barriers during SM btl initialization. This commit was SVN r14099.	2007-03-21 10:25:10 +00:00
Gleb Natapov	e551c5f1a3	Get rid of separate sm BTL for different shared memory base addresses. Now, when we precalculate most of the addresses there is no point to have separate BTL for this. The sm_progress() code become much more simple as a result. This commit was SVN r14071.	2007-03-20 08:15:58 +00:00
Josh Hursey	dadca7da88	Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). This merge adds Checkpoint/Restart support to Open MPI. The initial frameworks and components support a LAM/MPI-like implementation. This commit follows the risk assessment presented to the Open MPI core development group on Feb. 22, 2007. This commit closes trac:158 More details to follow. This commit was SVN r14051. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r13912 The following Trac tickets were found above: Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158	2007-03-16 23:11:45 +00:00
Gleb Natapov	be018944d2	Clean up circular buffer implementation. Get rid of _same_base_address() functions by pre-calculating everything in advance. This commit was SVN r13923.	2007-03-05 14:27:26 +00:00
Gleb Natapov	8078ae5977	Optimize sm communication. Pass message type (MCA_BTL_SM_FRAG_ACK/ MCA_BTL_SM_FRAG_SEND) and status success/fail in low bits of pointers we are passing through circular buffer. The rank that receives ACK doesn't need to look into data it received and this is a big win since this data is not in the cache of the rank's CPU. (Note that we can use low bits of pointers because free_list always return pointers aligned at least to cache line size). This commit was SVN r13922.	2007-03-05 14:24:09 +00:00
Gleb Natapov	90fb58de4f	When frags are allocated from mpool by free_list the frag structure is also allocated from mpool memory (which is registered memory for RDMA transports) This is not a problem for a small jobs, but for a big number of ranks an amount of waisted memory is big. This commit was SVN r13921.	2007-03-05 14:17:50 +00:00
Gleb Natapov	4d4b0a022a	Add error callback to sm BTL. Call it when allocation of the initial circular buffer fails. If cb is already allocated, but it is full and allocation of additional cb fails, we spin waiting for receiver to free space in existing cb. This commit was SVN r13635.	2007-02-13 12:01:36 +00:00
George Bosilca	b611e6d7dc	Less warnings. This commit was SVN r13419.	2007-02-01 17:51:43 +00:00
Brian Barrett	58b325b03f	Two changes to improve the sm situation with spawn: * have the mpool size be based on MCW, not num procs in other jobs we know about. Solves the problem of the spawned job having a much bigger than needed sm file * Can't assume that "me" is in the list of procs passed to addprocs, so need to use slightly different logic and not go through all of add procs unless there's a proc in my job that isn't me. This seems to greatly improve the situation, although there still seems to be more of a slowdown through MPI_INIT for the children (if there are more than one child) than MPI_INIT for the parent if there are 'n' children compared to 'n' parents. Hopefully that made sense ;) This commit was SVN r13417.	2007-02-01 17:18:35 +00:00
George Bosilca	126a68dc9a	Big datatype commit. Remove all unused features of the datatype engine. As the memory allocation logic is completely done outside the data-type engine (in the PML) there is no need for any special case inside the data-type engine. There is less arguments for the ompi_convertor_pack and ompi_convertor_unpack as well (the last field free_after is not required anymore as there is no memory allocated in the engine itself). This change affect all components using datatypes. I test most of them, but it might happens that I miss some ... If it's the case please let me know (don't shoot the pianist!!). This commit was SVN r12331.	2006-10-26 23:11:26 +00:00
George Bosilca	d7268557a8	Complete the SM BTL changes. Now all displacements are ptrdiff_t and there is no warnings about any issue with signed/unsigned. This commit was SVN r12234.	2006-10-20 19:28:12 +00:00
George Bosilca	c86214f420	Fix the SM BTL issues. The problem seems to come from the fact that the maximum number of nodes on the SM file should be signed, as we use the -1 to unlimit it. This commit was SVN r12227.	2006-10-20 17:25:53 +00:00
George Bosilca	06563b5dec	Last set of explicit conversions. We are now close to the zero warnings on all platforms. The only exceptions (and I will not deal with them anytime soon) are on Windows: - the write functions which require the length to be an int when it's a size_t on all UNIX variants. - all iovec manipulation functions where the iov_len is again an int when it's a size_t on most of the UNIXes. As these only happens on Windows, so I think we're set for now :) This commit was SVN r12215.	2006-10-20 03:57:44 +00:00
Brian Barrett	51b2a0fd3f	A couple of changes to improve shared memory behavior when resources get constrained: * Make sure we always have a number of eager fragments available that scales with the number of processes communicating with a given proc over shared memory * Use FREE_LIST_GET instead of FREE_LIST_WAIT to return an error to the PML when resource exhaustion occurs * Don't dereference the frag during alloc unless we're sure it's not NULL Reviewed by: Galen Refs trac:413 This commit was SVN r12053. The following Trac tickets were found above: Ticket 413 --> https://svn.open-mpi.org/trac/ompi/ticket/413	2006-10-06 21:13:49 +00:00
George Bosilca	3f0a7cad9e	The last patch for Windows support. Mostly casting and conversion to C++ friendly headers. This commit was SVN r11400.	2006-08-24 16:38:08 +00:00
Galen Shipman	e5c594c211	More updates for the async error handler for btl's In order to provide backwards compatability the framework versions are bumped and the handler registeration function is at the end of the btl struct. Testing done on sm, openib, and gm.. This commit was SVN r11256.	2006-08-17 22:02:01 +00:00
Galen Shipman	3b49953ce2	Add error callback to the btl interface, this allows error to be delivered to the upperlayer assynchronously although there are some issues with this.. such as there are multiple consumers of the btl's.. who get's the This commit was SVN r11232.	2006-08-16 20:21:38 +00:00
George Bosilca	fd39203262	As the self proc is marked as local, there will always be at least one local proc. Don't create the SM file until we really know there is someone lse on the same node. This commit was SVN r10740.	2006-07-11 17:05:13 +00:00
Galen Shipman	c79efc9efb	track which list a fragment came from, allows returning based on list, not on size. This commit was SVN r10142.	2006-05-31 14:24:32 +00:00

1 2

76 Коммитов