openmpi

Автор	SHA1	Сообщение	Дата
George Bosilca	7a63f9b730	I somehow mess up my last commit. Sorry. This commit was SVN r16543.	2007-10-22 15:08:17 +00:00
George Bosilca	b93f72bdfd	Remove 2 warnings about uninitialized i and quit_flags. This commit was SVN r16542.	2007-10-22 15:01:15 +00:00
Jeff Squyres	5637c7a5a0	In addition to r16513, this commit fixes trac:1170. If we cannot resolve the route to the peer that we're trying to send to, don't queue up the message in the TCP OOB -- instead, return it to the upper layer (e.g., the RML) and let it decide what to do. In the case of the routed RML, the tree component will queue it up for later transmission. Hence, we don't want the message queued up both here in the TCP OOB and the tree routed. Also see some more discussion / explanation in #1171. This commit was SVN r16540. The following SVN revision numbers were found above: r16513 --> open-mpi/ompi@7ae9589d70 The following Trac tickets were found above: Ticket 1170 --> https://svn.open-mpi.org/trac/ompi/ticket/1170	2007-10-22 13:46:57 +00:00
Gleb Natapov	63dde87076	If SM BTL cannot send fragment because the cyclic buffer is full put the fragment on the pending list and send it later instead of spinning on opal_progress(). This commit was SVN r16537.	2007-10-22 12:07:22 +00:00
Rainer Keller	42d6cf27c3	- In ompi_request_init(): Change order of initialization as in the declaration Add missing initialization of req_persistent and req_mpi_object to ompi_request_empty and ompi_request_null. This commit was SVN r16536.	2007-10-22 11:28:49 +00:00
Rich Graham	0de9bd9fa0	when attaching an md for posted receive, generate a start event, so that PtlMDUpdate will pick up all incoming events. This commit was SVN r16517.	2007-10-19 19:09:40 +00:00
Jeff Squyres	7ae9589d70	The header is at the address of the buffer pointed to by the iov, not the address of the iov. This commit was SVN r16513.	2007-10-19 12:40:14 +00:00
Jeff Squyres	abf1b728b9	Minor code maintenance fix -- put the THREAD_UNLOCK outside the if statement so that you only have to have it once. This commit was SVN r16512.	2007-10-19 12:36:26 +00:00
Ralph Castain	73eeb7f0d2	Fix a bug in the way we handled buffer releases and the conditioned wait that held us in the xcast until completed. This commit was SVN r16504.	2007-10-19 01:17:01 +00:00
Gleb Natapov	52c6160252	MCA_PML_BASE_REQUEST_MPI_COMPLETE() macro does nothing except call to ompi_request_complete(). Remove the macro and call the function directly. This commit was SVN r16498.	2007-10-18 14:20:24 +00:00
George Bosilca	aa20a94b6f	Remove warning about an unused variable. This commit was SVN r16497.	2007-10-18 13:48:56 +00:00
Gleb Natapov	4f865e22e8	We have two different version of ompi_request_complete. One as a function another as a macro. Make it one inline function. This commit was SVN r16495.	2007-10-18 13:02:27 +00:00
Gleb Natapov	e0a3a7e53e	Move duplicated code all over the code to a single function ompi_request_wait_completion(). This commit was SVN r16494.	2007-10-18 12:33:21 +00:00
Gleb Natapov	807f49ed7f	If there are more then one BTL present we may divide payload between them in such a way that converter will not be able to pack some of it. This commit adds handling of such cases. If converter can't pack any data for a BTL the data is sent over another BTL that has data to send. This commit was SVN r16493.	2007-10-18 12:07:37 +00:00
George Bosilca	df80d21e04	Get rid of the recv_context field. Instead we can rely on the unique_id, which is shared between the DLL and the parallel debugger. This commit was SVN r16492.	2007-10-17 22:07:38 +00:00
George Bosilca	938be44f07	Complete the removal of the mvapi BTL. This commit was SVN r16491.	2007-10-17 22:02:52 +00:00
Jeff Squyres	b7eeae0a74	Remove the mvapi BTL. Woo hoo! This commit was SVN r16483.	2007-10-17 14:08:03 +00:00
Josh Hursey	0bf61a1b84	Move in some accumulated small features and minor bug fixes for C/R support. {{{ svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs . }}} This commit was SVN r16478.	2007-10-17 13:47:36 +00:00
Jeff Squyres	c143858998	Various updates to the not-yet-released stuff. This commit was SVN r16463.	2007-10-16 16:57:30 +00:00
Ralph Castain	6f498c0964	Since we no longer set the APP_NUM attribute in the case of a singleton, protect the code so we don't crash in that case. This commit was SVN r16461.	2007-10-16 16:17:48 +00:00
Ralph Castain	ec5fe78876	When in the unity message routing mode, we have to update the RML contact info in the parent procs so that they know how to talk to the children. Ideally, this would be done in the MPI layer since that layer knows which procs are actively involved in the comm_spawn. However, it isn't being done there, which causes comm_spawn to fail, so do it explicitly in the RTE. Note that this means ALL procs in the parent job are updated, even though they may not be participating in the comm_spawn. This doesn't really hurt anything - just unnecessary. Comm_spawn still has a problem when a child process shares a node with a parent, so this doesn't fix everything. It only fixes the bug of ensuring all procs know how to talk to each other. This commit was SVN r16460.	2007-10-16 16:09:41 +00:00
Ralph Castain	713b6e13a5	Improve diagnostic output messages when errors are hit This commit was SVN r16457.	2007-10-16 14:51:52 +00:00
Josh Hursey	ea0652d20f	If we are going to pretend to do filem, then we should always pretend. No one should be using this feature except for me. :) This commit was SVN r16454.	2007-10-15 20:04:35 +00:00
Ralph Castain	b6196e8a39	When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed. In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean. I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it. This commit was SVN r16451.	2007-10-15 18:00:30 +00:00
Jeff Squyres	94b1e9cff9	Update to use BTL_VERBOSE and BTL_ERROR instead of opal_output'ing to the mca_btl_base_output stream directly (and relying on it to be -1 if we didn't want any output). This commit was SVN r16449.	2007-10-15 17:53:02 +00:00
Jeff Squyres	423f23eb6a	Fixes trac:1160. There is still some other problem in the OOB, but we wanted to commit this to get wider testing. This commit was SVN r16445. The following Trac tickets were found above: Ticket 1160 --> https://svn.open-mpi.org/trac/ompi/ticket/1160	2007-10-15 15:41:36 +00:00
Josh Hursey	f16a42947a	Change some default MCA parameters: - Global snapshot directory = $HOME - FileM 'rsh' = 'ssh' - FileM 'rcp' = 'scp' This commit was SVN r16444.	2007-10-15 15:21:17 +00:00
Rolf vandeVaart	3dd5196338	Remove the --mca btl_base_debug flag and clean up the use of the --mca btl_base_verbose flag. The btl framework now matches all the other frameworks. Slightly modify error messages for clarity. This commit was SVN r16443.	2007-10-15 13:10:20 +00:00
Gleb Natapov	1330974e5e	eager_limit is no longer needed in OB1 PML. Remove it. This commit was SVN r16442.	2007-10-15 09:26:42 +00:00
George Bosilca	436b0f2a5b	Way to many numbers in this uint32_t. This commit was SVN r16437.	2007-10-12 13:11:55 +00:00
Tim Prins	12d3ad4c5c	remove unused and outdated opal message buffer code This commit was SVN r16436.	2007-10-11 22:09:01 +00:00
George Bosilca	e9aa15f9d5	On behalf of Ralf Wildenhues: config/ompi_check_visibility.m4 (OMPI_CHECK_VISIBILITY): Rename ompi_vc_cc_fvisibility to ompi_cv_cc_fvisibility, so that it will be cached. This commit was SVN r16435.	2007-10-11 22:06:39 +00:00
Josh Hursey	520c27ac94	If the HNP is acting as the orted for local launch then the gpr_replica variable is not defined. Make sure to set it to something reasonable so that file preloading still works (instead of seg faulting :) Thanks to Hiep Bui Hoang for reporting this bug. This commit was SVN r16433.	2007-10-11 19:47:04 +00:00
Josh Hursey	e483c36cea	Remove a big of debug in filem/rsh that should have never been committed. A guesture towards overlapping file removal with metadata update. This commit was SVN r16432.	2007-10-11 19:37:33 +00:00
Josh Hursey	31e9369e8b	Fix orterun so it does not get influenced by an application's argv set. For example, if I have an application that, internal to the application, takes the argument '-mca foo bar' we do not want orterun to pick up this argument and pass it through the system. So the following {{{ shell$ mpirun -np 2 -mca btl tcp,self ./myapp -mca foo bar }}} orterun should pick up {{{-mca btl tcp,self}}} but not {{{-mca foo bar}}} which it was previous to this commit. I tested command line runs and runs with app files to confirm this patch works. This commit was SVN r16431.	2007-10-11 18:33:40 +00:00
George Bosilca	1299ed433e	Don't release the ODLS twice. This commit was SVN r16430.	2007-10-11 17:30:03 +00:00
Jeff Squyres	3500376d9e	Remove a warning about an unused label. This commit was SVN r16429.	2007-10-11 16:38:37 +00:00
Ralph Castain	3dbd4d9be7	Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by: 1. taking advantage of the fact that we no longer create the launch message via a GPR trigger. In earlier times, we had the GPR create the launch message based on a subscription. In that mode of operation, we could not guarantee the order in which the data was stored in the message - hence, we had no choice but to parse the message in a loop that checked each value against a list of possible "keys" until the corresponding value was found. Now, however, we construct the message "by hand", so we know precisely what data is in each location in the message. Thus, we no longer need to send the character string "keys" for each data value any more. This represents a rather large savings in the message size - to give you an example, we typically would use a 30-char "key" for a 2-byte data value. As you can see, the overhead can become very large. 2. sending node-specific data only once. Again, because we used to construct the message via subscriptions that were done on a per-proc basis, the data for each node (e.g., the daemon's name, whether or not the node was oversubscribed) would be included in the data for each proc. Thus, the node-specific data was repeated for every proc. Now that we construct the message "by hand", there is no reason to do this any more. Instead, we can insert the data for a specific node only once, and then provide the per-proc data for that node. We therefore not only save all that extra data in the message, but we also only need to parse the per-node data once. The savings become significant at scale. Here is a comparison between the revised trunk and the trunk prior to this commit (all data was taken on odin, using openib, 64 nodes, unity message routing, tested with application consisting of mpi_init/mpi_barrier/mpi_finalize, all execution times given in seconds, all launch message sizes in bytes): Per-node scaling, taken at 1ppn: #nodes original trunk revised trunk time size time size 1 0.10 819 0.09 564 2 0.14 1070 0.14 677 3 0.15 1321 0.14 790 4 0.15 1572 0.15 903 8 0.17 2576 0.20 1355 16 0.25 4584 0.21 2259 32 0.28 8600 0.27 4067 64 0.50 16632 0.39 7683 Per-proc scaling, taken at 64 nodes ppn original trunk revised trunk time size time size 1 0.50 16669 0.40 7720 2 0.55 32733 0.54 11048 3 0.87 48797 0.81 14376 4 1.0 64861 0.85 17704 Condensing those numbers, it appears we gained: per-node message size: 251 bytes/node -> 113 bytes/node per-proc message size: 251 bytes/proc -> 52 bytes/proc per-job message size: 568 bytes/job -> 399 bytes/job (job-specific data such as jobid, override oversubscribe flag, total #procs in job, total slots allocated) The fact that the two pre-commit trunk numbers are the same confirms the fact that each proc was containing the node data as well. It isn't quite the 10x message reduction I had hoped to get, but it is significant and gives much better scaling. Note that the timing info was, as usual, pretty chaotic - the numbers cited here were typical across several runs taken after the initial one to avoid NFS file positioning influences. Also note that this commit removes the orte_process_info.vpid_start field and the handful of places that passed that useless value. By definition, all jobs start at vpid=0, so all we were doing is passing "0" around. In fact, many places simply hardwired it to "0" anyway rather than deal with it. This commit was SVN r16428.	2007-10-11 15:57:26 +00:00
Rolf vandeVaart	25c95c9ee9	Fix build on solaris. Need to include sys/wait.h. This commit was SVN r16426.	2007-10-11 15:04:30 +00:00
Jeff Squyres	e2df42eea3	Move the <sys/wait.h> below "orte_config.h" This commit was SVN r16424.	2007-10-11 11:31:09 +00:00
George Bosilca	7cc9f588a8	Decorate the base functions with ORTE_DECLSPEC. This commit was SVN r16423.	2007-10-11 00:02:49 +00:00
Ralph Castain	53af94fd87	Modify the configure system so that gridengine support is only built in specific conditions: 1. --with-sge, always builds 2. --without-sge, never builds 3. if neither is specified, build if and only if either SGE_ROOT is set or "qrsh" is found in the path This commit was SVN r16422.	2007-10-10 21:39:16 +00:00
Josh Hursey	6e5341c659	Forgot to move a header in the code movement. This commit was SVN r16420.	2007-10-10 15:39:40 +00:00
Ralph Castain	82a8e2d10d	Reorganize the odls framework to place common functionality in the base, thus making maintenance easier. We still need this to be a framework as some environments (e.g., bproc) require significantly different functionality. However, there is quite a bit of commonality across the components, so this ensures that fixes in one get propagated across the others. This patch also fixes a minor bug discovered along the way: we had "lost" the passing of the oversubscribed condition flag from the mapper to the orteds. Thus, we were not setting sched_yield correctly when in oversubscribed conditions (except when a hostfile was specified - different logic there because we treat the number of slots allocated on the node as "uncertain") I did not modify the process component in this patch - I will send a proposed patch to the maintainers of that component so they can review it first. This commit was SVN r16418.	2007-10-10 15:02:10 +00:00
Josh Hursey	7f833a9cb2	silence a warning that is triggered on restart This commit was SVN r16417.	2007-10-10 14:25:49 +00:00
Ethan Mallove	d0b61db65c	Add in a missing #include for Solaris builds. This commit was SVN r16416.	2007-10-10 12:49:15 +00:00
George Bosilca	e3105a85be	Don't require a progress function from the PML. If there is one then the PML base will take care of the registration with the event library. Otherwise, (and this apply for the CM case) the MTL are in charge of registering their own progress function. This commit was SVN r16415.	2007-10-09 23:28:53 +00:00
Josh Hursey	aa8391f888	Local and global coordinators should be the only ones involved in the movement of checkpoint files. This reduces the overhead on the applicaiton. This commit was SVN r16412.	2007-10-09 19:52:47 +00:00
Galen Shipman	6a25a635de	that shouldn't have slipped through.. This commit was SVN r16411.	2007-10-09 19:07:23 +00:00
Galen Shipman	fda1306807	revert my stupidity.. This commit was SVN r16410.	2007-10-09 19:01:20 +00:00

1 2 3 4 5 ...

10550 Коммитов