1
1
Граф коммитов

2486 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
ba5615a18f Merge in /tmp-public/cpc3 branch to trunk. oob/xoob still remains the
default CPC.

This commit was SVN r18356.
2008-05-02 11:52:33 +00:00
Donald Kerr
843a35094f adding local work queue accounting
This commit was SVN r18352.
2008-05-01 21:01:51 +00:00
George Bosilca
a69ac964df Allow any order in the list of Elan vpid.
This commit was SVN r18350.
2008-05-01 20:32:03 +00:00
Josh Hursey
dcd21d7d07 Some checkpoint/restart fixes in response to r18338 (changes in modex).
Things should be working now.

This commit was SVN r18348.

The following SVN revision numbers were found above:
  r18338 --> open-mpi/ompi@3e55fe6f6d
2008-05-01 17:48:13 +00:00
Ralph Castain
3e55fe6f6d Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit.
Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs.

Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node.

This commit was SVN r18338.
2008-04-30 19:49:53 +00:00
Pavel Shamis
61cc8843bf The r17940 broke the XRC code.
The endpoint may be appended to list during XOOB connection bring up.

This commit was SVN r18328.

The following SVN revision numbers were found above:
  r17940 --> open-mpi/ompi@ebfdd133f5
2008-04-29 13:22:40 +00:00
Galen Shipman
ced88a338b include portals modex fun in the distro
This commit was SVN r18325.
2008-04-28 18:51:54 +00:00
Brad Penoff
c699236be2 updating SCTP BTL to configure properly with FreeBSD 7
This commit was SVN r18324.
2008-04-28 04:19:10 +00:00
George Bosilca
6e6c370917 Rollback r18274 as its legal to have a sequence number smaller than the
expected one. It doesn't necessarily means the message is duplicated,
it can simply signify the message is out of sequence and the counter
overflowed.

This commit was SVN r18323.

The following SVN revision numbers were found above:
  r18274 --> open-mpi/ompi@73c9de3af9
2008-04-27 18:35:54 +00:00
Aurelien Bouteiller
611d52fa95 Fix a bug that rpevented to use the same port (as returned by Open_port) for several Comm_accept)
This commit was SVN r18303.
2008-04-25 20:41:44 +00:00
Aurelien Bouteiller
c20b020ea6 Fix ticket #1275. The pml v can now be correctly deactivated on the configure command line. Also fix a dist target under some unusual circumpstances.
This commit was SVN r18291.
2008-04-24 21:42:54 +00:00
Josh Hursey
2c736873bb Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.

The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.

Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.

 * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
 * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
 * Update ft_event functions in PML and BML to handle the new restart state.
 * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.

This commit was SVN r18276.
2008-04-24 17:54:22 +00:00
George Bosilca
3ccac4f803 Oops ...
This commit was SVN r18275.
2008-04-24 15:54:52 +00:00
George Bosilca
73c9de3af9 Bark if we got a wrong sequence number. Here wrong means that the
seq number if smaller than what we expect.

This commit was SVN r18274.
2008-04-24 15:48:43 +00:00
Rich Graham
4d1ae7b05f accidentally made a change in the wrong place.
This commit was SVN r18262.
2008-04-23 17:32:05 +00:00
Rich Graham
293dd6ad4e add myself to list of people building this module.
This commit was SVN r18261.
2008-04-23 17:25:36 +00:00
Rich Graham
7658cc79e4 Pass in the correct module to the reduction call.
This commit was SVN r18260.
2008-04-23 17:23:30 +00:00
Adrian Knoth
c53d3c3c22 reverted r18169,r18170 due to connection reset by peer on odin/sif
This commit was SVN r18255.

The following SVN revision numbers were found above:
  r18169 --> open-mpi/ompi@20473bfda2
  r18170 --> open-mpi/ompi@d34dfbe12c
2008-04-23 15:26:15 +00:00
Josh Hursey
cc83d41ad9 Merge in tmp/jjh-scratch
{{{
 svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch .
}}}

Contains:
 * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart.
 * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff
 * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P.
 * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry
 * Some other sundry cleanup items all dealing with C/R functionality in the trunk.

This commit was SVN r18241.
2008-04-23 00:17:12 +00:00
Tim Mattox
0215474cb8 Fix two bugs in coll_sm_module.c from bit-rot:
Fixed a selection bug, and removed a bogus "free(proc)" call
which ultimately caused MPI_Finalize to crash.

This commit was SVN r18235.
2008-04-22 18:41:21 +00:00
Jeff Squyres
c40740947f Fix minor spelling error.
This commit was SVN r18229.
2008-04-22 13:11:50 +00:00
Galen Shipman
27c425b304 make portals level ack's optional (require ACK by default)
This commit was SVN r18228.
2008-04-21 22:22:18 +00:00
Rich Graham
df35223603 add selection logic for barrier and reduce.
This commit was SVN r18215.
2008-04-19 22:40:04 +00:00
Rich Graham
bee8b42f29 remove debug code that would not let people run.
Add infrastructure for blocking-barrier.

This commit was SVN r18214.
2008-04-19 01:34:04 +00:00
Galen Shipman
92e3b8671f nasty memory bug...
This commit was SVN r18207.
2008-04-18 03:01:53 +00:00
Ralph Castain
fa082cafa9 Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex.
Note: this is an early preliminary step in the movement of portions of the datatype engine to the opal layer.

This commit was SVN r18198.
2008-04-17 20:43:56 +00:00
Tim Prins
eb94fa48ce the port name is only relevant at the root, so only look at it there.
This commit was SVN r18188.
2008-04-17 12:37:10 +00:00
Tim Prins
3582e11200 cleanup some warnings on 32 bit systems
This commit was SVN r18187.
2008-04-17 12:25:05 +00:00
Rich Graham
6c77fa4921 add a blocking shared memory algorithm.
This commit was SVN r18185.
2008-04-16 22:10:23 +00:00
Ralph Castain
7b91f8baff Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory.
Fix the ompi-server -h cmd line option so it actually tells you something!

Add two new testing codes to the orte/test/mpi area: accept and connect.

This commit was SVN r18176.
2008-04-16 14:27:42 +00:00
Shiqing Fan
1c4c7e0f2f Add memchecker support for osc rdma communication.
This commit was SVN r18173.
2008-04-16 13:29:55 +00:00
Shiqing Fan
79da2fdd2c Use the new memchecker convertor function.
Remove some unnecessary memchecker calls.

This commit was SVN r18172.
2008-04-16 13:24:35 +00:00
Adrian Knoth
d34dfbe12c fixed misleading comment.
This commit was SVN r18170.
2008-04-16 11:26:15 +00:00
Adrian Knoth
20473bfda2 on incoming connections, compare with every possible source address.
Rational (taken from the code):

    /* This is PITA. We never know which source address an 
    * incoming/outgoing packet will have, so even with 
    * btl_tcp_if_include/exclude on the remote end, we 
    * might get a different source address. 
    * 
    * If this address isn't included in btl_proc->proc_addrs, 
    * we would erroneously drop the connection 
    */ 

merge -r18165:18167 to the trunk.

This commit was SVN r18169.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r18165
  r18167
2008-04-16 11:24:09 +00:00
Adrian Knoth
e981a259bb btl_tcp_disable_family=4 and btl_tcp_disable_family=6 are mutually
exclusive, so this should result in "unreachable" when set differently
between peers.

This commit was SVN r18168.
2008-04-16 10:14:58 +00:00
Adrian Knoth
75c54616c7 renamed opal_sockaddr2str to opal_net_get_hostname for WANT_PEER_DUMP=1
This commit was SVN r18154.
2008-04-15 19:23:47 +00:00
Jeff Squyres
72af302360 Remove unused variable.
This commit was SVN r18151.
2008-04-15 14:58:32 +00:00
Aurelien Bouteiller
0f311ed824 Make sure the function returns NULL when no elan adapter is available instead of a random value.
This commit was SVN r18136.
2008-04-11 21:03:01 +00:00
Aurelien Bouteiller
20592cbcbf Fixes a warning about mallocing 0 bytes when no elan adapter is available.
This commit was SVN r18135.
2008-04-11 20:59:12 +00:00
Rich Graham
249445d61f added reduce-scatter followed by gather to root.
This commit was SVN r18133.
2008-04-11 13:49:08 +00:00
Rich Graham
a6bdbfab97 implement allreduce as reduce-scatter, followed by an allgather.
This commit was SVN r18132.
2008-04-11 04:06:29 +00:00
Jon Mason
08ead87604 Potential double free of locks
mca_btl_openib_endpoint_post_rr_nolock is freeing the endpoint lock on
the error case, but most/all of the functions calling this free the lock
regardless of its error case.  Thus resulting is a double free of the
lock.

This commit was SVN r18131.
2008-04-10 21:15:01 +00:00
Rich Graham
70f3aab5f2 remove some code that is not needed.
This commit was SVN r18128.
2008-04-10 17:32:04 +00:00
Rich Graham
5c7db1e315 remove 2 race conditions in the buffer recycling logic.
This commit was SVN r18127.
2008-04-10 17:20:52 +00:00
Edgar Gabriel
4964434205 reverting commit 18122, since the commit was executed accidentally in the
wring directory. The UH copyrights do belong into this file (i.e. because of
the fix which is in the 1.2 branch, the UH copyright notes are in the header
there alreary), but I want to have the proper log for that.  

This commit was SVN r18124.
2008-04-10 15:09:31 +00:00
Edgar Gabriel
f87830767a the verification of recvcount==0 and rank = root was braking
inter-communicator scatter, since the root (root==MPI_ROOT) might very well
have recvcount=0. The same fix has been applied to gather.c just the other way
round. 
 
Fixes the bug reported on the mainling list by Martin Audet. If there is a
1.2.7 this fix might be worthwhile porting it over.

Please note, that while the test works now for basic and for inter, we get a
0byte malloc warning from the inter module, which we still have to fix in a
separate patch.

This commit was SVN r18122.
2008-04-10 14:58:51 +00:00
Ralph Castain
3a0d09300b Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations.
Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study.

This commit was SVN r18115.
2008-04-09 22:10:53 +00:00
Rich Graham
c6783549ef getting old
This commit was SVN r18110.
2008-04-09 16:55:16 +00:00
Rich Graham
1a20c3ce51 more debug.
This commit was SVN r18109.
2008-04-09 16:19:52 +00:00
Rich Graham
e7e18303f6 more debug.
This commit was SVN r18108.
2008-04-09 15:10:58 +00:00
Rich Graham
b14c6b17d5 adding debug output.
This commit was SVN r18107.
2008-04-09 13:32:01 +00:00
Rich Graham
10434fb2f1 add barrier synchorinzation at the end of the module init, to
avoid initializing shared memory variables in use.

This commit was SVN r18105.
2008-04-09 03:44:40 +00:00
Rich Graham
19bb1a2e86 fix initialization bug.
This commit was SVN r18104.
2008-04-08 23:34:06 +00:00
Donald Kerr
38e298cc9a report error message in all libs, not just debug
This commit was SVN r18103.
2008-04-08 22:58:28 +00:00
Rich Graham
a69a8d9626 initialize the flags.
This commit was SVN r18102.
2008-04-08 22:16:39 +00:00
Rich Graham
8765a2bbdd more debug code.
This commit was SVN r18101.
2008-04-08 20:38:20 +00:00
Rich Graham
08becf33b5 add more debugging.
This commit was SVN r18100.
2008-04-08 18:44:50 +00:00
Rich Graham
aa1b7dd406 more debug
This commit was SVN r18099.
2008-04-08 03:56:47 +00:00
Rich Graham
0c18bdeff7 more debug code.
This commit was SVN r18098.
2008-04-08 03:04:20 +00:00
Rich Graham
9d5a7238df Add some debugging code.
This commit was SVN r18097.
2008-04-07 23:20:15 +00:00
Rich Graham
fa696734d5 add some debug code.
This commit was SVN r18096.
2008-04-07 21:03:23 +00:00
Shiqing Fan
28746bbcdb Remove the memchecker macro in pml base request, used in req_wait.c, which actually is in the wrong place. Instead, one simple call from send_request_free and recv_request_free(already done) will do all the work, fast and clean.
This commit was SVN r18095.
2008-04-07 17:46:50 +00:00
Shiqing Fan
a1e5df1cc9 Use the new memchecker function call which is based on convertor.
Remove one unnecessary call.

This commit was SVN r18085.
2008-04-07 07:52:04 +00:00
Gleb Natapov
713a27dc71 Counter of created RDMA channels should be incremented immediately after channel
creation (not in control message completion) otherwise more than max_eager_rdma
channel may be created.

This commit was SVN r18082.
2008-04-06 13:48:45 +00:00
Rich Graham
1b54e8b76e fix buffer management for nb-barrier.
This commit was SVN r18081.
2008-04-05 21:59:04 +00:00
Tim Prins
313edd8955 - Fix a problem reported on the users list where we would segfault in finalize after calling spawn if the user did not call MPI_Comm_disconnect
- Fix the app context constructor so it initializes all the fields.

This commit was SVN r18079.
2008-04-04 15:07:39 +00:00
Jeff Squyres
7072a32703 * Properly protect XRC stuff
* A few minor style fixes

This commit was SVN r18076.
2008-04-02 19:52:03 +00:00
Rich Graham
94f8fd365c a few reduction optimizations. Add bcast.
This commit was SVN r18075.
2008-04-02 19:02:33 +00:00
George Bosilca
a00ca20446 More cleanups.
This commit was SVN r18069.
2008-04-02 06:38:33 +00:00
George Bosilca
944453c4c1 Cleanups.
This commit was SVN r18068.
2008-04-02 06:37:42 +00:00
Rich Graham
eb5d6096f1 add reduction routine - fix buffer recycling logic which was totally
broken.

This commit was SVN r18065.
2008-04-01 22:56:18 +00:00
Jeff Squyres
d944d5ec52 Just in case something goes drastically wrong, don't segv.
This commit was SVN r18049.
2008-03-31 21:55:07 +00:00
George Bosilca
b4f828f389 We need a newline at the nd of the file, or some compiler bark.
This commit was SVN r18023.
2008-03-30 19:05:56 +00:00
Gleb Natapov
b42234461a Cleanup shared file creation on unix/linux.
This commit was SVN r18021.
2008-03-30 13:41:47 +00:00
Jeff Squyres
d0f12f3df0 Make a better error message.
This commit was SVN r18014.
2008-03-29 12:54:24 +00:00
Rich Graham
90e53ca9ee debug the pipeline algorithm.
This commit was SVN r18008.
2008-03-28 15:10:07 +00:00
Aurelien Bouteiller
77653ac787 Missing .h file in makefile breaked nightly tarball distcheck...
This commit was SVN r18006.
2008-03-28 14:36:56 +00:00
Aurelien Bouteiller
c16339944a Fix a coverity warning about using unsafe sprintf.
This commit was SVN r17999.
2008-03-27 21:24:27 +00:00
Aurelien Bouteiller
e11237aadb Introduction of the "progress" sender_based method to replace the slow isend-self method.
This commit was SVN r17998.
2008-03-27 21:19:45 +00:00
Aurelien Bouteiller
93db01871e This is part of the previous patch.
This commit was SVN r17997.
2008-03-27 21:06:14 +00:00
Aurelien Bouteiller
f8bf6f2c6a Code cleanup.
sender_based.h is now split in two files, to solve cyclic .h files inclusion. 
Most macros are now inline functions.
Variable names have been changed from places to places.
Various other small things... 

This commit was SVN r17996.
2008-03-27 21:05:44 +00:00
George Bosilca
be4b153f0d Another patch for thread safety in the TCP BTL (thanks to Pierre).
This commit was SVN r17993.
2008-03-27 18:36:08 +00:00
Gleb Natapov
cf40674369 Decide if sends should be throttled at the receiver and pass this to the sender
in an ACK message. The decision can't be done reliably at the sender.

This commit was SVN r17987.
2008-03-27 08:56:43 +00:00
Rich Graham
e2ad9c4be2 adjust to change in orte_process_info.
This commit was SVN r17986.
2008-03-27 01:25:28 +00:00
Rich Graham
441fb9fb9e checkpoint.
This commit was SVN r17985.
2008-03-27 01:16:32 +00:00
Ralph Castain
90107f3c14 Fix an issue with comm_spawn over who sent/recv first in the modex. The modex assumes that the first name on the list is the "root" that will serve as the allgather collector/distributor. The dpm was putting that entity last, which forced us to pre-inform the parent procs of the child proc's contact info since the parent was trying to send to the child.
Clarify the setting of send_first in the mpi bindings (trivial, i know, but helpful)

Remove the extra xcast of child contact info to the parent job.

This commit was SVN r17952.
2008-03-25 14:57:34 +00:00
Ralph Castain
cca449e379 Move an OMPI RML tag to the OMPI layer
This commit was SVN r17950.
2008-03-25 13:30:48 +00:00
Jeff Squyres
5320c91ab3 Oops -- fix the constructor to also use opal_object_t instead of
opal_list_item_t.

This commit was SVN r17945.
2008-03-25 11:59:50 +00:00
Galen Shipman
0116041133 BTL shouldn't own the passive side's descriptor in the PML get protocol. The BTL
doesn't know when to free it on the passive side. 

This commit was SVN r17943.
2008-03-25 01:43:41 +00:00
Jeff Squyres
ebfdd133f5 AFACT, we never put endpoints on a list.
This commit was SVN r17940.
2008-03-24 18:32:55 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Rich Graham
a7c836a2b0 fix location of the restrict key word.
Make the tag in the fan-in/fan-out algorithm be fragment based.

This commit was SVN r17903.
2008-03-21 01:40:36 +00:00
Rich Graham
2c66d396b7 take care of some bit-rot with the fanin-fanout method.
This commit was SVN r17902.
2008-03-21 01:08:49 +00:00
Rich Graham
b9520e61dc get the sm optimized allreduce working for all but user defined
operations.  Added to the reduction operations a set of reduction
functions that take 2 input buffers and one output buffer to avoid
some extra memory copies.  These can't be used with user defined
operations.  The intel c collective suite passes both original, and
new (new, not the user defined operations).

This commit was SVN r17901.
2008-03-20 23:51:16 +00:00
Galen Shipman
dcac824f59 Fix problem in releasing fragments during GET_END event (didn't check that
portals btl has ownership and therefor didn't free the frag as it should) this
causes leakage and hangs in MPI_Finalize. 

Also added a bit more debugging. 

This commit was SVN r17900.
2008-03-20 22:46:32 +00:00
George Bosilca
efa89bfa3f Revert r17857. The context should be set in one case ... when we call prepare_{src|dst}
without calling a get or put. So, just keep it here until a better solution is
found.

This commit was SVN r17872.

The following SVN revision numbers were found above:
  r17857 --> open-mpi/ompi@d460ccfbf9
2008-03-18 19:01:27 +00:00
George Bosilca
8943ae0b4e Cleanup plus some typos.
This commit was SVN r17858.
2008-03-18 03:03:33 +00:00
George Bosilca
d460ccfbf9 No need to check for NULL there. The bml_btl is set correctly
on the upper level.

This commit was SVN r17857.
2008-03-18 03:02:31 +00:00
George Bosilca
39353ebb44 Cleanup.
This commit was SVN r17855.
2008-03-18 02:56:50 +00:00
George Bosilca
76deec135e The .h file is not used anymore (it contain the descriptor cache). Update the
Makefile.am file as well.

This commit was SVN r17854.
2008-03-18 02:50:24 +00:00