Tim Prins
eb94fa48ce
the port name is only relevant at the root, so only look at it there.
...
This commit was SVN r18188.
2008-04-17 12:37:10 +00:00
Tim Prins
3582e11200
cleanup some warnings on 32 bit systems
...
This commit was SVN r18187.
2008-04-17 12:25:05 +00:00
Tim Prins
b2acb51d04
make comm_join work again. Allocate memory to the correct pointer.
...
This commit was SVN r18186.
2008-04-17 11:56:53 +00:00
Rich Graham
6c77fa4921
add a blocking shared memory algorithm.
...
This commit was SVN r18185.
2008-04-16 22:10:23 +00:00
Ralph Castain
eb27e4f23d
Move the reissuing of the daemon recv to occur after the message actually gets processed. This ensures that we don't get multiple messages trying to be processed at the same time.
...
Add one more debug output to see where messages are heading
This commit was SVN r18183.
2008-04-16 20:41:00 +00:00
Ralph Castain
66e532669a
Remove some dead code
...
This commit was SVN r18182.
2008-04-16 20:33:53 +00:00
Ralph Castain
3413191e52
Fix singleton and singleton comm_spawn
...
This commit was SVN r18177.
2008-04-16 14:38:10 +00:00
Ralph Castain
7b91f8baff
Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory.
...
Fix the ompi-server -h cmd line option so it actually tells you something!
Add two new testing codes to the orte/test/mpi area: accept and connect.
This commit was SVN r18176.
2008-04-16 14:27:42 +00:00
Shiqing Fan
aa616b9530
Check whether the debugger is running and whether the convertor is valid.
...
Add a loop to skip the DT_LOOP element.
This commit was SVN r18175.
2008-04-16 13:58:58 +00:00
Shiqing Fan
49fbc4e795
These functions should always have a return value.
...
This commit was SVN r18174.
2008-04-16 13:54:15 +00:00
Shiqing Fan
1c4c7e0f2f
Add memchecker support for osc rdma communication.
...
This commit was SVN r18173.
2008-04-16 13:29:55 +00:00
Shiqing Fan
79da2fdd2c
Use the new memchecker convertor function.
...
Remove some unnecessary memchecker calls.
This commit was SVN r18172.
2008-04-16 13:24:35 +00:00
Adrian Knoth
d34dfbe12c
fixed misleading comment.
...
This commit was SVN r18170.
2008-04-16 11:26:15 +00:00
Adrian Knoth
20473bfda2
on incoming connections, compare with every possible source address.
...
Rational (taken from the code):
/* This is PITA. We never know which source address an
* incoming/outgoing packet will have, so even with
* btl_tcp_if_include/exclude on the remote end, we
* might get a different source address.
*
* If this address isn't included in btl_proc->proc_addrs,
* we would erroneously drop the connection
*/
merge -r18165:18167 to the trunk.
This commit was SVN r18169.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r18165
r18167
2008-04-16 11:24:09 +00:00
Adrian Knoth
e981a259bb
btl_tcp_disable_family=4 and btl_tcp_disable_family=6 are mutually
...
exclusive, so this should result in "unreachable" when set differently
between peers.
This commit was SVN r18168.
2008-04-16 10:14:58 +00:00
Adrian Knoth
84e4013530
Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set.
...
This commit was SVN r18164.
2008-04-16 09:31:15 +00:00
Adrian Knoth
0ddfff4ffe
Added new oob-tcp parameter oob_tcp_disable_family.
...
Like btl_tcp_disable_family, this parameter more or less disables
a whole address family. Though the sockets are still created, the
corresponding information isn't added to the connection strings.
Likewise, we don't try to connect to addresses matching the disabled
address family.
This is particularly important for multidomain clusters, where IPv4 is
oftenly filtered (firewalled), sometimes by simply dropping the packets
instead of rejecting them (thus causing a connection timeout instead of
a quick "no route to host").
This commit was SVN r18163.
2008-04-16 09:22:00 +00:00
Adrian Knoth
75c54616c7
renamed opal_sockaddr2str to opal_net_get_hostname for WANT_PEER_DUMP=1
...
This commit was SVN r18154.
2008-04-15 19:23:47 +00:00
Tim Mattox
55b2546026
Update the NEWS file for a 1.2.7 change.
...
This commit was SVN r18153.
2008-04-15 17:31:57 +00:00
Jeff Squyres
72af302360
Remove unused variable.
...
This commit was SVN r18151.
2008-04-15 14:58:32 +00:00
Ralph Castain
a4ea756a76
Ensure the node loop cntr gets incremented if the daemon already exists
...
This commit was SVN r18150.
2008-04-15 14:20:03 +00:00
Ralph Castain
73e4cfe58a
Add platform files for LANL's RRZ cluster. Update LANL platform files to not build libnbc, vt to save time
...
This commit was SVN r18146.
2008-04-15 02:19:54 +00:00
Ralph Castain
35c260a14f
Fix the plm modules to accommodate the new remote_spawn entry - set that entry to NULL for all but rsh as only that module supports it at this time
...
This commit was SVN r18145.
2008-04-14 19:36:13 +00:00
Ralph Castain
84156c422f
Egad! Typo snuck in there...nasty vi!
...
This commit was SVN r18144.
2008-04-14 18:29:11 +00:00
Ralph Castain
7c7304466c
Add a binomial tree-based launch to ssh, turned "on" only when the plm_rsh_tree_spawned mca param is set to a non-zero value. This probably isn't a very optimized capability, but it does execute a tree-based launch that may scale better than linear at high node counts.
...
Add the daemon map capability to the ODLS to create and save a map of daemon vpid vs nodename from the launch message.
Cleanup a few places in the base plm launch support where we didn't adequately protect rml recv's from potentially executing sends.
This commit was SVN r18143.
2008-04-14 18:26:08 +00:00
Aurelien Bouteiller
0f311ed824
Make sure the function returns NULL when no elan adapter is available instead of a random value.
...
This commit was SVN r18136.
2008-04-11 21:03:01 +00:00
Aurelien Bouteiller
20592cbcbf
Fixes a warning about mallocing 0 bytes when no elan adapter is available.
...
This commit was SVN r18135.
2008-04-11 20:59:12 +00:00
Aurelien Bouteiller
921a6ce3d4
Process with different jobid can kwon connet/accept to each other.
...
This commit was SVN r18134.
2008-04-11 15:40:59 +00:00
Rich Graham
249445d61f
added reduce-scatter followed by gather to root.
...
This commit was SVN r18133.
2008-04-11 13:49:08 +00:00
Rich Graham
a6bdbfab97
implement allreduce as reduce-scatter, followed by an allgather.
...
This commit was SVN r18132.
2008-04-11 04:06:29 +00:00
Jon Mason
08ead87604
Potential double free of locks
...
mca_btl_openib_endpoint_post_rr_nolock is freeing the endpoint lock on
the error case, but most/all of the functions calling this free the lock
regardless of its error case. Thus resulting is a double free of the
lock.
This commit was SVN r18131.
2008-04-10 21:15:01 +00:00
Ralph Castain
e050f37578
Cleanup a few warnings about initializing variables.
...
Remove an obsolete data value.
This commit was SVN r18129.
2008-04-10 19:15:16 +00:00
Rich Graham
70f3aab5f2
remove some code that is not needed.
...
This commit was SVN r18128.
2008-04-10 17:32:04 +00:00
Rich Graham
5c7db1e315
remove 2 race conditions in the buffer recycling logic.
...
This commit was SVN r18127.
2008-04-10 17:20:52 +00:00
Ralph Castain
851279fc9f
Consolidate the daemon wireup message into the launch message. The daemons don't need their contact info prior to the launch message anyway. This not only eliminates a job-wide communication from the startup procedure, but it also resolves a race condition reported when operating across highly distributed (i.e., cross-country) networks. In such scenarios, it proved possible for a daemon to receive its launch message -before- it had received the contact info message, even though the latter had been sent first!
...
This eliminates that problem...
This commit was SVN r18126.
2008-04-10 15:35:11 +00:00
Ralph Castain
4b798cf29a
Massage these platform files a little...
...
This commit was SVN r18125.
2008-04-10 15:32:41 +00:00
Edgar Gabriel
4964434205
reverting commit 18122, since the commit was executed accidentally in the
...
wring directory. The UH copyrights do belong into this file (i.e. because of
the fix which is in the 1.2 branch, the UH copyright notes are in the header
there alreary), but I want to have the proper log for that.
This commit was SVN r18124.
2008-04-10 15:09:31 +00:00
Edgar Gabriel
5989fa570c
Sorry, previous commit was in the wrong directory. This is the real fix (have
...
to undo 1822).
The verification of recvcount==0 and rank = root was braking
inter-communicator scatter, since the root (root==MPI_ROOT) might very well
have recvcount=0. The same fix has been applied to gather.c just the other way
round.
Fixes the bug reported on the mainling list by Martin Audet. If there is a
1.2.7 this fix might be worthwhile porting it over.
Please note, that while the test works now for basic and for inter, we get a
0byte malloc warning from the inter module, which we still have to fix in a
separate patch.
This commit was SVN r18123.
2008-04-10 15:03:14 +00:00
Edgar Gabriel
f87830767a
the verification of recvcount==0 and rank = root was braking
...
inter-communicator scatter, since the root (root==MPI_ROOT) might very well
have recvcount=0. The same fix has been applied to gather.c just the other way
round.
Fixes the bug reported on the mainling list by Martin Audet. If there is a
1.2.7 this fix might be worthwhile porting it over.
Please note, that while the test works now for basic and for inter, we get a
0byte malloc warning from the inter module, which we still have to fix in a
separate patch.
This commit was SVN r18122.
2008-04-10 14:58:51 +00:00
Ralph Castain
57e3e86cda
Use the proper exit code for mpirun to indicate an error when something goes wrong during launch (in scenarios where the procs don't report the problem directly themselves)
...
This commit was SVN r18121.
2008-04-10 09:15:08 +00:00
Ralph Castain
e7d0dae89d
Ensure we update the daemon collective trees if num_procs changes, but only if it changes
...
This commit was SVN r18120.
2008-04-10 03:44:18 +00:00
Ralph Castain
22343e6e0b
Given total lack of interest/support from the folks behind these environments, and the fact that we can now scale so well with our own daemons, it seems unlikely that we will be able to pursue direct and/or standalone launch in these environments. If that situation ever changes, it is easy enough to revive the effort since little had really been done to-date.
...
Meantime, no reason to continue dragging these around.
This commit was SVN r18119.
2008-04-10 02:54:13 +00:00
Ralph Castain
dc2f88b9f0
Now that we have the daemon collectives, the unity routed module no longer needs the "hack" we inserted a week ago to tell the daemons how to talk directly to all the application procs. The modex and barrier messages flow cleanly across the daemons and are "dropped" into the procs where required.
...
Add some insurance to make certain that the daemons' number of procs only gets updated when it absolutely is intended.
This commit was SVN r18118.
2008-04-10 02:45:42 +00:00
Ralph Castain
0b3122ee2f
Update the cnos module - should (hopefully) compile and work...
...
This commit was SVN r18117.
2008-04-09 22:33:00 +00:00
Ralph Castain
86b4ae5970
Remove a generated file from the repository - shouldn't have been there
...
This commit was SVN r18116.
2008-04-09 22:13:51 +00:00
Ralph Castain
3a0d09300b
Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations.
...
Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study.
This commit was SVN r18115.
2008-04-09 22:10:53 +00:00
Ralph Castain
7cb1e72f76
Okay, let's really get those libz references out of there!
...
This commit was SVN r18114.
2008-04-09 22:05:09 +00:00
Ralph Castain
95d7e177c6
Not really a test, but a useful tool for testing computation of binomial trees
...
This commit was SVN r18113.
2008-04-09 21:58:42 +00:00
Ralph Castain
3120428f0f
Update several platform files to remove the libz dependency, add a couple for the Mac
...
This commit was SVN r18112.
2008-04-09 21:57:59 +00:00
Rich Graham
c6783549ef
getting old
...
This commit was SVN r18110.
2008-04-09 16:55:16 +00:00