1
1
Граф коммитов

11413 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
09b6758f8c Pass the prefix dir to the remote orted when doing tree-based spawns
This commit was SVN r18280.
2008-04-24 18:38:24 +00:00
Josh Hursey
2c736873bb Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.

The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.

Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.

 * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
 * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
 * Update ft_event functions in PML and BML to handle the new restart state.
 * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.

This commit was SVN r18276.
2008-04-24 17:54:22 +00:00
George Bosilca
3ccac4f803 Oops ...
This commit was SVN r18275.
2008-04-24 15:54:52 +00:00
George Bosilca
73c9de3af9 Bark if we got a wrong sequence number. Here wrong means that the
seq number if smaller than what we expect.

This commit was SVN r18274.
2008-04-24 15:48:43 +00:00
Tim Mattox
46c6aa4ed4 Resync the trunk NEWS file with the 1.2.7 changes.
This commit was SVN r18268.
2008-04-23 18:32:19 +00:00
Ralph Castain
eece9f88f0 Fix a bug in the way we computed local_rank. This needs to be the local_rank -among my job peers- on a node.
We were mistakenly computing the local_rank across -all- jobs with procs on that node. While the two definitions are equivalent for an initial launch, comm_spawn'd procs would get the wrong local_rank. In particular, there would not be a local_rank=0 proc in the comm_spawn'd job on any node that was shared with the initial job.

This commit was SVN r18263.
2008-04-23 17:42:59 +00:00
Rich Graham
4d1ae7b05f accidentally made a change in the wrong place.
This commit was SVN r18262.
2008-04-23 17:32:05 +00:00
Rich Graham
293dd6ad4e add myself to list of people building this module.
This commit was SVN r18261.
2008-04-23 17:25:36 +00:00
Rich Graham
7658cc79e4 Pass in the correct module to the reduction call.
This commit was SVN r18260.
2008-04-23 17:23:30 +00:00
Ralph Castain
f56f06a7ff Do not trust the RM's names - apparently, RR has trained it to lie! Default to using the name we got from gethostname as it is the only one we can trust.
This commit was SVN r18259.
2008-04-23 17:00:35 +00:00
Ralph Castain
8001e4e99c See if this will fix a race condition showing up in comm_spawn MTT testing
This commit was SVN r18257.
2008-04-23 15:43:44 +00:00
Adrian Knoth
c53d3c3c22 reverted r18169,r18170 due to connection reset by peer on odin/sif
This commit was SVN r18255.

The following SVN revision numbers were found above:
  r18169 --> open-mpi/ompi@20473bfda2
  r18170 --> open-mpi/ompi@d34dfbe12c
2008-04-23 15:26:15 +00:00
Ralph Castain
5311b13b60 Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list
Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system.

This commit was SVN r18252.
2008-04-23 14:52:09 +00:00
Lenny Verkhovsky
456ce6c4da Few cleanups in Rank_File component + fixed opal_paffinity_slot_list without rankfile
This commit was SVN r18249.
2008-04-23 13:34:05 +00:00
Shiqing Fan
eb5f5d77cc If it's not the HNP, release the cluster object first and return.
This commit was SVN r18247.
2008-04-23 13:21:32 +00:00
Josh Hursey
750ce0152c After a bit of testing this morning it seems that the tree component is able to work correctly with the checkpoint/restart functionality. So enable this component when C/R is enabled.
This commit was SVN r18246.
2008-04-23 13:01:23 +00:00
Shiqing Fan
4a9787979e When valgrind is not available or it is deselected (--without-valgrind, --with-valgrind=no), don't compile this component, continue without abortion.
This commit was SVN r18243.
2008-04-23 11:50:42 +00:00
Josh Hursey
cc83d41ad9 Merge in tmp/jjh-scratch
{{{
 svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch .
}}}

Contains:
 * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart.
 * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff
 * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P.
 * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry
 * Some other sundry cleanup items all dealing with C/R functionality in the trunk.

This commit was SVN r18241.
2008-04-23 00:17:12 +00:00
Tim Mattox
0215474cb8 Fix two bugs in coll_sm_module.c from bit-rot:
Fixed a selection bug, and removed a bogus "free(proc)" call
which ultimately caused MPI_Finalize to crash.

This commit was SVN r18235.
2008-04-22 18:41:21 +00:00
Jeff Squyres
c40740947f Fix minor spelling error.
This commit was SVN r18229.
2008-04-22 13:11:50 +00:00
Galen Shipman
27c425b304 make portals level ack's optional (require ACK by default)
This commit was SVN r18228.
2008-04-21 22:22:18 +00:00
Ralph Castain
c3ddf66445 Move the dislay-allocation code to where it is always seen
This commit was SVN r18227.
2008-04-21 20:28:59 +00:00
Ralph Castain
16c9100633 Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation.
This commit was SVN r18216.
2008-04-20 02:25:45 +00:00
Rich Graham
df35223603 add selection logic for barrier and reduce.
This commit was SVN r18215.
2008-04-19 22:40:04 +00:00
Rich Graham
bee8b42f29 remove debug code that would not let people run.
Add infrastructure for blocking-barrier.

This commit was SVN r18214.
2008-04-19 01:34:04 +00:00
Josh Hursey
56a61bfacf switch the name of orterun to mpirun to make things more clear.
This commit was SVN r18208.
2008-04-18 12:59:23 +00:00
Galen Shipman
92e3b8671f nasty memory bug...
This commit was SVN r18207.
2008-04-18 03:01:53 +00:00
Jeff Squyres
db2695ccab Make the symbols be visible.
This commit was SVN r18201.
2008-04-18 00:26:17 +00:00
Jeff Squyres
a198971fa2 Temporarily disable Solaris ports support in libevent. Refs trac:1273
This commit was SVN r18199.

The following Trac tickets were found above:
  Ticket 1273 --> https://svn.open-mpi.org/trac/ompi/ticket/1273
2008-04-17 23:14:43 +00:00
Ralph Castain
fa082cafa9 Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex.
Note: this is an early preliminary step in the movement of portions of the datatype engine to the opal layer.

This commit was SVN r18198.
2008-04-17 20:43:56 +00:00
George Bosilca
01148b77dc Generate the help message for the available event ops. Now the list only
contains the one that are compiled on the current ompi.

This commit was SVN r18196.
2008-04-17 18:16:54 +00:00
Ralph Castain
07f0a71faa Cleanup the show_help entries on the seq mapper
This commit was SVN r18191.
2008-04-17 14:43:15 +00:00
Ralph Castain
e7487ad533 Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile.
Restore the "do-not-launch" functionality so users can test a mapping without launching it.

Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests.

Add a function to hostfile to generate an ordered list of host names from a hostfile

This commit was SVN r18190.
2008-04-17 13:50:59 +00:00
Tim Prins
eb94fa48ce the port name is only relevant at the root, so only look at it there.
This commit was SVN r18188.
2008-04-17 12:37:10 +00:00
Tim Prins
3582e11200 cleanup some warnings on 32 bit systems
This commit was SVN r18187.
2008-04-17 12:25:05 +00:00
Tim Prins
b2acb51d04 make comm_join work again. Allocate memory to the correct pointer.
This commit was SVN r18186.
2008-04-17 11:56:53 +00:00
Rich Graham
6c77fa4921 add a blocking shared memory algorithm.
This commit was SVN r18185.
2008-04-16 22:10:23 +00:00
Ralph Castain
eb27e4f23d Move the reissuing of the daemon recv to occur after the message actually gets processed. This ensures that we don't get multiple messages trying to be processed at the same time.
Add one more debug output to see where messages are heading

This commit was SVN r18183.
2008-04-16 20:41:00 +00:00
Ralph Castain
66e532669a Remove some dead code
This commit was SVN r18182.
2008-04-16 20:33:53 +00:00
Ralph Castain
3413191e52 Fix singleton and singleton comm_spawn
This commit was SVN r18177.
2008-04-16 14:38:10 +00:00
Ralph Castain
7b91f8baff Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory.
Fix the ompi-server -h cmd line option so it actually tells you something!

Add two new testing codes to the orte/test/mpi area: accept and connect.

This commit was SVN r18176.
2008-04-16 14:27:42 +00:00
Shiqing Fan
aa616b9530 Check whether the debugger is running and whether the convertor is valid.
Add a loop to skip the DT_LOOP element. 

This commit was SVN r18175.
2008-04-16 13:58:58 +00:00
Shiqing Fan
49fbc4e795 These functions should always have a return value.
This commit was SVN r18174.
2008-04-16 13:54:15 +00:00
Shiqing Fan
1c4c7e0f2f Add memchecker support for osc rdma communication.
This commit was SVN r18173.
2008-04-16 13:29:55 +00:00
Shiqing Fan
79da2fdd2c Use the new memchecker convertor function.
Remove some unnecessary memchecker calls.

This commit was SVN r18172.
2008-04-16 13:24:35 +00:00
Adrian Knoth
d34dfbe12c fixed misleading comment.
This commit was SVN r18170.
2008-04-16 11:26:15 +00:00
Adrian Knoth
20473bfda2 on incoming connections, compare with every possible source address.
Rational (taken from the code):

    /* This is PITA. We never know which source address an 
    * incoming/outgoing packet will have, so even with 
    * btl_tcp_if_include/exclude on the remote end, we 
    * might get a different source address. 
    * 
    * If this address isn't included in btl_proc->proc_addrs, 
    * we would erroneously drop the connection 
    */ 

merge -r18165:18167 to the trunk.

This commit was SVN r18169.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r18165
  r18167
2008-04-16 11:24:09 +00:00
Adrian Knoth
e981a259bb btl_tcp_disable_family=4 and btl_tcp_disable_family=6 are mutually
exclusive, so this should result in "unreachable" when set differently
between peers.

This commit was SVN r18168.
2008-04-16 10:14:58 +00:00
Adrian Knoth
84e4013530 Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set.
This commit was SVN r18164.
2008-04-16 09:31:15 +00:00
Adrian Knoth
0ddfff4ffe Added new oob-tcp parameter oob_tcp_disable_family.
Like btl_tcp_disable_family, this parameter more or less disables
a whole address family. Though the sockets are still created, the
corresponding information isn't added to the connection strings.

Likewise, we don't try to connect to addresses matching the disabled
address family.

This is particularly important for multidomain clusters, where IPv4 is
oftenly filtered (firewalled), sometimes by simply dropping the packets
instead of rejecting them (thus causing a connection timeout instead of
a quick "no route to host").

This commit was SVN r18163.
2008-04-16 09:22:00 +00:00