1
1

1624 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
4c2c6c9bd8 Ensure the pack/unpacks match for tree-spawn
This commit was SVN r18282.
2008-04-24 18:53:08 +00:00
Ralph Castain
09b6758f8c Pass the prefix dir to the remote orted when doing tree-based spawns
This commit was SVN r18280.
2008-04-24 18:38:24 +00:00
Josh Hursey
2c736873bb Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.

The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.

Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.

 * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
 * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
 * Update ft_event functions in PML and BML to handle the new restart state.
 * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.

This commit was SVN r18276.
2008-04-24 17:54:22 +00:00
Ralph Castain
eece9f88f0 Fix a bug in the way we computed local_rank. This needs to be the local_rank -among my job peers- on a node.
We were mistakenly computing the local_rank across -all- jobs with procs on that node. While the two definitions are equivalent for an initial launch, comm_spawn'd procs would get the wrong local_rank. In particular, there would not be a local_rank=0 proc in the comm_spawn'd job on any node that was shared with the initial job.

This commit was SVN r18263.
2008-04-23 17:42:59 +00:00
Ralph Castain
f56f06a7ff Do not trust the RM's names - apparently, RR has trained it to lie! Default to using the name we got from gethostname as it is the only one we can trust.
This commit was SVN r18259.
2008-04-23 17:00:35 +00:00
Ralph Castain
8001e4e99c See if this will fix a race condition showing up in comm_spawn MTT testing
This commit was SVN r18257.
2008-04-23 15:43:44 +00:00
Ralph Castain
5311b13b60 Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list
Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system.

This commit was SVN r18252.
2008-04-23 14:52:09 +00:00
Lenny Verkhovsky
456ce6c4da Few cleanups in Rank_File component + fixed opal_paffinity_slot_list without rankfile
This commit was SVN r18249.
2008-04-23 13:34:05 +00:00
Shiqing Fan
eb5f5d77cc If it's not the HNP, release the cluster object first and return.
This commit was SVN r18247.
2008-04-23 13:21:32 +00:00
Josh Hursey
750ce0152c After a bit of testing this morning it seems that the tree component is able to work correctly with the checkpoint/restart functionality. So enable this component when C/R is enabled.
This commit was SVN r18246.
2008-04-23 13:01:23 +00:00
Josh Hursey
cc83d41ad9 Merge in tmp/jjh-scratch
{{{
 svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch .
}}}

Contains:
 * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart.
 * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff
 * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P.
 * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry
 * Some other sundry cleanup items all dealing with C/R functionality in the trunk.

This commit was SVN r18241.
2008-04-23 00:17:12 +00:00
Ralph Castain
c3ddf66445 Move the dislay-allocation code to where it is always seen
This commit was SVN r18227.
2008-04-21 20:28:59 +00:00
Ralph Castain
16c9100633 Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation.
This commit was SVN r18216.
2008-04-20 02:25:45 +00:00
Josh Hursey
56a61bfacf switch the name of orterun to mpirun to make things more clear.
This commit was SVN r18208.
2008-04-18 12:59:23 +00:00
Ralph Castain
07f0a71faa Cleanup the show_help entries on the seq mapper
This commit was SVN r18191.
2008-04-17 14:43:15 +00:00
Ralph Castain
e7487ad533 Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile.
Restore the "do-not-launch" functionality so users can test a mapping without launching it.

Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests.

Add a function to hostfile to generate an ordered list of host names from a hostfile

This commit was SVN r18190.
2008-04-17 13:50:59 +00:00
Ralph Castain
eb27e4f23d Move the reissuing of the daemon recv to occur after the message actually gets processed. This ensures that we don't get multiple messages trying to be processed at the same time.
Add one more debug output to see where messages are heading

This commit was SVN r18183.
2008-04-16 20:41:00 +00:00
Ralph Castain
66e532669a Remove some dead code
This commit was SVN r18182.
2008-04-16 20:33:53 +00:00
Ralph Castain
3413191e52 Fix singleton and singleton comm_spawn
This commit was SVN r18177.
2008-04-16 14:38:10 +00:00
Ralph Castain
7b91f8baff Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory.
Fix the ompi-server -h cmd line option so it actually tells you something!

Add two new testing codes to the orte/test/mpi area: accept and connect.

This commit was SVN r18176.
2008-04-16 14:27:42 +00:00
Adrian Knoth
84e4013530 Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set.
This commit was SVN r18164.
2008-04-16 09:31:15 +00:00
Adrian Knoth
0ddfff4ffe Added new oob-tcp parameter oob_tcp_disable_family.
Like btl_tcp_disable_family, this parameter more or less disables
a whole address family. Though the sockets are still created, the
corresponding information isn't added to the connection strings.

Likewise, we don't try to connect to addresses matching the disabled
address family.

This is particularly important for multidomain clusters, where IPv4 is
oftenly filtered (firewalled), sometimes by simply dropping the packets
instead of rejecting them (thus causing a connection timeout instead of
a quick "no route to host").

This commit was SVN r18163.
2008-04-16 09:22:00 +00:00
Ralph Castain
a4ea756a76 Ensure the node loop cntr gets incremented if the daemon already exists
This commit was SVN r18150.
2008-04-15 14:20:03 +00:00
Ralph Castain
35c260a14f Fix the plm modules to accommodate the new remote_spawn entry - set that entry to NULL for all but rsh as only that module supports it at this time
This commit was SVN r18145.
2008-04-14 19:36:13 +00:00
Ralph Castain
84156c422f Egad! Typo snuck in there...nasty vi!
This commit was SVN r18144.
2008-04-14 18:29:11 +00:00
Ralph Castain
7c7304466c Add a binomial tree-based launch to ssh, turned "on" only when the plm_rsh_tree_spawned mca param is set to a non-zero value. This probably isn't a very optimized capability, but it does execute a tree-based launch that may scale better than linear at high node counts.
Add the daemon map capability to the ODLS to create and save a map of daemon vpid vs nodename from the launch message.

Cleanup a few places in the base plm launch support where we didn't adequately protect rml recv's from potentially executing sends.

This commit was SVN r18143.
2008-04-14 18:26:08 +00:00
Ralph Castain
e050f37578 Cleanup a few warnings about initializing variables.
Remove an obsolete data value.

This commit was SVN r18129.
2008-04-10 19:15:16 +00:00
Ralph Castain
851279fc9f Consolidate the daemon wireup message into the launch message. The daemons don't need their contact info prior to the launch message anyway. This not only eliminates a job-wide communication from the startup procedure, but it also resolves a race condition reported when operating across highly distributed (i.e., cross-country) networks. In such scenarios, it proved possible for a daemon to receive its launch message -before- it had received the contact info message, even though the latter had been sent first!
This eliminates that problem...

This commit was SVN r18126.
2008-04-10 15:35:11 +00:00
Ralph Castain
57e3e86cda Use the proper exit code for mpirun to indicate an error when something goes wrong during launch (in scenarios where the procs don't report the problem directly themselves)
This commit was SVN r18121.
2008-04-10 09:15:08 +00:00
Ralph Castain
e7d0dae89d Ensure we update the daemon collective trees if num_procs changes, but only if it changes
This commit was SVN r18120.
2008-04-10 03:44:18 +00:00
Ralph Castain
22343e6e0b Given total lack of interest/support from the folks behind these environments, and the fact that we can now scale so well with our own daemons, it seems unlikely that we will be able to pursue direct and/or standalone launch in these environments. If that situation ever changes, it is easy enough to revive the effort since little had really been done to-date.
Meantime, no reason to continue dragging these around.

This commit was SVN r18119.
2008-04-10 02:54:13 +00:00
Ralph Castain
dc2f88b9f0 Now that we have the daemon collectives, the unity routed module no longer needs the "hack" we inserted a week ago to tell the daemons how to talk directly to all the application procs. The modex and barrier messages flow cleanly across the daemons and are "dropped" into the procs where required.
Add some insurance to make certain that the daemons' number of procs only gets updated when it absolutely is intended.

This commit was SVN r18118.
2008-04-10 02:45:42 +00:00
Ralph Castain
0b3122ee2f Update the cnos module - should (hopefully) compile and work...
This commit was SVN r18117.
2008-04-09 22:33:00 +00:00
Ralph Castain
86b4ae5970 Remove a generated file from the repository - shouldn't have been there
This commit was SVN r18116.
2008-04-09 22:13:51 +00:00
Ralph Castain
3a0d09300b Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations.
Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study.

This commit was SVN r18115.
2008-04-09 22:10:53 +00:00
Ralph Castain
95d7e177c6 Not really a test, but a useful tool for testing computation of binomial trees
This commit was SVN r18113.
2008-04-09 21:58:42 +00:00
Ralph Castain
11c6773c83 Commit a patch from Brian that fixes potential segfaults in systems where IPv6 include files are found, but the kernel doesn't actually support IPv6.
This commit was SVN r18106.
2008-04-09 12:53:24 +00:00
Ralph Castain
5e6dc24e62 Fix ompi-server so it works with unity routed module - still not working with tree routing.
Cleanup debug flag so it activates debugging on the data server code itself

This commit was SVN r18080.
2008-04-04 19:17:28 +00:00
Tim Prins
313edd8955 - Fix a problem reported on the users list where we would segfault in finalize after calling spawn if the user did not call MPI_Comm_disconnect
- Fix the app context constructor so it initializes all the fields.

This commit was SVN r18079.
2008-04-04 15:07:39 +00:00
Ralph Castain
537395b924 Make two important MCA params "visible" to ompi_info
This commit was SVN r18074.
2008-04-02 14:54:57 +00:00
Lenny Verkhovsky
2be4e32c79 1. Fixing Possible strdup of NULL
2.  Fixing num_alloc when combined mapping policies ( rankfile & byslot or bynode )

This commit was SVN r18073.
2008-04-02 14:12:38 +00:00
Ralph Castain
f115b4aed2 Checkpoint the revised gather algorithm
This commit was SVN r18072.
2008-04-02 13:35:06 +00:00
Adrian Knoth
a56b9b1df1 Fix broken build with --disable-ipv6.
This commit was SVN r18071.
2008-04-02 10:53:48 +00:00
Ralph Castain
50433bf833 Turn off the new fqdn behavior pending resolution of hostfile issue
This commit was SVN r18064.
2008-04-01 20:52:22 +00:00
Ralph Castain
8dca132604 Cleanup some ignores
Add missing variables!

This commit was SVN r18063.
2008-04-01 20:32:17 +00:00
Ralph Castain
51533c9340 Add a new mapper component that sequentially maps ranks-to-hosts according to the ordering in the hostfile.
Not functional yet - still under development. Just placeholding for now to clear a backlog

This commit was SVN r18062.
2008-04-01 20:03:49 +00:00
Ralph Castain
ee5b96269e The RML is comfortable with zero-byte payloads, so don't pack something we don't need
This commit was SVN r18061.
2008-04-01 19:24:46 +00:00
Ralph Castain
3a4c10efd6 Delete obsolete file, cleanup obsolete cruft in another file
This commit was SVN r18060.
2008-04-01 18:36:23 +00:00
Ralph Castain
39c2680e9a Silence warning
This commit was SVN r18057.
2008-04-01 13:42:16 +00:00
Ralph Castain
524ed5d515 Don't have singletons wireup the iof. Instead, we let the fork'd orted handle io forwarding. This prevents an issue with the event library and pty's on singletons
This commit was SVN r18056.
2008-04-01 12:40:00 +00:00