1
1
Граф коммитов

1356 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
ad894b050b Set the bookmark so the first process of a comm_spawn'd job will be mapped to the same node as the spawning proc, assuming it has space. If not, then the mapper will automatically move to the next node.
This commit was SVN r18346.
2008-05-01 15:24:03 +00:00
Ralph Castain
1766442591 Fix a double-free when tree-spawning
Fix the round-robin mapper so it doesn't move to the next node just because it completed mapping an app_context

This commit was SVN r18344.
2008-05-01 14:49:56 +00:00
Ralph Castain
3e55fe6f6d Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit.
Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs.

Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node.

This commit was SVN r18338.
2008-04-30 19:49:53 +00:00
Ralph Castain
4c2c6c9bd8 Ensure the pack/unpacks match for tree-spawn
This commit was SVN r18282.
2008-04-24 18:53:08 +00:00
Ralph Castain
09b6758f8c Pass the prefix dir to the remote orted when doing tree-based spawns
This commit was SVN r18280.
2008-04-24 18:38:24 +00:00
Josh Hursey
2c736873bb Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.

The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.

Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.

 * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
 * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
 * Update ft_event functions in PML and BML to handle the new restart state.
 * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.

This commit was SVN r18276.
2008-04-24 17:54:22 +00:00
Ralph Castain
eece9f88f0 Fix a bug in the way we computed local_rank. This needs to be the local_rank -among my job peers- on a node.
We were mistakenly computing the local_rank across -all- jobs with procs on that node. While the two definitions are equivalent for an initial launch, comm_spawn'd procs would get the wrong local_rank. In particular, there would not be a local_rank=0 proc in the comm_spawn'd job on any node that was shared with the initial job.

This commit was SVN r18263.
2008-04-23 17:42:59 +00:00
Ralph Castain
f56f06a7ff Do not trust the RM's names - apparently, RR has trained it to lie! Default to using the name we got from gethostname as it is the only one we can trust.
This commit was SVN r18259.
2008-04-23 17:00:35 +00:00
Ralph Castain
8001e4e99c See if this will fix a race condition showing up in comm_spawn MTT testing
This commit was SVN r18257.
2008-04-23 15:43:44 +00:00
Ralph Castain
5311b13b60 Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list
Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system.

This commit was SVN r18252.
2008-04-23 14:52:09 +00:00
Lenny Verkhovsky
456ce6c4da Few cleanups in Rank_File component + fixed opal_paffinity_slot_list without rankfile
This commit was SVN r18249.
2008-04-23 13:34:05 +00:00
Shiqing Fan
eb5f5d77cc If it's not the HNP, release the cluster object first and return.
This commit was SVN r18247.
2008-04-23 13:21:32 +00:00
Josh Hursey
750ce0152c After a bit of testing this morning it seems that the tree component is able to work correctly with the checkpoint/restart functionality. So enable this component when C/R is enabled.
This commit was SVN r18246.
2008-04-23 13:01:23 +00:00
Josh Hursey
cc83d41ad9 Merge in tmp/jjh-scratch
{{{
 svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch .
}}}

Contains:
 * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart.
 * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff
 * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P.
 * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry
 * Some other sundry cleanup items all dealing with C/R functionality in the trunk.

This commit was SVN r18241.
2008-04-23 00:17:12 +00:00
Ralph Castain
c3ddf66445 Move the dislay-allocation code to where it is always seen
This commit was SVN r18227.
2008-04-21 20:28:59 +00:00
Ralph Castain
16c9100633 Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation.
This commit was SVN r18216.
2008-04-20 02:25:45 +00:00
Ralph Castain
07f0a71faa Cleanup the show_help entries on the seq mapper
This commit was SVN r18191.
2008-04-17 14:43:15 +00:00
Ralph Castain
e7487ad533 Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile.
Restore the "do-not-launch" functionality so users can test a mapping without launching it.

Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests.

Add a function to hostfile to generate an ordered list of host names from a hostfile

This commit was SVN r18190.
2008-04-17 13:50:59 +00:00
Ralph Castain
66e532669a Remove some dead code
This commit was SVN r18182.
2008-04-16 20:33:53 +00:00
Ralph Castain
3413191e52 Fix singleton and singleton comm_spawn
This commit was SVN r18177.
2008-04-16 14:38:10 +00:00
Ralph Castain
7b91f8baff Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory.
Fix the ompi-server -h cmd line option so it actually tells you something!

Add two new testing codes to the orte/test/mpi area: accept and connect.

This commit was SVN r18176.
2008-04-16 14:27:42 +00:00
Adrian Knoth
84e4013530 Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set.
This commit was SVN r18164.
2008-04-16 09:31:15 +00:00
Adrian Knoth
0ddfff4ffe Added new oob-tcp parameter oob_tcp_disable_family.
Like btl_tcp_disable_family, this parameter more or less disables
a whole address family. Though the sockets are still created, the
corresponding information isn't added to the connection strings.

Likewise, we don't try to connect to addresses matching the disabled
address family.

This is particularly important for multidomain clusters, where IPv4 is
oftenly filtered (firewalled), sometimes by simply dropping the packets
instead of rejecting them (thus causing a connection timeout instead of
a quick "no route to host").

This commit was SVN r18163.
2008-04-16 09:22:00 +00:00
Ralph Castain
a4ea756a76 Ensure the node loop cntr gets incremented if the daemon already exists
This commit was SVN r18150.
2008-04-15 14:20:03 +00:00
Ralph Castain
35c260a14f Fix the plm modules to accommodate the new remote_spawn entry - set that entry to NULL for all but rsh as only that module supports it at this time
This commit was SVN r18145.
2008-04-14 19:36:13 +00:00
Ralph Castain
84156c422f Egad! Typo snuck in there...nasty vi!
This commit was SVN r18144.
2008-04-14 18:29:11 +00:00
Ralph Castain
7c7304466c Add a binomial tree-based launch to ssh, turned "on" only when the plm_rsh_tree_spawned mca param is set to a non-zero value. This probably isn't a very optimized capability, but it does execute a tree-based launch that may scale better than linear at high node counts.
Add the daemon map capability to the ODLS to create and save a map of daemon vpid vs nodename from the launch message.

Cleanup a few places in the base plm launch support where we didn't adequately protect rml recv's from potentially executing sends.

This commit was SVN r18143.
2008-04-14 18:26:08 +00:00
Ralph Castain
e050f37578 Cleanup a few warnings about initializing variables.
Remove an obsolete data value.

This commit was SVN r18129.
2008-04-10 19:15:16 +00:00
Ralph Castain
851279fc9f Consolidate the daemon wireup message into the launch message. The daemons don't need their contact info prior to the launch message anyway. This not only eliminates a job-wide communication from the startup procedure, but it also resolves a race condition reported when operating across highly distributed (i.e., cross-country) networks. In such scenarios, it proved possible for a daemon to receive its launch message -before- it had received the contact info message, even though the latter had been sent first!
This eliminates that problem...

This commit was SVN r18126.
2008-04-10 15:35:11 +00:00
Ralph Castain
57e3e86cda Use the proper exit code for mpirun to indicate an error when something goes wrong during launch (in scenarios where the procs don't report the problem directly themselves)
This commit was SVN r18121.
2008-04-10 09:15:08 +00:00
Ralph Castain
e7d0dae89d Ensure we update the daemon collective trees if num_procs changes, but only if it changes
This commit was SVN r18120.
2008-04-10 03:44:18 +00:00
Ralph Castain
22343e6e0b Given total lack of interest/support from the folks behind these environments, and the fact that we can now scale so well with our own daemons, it seems unlikely that we will be able to pursue direct and/or standalone launch in these environments. If that situation ever changes, it is easy enough to revive the effort since little had really been done to-date.
Meantime, no reason to continue dragging these around.

This commit was SVN r18119.
2008-04-10 02:54:13 +00:00
Ralph Castain
dc2f88b9f0 Now that we have the daemon collectives, the unity routed module no longer needs the "hack" we inserted a week ago to tell the daemons how to talk directly to all the application procs. The modex and barrier messages flow cleanly across the daemons and are "dropped" into the procs where required.
Add some insurance to make certain that the daemons' number of procs only gets updated when it absolutely is intended.

This commit was SVN r18118.
2008-04-10 02:45:42 +00:00
Ralph Castain
0b3122ee2f Update the cnos module - should (hopefully) compile and work...
This commit was SVN r18117.
2008-04-09 22:33:00 +00:00
Ralph Castain
3a0d09300b Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations.
Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study.

This commit was SVN r18115.
2008-04-09 22:10:53 +00:00
Ralph Castain
11c6773c83 Commit a patch from Brian that fixes potential segfaults in systems where IPv6 include files are found, but the kernel doesn't actually support IPv6.
This commit was SVN r18106.
2008-04-09 12:53:24 +00:00
Lenny Verkhovsky
2be4e32c79 1. Fixing Possible strdup of NULL
2.  Fixing num_alloc when combined mapping policies ( rankfile & byslot or bynode )

This commit was SVN r18073.
2008-04-02 14:12:38 +00:00
Ralph Castain
f115b4aed2 Checkpoint the revised gather algorithm
This commit was SVN r18072.
2008-04-02 13:35:06 +00:00
Adrian Knoth
a56b9b1df1 Fix broken build with --disable-ipv6.
This commit was SVN r18071.
2008-04-02 10:53:48 +00:00
Ralph Castain
50433bf833 Turn off the new fqdn behavior pending resolution of hostfile issue
This commit was SVN r18064.
2008-04-01 20:52:22 +00:00
Ralph Castain
51533c9340 Add a new mapper component that sequentially maps ranks-to-hosts according to the ordering in the hostfile.
Not functional yet - still under development. Just placeholding for now to clear a backlog

This commit was SVN r18062.
2008-04-01 20:03:49 +00:00
Ralph Castain
ee5b96269e The RML is comfortable with zero-byte payloads, so don't pack something we don't need
This commit was SVN r18061.
2008-04-01 19:24:46 +00:00
Ralph Castain
3a4c10efd6 Delete obsolete file, cleanup obsolete cruft in another file
This commit was SVN r18060.
2008-04-01 18:36:23 +00:00
Ralph Castain
39c2680e9a Silence warning
This commit was SVN r18057.
2008-04-01 13:42:16 +00:00
Ralph Castain
524ed5d515 Don't have singletons wireup the iof. Instead, we let the fork'd orted handle io forwarding. This prevents an issue with the event library and pty's on singletons
This commit was SVN r18056.
2008-04-01 12:40:00 +00:00
Ralph Castain
3e8846d685 Some code cleanups from Brian to clarify port selection and opening logic
This commit was SVN r18055.
2008-04-01 12:39:02 +00:00
Ralph Castain
fe88956080 Fix singleton modex - ensure singletons know that a daemon is now in the system
This commit was SVN r18047.
2008-03-31 20:36:27 +00:00
Ralph Castain
f3936ff9bc Record the daemon's state so that we don't attempt to send "die" messages to a daemon that is known to have failed to start.
This commit was SVN r18044.
2008-03-31 18:15:24 +00:00
George Bosilca
ee784b601e For consistency reasons always use opal_home_directory and
opal_tmp_directory.

This commit was SVN r18043.
2008-03-31 18:13:41 +00:00
Ralph Castain
d8eb0eeec3 Correct the debug output
This commit was SVN r18042.
2008-03-31 18:09:37 +00:00
Ralph Castain
2b399a3563 Suppress a warning message - relegate it to only show up when verbosity is set as it is okay for this condition to be true
This commit was SVN r18041.
2008-03-31 17:48:07 +00:00
Ralph Castain
f327ebce31 Get the jobid correct - doh!
This commit was SVN r18040.
2008-03-31 17:42:50 +00:00
Ralph Castain
e396b9ee9a Fix unity routed component by adding xcast of proc data to the daemons. This enables daemons to complete the revised modex procedure by forwarding their collected modex info to the rank=0 proc.
This commit was SVN r18039.
2008-03-31 17:35:29 +00:00
George Bosilca
493677426d Use the OPAL function to retrieve the HOME and TMP environment values.
This commit was SVN r18037.
2008-03-31 17:10:08 +00:00
Ralph Castain
379b8a3e2f Fix singleton operations that have no data in the modex.
Note: this also allows -any- modex operation to have zero data in it, not just singletons.

This commit was SVN r18034.
2008-03-31 13:53:23 +00:00
Ralph Castain
1889bbd119 Quiet some warnings about uninitialized variables
This commit was SVN r18032.
2008-03-31 13:52:10 +00:00
Ralph Castain
8506be755d Clean-up the mess. Repair static builds. Remove unused and empty C-decl braces. Add missing prototype for function.
This commit was SVN r18031.
2008-03-31 13:02:33 +00:00
Ralph Castain
81a83dabc6 Setup sandbox for testing new orte collectives
This commit was SVN r18026.
2008-03-31 04:21:37 +00:00
George Bosilca
594884b613 The return is an int not a pointer.
This commit was SVN r18024.
2008-03-30 19:06:25 +00:00
George Bosilca
a6d5c15249 There is no need to force opal_progress down there. It will get called few
steps upper.

This commit was SVN r18022.
2008-03-30 19:05:09 +00:00
Lenny Verkhovsky
7e45d7e134 Few updates due to RMAPS rank_file component changes
1. applied prefix rule to functions and variables of RMAPS rank_file component
2. cleaned ompi_mpi_init.c from paffinity code
3. paffinity code moved to new opal/mca/paffinity/base/paffinity_base_service.c file
4. added opal_paffinity_slot_list mca parameter

This commit was SVN r18019.
2008-03-30 11:52:11 +00:00
Lenny Verkhovsky
cb83a1287d Realy deleted old files now
This commit was SVN r18018.
2008-03-30 11:50:19 +00:00
Lenny Verkhovsky
f734ba51a4 Added files with names according to prefix rule
This commit was SVN r18017.
2008-03-30 11:42:09 +00:00
Lenny Verkhovsky
b43f4a2dc9 Deleted and added files after prefix rule changes
This commit was SVN r18016.
2008-03-30 11:41:01 +00:00
Ralph Castain
9f1001a6f8 Ensure that the procs know how many daemons will be participating in collective operations.
This commit was SVN r17992.
2008-03-27 17:31:54 +00:00
Ralph Castain
6166278e18 Improve the scalability of the modex operation and fix a bug reported by Tim P
The bug was a race condition in the barrier operation that caused the barrier in MPI_Finalize to fail on very short programs.

Scalaiblity was improved by using the daemons to aggregate modex and barrier messages before sending them to the rank=0 proc. Improvement is proportional to ppn, of course, but there really wasn't a scaling problem at low ppn anyway. This modification also paves the way for better allgather operations since now all the data for each node is sitting at the daemon level, and the daemons are now aware that a collective operation on the OOB is underway (so they -can- participate in a collective of their own to support it).

Also added better diagnostics to map out the timing associated with MPI_Init - turned on by -mca orte_timing 1.

This commit was SVN r17988.
2008-03-27 15:17:53 +00:00
Ralph Castain
8e6da2ee76 Maintain the mapping bookmark across multiple comm_spawns
This commit was SVN r17984.
2008-03-27 00:19:13 +00:00
Ralph Castain
abfb3577c1 Ensure that the bookmark of the parent job is applied to the child in a comm_spawn so we start mapping from the right place
This commit was SVN r17982.
2008-03-26 21:18:16 +00:00
Ralph Castain
7ad6db207c Cover some timing-related output
This commit was SVN r17977.
2008-03-26 12:54:50 +00:00
Rainer Keller
ce8154eb3e - Coverity issues CID 945:
Event uninit_use: Using uninitialized value "rc"
   Instead of initializing rc in the beginning, rather use return value
   of opal_hash_table_set_value_uint32.

This commit was SVN r17976.
2008-03-26 11:39:25 +00:00
Brad Benton
0b84dfd2a6 POE is not currently working or supported, so removing from the trunk.
This commit was SVN r17970.
2008-03-26 02:06:40 +00:00
Ralph Castain
60d931217f Modify the routed framework to allow greater control/flexibility over response to lost routes and initial wireup of jobs as required by several soon-to-come new modules.
Specifically, add two new APIs:

1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job.

Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections.

2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job.

The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module.

This commit was SVN r17969.
2008-03-26 01:00:24 +00:00
George Bosilca
2ed6ed37bd Don't forget to cleanup once we're done.
This commit was SVN r17965.
2008-03-25 22:42:24 +00:00
George Bosilca
ac6121bd1c Remove unused variable.
This commit was SVN r17964.
2008-03-25 22:41:50 +00:00
Jeff Squyres
183fcdf51b Remove duplicate free(), fixing CID 973.
This commit was SVN r17959.
2008-03-25 20:30:56 +00:00
Ralph Castain
90107f3c14 Fix an issue with comm_spawn over who sent/recv first in the modex. The modex assumes that the first name on the list is the "root" that will serve as the allgather collector/distributor. The dpm was putting that entity last, which forced us to pre-inform the parent procs of the child proc's contact info since the parent was trying to send to the child.
Clarify the setting of send_first in the mpi bindings (trivial, i know, but helpful)

Remove the extra xcast of child contact info to the parent job.

This commit was SVN r17952.
2008-03-25 14:57:34 +00:00
Ralph Castain
cca449e379 Move an OMPI RML tag to the OMPI layer
This commit was SVN r17950.
2008-03-25 13:30:48 +00:00
Ralph Castain
4efddc7b0a Fix the allgather and allgather_list functions to avoid deadlocks at large node/proc counts. Violated the RML rules here - we received the allgather buffer and then did an xcast, which causes a send to go out, and is then subsequently received by the sender. This fix breaks that pattern by forcing the recv to complete outside of the function itself - thus, the allgather and allgather_list always complete their recvs before returning or sending.
Reogranize the grpcomm code a little to provide support for soon-to-come new grpcomm components. The revised organization puts what will be common code elements in the base to avoid duplication, while allowing components that don't need those functions to ignore them.

This commit was SVN r17941.
2008-03-24 20:50:31 +00:00
Ralph Castain
58d51f2689 Revert that! Need to complete the rest of the change so the orted knows the correct nodeid...
Sorry

This commit was SVN r17939.
2008-03-24 18:17:26 +00:00
Ralph Castain
dae4518878 Use the correct nodeid!
This commit was SVN r17938.
2008-03-24 18:15:08 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Ralph Castain
f8642e9390 Add debug to tell us when we opened a socket and to whom
This commit was SVN r17911.
2008-03-21 15:47:47 +00:00
Ralph Castain
19ffdfef42 Add some debugging output to tell us what interfaces were considered and used by OOB
This commit was SVN r17909.
2008-03-21 15:35:40 +00:00
Ralph Castain
c2fd5dd416 Clarify method used to translate application proc termination codes to exit status codes
This commit was SVN r17899.
2008-03-20 18:50:05 +00:00
Brian Barrett
2bf4784893 Set a meaningful orte_system_info.nodeid on Catamount
This commit was SVN r17898.
2008-03-20 16:55:57 +00:00
Ralph Castain
f8a10dfb93 Complete the fix of the orted vs mpirun race condition for finalizing. The darned mpirun is just too fast! Rather than try to slow it down, we set the orte_finalizing flag -prior- to telling mpirun the orted is leaving. This ensures we don't mistakenly declare the lifeline lost when mpirun leaves in a hurry.
This commit was SVN r17897.
2008-03-20 16:55:24 +00:00
Ralph Castain
6bb139e4f2 One more correction to mpirun exit codes - cleanup the application proc's exit codes in the orted so that non-zero exit codes generated by mpirun itself don't get "munged".
Modify the multi_abort function so they all return different exit codes - allows us to tell which one was being reported.

This commit was SVN r17895.
2008-03-20 13:54:11 +00:00
Ralph Castain
27a73ad9ee Fix a race condition between the orteds and HNP that can cause the orteds to output the "lost lifeline" message.
This has been a long-time problem. I tried to reduce the problem by having the orteds tell the HNP they were finalizing, and having the HNP wait until all orteds had reported or we timed out.

What was observed was that all the orteds were correctly reporting that they are leaving, but the HNP is able to exit before the orteds, thus closing the orteds lifeline socket and generating the error output. This is caused by the fact that the orteds have to whack all remaining session directories, which includes that blasted monster shared memory file! Cleaning up the SM file can take quite a while.

The HNP doesn't have that problem as there is no SM file there! So it gets out first.

What we had done in the past to resolve that problem was put a little test in the OOB that checks to see if we are finalizing. If we are, then we ignore the lifeline connection being lost. That check was still in the code - however, we had lost the line in orte_finalize that set the flag!!

This commit was SVN r17893.
2008-03-20 13:30:51 +00:00
Ralph Castain
8ee26a55ca Just turn these off for now - will revisit later
This commit was SVN r17891.
2008-03-20 13:25:35 +00:00
Ralph Castain
67a2cc8a8e Fix a bug noted by Tim P where we would report the incorrect app_context as "not found". If you gave us the command line:
mpirun -n 1 hostname : -n 1 bogus

we would erroneously report that hostname had not been found instead of bogus.

This commit was SVN r17886.
2008-03-19 21:13:13 +00:00
Ralph Castain
ec64bf3da8 Clarify the error output so we can understand if it was a daemon or process that lost its lifeline
This commit was SVN r17880.
2008-03-19 19:06:52 +00:00
Ralph Castain
2ed0e60321 Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize.
Jeff is preparing an (undoubtedly lengthy) explanation/matrix of how these codes are determined for the OMPI FAQ.

This commit was SVN r17879.
2008-03-19 19:00:51 +00:00
Galen Shipman
80ac7c87cd don't forget command file..
This commit was SVN r17878.
2008-03-19 16:24:29 +00:00
Galen Shipman
77c8532cc9 do things in a less hacky way..
This commit was SVN r17877.
2008-03-19 16:23:56 +00:00
Jeff Squyres
ac2e329353 Oops! That should not have been removed...
This commit was SVN r17865.
2008-03-18 14:42:30 +00:00
Jeff Squyres
bd92720d41 More fixes to make it compile and play nice on OS X. Still more fixes
are required; sending mail to devel shortly...

This commit was SVN r17864.
2008-03-18 14:38:52 +00:00
Ralph Castain
8f31a62600 Fix compilation errors so this will compile, remove unused variables
This commit was SVN r17862.
2008-03-18 13:01:26 +00:00
Lenny Verkhovsky
647bce6d3e Support for new RMAPS rank mapping component
This commit was SVN r17860.
2008-03-18 09:39:07 +00:00
Lenny Verkhovsky
14c32f87d5 Added new RMAPS component for rank mapping
This commit was SVN r17859.
2008-03-18 09:33:49 +00:00
Ralph Castain
8cd6142e6d Add some debugging to the grpcomm module. Setting grpcomm_base_verbose = 1 will now give you a trace through the functions as they are called. Setting it to 2 or more will give you details on what each function is doing as it works through its procedure.
This commit was SVN r17848.
2008-03-17 19:34:36 +00:00
Ralph Castain
629b95a2fe Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation.
Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway.

So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked.

Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion.

This commit was SVN r17843.
2008-03-17 17:58:59 +00:00
Josh Hursey
aaff245271 A couple verbose additions. Poll the event engine while waiting for the
named pipe.

This commit was SVN r17787.
2008-03-07 21:10:14 +00:00
Galen Shipman
0fb6cf0916 make output use verbose macro..
This commit was SVN r17778.
2008-03-07 03:06:17 +00:00
Shiqing Fan
eb1dfaf4d5 Select the windows CCP component at runtime by testing if we are on Windows cluster.
This commit was SVN r17776.
2008-03-07 01:31:53 +00:00
Ralph Castain
b110a247be Fix comm_spawn (maybe).
Comm_spawn was sticking during spawn_multiple because of a problem in the dpm - the modex there is asking processes to talk to each other in an allgather_list operation, but the procs don't have the required contact info to do so. The solution here was to ensure that all parent procs have full contact info for procs in the child job.

Admittedly, this isn't the long-term answer. We would like to have the contact info given to only the parent procs that were involved in the comm_spawn. There is a way to do that, but this will suffice to keep things working until that can be implemented and tested.

This commit was SVN r17772.
2008-03-06 21:56:00 +00:00
Ralph Castain
64d43cc44b Fix the unity routed component and direct xcast mode.
Ensure that direct xcast handles all its use-cases correctly.

Unity routed component needs to use the base recv function to properly operate.

This commit was SVN r17764.
2008-03-06 18:13:05 +00:00
Ralph Castain
ff99aa054f In order to prevent orphaned processes when using non-unity routing methods, the procs need to realize that their local daemon is a critical connection - if that connection unexpectedly closes, they need to terminate.
This commit adds definition for a "lifeline" connection. For an HNP, there is no lifeline, so the lifeline proc is NULL. For a daemon, the lifeline is the HNP - the daemon should abort if it loses that connection.

For a proc using unity routed, the lifeline is the HNP since it connects directly to the HNP.

For a proc using tree routed, the lifeline is the local daemon.

Adjusted OOB to call abort if the lifeline (as opposed to HNP) connection is lost.

This commit was SVN r17761.
2008-03-06 15:30:44 +00:00
Josh Hursey
0b4d9a12ce a bit more verbosity for the fun of it
This commit was SVN r17758.
2008-03-06 14:04:25 +00:00
Tim Prins
f61c2333c0 Remove unneeded field, and the two uses of it.
This commit was SVN r17757.
2008-03-06 12:46:36 +00:00
Tim Prins
d56f19c77d Fix logic error, and remove uneeded checks for invalid results.
This commit was SVN r17756.
2008-03-06 04:38:13 +00:00
Ralph Castain
6d94e7b232 Fix the debug output so it correctly reports launch state
This commit was SVN r17755.
2008-03-06 03:11:01 +00:00
Tim Prins
5de3e1965e Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte.
Everything should work, however I am unable to compile and test the sctp BTL.

This commit was SVN r17751.
2008-03-05 22:44:35 +00:00
Tim Prins
f9916811ae Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124
The change also:
 - cleans up and simplifies the command line processing code
 - adds an error output if more than one hostfile passed for a single app context
 - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options

This commit was SVN r17750.

The following Trac tickets were found above:
  Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124
2008-03-05 22:12:27 +00:00
Rolf vandeVaart
03fdd57d5a Fix the use of --path and -x PATH so that things work properly.
Note that --path specifies extra directories where the executable
is searched for, but does not affect the PATH settings.

This commit fixes trac:1221.

This commit was SVN r17748.

The following Trac tickets were found above:
  Ticket 1221 --> https://svn.open-mpi.org/trac/ompi/ticket/1221
2008-03-05 21:07:43 +00:00
Ralph Castain
4dbc352828 Per request, change name of new enviro var to OMPI_COMM_WORLD_LOCAL_SIZE
This commit was SVN r17736.
2008-03-05 14:45:26 +00:00
Ralph Castain
06d3145fe4 First cut at direct launch for TM. Able to launch non-ORTE procs and detect their completion for a clean shutdown.
This commit was SVN r17732.
2008-03-05 13:51:32 +00:00
George Bosilca
c71f225a28 These functions should only be compiled when OPAL_ENABLE_FT == 1.
This commit was SVN r17727.
2008-03-05 05:57:13 +00:00
Josh Hursey
3b4073e32c This commit fixes the checkpoint/restart functionality on the trunk. Included in this commit are:
* Extension to the ESS framework to support C/R
 * Fixed support for {{{snapc_base_establish_global_snapshot_dir}}}
 * Fixed FileM support
 * Misc. minor code modifications

There are some outstanding visability issues that I want to fix next.

This commit was SVN r17725.
2008-03-05 04:57:23 +00:00
Ralph Castain
edb8e32a7a Add default hostfile parameter plus --default-hostfile command line option.
Fix error message when job setup failed

This commit was SVN r17724.
2008-03-05 04:54:57 +00:00
Ralph Castain
022fc1f382 Add another MPI-related enviro variable OMPI_COMM_WORLD_NUM_LOCAL_PROCS
This commit was SVN r17723.
2008-03-05 04:53:32 +00:00
Ralph Castain
e745c16ff1 Modify the enviro variable names to be OMPI_...
Add two new ones: OMPI_COMM_WORLD_LOCAL_RANK and OMPI_UNIVERSE_SIZE

This commit was SVN r17694.
2008-03-04 20:16:05 +00:00
Shiqing Fan
ebf9c0441d Set the windows components invisible.
This commit was SVN r17687.
2008-03-04 17:37:17 +00:00
Shiqing Fan
ae41b5418b Update the RAS and PLM components for Windows.
These won't suffer another platforms but only windows. 

This commit was SVN r17686.
2008-03-04 17:13:01 +00:00
Ralph Castain
ffa232687a Fix xcast so it works in multi-node situations where the user specifies a particular mode to use (e.g., direct).
This commit was SVN r17682.
2008-03-03 20:07:02 +00:00
Ralph Castain
841d0e5208 Cleanup an attribute warning - not sure which one to set or where it should go, so I'll leave that to someone more familiar with "attributes".
Ensure some debugging is only enabled when have_debug is set.

This commit was SVN r17681.
2008-03-03 16:06:47 +00:00
Rich Graham
d37db14901 get the shared memory collectives working again with the new
version of orte.

This commit was SVN r17672.
2008-02-29 22:28:57 +00:00
Ralph Castain
6450962d59 Add some debugging to the message event object.
Cleanup some no-longer-used values

This commit was SVN r17671.
2008-02-29 20:10:31 +00:00
Ralph Castain
a585923de1 Silence some minor compiler warnings
This commit was SVN r17662.
2008-02-29 02:39:39 +00:00
Tim Prins
84b2099fe8 Remove the now-unused orte_value_array. As this is the last 'class' split between orte and ompi, remove the big comment about the split in ompi_bitmap.
Also, update some properties (source files should not be executeable...), and remove a couple unneeded inclusions of orte_proc_table.h

This commit was SVN r17655.
2008-02-28 21:39:42 +00:00
Ralph Castain
5e6928d710 Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message.
Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv.

Created two new macros to assist:

ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object

ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd.

Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output.

This has been tested on Mac, TM, and SLURM.

This commit was SVN r17647.
2008-02-28 19:58:32 +00:00
Ralph Castain
5dc64cea6a Correct logic - only issue recv and cancel it if we are an HNP
This commit was SVN r17641.
2008-02-28 15:27:16 +00:00
George Bosilca
9d421bea2a Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the
implementation of orte_pointer_array.

This commit was SVN r17636.
2008-02-28 05:32:23 +00:00
Ralph Castain
d70e2e8c2b Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately.
Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer

This commit was SVN r17632.
2008-02-28 01:57:57 +00:00
Gleb Natapov
da3e69101d Add missing include.
This commit was SVN r17493.
2008-02-18 14:55:02 +00:00
Galen Shipman
18d1d3b408 Add ORTE ALPS support (Cray XT CNL)
This commit was SVN r17482.
2008-02-17 19:29:06 +00:00
George Bosilca
fcab6cc0bb Fix typo.
This commit was SVN r17255.
2008-01-26 21:36:04 +00:00
Rainer Keller
9d4852cdc1 - Get rid of Wshadow warnings.
This commit was SVN r17231.
2008-01-25 14:07:38 +00:00
Pak Lui
413bcca4c0 Support the qrsh or qsub "-notify" option by catching the SIGUSR1/2
signals and not letting user processes to exit on those signals.

This commit was SVN r17174.
2008-01-22 17:32:29 +00:00
Josh Hursey
158dda5458 Fix some overlapping code.
This commit was SVN r17067.
2008-01-08 15:40:21 +00:00
George Bosilca
eb71a634c6 Don't forget to initialize the msg_origin field.
This commit was SVN r17055.
2008-01-04 23:24:49 +00:00
George Bosilca
48f5a26e8c Cast to keep VC happy (quiet).
This commit was SVN r17054.
2008-01-04 23:13:32 +00:00
Adrian Knoth
42d5fe62f9 Fixed misplaced #endif
This commit was SVN r17028.
2008-01-01 11:02:38 +00:00
Jeff Squyres
213b5d5c6e Per long threads on the mailing list and much confusion discussion
about linkers, have all OPAL, ORTE, and OMPI components '''not'' link
against the OPAL, ORTE, or OMPI libraries.

See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for
details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a
better-formatted version of the same info).

This commit was SVN r16968.
2007-12-15 13:32:02 +00:00
Josh Hursey
f7812baf5b forgot a bit of error checking in the last commit
This commit was SVN r16953.
2007-12-13 14:41:18 +00:00
Josh Hursey
a287c9cb65 This commit distinguishes the file transfer stage from the finish stage.
This commit also cleans up the checkpoint and terminate case making it more
precise than before. Previously the application could make a small amount of
progress between checkpoint completion and application termination. Now the
application will make no progress at all in this time span.

Additional minor change:
 - Start using OPAL_INT_TO_BOOL instead of if/else logic

This commit was SVN r16952.
2007-12-13 14:37:17 +00:00
Rolf vandeVaart
3ea89b69ae Remove a few tabs. Allow the output stream to be
passed to the close command for verbose output.  This
matches all the other frameworks.

This commit was SVN r16938.
2007-12-11 20:44:56 +00:00
Josh Hursey
27c9016b93 sleep -> usleep so we can be a bit more eager when waiting for events to finish.
Still working on solutions that do not involve sleeping, but this will do for
now.

This commit was SVN r16824.
2007-12-03 19:27:32 +00:00
Jeff Squyres
c20350b943 Patch submitted by Brian Barrett, inspired by this thread:
http://www.open-mpi.org/community/lists/users/2007/11/4547.php.

- Better handling of ECONNABORTED from connect on Linux.
- Reduce extraneous output from OOB when TCP connections must
  be retried.

This commit was SVN r16808.
2007-11-30 21:42:15 +00:00
Ron Brightwell
edb9d8e354 Added Catamount to the conditional compilation since Catamount
doesn't support fork() or pipe() either.  This removes a
linker warning message when building for Cray XT with Catamount.

This commit was SVN r16772.
2007-11-21 21:37:58 +00:00
George Bosilca
d67c0eefb4 Remove a compilation warning about using uninitialized variables.
This commit was SVN r16589.
2007-10-26 20:15:28 +00:00