Ralph Castain
e050f37578
Cleanup a few warnings about initializing variables.
...
Remove an obsolete data value.
This commit was SVN r18129.
2008-04-10 19:15:16 +00:00
Ralph Castain
851279fc9f
Consolidate the daemon wireup message into the launch message. The daemons don't need their contact info prior to the launch message anyway. This not only eliminates a job-wide communication from the startup procedure, but it also resolves a race condition reported when operating across highly distributed (i.e., cross-country) networks. In such scenarios, it proved possible for a daemon to receive its launch message -before- it had received the contact info message, even though the latter had been sent first!
...
This eliminates that problem...
This commit was SVN r18126.
2008-04-10 15:35:11 +00:00
Ralph Castain
57e3e86cda
Use the proper exit code for mpirun to indicate an error when something goes wrong during launch (in scenarios where the procs don't report the problem directly themselves)
...
This commit was SVN r18121.
2008-04-10 09:15:08 +00:00
Ralph Castain
e7d0dae89d
Ensure we update the daemon collective trees if num_procs changes, but only if it changes
...
This commit was SVN r18120.
2008-04-10 03:44:18 +00:00
Ralph Castain
22343e6e0b
Given total lack of interest/support from the folks behind these environments, and the fact that we can now scale so well with our own daemons, it seems unlikely that we will be able to pursue direct and/or standalone launch in these environments. If that situation ever changes, it is easy enough to revive the effort since little had really been done to-date.
...
Meantime, no reason to continue dragging these around.
This commit was SVN r18119.
2008-04-10 02:54:13 +00:00
Ralph Castain
dc2f88b9f0
Now that we have the daemon collectives, the unity routed module no longer needs the "hack" we inserted a week ago to tell the daemons how to talk directly to all the application procs. The modex and barrier messages flow cleanly across the daemons and are "dropped" into the procs where required.
...
Add some insurance to make certain that the daemons' number of procs only gets updated when it absolutely is intended.
This commit was SVN r18118.
2008-04-10 02:45:42 +00:00
Ralph Castain
0b3122ee2f
Update the cnos module - should (hopefully) compile and work...
...
This commit was SVN r18117.
2008-04-09 22:33:00 +00:00
Ralph Castain
3a0d09300b
Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations.
...
Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study.
This commit was SVN r18115.
2008-04-09 22:10:53 +00:00
Ralph Castain
11c6773c83
Commit a patch from Brian that fixes potential segfaults in systems where IPv6 include files are found, but the kernel doesn't actually support IPv6.
...
This commit was SVN r18106.
2008-04-09 12:53:24 +00:00
Lenny Verkhovsky
2be4e32c79
1. Fixing Possible strdup of NULL
...
2. Fixing num_alloc when combined mapping policies ( rankfile & byslot or bynode )
This commit was SVN r18073.
2008-04-02 14:12:38 +00:00
Ralph Castain
f115b4aed2
Checkpoint the revised gather algorithm
...
This commit was SVN r18072.
2008-04-02 13:35:06 +00:00
Adrian Knoth
a56b9b1df1
Fix broken build with --disable-ipv6.
...
This commit was SVN r18071.
2008-04-02 10:53:48 +00:00
Ralph Castain
50433bf833
Turn off the new fqdn behavior pending resolution of hostfile issue
...
This commit was SVN r18064.
2008-04-01 20:52:22 +00:00
Ralph Castain
51533c9340
Add a new mapper component that sequentially maps ranks-to-hosts according to the ordering in the hostfile.
...
Not functional yet - still under development. Just placeholding for now to clear a backlog
This commit was SVN r18062.
2008-04-01 20:03:49 +00:00
Ralph Castain
ee5b96269e
The RML is comfortable with zero-byte payloads, so don't pack something we don't need
...
This commit was SVN r18061.
2008-04-01 19:24:46 +00:00
Ralph Castain
3a4c10efd6
Delete obsolete file, cleanup obsolete cruft in another file
...
This commit was SVN r18060.
2008-04-01 18:36:23 +00:00
Ralph Castain
39c2680e9a
Silence warning
...
This commit was SVN r18057.
2008-04-01 13:42:16 +00:00
Ralph Castain
524ed5d515
Don't have singletons wireup the iof. Instead, we let the fork'd orted handle io forwarding. This prevents an issue with the event library and pty's on singletons
...
This commit was SVN r18056.
2008-04-01 12:40:00 +00:00
Ralph Castain
3e8846d685
Some code cleanups from Brian to clarify port selection and opening logic
...
This commit was SVN r18055.
2008-04-01 12:39:02 +00:00
Ralph Castain
fe88956080
Fix singleton modex - ensure singletons know that a daemon is now in the system
...
This commit was SVN r18047.
2008-03-31 20:36:27 +00:00
Ralph Castain
f3936ff9bc
Record the daemon's state so that we don't attempt to send "die" messages to a daemon that is known to have failed to start.
...
This commit was SVN r18044.
2008-03-31 18:15:24 +00:00
George Bosilca
ee784b601e
For consistency reasons always use opal_home_directory and
...
opal_tmp_directory.
This commit was SVN r18043.
2008-03-31 18:13:41 +00:00
Ralph Castain
d8eb0eeec3
Correct the debug output
...
This commit was SVN r18042.
2008-03-31 18:09:37 +00:00
Ralph Castain
2b399a3563
Suppress a warning message - relegate it to only show up when verbosity is set as it is okay for this condition to be true
...
This commit was SVN r18041.
2008-03-31 17:48:07 +00:00
Ralph Castain
f327ebce31
Get the jobid correct - doh!
...
This commit was SVN r18040.
2008-03-31 17:42:50 +00:00
Ralph Castain
e396b9ee9a
Fix unity routed component by adding xcast of proc data to the daemons. This enables daemons to complete the revised modex procedure by forwarding their collected modex info to the rank=0 proc.
...
This commit was SVN r18039.
2008-03-31 17:35:29 +00:00
George Bosilca
493677426d
Use the OPAL function to retrieve the HOME and TMP environment values.
...
This commit was SVN r18037.
2008-03-31 17:10:08 +00:00
Ralph Castain
379b8a3e2f
Fix singleton operations that have no data in the modex.
...
Note: this also allows -any- modex operation to have zero data in it, not just singletons.
This commit was SVN r18034.
2008-03-31 13:53:23 +00:00
Ralph Castain
1889bbd119
Quiet some warnings about uninitialized variables
...
This commit was SVN r18032.
2008-03-31 13:52:10 +00:00
Ralph Castain
8506be755d
Clean-up the mess. Repair static builds. Remove unused and empty C-decl braces. Add missing prototype for function.
...
This commit was SVN r18031.
2008-03-31 13:02:33 +00:00
Ralph Castain
81a83dabc6
Setup sandbox for testing new orte collectives
...
This commit was SVN r18026.
2008-03-31 04:21:37 +00:00
George Bosilca
594884b613
The return is an int not a pointer.
...
This commit was SVN r18024.
2008-03-30 19:06:25 +00:00
George Bosilca
a6d5c15249
There is no need to force opal_progress down there. It will get called few
...
steps upper.
This commit was SVN r18022.
2008-03-30 19:05:09 +00:00
Lenny Verkhovsky
7e45d7e134
Few updates due to RMAPS rank_file component changes
...
1. applied prefix rule to functions and variables of RMAPS rank_file component
2. cleaned ompi_mpi_init.c from paffinity code
3. paffinity code moved to new opal/mca/paffinity/base/paffinity_base_service.c file
4. added opal_paffinity_slot_list mca parameter
This commit was SVN r18019.
2008-03-30 11:52:11 +00:00
Lenny Verkhovsky
cb83a1287d
Realy deleted old files now
...
This commit was SVN r18018.
2008-03-30 11:50:19 +00:00
Lenny Verkhovsky
f734ba51a4
Added files with names according to prefix rule
...
This commit was SVN r18017.
2008-03-30 11:42:09 +00:00
Lenny Verkhovsky
b43f4a2dc9
Deleted and added files after prefix rule changes
...
This commit was SVN r18016.
2008-03-30 11:41:01 +00:00
Ralph Castain
9f1001a6f8
Ensure that the procs know how many daemons will be participating in collective operations.
...
This commit was SVN r17992.
2008-03-27 17:31:54 +00:00
Ralph Castain
6166278e18
Improve the scalability of the modex operation and fix a bug reported by Tim P
...
The bug was a race condition in the barrier operation that caused the barrier in MPI_Finalize to fail on very short programs.
Scalaiblity was improved by using the daemons to aggregate modex and barrier messages before sending them to the rank=0 proc. Improvement is proportional to ppn, of course, but there really wasn't a scaling problem at low ppn anyway. This modification also paves the way for better allgather operations since now all the data for each node is sitting at the daemon level, and the daemons are now aware that a collective operation on the OOB is underway (so they -can- participate in a collective of their own to support it).
Also added better diagnostics to map out the timing associated with MPI_Init - turned on by -mca orte_timing 1.
This commit was SVN r17988.
2008-03-27 15:17:53 +00:00
Ralph Castain
8e6da2ee76
Maintain the mapping bookmark across multiple comm_spawns
...
This commit was SVN r17984.
2008-03-27 00:19:13 +00:00
Ralph Castain
abfb3577c1
Ensure that the bookmark of the parent job is applied to the child in a comm_spawn so we start mapping from the right place
...
This commit was SVN r17982.
2008-03-26 21:18:16 +00:00
Ralph Castain
7ad6db207c
Cover some timing-related output
...
This commit was SVN r17977.
2008-03-26 12:54:50 +00:00
Rainer Keller
ce8154eb3e
- Coverity issues CID 945:
...
Event uninit_use: Using uninitialized value "rc"
Instead of initializing rc in the beginning, rather use return value
of opal_hash_table_set_value_uint32.
This commit was SVN r17976.
2008-03-26 11:39:25 +00:00
Brad Benton
0b84dfd2a6
POE is not currently working or supported, so removing from the trunk.
...
This commit was SVN r17970.
2008-03-26 02:06:40 +00:00
Ralph Castain
60d931217f
Modify the routed framework to allow greater control/flexibility over response to lost routes and initial wireup of jobs as required by several soon-to-come new modules.
...
Specifically, add two new APIs:
1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job.
Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections.
2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job.
The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module.
This commit was SVN r17969.
2008-03-26 01:00:24 +00:00
George Bosilca
2ed6ed37bd
Don't forget to cleanup once we're done.
...
This commit was SVN r17965.
2008-03-25 22:42:24 +00:00
George Bosilca
ac6121bd1c
Remove unused variable.
...
This commit was SVN r17964.
2008-03-25 22:41:50 +00:00
Jeff Squyres
183fcdf51b
Remove duplicate free(), fixing CID 973.
...
This commit was SVN r17959.
2008-03-25 20:30:56 +00:00
Ralph Castain
90107f3c14
Fix an issue with comm_spawn over who sent/recv first in the modex. The modex assumes that the first name on the list is the "root" that will serve as the allgather collector/distributor. The dpm was putting that entity last, which forced us to pre-inform the parent procs of the child proc's contact info since the parent was trying to send to the child.
...
Clarify the setting of send_first in the mpi bindings (trivial, i know, but helpful)
Remove the extra xcast of child contact info to the parent job.
This commit was SVN r17952.
2008-03-25 14:57:34 +00:00
Ralph Castain
cca449e379
Move an OMPI RML tag to the OMPI layer
...
This commit was SVN r17950.
2008-03-25 13:30:48 +00:00