1
1
Граф коммитов

1649 Коммитов

Автор SHA1 Сообщение Дата
Shiqing Fan
7ff440f628 Add quotation marks for windows path.
This commit was SVN r18420.
2008-05-09 14:12:09 +00:00
Josh Hursey
da2f1c58e2 Some checkpoint/restart cleanup.
* Remove the opal_only option. This was suffering from bit rot, and no one uses it. It can be added back fairly easily if wanted.
 * Cleanup metadata interactions at the local level.
 * Touch up some of the INC funcitonality (fix typos and a minor ordering issue)

This commit was SVN r18416.
2008-05-08 18:47:47 +00:00
Ralph Castain
64ef4102c4 Add the topo mapper module - requires some work in carto for completion.
Little cleanup in round-robin mapper.

This commit was SVN r18412.
2008-05-08 05:09:13 +00:00
Ralph Castain
ac5263613c Fix stupid singletons yet again
This commit was SVN r18408.
2008-05-07 20:26:31 +00:00
George Bosilca
dbea3e070e Correct some copy/paste errors.
This commit was SVN r18396.
2008-05-07 04:04:42 +00:00
Ralph Castain
ff70636024 Allgather_list needs its own tag to avoid conflicting with the allgather modex operation.
All spawned procs must decode the port of the spawning process so they can communicate in direct routed mode.

This fixes comm_spawn for all routing modes.

This commit was SVN r18395.
2008-05-07 03:03:56 +00:00
Josh Hursey
bc67f40936 whoops typo
This commit was SVN r18390.
2008-05-06 22:00:24 +00:00
Josh Hursey
50c909a23d Fix a bit of selection logic. Filem should not fail select if the user decided not to build with any filem components. This matches the logic before the mca_base_select() change.
This commit was SVN r18389.
2008-05-06 21:57:45 +00:00
Pak Lui
108921c020 typo
This commit was SVN r18387.
2008-05-06 21:37:35 +00:00
Pak Lui
0302c098be minor typo
This commit was SVN r18386.
2008-05-06 21:26:17 +00:00
Ralph Castain
d97a4f880d Shift the daemon collective operation to the ODLS framework. Ensure we track the collectives per job to avoid race conditions. Take advantage of the new capabilities of the routed framework to define aggregating trees for the daemon collective, and to track which daemons are participating to handle the case of sparse participation.
Make it all work with comm_spawn in the case of all procs on previously occupied nodes, some new procs on new nodes, and mixtures of the two.

Note: comm_spawn now works with both binomial and linear routed modules. There remains a problem of spawned procs not properly getting updated contact info for the parent proc when run in the direct routed mode...but that's for another day.

This commit was SVN r18385.
2008-05-06 20:16:17 +00:00
Josh Hursey
c47406810e Fix AMCA orted command line.
If no AMCA parameters are passed then do not send across the path information. Only place it on the command line if the AMCA parameter is set.

This commit was SVN r18382.
2008-05-06 18:27:31 +00:00
Josh Hursey
9971bc9d95 Merge in the mca_base_select changes per RFC:
http://www.open-mpi.org/community/lists/devel/2008/04/3779.php

{{{
svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play .
}}}

Any components not in the trunk, but in one of the effected frameworks *must* be
updated. Contact the list, look at the RFC, or look at the diff for how to do this.

Sorry for the early commit of this, but I wanted to get it in today (per RFC) and
didn't know if I would have a chance later today.

This commit was SVN r18381.
2008-05-06 18:08:45 +00:00
Ralph Castain
40904dd152 Add a binomial routed module - for now, still completely wires up the daemons, but that will be changed later.
Modify grpcomm xcast so it now uses the selected routed module - eliminates cross-wiring of xcast and routing paths. Suboptimal at the moment, but better implementation is on its way.

Cleanup ignore properties on the new routed components.

This commit was SVN r18377.
2008-05-05 22:32:25 +00:00
Aurelien Bouteiller
5ba62469a0 Add a route_is_defined implementation for the linear oob routing.
This commit was SVN r18375.
2008-05-05 19:12:41 +00:00
Aurelien Bouteiller
2ae30fe126 Implementation of the route_is_defined stub for direct oob routing.
This commit was SVN r18373.
2008-05-05 18:23:26 +00:00
Ralph Castain
b8bb990acf Rename the routed modules to more accurately reflect what they do and the role they will play in soon-to-come updates.
Add two new API's to the routed framework - stub them out so that collaborators can work on them in various components without conflicts.

Remove a "finalize" from the select function that could cause problems as the component had not had its initialize called yet.

This commit was SVN r18369.
2008-05-05 02:59:09 +00:00
Ralph Castain
519c15f8af Fix direct and linear xcast modes
This commit was SVN r18359.
2008-05-02 14:30:07 +00:00
Ralph Castain
8e846bf7f2 Separate the gathering of collective data by jobid
This commit was SVN r18357.
2008-05-02 12:00:08 +00:00
Ralph Castain
432d441b3e Cleanup a bug found by Josh that caused multiple app_contexts to keep mapping onto the first node in an allocation
Continue work on loadbalancing

Cleanup code organization in rmaps_base

This commit was SVN r18353.
2008-05-01 21:07:49 +00:00
Ralph Castain
b2c73f6e11 Fix tree-spawn to work within the new modex system
This commit was SVN r18349.
2008-05-01 19:19:34 +00:00
Josh Hursey
dcd21d7d07 Some checkpoint/restart fixes in response to r18338 (changes in modex).
Things should be working now.

This commit was SVN r18348.

The following SVN revision numbers were found above:
  r18338 --> open-mpi/ompi@3e55fe6f6d
2008-05-01 17:48:13 +00:00
Ralph Castain
ad894b050b Set the bookmark so the first process of a comm_spawn'd job will be mapped to the same node as the spawning proc, assuming it has space. If not, then the mapper will automatically move to the next node.
This commit was SVN r18346.
2008-05-01 15:24:03 +00:00
Ralph Castain
1766442591 Fix a double-free when tree-spawning
Fix the round-robin mapper so it doesn't move to the next node just because it completed mapping an app_context

This commit was SVN r18344.
2008-05-01 14:49:56 +00:00
Ralph Castain
3e55fe6f6d Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit.
Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs.

Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node.

This commit was SVN r18338.
2008-04-30 19:49:53 +00:00
Ralph Castain
4c2c6c9bd8 Ensure the pack/unpacks match for tree-spawn
This commit was SVN r18282.
2008-04-24 18:53:08 +00:00
Ralph Castain
09b6758f8c Pass the prefix dir to the remote orted when doing tree-based spawns
This commit was SVN r18280.
2008-04-24 18:38:24 +00:00
Josh Hursey
2c736873bb Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.

The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.

Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.

 * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
 * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
 * Update ft_event functions in PML and BML to handle the new restart state.
 * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.

This commit was SVN r18276.
2008-04-24 17:54:22 +00:00
Ralph Castain
eece9f88f0 Fix a bug in the way we computed local_rank. This needs to be the local_rank -among my job peers- on a node.
We were mistakenly computing the local_rank across -all- jobs with procs on that node. While the two definitions are equivalent for an initial launch, comm_spawn'd procs would get the wrong local_rank. In particular, there would not be a local_rank=0 proc in the comm_spawn'd job on any node that was shared with the initial job.

This commit was SVN r18263.
2008-04-23 17:42:59 +00:00
Ralph Castain
f56f06a7ff Do not trust the RM's names - apparently, RR has trained it to lie! Default to using the name we got from gethostname as it is the only one we can trust.
This commit was SVN r18259.
2008-04-23 17:00:35 +00:00
Ralph Castain
8001e4e99c See if this will fix a race condition showing up in comm_spawn MTT testing
This commit was SVN r18257.
2008-04-23 15:43:44 +00:00
Ralph Castain
5311b13b60 Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list
Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system.

This commit was SVN r18252.
2008-04-23 14:52:09 +00:00
Lenny Verkhovsky
456ce6c4da Few cleanups in Rank_File component + fixed opal_paffinity_slot_list without rankfile
This commit was SVN r18249.
2008-04-23 13:34:05 +00:00
Shiqing Fan
eb5f5d77cc If it's not the HNP, release the cluster object first and return.
This commit was SVN r18247.
2008-04-23 13:21:32 +00:00
Josh Hursey
750ce0152c After a bit of testing this morning it seems that the tree component is able to work correctly with the checkpoint/restart functionality. So enable this component when C/R is enabled.
This commit was SVN r18246.
2008-04-23 13:01:23 +00:00
Josh Hursey
cc83d41ad9 Merge in tmp/jjh-scratch
{{{
 svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch .
}}}

Contains:
 * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart.
 * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff
 * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P.
 * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry
 * Some other sundry cleanup items all dealing with C/R functionality in the trunk.

This commit was SVN r18241.
2008-04-23 00:17:12 +00:00
Ralph Castain
c3ddf66445 Move the dislay-allocation code to where it is always seen
This commit was SVN r18227.
2008-04-21 20:28:59 +00:00
Ralph Castain
16c9100633 Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation.
This commit was SVN r18216.
2008-04-20 02:25:45 +00:00
Josh Hursey
56a61bfacf switch the name of orterun to mpirun to make things more clear.
This commit was SVN r18208.
2008-04-18 12:59:23 +00:00
Ralph Castain
07f0a71faa Cleanup the show_help entries on the seq mapper
This commit was SVN r18191.
2008-04-17 14:43:15 +00:00
Ralph Castain
e7487ad533 Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile.
Restore the "do-not-launch" functionality so users can test a mapping without launching it.

Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests.

Add a function to hostfile to generate an ordered list of host names from a hostfile

This commit was SVN r18190.
2008-04-17 13:50:59 +00:00
Ralph Castain
eb27e4f23d Move the reissuing of the daemon recv to occur after the message actually gets processed. This ensures that we don't get multiple messages trying to be processed at the same time.
Add one more debug output to see where messages are heading

This commit was SVN r18183.
2008-04-16 20:41:00 +00:00
Ralph Castain
66e532669a Remove some dead code
This commit was SVN r18182.
2008-04-16 20:33:53 +00:00
Ralph Castain
3413191e52 Fix singleton and singleton comm_spawn
This commit was SVN r18177.
2008-04-16 14:38:10 +00:00
Ralph Castain
7b91f8baff Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory.
Fix the ompi-server -h cmd line option so it actually tells you something!

Add two new testing codes to the orte/test/mpi area: accept and connect.

This commit was SVN r18176.
2008-04-16 14:27:42 +00:00
Adrian Knoth
84e4013530 Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set.
This commit was SVN r18164.
2008-04-16 09:31:15 +00:00
Adrian Knoth
0ddfff4ffe Added new oob-tcp parameter oob_tcp_disable_family.
Like btl_tcp_disable_family, this parameter more or less disables
a whole address family. Though the sockets are still created, the
corresponding information isn't added to the connection strings.

Likewise, we don't try to connect to addresses matching the disabled
address family.

This is particularly important for multidomain clusters, where IPv4 is
oftenly filtered (firewalled), sometimes by simply dropping the packets
instead of rejecting them (thus causing a connection timeout instead of
a quick "no route to host").

This commit was SVN r18163.
2008-04-16 09:22:00 +00:00
Ralph Castain
a4ea756a76 Ensure the node loop cntr gets incremented if the daemon already exists
This commit was SVN r18150.
2008-04-15 14:20:03 +00:00
Ralph Castain
35c260a14f Fix the plm modules to accommodate the new remote_spawn entry - set that entry to NULL for all but rsh as only that module supports it at this time
This commit was SVN r18145.
2008-04-14 19:36:13 +00:00
Ralph Castain
84156c422f Egad! Typo snuck in there...nasty vi!
This commit was SVN r18144.
2008-04-14 18:29:11 +00:00