1
1
Граф коммитов

3079 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
30fb002524 Take the first small step towards rationalizing rsh support. Create a new "rshbase" component that contains a simple rsh module - no tree spawn, uses all the base functions for launch support. Extend the base rsh support functions to include those functions in common across all rsh modules.
Only a minor change made to the current rsh module to avoid a naming conflict. Otherwise, left it alone to avoid creating conflicts with other external work. The current rsh module remains the default for rsh/ssh support, and continues to contain the support for SGE and Loadleveler.

This commit was SVN r24593.
2011-03-30 01:15:07 +00:00
Nysal Jan
866ae8b43a Close the file descriptor
This commit was SVN r24580.
2011-03-29 08:42:49 +00:00
Nysal Jan
c8c6b0edab Improve LoadLeveler integration with Open MPI. Add support for LL native rsh agent - llspawn
This commit was SVN r24579.
2011-03-29 07:46:59 +00:00
Nathan Hjelm
8634b6394f fixed plm/tm component
This commit was SVN r24577.
2011-03-25 22:20:15 +00:00
Ralph Castain
d7e029cb40 Convert heartbeat to multicast basis
This commit was SVN r24570.
2011-03-24 19:05:39 +00:00
Ralph Castain
90698a2c02 Ensure that blocking recvs wait until the data is actually recvd
This commit was SVN r24558.
2011-03-22 18:45:54 +00:00
Ralph Castain
888472f671 Do not release recv as the calling function needs that data and will release it later
This commit was SVN r24557.
2011-03-22 18:44:56 +00:00
Ralph Castain
30981de200 Minor cleanups courtesy of Nysal - thanks!
This commit was SVN r24552.
2011-03-22 13:48:58 +00:00
Ralph Castain
c1396b278c Resolve the rsh confusion by splitting the initial search for a launch agent from the actual setup of the launch agent values in the plm base globals. Have each aspiring rsh-clone call lookup to see if their desired launch agent is available - if not, then reject that plm component.
If so, then setup the actual launch agent values only when the module init function is called.

This resolves the current conflict between the rsh and rshd components. Hopefully, it may avoid future problems in this area -provided- any new uses of rsh-like launchers abide by the lookup-and-then-setup rule.

This commit was SVN r24550.
2011-03-22 02:23:09 +00:00
Ralph Castain
d17b50e1ff Add the appropriate hooks to tell Totalview to display the user's main program upon startup. Apparently, this hook got lost somewhere after the 1.2 series :-(
Thanks to David Turner and the TV folks for passing this along.

This commit was SVN r24549.
2011-03-21 17:40:58 +00:00
Ralph Castain
795ca2cff2 Complete implementation of the multicast-based grpcomm module
This commit was SVN r24548.
2011-03-20 01:18:06 +00:00
Ralph Castain
fa40f5d7c3 Fix bad formatting
This commit was SVN r24547.
2011-03-20 01:17:29 +00:00
Ralph Castain
281116ddc5 A max_restarts value of -1 is now valid and indicates infinite restarts, so correct the validity check
This commit was SVN r24546.
2011-03-20 01:17:00 +00:00
Eugene Loh
2770a12beb Continue clean up of thread options started in r22841, 22842, and 22849.
No need for any CMRs to 1.5... that was already done in CMR 2728.

This commit was SVN r24545.

The following SVN revision numbers were found above:
  r22841 --> open-mpi/ompi@b400b84162
2011-03-18 21:36:35 +00:00
Ralph Castain
ee68cd102c Fix the hier grpcomm module so modex results in correct data. The prior implementation stored the modex data as node-based attributes. This worked fine for BTL's such as openib where the interfaces were associated with the node. However, BTL's such as TCP have interfaces associated with a specific process, not a node. Thus, store the data in the modex database so it is correctly indexed.
This commit was SVN r24536.
2011-03-17 02:22:23 +00:00
Ralph Castain
d5dfe05521 Remove stale code associated with OPAL_THREADS_HAVE_DIFFERENT_PIDS. In the past, we have supported the case of really, really old Linux kernels where threads have different pids. However, when we updated the event library, we didn't also update that support code. In addition, when we dropped progress thread support, we didn't remove areas of the code that could no longer be compiled (i.e., were protected by "if progress thread && if have different pids).
There was no compelling reason to support such old kernels. Accordingly, convert the test to print a nice error message indicating we no longer support old kernels (but indicate that earlier OMPI versions do) and error out. Remove all code that was protected by "if have different pids" since it can no longer be compiled.

This commit was SVN r24531.
2011-03-15 21:05:03 +00:00
Ralph Castain
de092af8ef Add a little more debug
This commit was SVN r24526.
2011-03-14 18:43:49 +00:00
Ralph Castain
ebabe9c83a Forgot that Terry wanted to control the vm launch with an mca param - set one up for that purpose
This commit was SVN r24525.
2011-03-13 00:46:42 +00:00
Ralph Castain
dc6f616599 Enable VM launch.
For some time, ORTE has had the ability to launch daemons on all nodes prior to launching an application. It has largely been used outside of the OMPI community, and so was never explicitly turned "on" inside OMPI releases. Nevertheless, the code has been there.

Allowing VM launches does not require ANY changes to existing PLM components. All that was required was to have orterun launch the daemons as a separate call to orte_plm.spawn -prior- to launching the applications. The rest of the VM support code resides in the rmaps framework:

(a) a check when asked to map a job to see if it is the daemon job, and

(b) a separate "setup_virtual_machine" mapper in the rmaps base that creates the required map so the PLM's will do the right thing.

In order to support those users who have no RM allocation but like to give the allocation in the form of a -host or -hostfile argument to their application, there is a little more code in orterun and the setup_virtual_machine mapper to capture information passed in that manner.

This has been tested with rsh and slurm environments, and, since there is nothing environment-specific in the implementation, should work in others as well - but needs to be proven.

This commit was SVN r24524.
2011-03-12 22:50:53 +00:00
Ralph Castain
80265b472e Avoid direct reference of pointer_array elements
This commit was SVN r24523.
2011-03-12 20:18:51 +00:00
Ralph Castain
df82e4cd36 Plug a memory leak
This commit was SVN r24521.
2011-03-12 15:37:33 +00:00
Ralph Castain
1297acde13 George raised some valid concerns about the extensibility of the revised rmaps framework. Address those by:
1. removing the enum of mapper values

2. change the req_mapper and last_mapper fields to char* so they can hold the component name instead of a mapper flag

3. revise the selection logic in the mapper components to reflect the change. Components now look for their name in the req_mapper field, or to see if other criteria (e.g., npernode) are set that mandate their doing the mapping

Several MCA params resided in the rmaps base for historical reasons - they have been in the base since at least the original 1.2 release (and perhaps earlier). However, George correctly pointed out that they really should reside in their respective components. Accordingly, move them to the components, but register synonyms to the old names to avoid breaking backward compatibility.

These revisions retain the current functionality of allowing comm_spawn'd jobs to use different mappers than the original job, and for the errmgr to utilize the resilient mapper to recover processes regardless of how they were originally mapped.

Given the large number of possible combinations, I am sure that someone will find a corner-case combination of values and selection criteria that cause either no mapper to be selected, or one other than the intended to be used. No one can test all the ways people will use this system, so I expect debugging to continue for awhile.

The ability of comm_spawn'd jobs to exploit this functionality relies on changes to the orte_dpm component - this will be committed separately.

This commit was SVN r24520.
2011-03-12 05:30:09 +00:00
Samuel Gutierrez
830c7c66dc fixes CID #1667
This commit was SVN r24518.
2011-03-12 03:09:01 +00:00
Ralph Castain
e6a76cc923 Fixes CID #1954
This commit was SVN r24516.
2011-03-11 23:00:27 +00:00
Ralph Castain
2ccd514b9a Add version string to app
This commit was SVN r24514.
2011-03-11 20:38:37 +00:00
Samuel Gutierrez
2a2319d23a when orte_timing is enabled, always record daemon launch start time before starting the real work.
This commit was SVN r24513.
2011-03-11 00:09:23 +00:00
Ralph Castain
f9a9fac76b Minor typo
This commit was SVN r24506.
2011-03-10 16:09:31 +00:00
George Bosilca
80fe617cd2 If we don't release the OPAL utils explicitly there will be a memory leak.
This commit was SVN r24505.
2011-03-10 00:42:28 +00:00
George Bosilca
7f34a28c8f Correct a comment.
This commit was SVN r24504.
2011-03-10 00:41:41 +00:00
George Bosilca
d2502b14f9 Destruct the OOB TCP internal objects.
This commit was SVN r24503.
2011-03-10 00:40:54 +00:00
Ralph Castain
3b4421d8e3 Separately track requested and last-used mapper so we don't lose that info
This commit was SVN r24502.
2011-03-09 18:51:36 +00:00
Jeff Squyres
06d5c59115 Fix a few valgrind-reported memory leaks
This commit was SVN r24498.
2011-03-08 17:37:28 +00:00
Jeff Squyres
0586612bd5 Fix another minor memory leak
This commit was SVN r24495.
2011-03-08 15:46:13 +00:00
Jeff Squyres
79cf382ff3 Fix a few issues with error messages:
* If something goes wrong during ompi_mpi_init, don't erroneously
   report that it is illegal to invoke MPI_INIT* before MPI_INIT
 * Aggregate help messages when possible when something goes wring
   during ompi_mpi_init

This commit was SVN r24492.
2011-03-07 16:45:45 +00:00
Ralph Castain
63f38e38bb Fix ompi-server: remove extra command flag in buffer being sent to mpirun, ensure that tools route messages thru a remote HNP
This commit was SVN r24491.
2011-03-05 17:12:46 +00:00
Ralph Castain
d764e7a398 We want uid/gid support at the individual application level. Ensure the values get initialized and packed/unpacked for transfer.
This commit was SVN r24489.
2011-03-04 18:46:43 +00:00
George Bosilca
9bbe00bdc3 Set the return code from the processes upstream.
This commit was SVN r24483.
2011-03-03 00:02:21 +00:00
George Bosilca
c6a5f9706a Thomas's patch: Assume we won't fail unless notified by a child.
This commit was SVN r24482.
2011-03-02 23:50:01 +00:00
Josh Hursey
62bba1bf12 Name the enum so that it represents as an actual symbol in gdb, instead of just a number.
This commit was SVN r24472.
2011-03-01 21:00:03 +00:00
Nysal Jan
4030111478 Add missing copyright and fix the year
This commit was SVN r24446.
2011-02-23 15:52:06 +00:00
Nysal Jan
42a73bb887 POE is supported on both AIX and Linux. Build POE PLM only if we find the poe binary. Fix hostfile creation and POE command line arguments.
This commit was SVN r24444.
2011-02-23 15:38:41 +00:00
Ralph Castain
f014284f91 Update resilient recovery mapping algorithm to be a bit more sophisticated. Track the prior node a proc was on so we avoid ricochet effect. Also avoid putting recovering proc onto node that is already occupied by a peer as this degrades fault tolerance.
This commit was SVN r24417.
2011-02-20 18:46:21 +00:00
Ralph Castain
a8cf19a7bc Ensure heartbeat only started once and only for daemon job
This commit was SVN r24416.
2011-02-18 20:33:54 +00:00
Ralph Castain
ef56e6d78b Helps to move the pointer
This commit was SVN r24414.
2011-02-18 14:01:25 +00:00
Ralph Castain
7b35ada7fc Fix ricochet effect - move failed procs to next on list instead of loadbalancing
This commit was SVN r24413.
2011-02-18 13:11:55 +00:00
Ralph Castain
b98a2917ff Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it.
This commit was SVN r24412.
2011-02-18 02:48:12 +00:00
Ralph Castain
a0f6e153c7 Add missing fields to copy of app_context object
This commit was SVN r24411.
2011-02-17 23:55:05 +00:00
Ralph Castain
51cf0a16c3 Some minor cleanups to support VM and CM operations
This commit was SVN r24408.
2011-02-16 23:03:08 +00:00
Ralph Castain
9b48c07599 CM daemons handle their own output
This commit was SVN r24407.
2011-02-16 23:02:23 +00:00
Ralph Castain
65ba6af44d Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM.
Have each mapper flag it did the map so we can see who did it later.

Ensure procs are flagged as "ready to launch".

This commit was SVN r24406.
2011-02-16 23:01:57 +00:00