Iain Bason
28f03a2d86
Suspend/resume enhancements:
...
Have orte call setpgrp after forking (but before exec) when
orte_forward_job_control is set. Then have it send signals to the
child's process group. This allows suspending jobs that fork.
If a SIGTSTP arrives before the processes have been launched, then
record it and suspend them right after launching.
This commit was SVN r22557.
2010-02-04 15:47:20 +00:00
Shiqing Fan
bbcf1f71c4
Remove a incorrect callback, which was based on the old source base without WMI. This makes no harm on 32 bit Windows, but it seems causing exceptions on 64 bit Windows sometimes. What this callback does is just waiting on the given pid which actually is a remote pid, so it won't work as expected.
...
cmr:v1.4.2
cmr:v1.5
This commit was SVN r22549.
2010-02-04 10:47:28 +00:00
Ralph Castain
16b7bc7a82
Sigh...get the order right to match unpack
...
This commit was SVN r22539.
2010-02-03 15:50:43 +00:00
Ralph Castain
e88627a7ca
Ensure we don't go through rml open/select more than once.
...
Open the rml to get the uri when bootstrapping daemons
This commit was SVN r22538.
2010-02-03 15:38:32 +00:00
Ralph Castain
cb1007b5a9
Pass back the number of daemons in the system
...
This commit was SVN r22537.
2010-02-03 14:31:16 +00:00
Shiqing Fan
bdc13dacb1
A type cast.
...
This commit was SVN r22520.
2010-01-31 20:22:22 +00:00
Ralph Castain
7badff9d2d
Okay to return no available nodes for mapping when launching daemons - just means there is nothing to do
...
This commit was SVN r22509.
2010-01-28 22:58:28 +00:00
Ralph Castain
86dd1d41af
Handle zero-length iovecs in multicast messages
...
This commit was SVN r22507.
2010-01-28 15:29:43 +00:00
Ralph Castain
f66b6cae23
Enable the boot of an orted "virtual machine". Modify the mapper framework to allow mapping of only daemons. Remove the cm ras module as no longer required. Modify the orted code to always send back node arch info. Remove the "--enable-bootstrap" configure option as this feature will now always be available.
...
This commit was SVN r22480.
2010-01-25 22:25:13 +00:00
Josh Hursey
b749ecbab8
This commit fixes trac:2190.
...
Originally the patch was to improve the error message, but when digging into the code I found a subtle bug. If the daemon does not tell the HNP what CRS component it used, then the HNP tries to figure it out from the metadata (this is an uncommon case). The path the HNP used was not complete, so it was unable to find the metadata information. This patch fixes this by adding the 'snapshot_reference' to the 'snapshot_location' which completes the path for this search.
cmr:v1.4 (needs a custom patch)
cmr:v1.5
This commit was SVN r22479.
The following Trac tickets were found above:
Ticket 2190 --> https://svn.open-mpi.org/trac/ompi/ticket/2190
2010-01-25 20:28:38 +00:00
Ralph Castain
e4bf33dcab
Just a slight efficiency improvement - why check a flag twice?
...
This commit was SVN r22472.
2010-01-23 03:57:56 +00:00
Ralph Castain
3fe5e3e142
Propagate the user's callback data during non-blocking sends
...
This commit was SVN r22432.
2010-01-15 20:02:47 +00:00
Shiqing Fan
ad763c327d
Restore several linked libraries that were deleted by mistake in r22405.
...
This commit was SVN r22415.
The following SVN revision numbers were found above:
r22405 --> open-mpi/ompi@872a4047ba
2010-01-14 21:50:42 +00:00
Shiqing Fan
872a4047ba
Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake.
...
In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time.
This commit was SVN r22405.
2010-01-14 18:10:20 +00:00
Ralph Castain
cec840f6b9
The ability to add procs to a running job was unfortunately borked when we added the detection of a proc exiting before calling init. Re-enable it here, ensuring that procs that are being restarted and/or added to a job do -not- call barrier during orte_init.
...
This commit was SVN r22404.
2010-01-14 17:59:42 +00:00
Shiqing Fan
0259fa0b9c
Correct a few variable names.
...
This commit was SVN r22401.
2010-01-14 10:55:15 +00:00
Ralph Castain
adb2430e24
Missed one place, of course
...
This commit was SVN r22400.
2010-01-13 23:11:44 +00:00
Ralph Castain
c782c98433
Rename the "basic" rmcast component "udp" to more accurately reflect its operation
...
This commit was SVN r22399.
2010-01-13 23:01:25 +00:00
Ralph Castain
237eb4e8df
For some strange reason, every so often it appears possible for the event library to trip the read event on a socket, yet have the read itself yield an error. If/when that happens, report the error and continue on.
...
This happens rarely, but it does seem to happen.
This commit was SVN r22398.
2010-01-13 19:23:28 +00:00
Ralph Castain
ae1719306b
Fix a bug in non-blocking sends
...
This commit was SVN r22395.
2010-01-13 05:37:36 +00:00
Ralph Castain
b35486d945
The CM ess module needs to open the sysinfo framework and select modules prior to when others need it. Thus, setup a flag to avoid multiple open/select within that framework.
...
This commit was SVN r22393.
2010-01-12 22:03:49 +00:00
Ralph Castain
48486df4fe
Cleanup some diagnostics
...
This commit was SVN r22389.
2010-01-12 01:25:19 +00:00
Ralph Castain
9f3ccebeaa
We need to barrier for orte apps when the job is initially started, but we must not do the barrier when a proc is restarted as the other procs in the job won't know to participate.
...
This commit was SVN r22388.
2010-01-10 02:21:30 +00:00
Ralph Castain
16b16c5cb8
Fix a silly typo
...
This commit was SVN r22387.
2010-01-09 15:34:49 +00:00
Ralph Castain
add84178ef
Fix a silly typo that prevented tcp multicast messages from being delivered
...
This commit was SVN r22384.
2010-01-08 20:30:27 +00:00
Brian Barrett
86d8356b13
Updates to allow OMPI to build on Cray XT platforms running Catamount
...
This commit was SVN r22381.
2010-01-07 18:14:03 +00:00
Ralph Castain
09763ec711
Since we modified ORTE to declare that any process that terminates after calling "init" while at least one other process has not yet called "init" is an error, we have to ensure that non-MPI ORTE apps (i.e., apps that call orte_init but not mpi_init) include a barrier in orte_init. Otherwise, fast ORTE apps almost always wind up triggering the "abnormal termination" condition.
...
The barrier is protected with a test to ensure that MPI apps don't execute it and wind up doing two barriers during their init.
This commit was SVN r22378.
2010-01-07 06:58:01 +00:00
Ralph Castain
ef1bfaa823
Add the ability to track how many times a process has been restarted, and to communicate that value to a process when it is restarted in case it needs to take action when it is restarted as opposed to being started for the first time.
...
This commit was SVN r22377.
2010-01-07 01:19:44 +00:00
Ralph Castain
a12de9d1e8
Oh, the pain one little word can make...sigh.
...
This commit was SVN r22364.
2010-01-05 23:29:56 +00:00
Ralph Castain
5faf857840
Add a new tag for pnp/multicast send of direct messages
...
This commit was SVN r22352.
2009-12-31 20:34:58 +00:00
Ralph Castain
b3a58f8b83
Pass the correct address when packing iovec bytes for multicast.
...
Thanks to Rick Payne for the correction.
This commit was SVN r22351.
2009-12-30 20:59:31 +00:00
Ralph Castain
89a6131032
Check the return status code on all dss operations within the rmcast modules
...
This commit was SVN r22349.
2009-12-30 01:45:31 +00:00
Ralph Castain
50074f0770
Remove unused (and uninitialized) variable
...
This commit was SVN r22340.
2009-12-24 01:36:47 +00:00
Ralph Castain
aaf1119f40
Garrr...ensure we accurately know when to update the contact info so we don't do it incorrectly as procs terminate, thus causing the system to think that perfectly good apps are incorrectly terminating.
...
Thanks to George for pointing out the problem
This commit was SVN r22332.
2009-12-17 20:40:21 +00:00
Ralph Castain
db2cbd3166
Okay, okay - do it at destruct time too.
...
This commit was SVN r22331.
2009-12-17 20:08:49 +00:00
Ralph Castain
a56e09c874
Per suggestion from Josh, init the sender field of the msg_packet object to INVALID
...
This commit was SVN r22330.
2009-12-17 20:03:35 +00:00
Ralph Castain
8ab962411c
Detect the scenario where one or more procs fail to call orte/ompi_init while others in the job do. This scenario can cause the job to hang as MPI_Init contains a barrier operation that will not complete. Although ORTE does not contain such a barrier, it still will be considered as an error scenario so that we can detect the MPI case - otherwise, ORTE has no knowledge of OMPI and wouldn't know how to differentiate the use-cases.
...
Take advantage of the changes to update the routed_base_receive code to avoid message overlap.
This commit was SVN r22329.
2009-12-17 19:39:53 +00:00
Josh Hursey
313acba4ce
Move the mca_base_is_component_required() functionality to mca/base per suggestion so that it can be reused in other components.
...
This commit was SVN r22327.
2009-12-17 15:12:26 +00:00
Josh Hursey
a418a7dc43
Make sure to look in not only the env var, but also {{{orte_routed_base_components}}} to confirm that this is the only component available, and intended for selection.
...
This commit was SVN r22323.
2009-12-16 20:17:26 +00:00
Josh Hursey
646f90a90a
Small fix for a egde case
...
This commit was SVN r22322.
2009-12-16 18:06:05 +00:00
George Bosilca
a2310808f1
Santa's back! Fix all warnings about the deprecated usage of
...
stringWithCString as well as the casting issue between NSInteger and
%d. The first is solved by using stringWithUTF8String, which apparently
will always give the right answer (sic). The second is fixed as suggested
by Apple by casting the NSInteger (hint: which by definition is large
enough to hold a pointer) to a long and use %ld in the printf.
This commit was SVN r22317.
2009-12-16 00:06:37 +00:00
Ralph Castain
9acec283af
Add a new TCP module to the reliable multicast framework. This module uses ORTE's grpcomm.xcast functionality to "fake" multicasts for environments where regular multicast isn't reliable.
...
Modify the startup logic to allow for this use-case.
This commit was SVN r22310.
2009-12-15 01:18:27 +00:00
Ralph Castain
0ffa4f2f0c
Ensure we cancel the lingering recv in the allgather code to avoid having incorrect counters.
...
Thanks to Damien for spotting the problem.
This commit was SVN r22301.
2009-12-14 13:21:56 +00:00
George Bosilca
501d1cc4ad
Set default values to avoid using these variables uninitialized.
...
This commit was SVN r22279.
2009-12-08 18:42:22 +00:00
Ralph Castain
e3a2e66ec2
Add limits on rmcast seq numbers
...
This commit was SVN r22269.
2009-12-05 01:20:14 +00:00
Ralph Castain
4a82dd9a45
Add message sequence numbers to multicast messages, tracked by channel
...
This commit was SVN r22262.
2009-12-04 04:17:44 +00:00
Ralph Castain
4ec9c4b532
Do a better job of ensuring session directories are removed when procs abnormally terminate and/or we order "kill local procs"
...
This commit was SVN r22258.
2009-12-03 04:46:17 +00:00
Ralph Castain
93ebed48b1
Update the multicast test. Some cleanups to the basic rmcast module
...
This commit was SVN r22257.
2009-12-03 04:30:58 +00:00
Ralph Castain
66efa05a53
Don't cancel the recv unless it was issued or else we generate an error whenever we launch an app without having to launch daemons (e.g., a completely local launch to mpirun)
...
This commit was SVN r22256.
2009-12-03 04:28:43 +00:00
George Bosilca
7bf1d7a1c4
A more asynchronous startup over rsh/ssh.
...
This commit was SVN r22253.
2009-12-02 20:29:32 +00:00