Ralph Castain
e4bf33dcab
Just a slight efficiency improvement - why check a flag twice?
...
This commit was SVN r22472.
2010-01-23 03:57:56 +00:00
Shiqing Fan
c29a668e37
Remove flex.exe and its license file from the tarball.
...
cmr:v1.4
cmr:v1.5
This commit was SVN r22469.
2010-01-22 16:40:13 +00:00
Ralph Castain
00b493e10d
Make the sigpipe message be a verbose output
...
This commit was SVN r22464.
2010-01-22 03:46:11 +00:00
Ralph Castain
2517799102
Report, but ignore, SIGPIPE events. The odls already resets this signal handler when spawning local procs.
...
This commit was SVN r22463.
2010-01-21 05:01:06 +00:00
Shiqing Fan
4836e8878a
Update a few more CMake scripts.
...
This commit was SVN r22454.
2010-01-19 17:34:55 +00:00
Ralph Castain
3fe5e3e142
Propagate the user's callback data during non-blocking sends
...
This commit was SVN r22432.
2010-01-15 20:02:47 +00:00
Shiqing Fan
ad763c327d
Restore several linked libraries that were deleted by mistake in r22405.
...
This commit was SVN r22415.
The following SVN revision numbers were found above:
r22405 --> open-mpi/ompi@872a4047ba
2010-01-14 21:50:42 +00:00
Shiqing Fan
872a4047ba
Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake.
...
In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time.
This commit was SVN r22405.
2010-01-14 18:10:20 +00:00
Ralph Castain
cec840f6b9
The ability to add procs to a running job was unfortunately borked when we added the detection of a proc exiting before calling init. Re-enable it here, ensuring that procs that are being restarted and/or added to a job do -not- call barrier during orte_init.
...
This commit was SVN r22404.
2010-01-14 17:59:42 +00:00
Shiqing Fan
0259fa0b9c
Correct a few variable names.
...
This commit was SVN r22401.
2010-01-14 10:55:15 +00:00
Ralph Castain
adb2430e24
Missed one place, of course
...
This commit was SVN r22400.
2010-01-13 23:11:44 +00:00
Ralph Castain
c782c98433
Rename the "basic" rmcast component "udp" to more accurately reflect its operation
...
This commit was SVN r22399.
2010-01-13 23:01:25 +00:00
Ralph Castain
237eb4e8df
For some strange reason, every so often it appears possible for the event library to trip the read event on a socket, yet have the read itself yield an error. If/when that happens, report the error and continue on.
...
This happens rarely, but it does seem to happen.
This commit was SVN r22398.
2010-01-13 19:23:28 +00:00
Ralph Castain
ae1719306b
Fix a bug in non-blocking sends
...
This commit was SVN r22395.
2010-01-13 05:37:36 +00:00
Ralph Castain
b35486d945
The CM ess module needs to open the sysinfo framework and select modules prior to when others need it. Thus, setup a flag to avoid multiple open/select within that framework.
...
This commit was SVN r22393.
2010-01-12 22:03:49 +00:00
Ralph Castain
48486df4fe
Cleanup some diagnostics
...
This commit was SVN r22389.
2010-01-12 01:25:19 +00:00
Ralph Castain
9f3ccebeaa
We need to barrier for orte apps when the job is initially started, but we must not do the barrier when a proc is restarted as the other procs in the job won't know to participate.
...
This commit was SVN r22388.
2010-01-10 02:21:30 +00:00
Ralph Castain
16b16c5cb8
Fix a silly typo
...
This commit was SVN r22387.
2010-01-09 15:34:49 +00:00
Ralph Castain
e0afc30708
Since the decision on whether or not to use script wrapper compilers is a configure-time option, we need the wrapper compiler script to be in the tarball
...
This commit was SVN r22386.
2010-01-09 01:08:10 +00:00
Ralph Castain
add84178ef
Fix a silly typo that prevented tcp multicast messages from being delivered
...
This commit was SVN r22384.
2010-01-08 20:30:27 +00:00
Brian Barrett
86d8356b13
Updates to allow OMPI to build on Cray XT platforms running Catamount
...
This commit was SVN r22381.
2010-01-07 18:14:03 +00:00
Ralph Castain
09763ec711
Since we modified ORTE to declare that any process that terminates after calling "init" while at least one other process has not yet called "init" is an error, we have to ensure that non-MPI ORTE apps (i.e., apps that call orte_init but not mpi_init) include a barrier in orte_init. Otherwise, fast ORTE apps almost always wind up triggering the "abnormal termination" condition.
...
The barrier is protected with a test to ensure that MPI apps don't execute it and wind up doing two barriers during their init.
This commit was SVN r22378.
2010-01-07 06:58:01 +00:00
Ralph Castain
ef1bfaa823
Add the ability to track how many times a process has been restarted, and to communicate that value to a process when it is restarted in case it needs to take action when it is restarted as opposed to being started for the first time.
...
This commit was SVN r22377.
2010-01-07 01:19:44 +00:00
Ralph Castain
a12de9d1e8
Oh, the pain one little word can make...sigh.
...
This commit was SVN r22364.
2010-01-05 23:29:56 +00:00
Ralph Castain
5faf857840
Add a new tag for pnp/multicast send of direct messages
...
This commit was SVN r22352.
2009-12-31 20:34:58 +00:00
Ralph Castain
b3a58f8b83
Pass the correct address when packing iovec bytes for multicast.
...
Thanks to Rick Payne for the correction.
This commit was SVN r22351.
2009-12-30 20:59:31 +00:00
Ralph Castain
bb7aa9797f
Not sure why the nightly tarball had a problem as this wasn't changed, but comment out the orte wrapper compiler man pages for now
...
This commit was SVN r22350.
2009-12-30 02:52:29 +00:00
Ralph Castain
89a6131032
Check the return status code on all dss operations within the rmcast modules
...
This commit was SVN r22349.
2009-12-30 01:45:31 +00:00
Ralph Castain
fad1ba15b0
Move the test for case-sensitive file system from ompi to opal so that all layers can have that knowledge.
...
Use that for the orte wrapper compilers
This commit was SVN r22348.
2009-12-29 23:26:45 +00:00
Ralph Castain
d6003e3369
Update the mpirun man page to include info on the "!" option to the xterm cmd line option that holds the xterm window open after process termination so the user can see the process' output.
...
This commit was SVN r22341.
2009-12-26 16:01:03 +00:00
Ralph Castain
50074f0770
Remove unused (and uninitialized) variable
...
This commit was SVN r22340.
2009-12-24 01:36:47 +00:00
Ralph Castain
aaf1119f40
Garrr...ensure we accurately know when to update the contact info so we don't do it incorrectly as procs terminate, thus causing the system to think that perfectly good apps are incorrectly terminating.
...
Thanks to George for pointing out the problem
This commit was SVN r22332.
2009-12-17 20:40:21 +00:00
Ralph Castain
db2cbd3166
Okay, okay - do it at destruct time too.
...
This commit was SVN r22331.
2009-12-17 20:08:49 +00:00
Ralph Castain
a56e09c874
Per suggestion from Josh, init the sender field of the msg_packet object to INVALID
...
This commit was SVN r22330.
2009-12-17 20:03:35 +00:00
Ralph Castain
8ab962411c
Detect the scenario where one or more procs fail to call orte/ompi_init while others in the job do. This scenario can cause the job to hang as MPI_Init contains a barrier operation that will not complete. Although ORTE does not contain such a barrier, it still will be considered as an error scenario so that we can detect the MPI case - otherwise, ORTE has no knowledge of OMPI and wouldn't know how to differentiate the use-cases.
...
Take advantage of the changes to update the routed_base_receive code to avoid message overlap.
This commit was SVN r22329.
2009-12-17 19:39:53 +00:00
Ralph Castain
06d1f2cfe2
Add some new tests to the ORTE collection
...
This commit was SVN r22328.
2009-12-17 19:30:57 +00:00
Josh Hursey
313acba4ce
Move the mca_base_is_component_required() functionality to mca/base per suggestion so that it can be reused in other components.
...
This commit was SVN r22327.
2009-12-17 15:12:26 +00:00
Josh Hursey
a418a7dc43
Make sure to look in not only the env var, but also {{{orte_routed_base_components}}} to confirm that this is the only component available, and intended for selection.
...
This commit was SVN r22323.
2009-12-16 20:17:26 +00:00
Josh Hursey
646f90a90a
Small fix for a egde case
...
This commit was SVN r22322.
2009-12-16 18:06:05 +00:00
George Bosilca
a2310808f1
Santa's back! Fix all warnings about the deprecated usage of
...
stringWithCString as well as the casting issue between NSInteger and
%d. The first is solved by using stringWithUTF8String, which apparently
will always give the right answer (sic). The second is fixed as suggested
by Apple by casting the NSInteger (hint: which by definition is large
enough to hold a pointer) to a long and use %ld in the printf.
This commit was SVN r22317.
2009-12-16 00:06:37 +00:00
Ralph Castain
9acec283af
Add a new TCP module to the reliable multicast framework. This module uses ORTE's grpcomm.xcast functionality to "fake" multicasts for environments where regular multicast isn't reliable.
...
Modify the startup logic to allow for this use-case.
This commit was SVN r22310.
2009-12-15 01:18:27 +00:00
Ralph Castain
0ffa4f2f0c
Ensure we cancel the lingering recv in the allgather code to avoid having incorrect counters.
...
Thanks to Damien for spotting the problem.
This commit was SVN r22301.
2009-12-14 13:21:56 +00:00
Josh Hursey
4357159ac9
Make sure to check for the NO_CKPT state while waiting. This means that the target was not able to checkpoint [ever | at this time]. So {{{ompi-checkpoint}}} should exit after printing the error message, instead of hanging and waiting.
...
Will need to be moved to v1.5 and v1.4. v1.4 will require a custom patch, but should apply cleanly to v1.5. CMRs to follow.
This commit was SVN r22289.
2009-12-09 16:01:33 +00:00
George Bosilca
501d1cc4ad
Set default values to avoid using these variables uninitialized.
...
This commit was SVN r22279.
2009-12-08 18:42:22 +00:00
Ralph Castain
e3a2e66ec2
Add limits on rmcast seq numbers
...
This commit was SVN r22269.
2009-12-05 01:20:14 +00:00
Jeff Squyres
9fffd30660
Fix a typo in the man page. Thanks to Jeremiah Willcock for pointing
...
it out.
This commit was SVN r22268.
2009-12-04 21:10:50 +00:00
Ralph Castain
4026a9c873
Update all the tests to the new orte_init API
...
This commit was SVN r22263.
2009-12-04 04:31:06 +00:00
Ralph Castain
4a82dd9a45
Add message sequence numbers to multicast messages, tracked by channel
...
This commit was SVN r22262.
2009-12-04 04:17:44 +00:00
Jeff Squyres
16b100219d
A patch from UTK to allow orte_init(), opal_init(), and associated
...
friends also receive &argc and &argv (George asked Jeff to Ralph to
review before committing). The thought is that passing argv and argc
to opal/orte_init be useful to other projects outside of OMPI that are
using OPAL and/or ORTE (especially in conjunction with some other
bootstrapping code where it is helpful to modify argv). It's such a
small thing that it's easy to apply here to make others' lives a
little easier.
Ask George for more details; I'm just the messenger. :-)
Judging by the copyrights on this patch, it's been around for a
while. :-)
This commit was SVN r22260.
2009-12-04 00:51:15 +00:00
Ralph Castain
ae3e9f2aee
Update the spin.c test
...
This commit was SVN r22259.
2009-12-03 04:46:31 +00:00
Ralph Castain
4ec9c4b532
Do a better job of ensuring session directories are removed when procs abnormally terminate and/or we order "kill local procs"
...
This commit was SVN r22258.
2009-12-03 04:46:17 +00:00
Ralph Castain
93ebed48b1
Update the multicast test. Some cleanups to the basic rmcast module
...
This commit was SVN r22257.
2009-12-03 04:30:58 +00:00
Ralph Castain
66efa05a53
Don't cancel the recv unless it was issued or else we generate an error whenever we launch an app without having to launch daemons (e.g., a completely local launch to mpirun)
...
This commit was SVN r22256.
2009-12-03 04:28:43 +00:00
Ralph Castain
3a72ee9dca
Fix a bug reported by Rainer whereby we could free and reuse an address if the user specified the tmp dir base. After discussing with Josh, we also removed the code that had us retry creation of the session dir (using default values) if the user-specified value didn't work for some reason. Adhering to OMPI standard practices, we abort if the user-specified value doesn't work.
...
This commit was SVN r22255.
2009-12-03 01:57:35 +00:00
George Bosilca
7bf1d7a1c4
A more asynchronous startup over rsh/ssh.
...
This commit was SVN r22253.
2009-12-02 20:29:32 +00:00
Ralph Castain
a0d5c80ce0
Add a new framework for discovering local resource information such as cpu type/model, #cpus, available physical memory, etc. Two initial components (darwin and linux) are provided. This is needed to support bootstrap operations where daemons are started at node boot, and applications where initial knowledge of cpu identification is needed to guide framework component selection.
...
Add orte configuration option to control the use of the framework in the system. Although the code will build, it will not be active unless configured with --enable-bootstrap.
If bootstrap is enabled and the new opal_sysinfo framework can successfully determine the cpu model, pass that info to the application as an MCA param to support some work at Sun.
Also, have daemons report back the resources they find to guide process mapping in bootstrap operations (i.e., where the daemon starts at node boot as opposed to being launched at application start).
Adjust some platform files to enable these capabilities.
This commit was SVN r22244.
2009-11-30 23:11:25 +00:00
Ralph Castain
e38a0eab9f
Remove the fddp and sensor frameworks - relocated to new cluster mgr project
...
This commit was SVN r22240.
2009-11-27 22:14:47 +00:00
Rainer Keller
70a69e796f
- Get rid of a small nuisance: after installation of the
...
alps-resid script, set it to exec, to allow:
export OMPI_ALPS_RESID=`$OMPI/share/openmpi/ras-alps-command.sh`
This commit was SVN r22234.
2009-11-25 19:01:33 +00:00
Ralph Castain
9a6d5697a8
Protect against NULL input - I'm -sure- no one will do it, but...well, actually, they did. :-/
...
This commit was SVN r22232.
2009-11-25 15:13:21 +00:00
Ralph Castain
c1206139dd
Ensure the thread-safe data buffers are initialized prior to use
...
This commit was SVN r22231.
2009-11-25 15:12:45 +00:00
Ralph Castain
92733b13d9
Add a couple of new tests to the orte system.
...
Modify the job_complete check so we don't kill jobs when a single proc was terminated by ORTE command via plm.terminate_procs
Still dies gracefully with a ctrl-c, and behaves as before when using plm.terminate_job
This commit was SVN r22227.
2009-11-20 01:47:49 +00:00
Ralph Castain
5e031d9ded
Let a restarted process have access to all known nodes instead of only those already in its prior job map
...
This commit was SVN r22225.
2009-11-19 19:45:11 +00:00
Ralph Castain
852e5d9ee0
Add some diag output
...
This commit was SVN r22224.
2009-11-19 19:43:36 +00:00
Ralph Castain
a401f05ea3
Add some diagnostics to chase down forced termination of procs. Ensure that procs are removed from the local data list upon termination
...
This commit was SVN r22223.
2009-11-19 19:43:10 +00:00
Ralph Castain
3921069230
Ensure we completely cleanout the old nidmap info
...
This commit was SVN r22222.
2009-11-19 19:42:15 +00:00
Ralph Castain
8dc08e304f
No longer require name passed separately
...
This commit was SVN r22221.
2009-11-19 19:41:41 +00:00
Ralph Castain
1a44b84b25
If a process is in certain states (e.g., polling for messages in the event lib), then it can blissfully ignore SIGTERM when we try to order it to die. Unfortunately, the OS thinks the process actually did die, leading us to leave orphaned procs around.
...
The only sure way to kill the thing is with SIGKILL. After hours spent trying to debug this bizarre situation with a reliable reproducer, I finally tracked it down and fixed it.
Go figure...I sure can't.
This commit was SVN r22220.
2009-11-19 17:25:15 +00:00
Shiqing Fan
11ad25fa77
A few windows fixes:
...
Add a missing value for the configure file.
Fix the bug that generating wrong svn version number.
Correct the wrong string length of the headnode name.
cmr:v1.5
cmr:v1.3.4
This commit was SVN r22219.
2009-11-18 09:43:47 +00:00
Ralph Castain
840766a894
Update the rmcast APIs to include tag params and reorder them to look like their rml cousins
...
This commit was SVN r22218.
2009-11-17 15:58:59 +00:00
Ralph Castain
aea1ab3bd6
Remove diagnostic
...
This commit was SVN r22216.
2009-11-11 22:16:15 +00:00
Ralph Castain
a2f3a47b92
Update the orte_mcast test
...
This commit was SVN r22214.
2009-11-11 22:11:19 +00:00
Ralph Castain
6496ce7212
Expand the reliable multicast APIs to support sending/recving of iovecs
...
This commit was SVN r22213.
2009-11-11 22:10:35 +00:00
Rainer Keller
366bd96c88
- Allow to work without xt-catamount module on Jaguar,
...
reducing the amount of components, that up to now needed to be
deselected.
This commit was SVN r22205.
2009-11-09 14:26:24 +00:00
Shiqing Fan
6f8d0a1ab8
Update a few CMake scripts.
...
Add Program Database (pdb) files for installation for debug build.
This commit was SVN r22188.
2009-11-03 10:40:58 +00:00
Rainer Keller
f121e46db1
- Finalize ornl_configure
...
This commit was SVN r22178.
2009-11-01 03:25:57 +00:00
Rainer Keller
7dfe709ac1
- Initialize n before usage.
...
This commit was SVN r22169.
2009-10-29 15:52:53 +00:00
Terry Dontje
c6ebc7c341
rename macros ompi_check_optflags and ompi_make_stripped_flags based on comments in #2072
...
This commit was SVN r22151.
2009-10-28 10:51:59 +00:00
Terry Dontje
6df802424d
remove duplicate setting of CFLAGS_WITHOUT_OPTFLAGS and special case DEBUGGER_FLAGS for intel compiler
...
This commit was SVN r22143.
2009-10-26 18:41:53 +00:00
Ralph Castain
13d86e100b
Courtesy of Ralph and Jeff:
...
Continue the reorganization of the configure system. Move files from the main config directory to their appropriate level-specific config directories. Modify the configure system to correctly handle compiler detection, test, and setup so that all things pertaining to opal and orte are done at the lower level, with the ompi configure system only looking at mpi-specific options.
Ensure the wrapper compilers for orte and ompi only get built when appropriate. Add support for c++ to the orte wrapper compilers, both script and non-script versions.
This commit was SVN r22138.
2009-10-24 01:04:35 +00:00
Ralph Castain
7afd65d631
Add a couple of test programs
...
This commit was SVN r22137.
2009-10-24 01:00:38 +00:00
Jeff Squyres
02db4f5146
Terry pointed out that ORTE also needs the "totalview" flags, and
...
therefore the m4 test really belongs on orte/config. Thank Terry!
Additionally, I took the opprotunity to rename the variable so that
"TOTALVIEW" is not in the name anymore (because it applies to all
variables, not just Totalview).
This commit was SVN r22134.
2009-10-23 13:00:59 +00:00
Tim Mattox
4acfbe6554
Unfortunately, the typo's that r22129 tried to fix were not
...
as simple as I or Ralph had hoped. This should be the real fix,
or very close to it. I can now see both the sensor and rmcast
information from ompi_info when configured
with --enable-monitoring --enable_multicast
This commit was SVN r22131.
The following SVN revision numbers were found above:
r22129 --> open-mpi/ompi@02ff00dfb5
2009-10-23 02:38:51 +00:00
Jeff Squyres
e0e20870e1
Generated files should not be in SVN.
...
This commit was SVN r22126.
2009-10-22 17:57:05 +00:00
Pavel Shamis
7425255be5
Fixing compilation failure. Adding missing include.
...
This commit was SVN r22119.
2009-10-21 16:28:40 +00:00
Ralph Castain
c33866f0df
Per Tim Mattox (again, via my branch):
...
Add a script wrapper compiler version of ortecc for use when in cross-compile scenarios
This commit was SVN r22115.
2009-10-20 23:46:46 +00:00
Ralph Castain
ee82d42a1c
Add a new sensor component that pulls data via an external shared memory interface
...
Only builds when the appropriate library is present
This commit was SVN r22114.
2009-10-20 23:45:35 +00:00
Ralph Castain
214e26b539
Per Jeff (this work was done on a branch of mine, so I will do the commit):
...
Re-enable "./autogen.sh -no-ompi" again. If you -no-ompi, the entire OMPI
configury is skipped and the entire ompi/ subtree is not built. There's
some simple m4-isms that prune out the relevant parts.
I added ompi/config/, orte/config/, and opal/config/ directories. I moved a
bunch of m4 files from the top-level config/ dir into ompi/config/, and a few
into orte/config/.
Note that all 3 <project>/config directories have a config_files.m4 file. This
file contains the AC_CONFIG_FILES list for that project. The AC_CONFIG_FILES
call cannot be in an AC_DEFUN macro and conditionally called -- if it is
included at all, Autoconf will process it. Hence, these config_files.m4 files
don't AC_DEFUN -- they just have AC_CONFIG_FILES. m4_ifdef() is used to
conditionally include the files or not.
I moved a bunch of obvious OMPI-only m4 files from config/ to ompi/config/,
but I'm sure that there's more that could go. A ticket will be filed with
thoughts on future work in this area.
This commit was SVN r22113.
2009-10-20 23:44:20 +00:00
Ralph Castain
f1f156d57b
Make rmaps base open function play nicely with ompi_info
...
This commit was SVN r22111.
2009-10-20 07:28:23 +00:00
Ralph Castain
ff9d72b3ab
Add a new multicast tag for collecting ps data
...
This commit was SVN r22107.
2009-10-16 04:21:22 +00:00
Terry Dontje
13907781b2
missed adding report-uri option
...
This commit was SVN r22106.
2009-10-15 18:05:24 +00:00
Ralph Castain
49ce2b4342
Add a new interface to the rmcast framework to query the output channel for the proc
...
This commit was SVN r22105.
2009-10-15 17:47:42 +00:00
Terry Dontje
c96af5654c
correct options and wording that were dropped in the last change due to committing v1.3 manpage to the trunk
...
This commit was SVN r22104.
2009-10-15 15:03:21 +00:00
Ralph Castain
99c67183d2
Minor cleanups, mainly to ensure we correctly block on blocking sends
...
This commit was SVN r22102.
2009-10-15 02:39:15 +00:00
Ralph Castain
2f91a4833b
Have the trigger event return the event itself in the callback function so it can be reset, if desired
...
This commit was SVN r22101.
2009-10-15 02:35:53 +00:00
Ralph Castain
2665825693
Correct an error that causes the system to "bounce" when we order a job killed. We didn't used to discriminate between a process being ordered to die, and a process that was aborted by an external signal. Unfortunately, that means the error mgr gets called and told a process abnormally aborted when we order termination, thus causing the errmgr to send out a "kill procs" command again.
...
Wouldn't be so bad, except...the errmgr orders the termination of ALL procs, which kills any other job that should have been left alone.
Add a new proc and job state indicating "killed_by_cmd" so we can tell the difference between a proc/job that was deliberately terminated by us vs one that is killed by external signal.
This change was tested to ensure it didn't interfere with ctrl-c operation (it doesn't - we order termination of all jobs when we get a ctrl-c).
This commit was SVN r22100.
2009-10-14 22:49:56 +00:00
Ralph Castain
18960a9c5a
Refactor the multicast support so the data type objects can be accessed beyond just the one component
...
Ensure that the local node is included in the allocation prior to bootstrap discovery
This commit was SVN r22099.
2009-10-14 17:43:40 +00:00
Terry Dontje
0a8645a411
This commit fixes trac:2017
...
This commit was SVN r22098.
The following Trac tickets were found above:
Ticket 2017 --> https://svn.open-mpi.org/trac/ompi/ticket/2017
2009-10-14 11:40:47 +00:00
Ralph Castain
bc869636be
Reset the verbosity levels to suppress debug output
...
This commit was SVN r22095.
2009-10-13 15:29:38 +00:00
Ralph Castain
e501589b3b
Cleanup the bootstrap procedure for multiple daemons starting up
...
This commit was SVN r22094.
2009-10-13 15:14:54 +00:00
Ralph Castain
c25dd14440
Correctly set the multicast interface, cleanup a comment
...
This commit was SVN r22093.
2009-10-13 15:14:28 +00:00