Josh Hursey
313acba4ce
Move the mca_base_is_component_required() functionality to mca/base per suggestion so that it can be reused in other components.
...
This commit was SVN r22327.
2009-12-17 15:12:26 +00:00
Josh Hursey
a418a7dc43
Make sure to look in not only the env var, but also {{{orte_routed_base_components}}} to confirm that this is the only component available, and intended for selection.
...
This commit was SVN r22323.
2009-12-16 20:17:26 +00:00
Josh Hursey
646f90a90a
Small fix for a egde case
...
This commit was SVN r22322.
2009-12-16 18:06:05 +00:00
George Bosilca
a2310808f1
Santa's back! Fix all warnings about the deprecated usage of
...
stringWithCString as well as the casting issue between NSInteger and
%d. The first is solved by using stringWithUTF8String, which apparently
will always give the right answer (sic). The second is fixed as suggested
by Apple by casting the NSInteger (hint: which by definition is large
enough to hold a pointer) to a long and use %ld in the printf.
This commit was SVN r22317.
2009-12-16 00:06:37 +00:00
Ralph Castain
9acec283af
Add a new TCP module to the reliable multicast framework. This module uses ORTE's grpcomm.xcast functionality to "fake" multicasts for environments where regular multicast isn't reliable.
...
Modify the startup logic to allow for this use-case.
This commit was SVN r22310.
2009-12-15 01:18:27 +00:00
Ralph Castain
0ffa4f2f0c
Ensure we cancel the lingering recv in the allgather code to avoid having incorrect counters.
...
Thanks to Damien for spotting the problem.
This commit was SVN r22301.
2009-12-14 13:21:56 +00:00
George Bosilca
501d1cc4ad
Set default values to avoid using these variables uninitialized.
...
This commit was SVN r22279.
2009-12-08 18:42:22 +00:00
Ralph Castain
e3a2e66ec2
Add limits on rmcast seq numbers
...
This commit was SVN r22269.
2009-12-05 01:20:14 +00:00
Ralph Castain
4a82dd9a45
Add message sequence numbers to multicast messages, tracked by channel
...
This commit was SVN r22262.
2009-12-04 04:17:44 +00:00
Ralph Castain
4ec9c4b532
Do a better job of ensuring session directories are removed when procs abnormally terminate and/or we order "kill local procs"
...
This commit was SVN r22258.
2009-12-03 04:46:17 +00:00
Ralph Castain
93ebed48b1
Update the multicast test. Some cleanups to the basic rmcast module
...
This commit was SVN r22257.
2009-12-03 04:30:58 +00:00
Ralph Castain
66efa05a53
Don't cancel the recv unless it was issued or else we generate an error whenever we launch an app without having to launch daemons (e.g., a completely local launch to mpirun)
...
This commit was SVN r22256.
2009-12-03 04:28:43 +00:00
George Bosilca
7bf1d7a1c4
A more asynchronous startup over rsh/ssh.
...
This commit was SVN r22253.
2009-12-02 20:29:32 +00:00
Ralph Castain
a0d5c80ce0
Add a new framework for discovering local resource information such as cpu type/model, #cpus, available physical memory, etc. Two initial components (darwin and linux) are provided. This is needed to support bootstrap operations where daemons are started at node boot, and applications where initial knowledge of cpu identification is needed to guide framework component selection.
...
Add orte configuration option to control the use of the framework in the system. Although the code will build, it will not be active unless configured with --enable-bootstrap.
If bootstrap is enabled and the new opal_sysinfo framework can successfully determine the cpu model, pass that info to the application as an MCA param to support some work at Sun.
Also, have daemons report back the resources they find to guide process mapping in bootstrap operations (i.e., where the daemon starts at node boot as opposed to being launched at application start).
Adjust some platform files to enable these capabilities.
This commit was SVN r22244.
2009-11-30 23:11:25 +00:00
Ralph Castain
e38a0eab9f
Remove the fddp and sensor frameworks - relocated to new cluster mgr project
...
This commit was SVN r22240.
2009-11-27 22:14:47 +00:00
Rainer Keller
70a69e796f
- Get rid of a small nuisance: after installation of the
...
alps-resid script, set it to exec, to allow:
export OMPI_ALPS_RESID=`$OMPI/share/openmpi/ras-alps-command.sh`
This commit was SVN r22234.
2009-11-25 19:01:33 +00:00
Ralph Castain
92733b13d9
Add a couple of new tests to the orte system.
...
Modify the job_complete check so we don't kill jobs when a single proc was terminated by ORTE command via plm.terminate_procs
Still dies gracefully with a ctrl-c, and behaves as before when using plm.terminate_job
This commit was SVN r22227.
2009-11-20 01:47:49 +00:00
Ralph Castain
5e031d9ded
Let a restarted process have access to all known nodes instead of only those already in its prior job map
...
This commit was SVN r22225.
2009-11-19 19:45:11 +00:00
Ralph Castain
852e5d9ee0
Add some diag output
...
This commit was SVN r22224.
2009-11-19 19:43:36 +00:00
Ralph Castain
a401f05ea3
Add some diagnostics to chase down forced termination of procs. Ensure that procs are removed from the local data list upon termination
...
This commit was SVN r22223.
2009-11-19 19:43:10 +00:00
Ralph Castain
8dc08e304f
No longer require name passed separately
...
This commit was SVN r22221.
2009-11-19 19:41:41 +00:00
Ralph Castain
1a44b84b25
If a process is in certain states (e.g., polling for messages in the event lib), then it can blissfully ignore SIGTERM when we try to order it to die. Unfortunately, the OS thinks the process actually did die, leading us to leave orphaned procs around.
...
The only sure way to kill the thing is with SIGKILL. After hours spent trying to debug this bizarre situation with a reliable reproducer, I finally tracked it down and fixed it.
Go figure...I sure can't.
This commit was SVN r22220.
2009-11-19 17:25:15 +00:00
Shiqing Fan
11ad25fa77
A few windows fixes:
...
Add a missing value for the configure file.
Fix the bug that generating wrong svn version number.
Correct the wrong string length of the headnode name.
cmr:v1.5
cmr:v1.3.4
This commit was SVN r22219.
2009-11-18 09:43:47 +00:00
Ralph Castain
840766a894
Update the rmcast APIs to include tag params and reorder them to look like their rml cousins
...
This commit was SVN r22218.
2009-11-17 15:58:59 +00:00
Ralph Castain
aea1ab3bd6
Remove diagnostic
...
This commit was SVN r22216.
2009-11-11 22:16:15 +00:00
Ralph Castain
6496ce7212
Expand the reliable multicast APIs to support sending/recving of iovecs
...
This commit was SVN r22213.
2009-11-11 22:10:35 +00:00
Rainer Keller
366bd96c88
- Allow to work without xt-catamount module on Jaguar,
...
reducing the amount of components, that up to now needed to be
deselected.
This commit was SVN r22205.
2009-11-09 14:26:24 +00:00
Rainer Keller
f121e46db1
- Finalize ornl_configure
...
This commit was SVN r22178.
2009-11-01 03:25:57 +00:00
Rainer Keller
7dfe709ac1
- Initialize n before usage.
...
This commit was SVN r22169.
2009-10-29 15:52:53 +00:00
Ralph Castain
13d86e100b
Courtesy of Ralph and Jeff:
...
Continue the reorganization of the configure system. Move files from the main config directory to their appropriate level-specific config directories. Modify the configure system to correctly handle compiler detection, test, and setup so that all things pertaining to opal and orte are done at the lower level, with the ompi configure system only looking at mpi-specific options.
Ensure the wrapper compilers for orte and ompi only get built when appropriate. Add support for c++ to the orte wrapper compilers, both script and non-script versions.
This commit was SVN r22138.
2009-10-24 01:04:35 +00:00
Tim Mattox
4acfbe6554
Unfortunately, the typo's that r22129 tried to fix were not
...
as simple as I or Ralph had hoped. This should be the real fix,
or very close to it. I can now see both the sensor and rmcast
information from ompi_info when configured
with --enable-monitoring --enable_multicast
This commit was SVN r22131.
The following SVN revision numbers were found above:
r22129 --> open-mpi/ompi@02ff00dfb5
2009-10-23 02:38:51 +00:00
Pavel Shamis
7425255be5
Fixing compilation failure. Adding missing include.
...
This commit was SVN r22119.
2009-10-21 16:28:40 +00:00
Ralph Castain
ee82d42a1c
Add a new sensor component that pulls data via an external shared memory interface
...
Only builds when the appropriate library is present
This commit was SVN r22114.
2009-10-20 23:45:35 +00:00
Ralph Castain
f1f156d57b
Make rmaps base open function play nicely with ompi_info
...
This commit was SVN r22111.
2009-10-20 07:28:23 +00:00
Ralph Castain
ff9d72b3ab
Add a new multicast tag for collecting ps data
...
This commit was SVN r22107.
2009-10-16 04:21:22 +00:00
Ralph Castain
49ce2b4342
Add a new interface to the rmcast framework to query the output channel for the proc
...
This commit was SVN r22105.
2009-10-15 17:47:42 +00:00
Ralph Castain
99c67183d2
Minor cleanups, mainly to ensure we correctly block on blocking sends
...
This commit was SVN r22102.
2009-10-15 02:39:15 +00:00
Ralph Castain
2665825693
Correct an error that causes the system to "bounce" when we order a job killed. We didn't used to discriminate between a process being ordered to die, and a process that was aborted by an external signal. Unfortunately, that means the error mgr gets called and told a process abnormally aborted when we order termination, thus causing the errmgr to send out a "kill procs" command again.
...
Wouldn't be so bad, except...the errmgr orders the termination of ALL procs, which kills any other job that should have been left alone.
Add a new proc and job state indicating "killed_by_cmd" so we can tell the difference between a proc/job that was deliberately terminated by us vs one that is killed by external signal.
This change was tested to ensure it didn't interfere with ctrl-c operation (it doesn't - we order termination of all jobs when we get a ctrl-c).
This commit was SVN r22100.
2009-10-14 22:49:56 +00:00
Ralph Castain
18960a9c5a
Refactor the multicast support so the data type objects can be accessed beyond just the one component
...
Ensure that the local node is included in the allocation prior to bootstrap discovery
This commit was SVN r22099.
2009-10-14 17:43:40 +00:00
Ralph Castain
bc869636be
Reset the verbosity levels to suppress debug output
...
This commit was SVN r22095.
2009-10-13 15:29:38 +00:00
Ralph Castain
e501589b3b
Cleanup the bootstrap procedure for multiple daemons starting up
...
This commit was SVN r22094.
2009-10-13 15:14:54 +00:00
Ralph Castain
c25dd14440
Correctly set the multicast interface, cleanup a comment
...
This commit was SVN r22093.
2009-10-13 15:14:28 +00:00
Ralph Castain
d8d80d6f1a
Closes trac:2054. Check if a user specifies more cpus-per-rank than there are cpus in a socket - if so, politely tell them "you are stupid" and abort.
...
This commit was SVN r22091.
The following Trac tickets were found above:
Ticket 2054 --> https://svn.open-mpi.org/trac/ompi/ticket/2054
2009-10-13 04:19:07 +00:00
Ralph Castain
1475d34c13
Ensure we default to byslot mapping
...
This commit was SVN r22090.
2009-10-11 23:50:42 +00:00
Ralph Castain
84cc847be8
Next phase of auto-wireup using multicast. Enable use of multicast groups to separate comm from different application groups. Have the orted bootstrap message go to a different rml tag so the node can be added to the pool.
...
This commit was SVN r22083.
2009-10-10 01:19:56 +00:00
Ralph Castain
40e2299fa7
Test to ensure that num_procs was provided for the resilient mapper - it cannot be used with options like npernode.
...
Cleanup the show_help text file
This commit was SVN r22082.
2009-10-09 15:26:23 +00:00
Shiqing Fan
7dff65cbc9
Clean up a little bit.
...
Add an option for setting up the job name.
This commit was SVN r22053.
2009-10-06 07:52:43 +00:00
Ralph Castain
dcab61ad83
Restore the prior default rank assignment scheme for round-robin mappers. Ensure that each app_context has sequential vpids.
...
This commit was SVN r22048.
2009-10-02 03:16:18 +00:00
Ralph Castain
a15c58c583
Fix the proc assignment into the job data object during assignment of vpids as comm_spawned procs were being overwritten by their parents with the same vpid.
...
Add a little debug output when updating proc state
This commit was SVN r22042.
2009-10-01 13:44:34 +00:00
Ralph Castain
51f64aaf96
Add a new ras module to support bootstrap operations. Additional functionality may eventually be required in the component, but for now all it does is provide a mechanism for ensuring that other allocations don't confuse the system.
...
Only active if specifically directed to use it
This commit was SVN r22040.
2009-09-30 23:30:24 +00:00