1
1

1955 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
aea1ab3bd6 Remove diagnostic
This commit was SVN r22216.
2009-11-11 22:16:15 +00:00
Ralph Castain
6496ce7212 Expand the reliable multicast APIs to support sending/recving of iovecs
This commit was SVN r22213.
2009-11-11 22:10:35 +00:00
Rainer Keller
366bd96c88 - Allow to work without xt-catamount module on Jaguar,
reducing the amount of components, that up to now needed to be
   deselected.

This commit was SVN r22205.
2009-11-09 14:26:24 +00:00
Rainer Keller
f121e46db1 - Finalize ornl_configure
This commit was SVN r22178.
2009-11-01 03:25:57 +00:00
Rainer Keller
7dfe709ac1 - Initialize n before usage.
This commit was SVN r22169.
2009-10-29 15:52:53 +00:00
Ralph Castain
13d86e100b Courtesy of Ralph and Jeff:
Continue the reorganization of the configure system. Move files from the main config directory to their appropriate level-specific config directories. Modify the configure system to correctly handle compiler detection, test, and setup so that all things pertaining to opal and orte are done at the lower level, with the ompi configure system only looking at mpi-specific options.

Ensure the wrapper compilers for orte and ompi only get built when appropriate. Add support for c++ to the orte wrapper compilers, both script and non-script versions.

This commit was SVN r22138.
2009-10-24 01:04:35 +00:00
Tim Mattox
4acfbe6554 Unfortunately, the typo's that r22129 tried to fix were not
as simple as I or Ralph had hoped.  This should be the real fix,
or very close to it.  I can now see both the sensor and rmcast
information from ompi_info when configured
with --enable-monitoring --enable_multicast

This commit was SVN r22131.

The following SVN revision numbers were found above:
  r22129 --> open-mpi/ompi@02ff00dfb5
2009-10-23 02:38:51 +00:00
Pavel Shamis
7425255be5 Fixing compilation failure. Adding missing include.
This commit was SVN r22119.
2009-10-21 16:28:40 +00:00
Ralph Castain
ee82d42a1c Add a new sensor component that pulls data via an external shared memory interface
Only builds when the appropriate library is present

This commit was SVN r22114.
2009-10-20 23:45:35 +00:00
Ralph Castain
f1f156d57b Make rmaps base open function play nicely with ompi_info
This commit was SVN r22111.
2009-10-20 07:28:23 +00:00
Ralph Castain
ff9d72b3ab Add a new multicast tag for collecting ps data
This commit was SVN r22107.
2009-10-16 04:21:22 +00:00
Ralph Castain
49ce2b4342 Add a new interface to the rmcast framework to query the output channel for the proc
This commit was SVN r22105.
2009-10-15 17:47:42 +00:00
Ralph Castain
99c67183d2 Minor cleanups, mainly to ensure we correctly block on blocking sends
This commit was SVN r22102.
2009-10-15 02:39:15 +00:00
Ralph Castain
2665825693 Correct an error that causes the system to "bounce" when we order a job killed. We didn't used to discriminate between a process being ordered to die, and a process that was aborted by an external signal. Unfortunately, that means the error mgr gets called and told a process abnormally aborted when we order termination, thus causing the errmgr to send out a "kill procs" command again.
Wouldn't be so bad, except...the errmgr orders the termination of ALL procs, which kills any other job that should have been left alone.

Add a new proc and job state indicating "killed_by_cmd" so we can tell the difference between a proc/job that was deliberately terminated by us vs one that is killed by external signal.

This change was tested to ensure it didn't interfere with ctrl-c operation (it doesn't - we order termination of all jobs when we get a ctrl-c).

This commit was SVN r22100.
2009-10-14 22:49:56 +00:00
Ralph Castain
18960a9c5a Refactor the multicast support so the data type objects can be accessed beyond just the one component
Ensure that the local node is included in the allocation prior to bootstrap discovery

This commit was SVN r22099.
2009-10-14 17:43:40 +00:00
Ralph Castain
bc869636be Reset the verbosity levels to suppress debug output
This commit was SVN r22095.
2009-10-13 15:29:38 +00:00
Ralph Castain
e501589b3b Cleanup the bootstrap procedure for multiple daemons starting up
This commit was SVN r22094.
2009-10-13 15:14:54 +00:00
Ralph Castain
c25dd14440 Correctly set the multicast interface, cleanup a comment
This commit was SVN r22093.
2009-10-13 15:14:28 +00:00
Ralph Castain
d8d80d6f1a Closes trac:2054. Check if a user specifies more cpus-per-rank than there are cpus in a socket - if so, politely tell them "you are stupid" and abort.
This commit was SVN r22091.

The following Trac tickets were found above:
  Ticket 2054 --> https://svn.open-mpi.org/trac/ompi/ticket/2054
2009-10-13 04:19:07 +00:00
Ralph Castain
1475d34c13 Ensure we default to byslot mapping
This commit was SVN r22090.
2009-10-11 23:50:42 +00:00
Ralph Castain
84cc847be8 Next phase of auto-wireup using multicast. Enable use of multicast groups to separate comm from different application groups. Have the orted bootstrap message go to a different rml tag so the node can be added to the pool.
This commit was SVN r22083.
2009-10-10 01:19:56 +00:00
Ralph Castain
40e2299fa7 Test to ensure that num_procs was provided for the resilient mapper - it cannot be used with options like npernode.
Cleanup the show_help text file

This commit was SVN r22082.
2009-10-09 15:26:23 +00:00
Shiqing Fan
7dff65cbc9 Clean up a little bit.
Add an option for setting up the job name.

This commit was SVN r22053.
2009-10-06 07:52:43 +00:00
Ralph Castain
dcab61ad83 Restore the prior default rank assignment scheme for round-robin mappers. Ensure that each app_context has sequential vpids.
This commit was SVN r22048.
2009-10-02 03:16:18 +00:00
Ralph Castain
a15c58c583 Fix the proc assignment into the job data object during assignment of vpids as comm_spawned procs were being overwritten by their parents with the same vpid.
Add a little debug output when updating proc state

This commit was SVN r22042.
2009-10-01 13:44:34 +00:00
Ralph Castain
51f64aaf96 Add a new ras module to support bootstrap operations. Additional functionality may eventually be required in the component, but for now all it does is provide a mechanism for ensuring that other allocations don't confuse the system.
Only active if specifically directed to use it

This commit was SVN r22040.
2009-09-30 23:30:24 +00:00
Ralph Castain
1d7ab97c84 Update the multicast framework to allow specification of different message scopes per various RFCs. Redefine the API a little to utilize channel numbers without worrying about the specifics of their addressing
This commit was SVN r22037.
2009-09-30 14:40:43 +00:00
Ralph Castain
5a24d6f60e Remove an option that the orteds don't actually support...
This commit was SVN r22027.
2009-09-29 02:08:27 +00:00
Ralph Castain
c749fefbd0 Instead of an odls-base mca param, make report_bindings a global param so that we can (a) detect it was set in the plm, and then (b) ensure it gets passed along to remote orteds so they will comply with the request.
This commit was SVN r22021.
2009-09-28 03:17:15 +00:00
Ralph Castain
47c9a5409e Ensure that tools init the multicast channel correctly
This commit was SVN r22020.
2009-09-28 03:15:51 +00:00
Ralph Castain
ef0fd8b8d1 Return an error code if the job failed to start
This commit was SVN r22019.
2009-09-26 03:34:58 +00:00
Ralph Castain
e337fa686e Correct handling of pointer array indexing
This commit was SVN r22018.
2009-09-26 03:33:55 +00:00
Ralph Castain
709b36efb4 Cleanup auto-wireup and enable tools to "discover" the HNP via multicast
This commit was SVN r22012.
2009-09-25 01:00:09 +00:00
Abhishek Kulkarni
2af7657db1 A few changes to the FTB notifier interface:
- add an orte ftb notifier help file for more verbose error messages
- check if we can connect to the FTB during component->query and close
  the component, if we cannot.
- make the ftb component interface methods static.
- add mca parameters to set override the default subscription style and
  priority.

This commit was SVN r22011.
2009-09-24 23:56:41 +00:00
Ralph Castain
3167f0a0a0 Complete the next round of the multicast framework development. Needs further polish, upgrade to handle message fragmentation - but good enough for auto-bootstrap of orteds.
Teach the ess cm module to bootstrap orted launch

This commit was SVN r22006.
2009-09-23 20:57:49 +00:00
Josh Hursey
c9bd045cff move {{{ess_env_ft_event_update_process_info}}} into SnapC {{{snapc_full_app_ft_event_update_process_info}}} where it should have been all along.
This commit was SVN r22004.
2009-09-23 18:29:13 +00:00
Josh Hursey
a6ee73156c Add a verbose debug options. And add some error prints in the ESS' ft_event code.
This commit was SVN r22003.
2009-09-23 17:05:49 +00:00
Josh Hursey
2769091261 Fix for the stalled scenario in which 'options' might be reset to NULL inadvertently.
Thanks to MTT for picking this up.

This commit was SVN r22002.
2009-09-23 13:26:48 +00:00
Ralph Castain
dff0d01673 Yet another paffinity cleanup...sigh.
1. ensure that orte_rmaps_base_schedule_policy does not override cmd line settings

2. when you try to bind to more cores than we have, generate a not-enough-processors error message

3. allow npersocket -bind-to-core combination - because, yes, somebody actually wants to do it.

This commit was SVN r21996.
2009-09-22 18:44:53 +00:00
Josh Hursey
5406fdfb80 Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option).
This commit looks larger than it really is since it includes a fair amount of code cleanup.

The SIGSTOP/SIGCONT+checkpointing work uses some of the functionality in r20391. Basic use case below (note that the checkpoint generated is useable as usual if the stopped application is terminated).
{{{
shell 1) mpirun -np 2 -am ft-enable-cr my-app
... running ...

shell 2) ompi-checkpoint --stop -v MPIRUN_PID
[localhost:001300] [  0.00 /   0.20]                 Requested - ...
[localhost:001300] [  0.00 /   0.20]                   Pending - ...
[localhost:001300] [  0.01 /   0.21]                   Running - ...
[localhost:001300] [  1.01 /   1.22]                   Stopped - ompi_global_snapshot_1234.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt

shell 2) killall -CONT mpirun

... Application Continues execution in shell 1 ...
}}}

Other items in this commit are mostly cleanup that has been sitting off-trunk for too long:
 * Add a new {{{opal_crs_base_ckpt_options_t}}} type that encapsulates the various options that could be passed to the CRS. Currently only TERM and STOP, but this makes adding others ''much'' easier.
 * Eliminate ORTE_SNAPC_CKPT_STATE_PENDING_TERM, since it served a redundant purpose with the new options type.
 * Lay some basic ground work for some future features.

This commit was SVN r21995.

The following SVN revision numbers were found above:
  r20391 --> open-mpi/ompi@0704b98668
2009-09-22 18:26:12 +00:00
Ralph Castain
8da3aa8d5c Some (hopefully final!) adjustments and corrections to the paffinity support:
1. default -npersocket to force -bind-to-socket

2. if we cannot get a value for cores/socket, try using #logical cpus. otherwise, default to 1 core

3. add missing error message for not-enough-processors

4. since we no longer loop through orte_register_params twice, put the auto-detect of
   topology info in the rte_init for hnp and std_orted

5. fix bind-to-core, bysocket combination

This commit was SVN r21992.
2009-09-22 15:41:03 +00:00
Ralph Castain
12613352eb Add missing header file
This commit was SVN r21990.
2009-09-22 13:07:57 +00:00
Ralph Castain
2210989e2d Update the cm ess module to support orted bootstrap. Continue work towards bootstrap capability.
This commit was SVN r21989.
2009-09-22 02:16:40 +00:00
Ralph Castain
c3f9096fd9 Add a reliable multicast framework, with an initial basic module. This is configured out unless specifically requested via --enable-multicast.
This commit was SVN r21988.
2009-09-22 00:58:29 +00:00
Terry Dontje
0ccf2d87b6 rename do-not-bind to bind-to-none and clean up an error message
This commit was SVN r21980.
2009-09-21 17:00:02 +00:00
Terry Dontje
13be2d2a00 correct mistype in odle should be odls call to orte_show_help
This commit was SVN r21979.
2009-09-21 13:22:37 +00:00
Ralph Castain
7138fd131f Final cleanup on new paffinity "if-avail" messages, plus fix one bug reported by Terry
This commit was SVN r21978.
2009-09-19 17:43:21 +00:00
Ralph Castain
2028017554 Modify the paffinity system to handle binding directives that are "soft" - i.e., when someone directs that we bind if the system supports it. This allows community members to distribute OMPI with default MCA param files that direct general binding policies, without having the distributed software fail if the system cannot support those policies.
The new options work by adding an ":if-avail" qualifier to the "bind-to-socket" and "bind-to-core" MCA params. If the system does not support this capability, the job will launch anyway. Without the qualifier, the job will abort with an error message indicating that the required functionality is not supported on this system.

This commit was SVN r21975.
2009-09-18 19:48:42 +00:00
Ralph Castain
98a4450df6 Fix the seq mapper by initializing the proc object to NULL before claiming a slot for it
This commit was SVN r21969.
2009-09-17 05:18:37 +00:00
Ralph Castain
ae31af7dec Enable monitoring if configured to do so. Update the sensor framework
This commit was SVN r21964.
2009-09-09 21:00:27 +00:00