Ralph Castain
c749fefbd0
Instead of an odls-base mca param, make report_bindings a global param so that we can (a) detect it was set in the plm, and then (b) ensure it gets passed along to remote orteds so they will comply with the request.
...
This commit was SVN r22021.
2009-09-28 03:17:15 +00:00
Ralph Castain
47c9a5409e
Ensure that tools init the multicast channel correctly
...
This commit was SVN r22020.
2009-09-28 03:15:51 +00:00
Ralph Castain
ef0fd8b8d1
Return an error code if the job failed to start
...
This commit was SVN r22019.
2009-09-26 03:34:58 +00:00
Ralph Castain
e337fa686e
Correct handling of pointer array indexing
...
This commit was SVN r22018.
2009-09-26 03:33:55 +00:00
Ralph Castain
6fa2e81491
Correct handling of pointer array indexing
...
This commit was SVN r22017.
2009-09-26 03:33:26 +00:00
Ralph Castain
709b36efb4
Cleanup auto-wireup and enable tools to "discover" the HNP via multicast
...
This commit was SVN r22012.
2009-09-25 01:00:09 +00:00
Abhishek Kulkarni
2af7657db1
A few changes to the FTB notifier interface:
...
- add an orte ftb notifier help file for more verbose error messages
- check if we can connect to the FTB during component->query and close
the component, if we cannot.
- make the ftb component interface methods static.
- add mca parameters to set override the default subscription style and
priority.
This commit was SVN r22011.
2009-09-24 23:56:41 +00:00
Ralph Castain
3167f0a0a0
Complete the next round of the multicast framework development. Needs further polish, upgrade to handle message fragmentation - but good enough for auto-bootstrap of orteds.
...
Teach the ess cm module to bootstrap orted launch
This commit was SVN r22006.
2009-09-23 20:57:49 +00:00
Josh Hursey
c9bd045cff
move {{{ess_env_ft_event_update_process_info}}} into SnapC {{{snapc_full_app_ft_event_update_process_info}}} where it should have been all along.
...
This commit was SVN r22004.
2009-09-23 18:29:13 +00:00
Josh Hursey
a6ee73156c
Add a verbose debug options. And add some error prints in the ESS' ft_event code.
...
This commit was SVN r22003.
2009-09-23 17:05:49 +00:00
Josh Hursey
2769091261
Fix for the stalled scenario in which 'options' might be reset to NULL inadvertently.
...
Thanks to MTT for picking this up.
This commit was SVN r22002.
2009-09-23 13:26:48 +00:00
Ralph Castain
26bb6e8f79
Add a couple of non-orte multicast tests
...
This commit was SVN r22001.
2009-09-23 05:24:22 +00:00
Ralph Castain
dff0d01673
Yet another paffinity cleanup...sigh.
...
1. ensure that orte_rmaps_base_schedule_policy does not override cmd line settings
2. when you try to bind to more cores than we have, generate a not-enough-processors error message
3. allow npersocket -bind-to-core combination - because, yes, somebody actually wants to do it.
This commit was SVN r21996.
2009-09-22 18:44:53 +00:00
Josh Hursey
5406fdfb80
Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option).
...
This commit looks larger than it really is since it includes a fair amount of code cleanup.
The SIGSTOP/SIGCONT+checkpointing work uses some of the functionality in r20391. Basic use case below (note that the checkpoint generated is useable as usual if the stopped application is terminated).
{{{
shell 1) mpirun -np 2 -am ft-enable-cr my-app
... running ...
shell 2) ompi-checkpoint --stop -v MPIRUN_PID
[localhost:001300] [ 0.00 / 0.20] Requested - ...
[localhost:001300] [ 0.00 / 0.20] Pending - ...
[localhost:001300] [ 0.01 / 0.21] Running - ...
[localhost:001300] [ 1.01 / 1.22] Stopped - ompi_global_snapshot_1234.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt
shell 2) killall -CONT mpirun
... Application Continues execution in shell 1 ...
}}}
Other items in this commit are mostly cleanup that has been sitting off-trunk for too long:
* Add a new {{{opal_crs_base_ckpt_options_t}}} type that encapsulates the various options that could be passed to the CRS. Currently only TERM and STOP, but this makes adding others ''much'' easier.
* Eliminate ORTE_SNAPC_CKPT_STATE_PENDING_TERM, since it served a redundant purpose with the new options type.
* Lay some basic ground work for some future features.
This commit was SVN r21995.
The following SVN revision numbers were found above:
r20391 --> open-mpi/ompi@0704b98668
2009-09-22 18:26:12 +00:00
Ralph Castain
8da3aa8d5c
Some (hopefully final!) adjustments and corrections to the paffinity support:
...
1. default -npersocket to force -bind-to-socket
2. if we cannot get a value for cores/socket, try using #logical cpus. otherwise, default to 1 core
3. add missing error message for not-enough-processors
4. since we no longer loop through orte_register_params twice, put the auto-detect of
topology info in the rte_init for hnp and std_orted
5. fix bind-to-core, bysocket combination
This commit was SVN r21992.
2009-09-22 15:41:03 +00:00
Ralph Castain
12613352eb
Add missing header file
...
This commit was SVN r21990.
2009-09-22 13:07:57 +00:00
Ralph Castain
2210989e2d
Update the cm ess module to support orted bootstrap. Continue work towards bootstrap capability.
...
This commit was SVN r21989.
2009-09-22 02:16:40 +00:00
Ralph Castain
c3f9096fd9
Add a reliable multicast framework, with an initial basic module. This is configured out unless specifically requested via --enable-multicast.
...
This commit was SVN r21988.
2009-09-22 00:58:29 +00:00
Ralph Castain
82af6ee940
Update test
...
This commit was SVN r21987.
2009-09-22 00:55:02 +00:00
Terry Dontje
0ccf2d87b6
rename do-not-bind to bind-to-none and clean up an error message
...
This commit was SVN r21980.
2009-09-21 17:00:02 +00:00
Terry Dontje
13be2d2a00
correct mistype in odle should be odls call to orte_show_help
...
This commit was SVN r21979.
2009-09-21 13:22:37 +00:00
Ralph Castain
7138fd131f
Final cleanup on new paffinity "if-avail" messages, plus fix one bug reported by Terry
...
This commit was SVN r21978.
2009-09-19 17:43:21 +00:00
Ralph Castain
2028017554
Modify the paffinity system to handle binding directives that are "soft" - i.e., when someone directs that we bind if the system supports it. This allows community members to distribute OMPI with default MCA param files that direct general binding policies, without having the distributed software fail if the system cannot support those policies.
...
The new options work by adding an ":if-avail" qualifier to the "bind-to-socket" and "bind-to-core" MCA params. If the system does not support this capability, the job will launch anyway. Without the qualifier, the job will abort with an error message indicating that the required functionality is not supported on this system.
This commit was SVN r21975.
2009-09-18 19:48:42 +00:00
Ralph Castain
98a4450df6
Fix the seq mapper by initializing the proc object to NULL before claiming a slot for it
...
This commit was SVN r21969.
2009-09-17 05:18:37 +00:00
Ralph Castain
ae31af7dec
Enable monitoring if configured to do so. Update the sensor framework
...
This commit was SVN r21964.
2009-09-09 21:00:27 +00:00
Ralph Castain
5fb3d13c24
Cleanup some pointer array addressing
...
This commit was SVN r21963.
2009-09-09 20:59:17 +00:00
Ralph Castain
e554fc282d
Add some diagnostic output when daemons die
...
This commit was SVN r21960.
2009-09-09 18:16:50 +00:00
Ralph Castain
c20d977a30
Report the allocate event, if requested
...
This commit was SVN r21959.
2009-09-09 17:47:58 +00:00
Ralph Castain
2688ad2c9f
Ensure the odls_types are included when referencing the APIs
...
This commit was SVN r21958.
2009-09-09 17:47:13 +00:00
Ralph Castain
cb7f608006
Remove debug output
...
This commit was SVN r21957.
2009-09-09 17:46:28 +00:00
Ralph Castain
51b13b3d5c
A few minor cleanups in where threads are unlocked.
...
Reset mpirun's exit code when we restart failed procs
This commit was SVN r21955.
2009-09-09 05:31:06 +00:00
Ralph Castain
8ae4b55d16
Enable a new command line option to --report-events that instructs mpirun to RML-report specific events during job life to the requestor.
...
This commit was SVN r21954.
2009-09-09 05:28:45 +00:00
Ralph Castain
c877b1a5f8
Silence a compiler warning about no format
...
This commit was SVN r21951.
2009-09-08 15:03:14 +00:00
Ralph Castain
81b8bc5b54
Silence a compiler warning about no format
...
This commit was SVN r21950.
2009-09-08 15:02:48 +00:00
Ralph Castain
142036f2c0
Issue an error message and abort if the user requests a number of processes that conflicts with nperxxx directives when evaluated against available resources
...
This commit was SVN r21949.
2009-09-07 03:36:10 +00:00
Ralph Castain
ca09e8f604
Minor modification required to allow opal_paffinity_alone to default to bind-to-core
...
This commit was SVN r21948.
2009-09-05 15:24:26 +00:00
Ralph Castain
17444243f7
Correct the bit mask to properly set the binding policy
...
This commit was SVN r21934.
2009-09-03 17:58:23 +00:00
Jeff Squyres
e1fe03ad44
Minor grammar fixes, and use "#" for separating lines, not blank lines.
...
This commit was SVN r21931.
2009-09-03 07:02:21 +00:00
Ralph Castain
0421a49844
Update the xml support to allow -xml-file foo whereby we redirect all xml formatted output (and ONLY xml formatted output) to a specified file
...
This commit was SVN r21930.
2009-09-02 18:03:10 +00:00
Ralph Castain
d3d34f8f15
Correct a bug in the assignment of node index value. Ensure we set the app number so that MPI attributes get set correctly.
...
This commit was SVN r21927.
2009-09-02 01:15:44 +00:00
Ralph Castain
50ca27c1c8
Ensure that procs launched natively by slurm do not mistakenly identify themselves as daemons to the system
...
This commit was SVN r21926.
2009-09-01 17:57:15 +00:00
Lenny Verkhovsky
2a594fec6c
added help message to rankfile mapper when failed if using alias instead of full hostname
...
This commit was SVN r21919.
2009-09-01 11:17:32 +00:00
Ralph Castain
59645c5c8e
Per direction from the slurm team, change the envar we look at to get our allocation
...
This commit was SVN r21915.
2009-08-30 15:57:27 +00:00
Ralph Castain
0394a4884d
Setup cpus-per-proc and cpus-per-rank as synonyms, both in mca params and on mpirun cmd line
...
This commit was SVN r21914.
2009-08-30 14:30:36 +00:00
Ralph Castain
ef4cdeeb69
Fix round-robin mapping when bind-to-socket in cases where #procs > #sockets and #cores
...
This commit was SVN r21913.
2009-08-29 03:36:21 +00:00
Ralph Castain
433673c64f
Report bindings in all cases, including external bindings and slot lists
...
This commit was SVN r21911.
2009-08-28 13:58:46 +00:00
Ralph Castain
3ef028ca23
Trap mpirun error messages in xml format
...
This commit was SVN r21910.
2009-08-28 02:46:15 +00:00
Ralph Castain
59f08dd2ff
Support the combination of npersocket and bind-to-core
...
This commit was SVN r21909.
2009-08-28 02:31:26 +00:00
Shiqing Fan
fb777134cf
Adjust the command string length.
...
This commit was SVN r21905.
2009-08-27 13:42:55 +00:00
Ralph Castain
2d27bc9824
Default npersocket to bind-to-socket unless otherwise directed
...
This commit was SVN r21904.
2009-08-27 13:21:14 +00:00