Ralph Castain
69fe5ca69b
Correctly compute bynode mapping, even in the presence of a $#$%#@^$ rankfile
...
This commit was SVN r22748.
2010-03-02 05:21:42 +00:00
Ralph Castain
bef06d52bc
Silence compiler warning
...
This commit was SVN r22747.
2010-03-01 21:04:26 +00:00
Ralph Castain
5514d9c673
Fix the stupid rankfile mapper again, hopefully not breaking everything else to accommodate it. Looks like the round-robin mappers still work, at least...
...
This commit was SVN r22746.
2010-03-01 20:40:47 +00:00
Ralph Castain
96590b9fad
Filter multicast messages to avoid cross-job confusion
...
This commit was SVN r22729.
2010-02-28 18:22:56 +00:00
Ralph Castain
359dc5cad3
Complete the app_idx change by cleaning up warnings in mappers
...
This commit was SVN r22728.
2010-02-27 18:14:27 +00:00
Ralph Castain
2541aa98ab
Change the app_idx type to uint32_t to support users who use large numbers of app_contexts. Set it up as a new typedef so we can change it later without as much effort.
...
This commit was SVN r22727.
2010-02-27 17:37:34 +00:00
Ralph Castain
6c0d7940c7
Add a new MCA param (and corresponding mpirun cmd line option) to output the debugger proctable info after launch. The output is just the job map with the process pid included, so you get a node-by-node list of the process ranks on that node and thier pids.
...
Works for initial launch and comm_spawn. xml and non-xml output is available
This commit was SVN r22725.
2010-02-27 08:32:25 +00:00
Shiqing Fan
4a3f42d159
Correctly initialize the CCP command line buffer.
...
This commit was SVN r22721.
2010-02-26 15:53:00 +00:00
Ralph Castain
c6448587fe
It is okay to not select an rmcast module
...
This commit was SVN r22719.
2010-02-26 02:39:04 +00:00
Ralph Castain
b89a21f0fa
Grrr....cleanup the new module
...
This commit was SVN r22711.
2010-02-25 06:08:04 +00:00
Ralph Castain
8954700845
No, we don't have a .windows file...
...
This commit was SVN r22710.
2010-02-25 02:18:54 +00:00
Ralph Castain
18c7aaff08
Update the grpcomm framework to be more thread-friendly.
...
Modify the orte configure options to specify --enable-multicast such that it directs components to build or not instead of littering the code base with #if's. Remove those #if's where they used to occur.
Add a new grpcomm "mcast" module to support multicast operations. Still some work required to properly perform daemon collectives for comm_spawn operations. New module only builds when --enable-multicast is provided, and when specifically selected.
This commit was SVN r22709.
2010-02-25 01:11:29 +00:00
Shiqing Fan
7a5a5ce024
Add a global option for inputing the head node name for Windows CCP trough command line.
...
This commit was SVN r22688.
2010-02-23 19:42:51 +00:00
Jeff Squyres
f65eebf53d
More changes for NetBSD. Thanks to Aleksej Saushev for this patch.
...
This commit was SVN r22680.
2010-02-22 15:05:09 +00:00
Ralph Castain
65a8ab4267
Cleanup the kill_procs command. Send a SIGTERM initially to allow C/R operations, and to be polite. Correctly update proc state if there is a problem so we don't hang.
...
The change to just using SIGKILL was originally done due to problems whereby waitpid thought a proc had died, but it hadn't. We'll continue debugging that problem separately, but SIGTERM is required for C/R to work properly.
This commit was SVN r22674.
2010-02-21 19:35:32 +00:00
Ralph Castain
2be03b4fb6
Cleanup a few bugs in the rmcast subsystem
...
This commit was SVN r22650.
2010-02-18 01:54:45 +00:00
Ralph Castain
9a5fdbb622
Continue development of reliable multicast
...
This commit was SVN r22616.
2010-02-14 19:20:56 +00:00
Josh Hursey
a3583b8f57
Fix --bynode option to remember for subsequent jobs where it left off last time.
...
Add a ''map_bynode'' info key to determine if the job to be started by comm_spawn* should be mapped by node or by slot. Default is to map according to the default policy set when the parent job was started.
cmr:v1.5.1
This commit was SVN r22564.
2010-02-05 15:37:49 +00:00
Iain Bason
28f03a2d86
Suspend/resume enhancements:
...
Have orte call setpgrp after forking (but before exec) when
orte_forward_job_control is set. Then have it send signals to the
child's process group. This allows suspending jobs that fork.
If a SIGTSTP arrives before the processes have been launched, then
record it and suspend them right after launching.
This commit was SVN r22557.
2010-02-04 15:47:20 +00:00
Shiqing Fan
bbcf1f71c4
Remove a incorrect callback, which was based on the old source base without WMI. This makes no harm on 32 bit Windows, but it seems causing exceptions on 64 bit Windows sometimes. What this callback does is just waiting on the given pid which actually is a remote pid, so it won't work as expected.
...
cmr:v1.4.2
cmr:v1.5
This commit was SVN r22549.
2010-02-04 10:47:28 +00:00
Ralph Castain
16b7bc7a82
Sigh...get the order right to match unpack
...
This commit was SVN r22539.
2010-02-03 15:50:43 +00:00
Ralph Castain
e88627a7ca
Ensure we don't go through rml open/select more than once.
...
Open the rml to get the uri when bootstrapping daemons
This commit was SVN r22538.
2010-02-03 15:38:32 +00:00
Ralph Castain
cb1007b5a9
Pass back the number of daemons in the system
...
This commit was SVN r22537.
2010-02-03 14:31:16 +00:00
Shiqing Fan
bdc13dacb1
A type cast.
...
This commit was SVN r22520.
2010-01-31 20:22:22 +00:00
Ralph Castain
7badff9d2d
Okay to return no available nodes for mapping when launching daemons - just means there is nothing to do
...
This commit was SVN r22509.
2010-01-28 22:58:28 +00:00
Ralph Castain
86dd1d41af
Handle zero-length iovecs in multicast messages
...
This commit was SVN r22507.
2010-01-28 15:29:43 +00:00
Ralph Castain
f66b6cae23
Enable the boot of an orted "virtual machine". Modify the mapper framework to allow mapping of only daemons. Remove the cm ras module as no longer required. Modify the orted code to always send back node arch info. Remove the "--enable-bootstrap" configure option as this feature will now always be available.
...
This commit was SVN r22480.
2010-01-25 22:25:13 +00:00
Josh Hursey
b749ecbab8
This commit fixes trac:2190.
...
Originally the patch was to improve the error message, but when digging into the code I found a subtle bug. If the daemon does not tell the HNP what CRS component it used, then the HNP tries to figure it out from the metadata (this is an uncommon case). The path the HNP used was not complete, so it was unable to find the metadata information. This patch fixes this by adding the 'snapshot_reference' to the 'snapshot_location' which completes the path for this search.
cmr:v1.4 (needs a custom patch)
cmr:v1.5
This commit was SVN r22479.
The following Trac tickets were found above:
Ticket 2190 --> https://svn.open-mpi.org/trac/ompi/ticket/2190
2010-01-25 20:28:38 +00:00
Ralph Castain
e4bf33dcab
Just a slight efficiency improvement - why check a flag twice?
...
This commit was SVN r22472.
2010-01-23 03:57:56 +00:00
Ralph Castain
3fe5e3e142
Propagate the user's callback data during non-blocking sends
...
This commit was SVN r22432.
2010-01-15 20:02:47 +00:00
Shiqing Fan
ad763c327d
Restore several linked libraries that were deleted by mistake in r22405.
...
This commit was SVN r22415.
The following SVN revision numbers were found above:
r22405 --> open-mpi/ompi@872a4047ba
2010-01-14 21:50:42 +00:00
Shiqing Fan
872a4047ba
Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake.
...
In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time.
This commit was SVN r22405.
2010-01-14 18:10:20 +00:00
Ralph Castain
cec840f6b9
The ability to add procs to a running job was unfortunately borked when we added the detection of a proc exiting before calling init. Re-enable it here, ensuring that procs that are being restarted and/or added to a job do -not- call barrier during orte_init.
...
This commit was SVN r22404.
2010-01-14 17:59:42 +00:00
Shiqing Fan
0259fa0b9c
Correct a few variable names.
...
This commit was SVN r22401.
2010-01-14 10:55:15 +00:00
Ralph Castain
adb2430e24
Missed one place, of course
...
This commit was SVN r22400.
2010-01-13 23:11:44 +00:00
Ralph Castain
c782c98433
Rename the "basic" rmcast component "udp" to more accurately reflect its operation
...
This commit was SVN r22399.
2010-01-13 23:01:25 +00:00
Ralph Castain
237eb4e8df
For some strange reason, every so often it appears possible for the event library to trip the read event on a socket, yet have the read itself yield an error. If/when that happens, report the error and continue on.
...
This happens rarely, but it does seem to happen.
This commit was SVN r22398.
2010-01-13 19:23:28 +00:00
Ralph Castain
ae1719306b
Fix a bug in non-blocking sends
...
This commit was SVN r22395.
2010-01-13 05:37:36 +00:00
Ralph Castain
b35486d945
The CM ess module needs to open the sysinfo framework and select modules prior to when others need it. Thus, setup a flag to avoid multiple open/select within that framework.
...
This commit was SVN r22393.
2010-01-12 22:03:49 +00:00
Ralph Castain
48486df4fe
Cleanup some diagnostics
...
This commit was SVN r22389.
2010-01-12 01:25:19 +00:00
Ralph Castain
9f3ccebeaa
We need to barrier for orte apps when the job is initially started, but we must not do the barrier when a proc is restarted as the other procs in the job won't know to participate.
...
This commit was SVN r22388.
2010-01-10 02:21:30 +00:00
Ralph Castain
16b16c5cb8
Fix a silly typo
...
This commit was SVN r22387.
2010-01-09 15:34:49 +00:00
Ralph Castain
add84178ef
Fix a silly typo that prevented tcp multicast messages from being delivered
...
This commit was SVN r22384.
2010-01-08 20:30:27 +00:00
Brian Barrett
86d8356b13
Updates to allow OMPI to build on Cray XT platforms running Catamount
...
This commit was SVN r22381.
2010-01-07 18:14:03 +00:00
Ralph Castain
09763ec711
Since we modified ORTE to declare that any process that terminates after calling "init" while at least one other process has not yet called "init" is an error, we have to ensure that non-MPI ORTE apps (i.e., apps that call orte_init but not mpi_init) include a barrier in orte_init. Otherwise, fast ORTE apps almost always wind up triggering the "abnormal termination" condition.
...
The barrier is protected with a test to ensure that MPI apps don't execute it and wind up doing two barriers during their init.
This commit was SVN r22378.
2010-01-07 06:58:01 +00:00
Ralph Castain
ef1bfaa823
Add the ability to track how many times a process has been restarted, and to communicate that value to a process when it is restarted in case it needs to take action when it is restarted as opposed to being started for the first time.
...
This commit was SVN r22377.
2010-01-07 01:19:44 +00:00
Ralph Castain
a12de9d1e8
Oh, the pain one little word can make...sigh.
...
This commit was SVN r22364.
2010-01-05 23:29:56 +00:00
Ralph Castain
5faf857840
Add a new tag for pnp/multicast send of direct messages
...
This commit was SVN r22352.
2009-12-31 20:34:58 +00:00
Ralph Castain
b3a58f8b83
Pass the correct address when packing iovec bytes for multicast.
...
Thanks to Rick Payne for the correction.
This commit was SVN r22351.
2009-12-30 20:59:31 +00:00
Ralph Castain
89a6131032
Check the return status code on all dss operations within the rmcast modules
...
This commit was SVN r22349.
2009-12-30 01:45:31 +00:00
Ralph Castain
50074f0770
Remove unused (and uninitialized) variable
...
This commit was SVN r22340.
2009-12-24 01:36:47 +00:00
Ralph Castain
aaf1119f40
Garrr...ensure we accurately know when to update the contact info so we don't do it incorrectly as procs terminate, thus causing the system to think that perfectly good apps are incorrectly terminating.
...
Thanks to George for pointing out the problem
This commit was SVN r22332.
2009-12-17 20:40:21 +00:00
Ralph Castain
db2cbd3166
Okay, okay - do it at destruct time too.
...
This commit was SVN r22331.
2009-12-17 20:08:49 +00:00
Ralph Castain
a56e09c874
Per suggestion from Josh, init the sender field of the msg_packet object to INVALID
...
This commit was SVN r22330.
2009-12-17 20:03:35 +00:00
Ralph Castain
8ab962411c
Detect the scenario where one or more procs fail to call orte/ompi_init while others in the job do. This scenario can cause the job to hang as MPI_Init contains a barrier operation that will not complete. Although ORTE does not contain such a barrier, it still will be considered as an error scenario so that we can detect the MPI case - otherwise, ORTE has no knowledge of OMPI and wouldn't know how to differentiate the use-cases.
...
Take advantage of the changes to update the routed_base_receive code to avoid message overlap.
This commit was SVN r22329.
2009-12-17 19:39:53 +00:00
Josh Hursey
313acba4ce
Move the mca_base_is_component_required() functionality to mca/base per suggestion so that it can be reused in other components.
...
This commit was SVN r22327.
2009-12-17 15:12:26 +00:00
Josh Hursey
a418a7dc43
Make sure to look in not only the env var, but also {{{orte_routed_base_components}}} to confirm that this is the only component available, and intended for selection.
...
This commit was SVN r22323.
2009-12-16 20:17:26 +00:00
Josh Hursey
646f90a90a
Small fix for a egde case
...
This commit was SVN r22322.
2009-12-16 18:06:05 +00:00
George Bosilca
a2310808f1
Santa's back! Fix all warnings about the deprecated usage of
...
stringWithCString as well as the casting issue between NSInteger and
%d. The first is solved by using stringWithUTF8String, which apparently
will always give the right answer (sic). The second is fixed as suggested
by Apple by casting the NSInteger (hint: which by definition is large
enough to hold a pointer) to a long and use %ld in the printf.
This commit was SVN r22317.
2009-12-16 00:06:37 +00:00
Ralph Castain
9acec283af
Add a new TCP module to the reliable multicast framework. This module uses ORTE's grpcomm.xcast functionality to "fake" multicasts for environments where regular multicast isn't reliable.
...
Modify the startup logic to allow for this use-case.
This commit was SVN r22310.
2009-12-15 01:18:27 +00:00
Ralph Castain
0ffa4f2f0c
Ensure we cancel the lingering recv in the allgather code to avoid having incorrect counters.
...
Thanks to Damien for spotting the problem.
This commit was SVN r22301.
2009-12-14 13:21:56 +00:00
George Bosilca
501d1cc4ad
Set default values to avoid using these variables uninitialized.
...
This commit was SVN r22279.
2009-12-08 18:42:22 +00:00
Ralph Castain
e3a2e66ec2
Add limits on rmcast seq numbers
...
This commit was SVN r22269.
2009-12-05 01:20:14 +00:00
Ralph Castain
4a82dd9a45
Add message sequence numbers to multicast messages, tracked by channel
...
This commit was SVN r22262.
2009-12-04 04:17:44 +00:00
Ralph Castain
4ec9c4b532
Do a better job of ensuring session directories are removed when procs abnormally terminate and/or we order "kill local procs"
...
This commit was SVN r22258.
2009-12-03 04:46:17 +00:00
Ralph Castain
93ebed48b1
Update the multicast test. Some cleanups to the basic rmcast module
...
This commit was SVN r22257.
2009-12-03 04:30:58 +00:00
Ralph Castain
66efa05a53
Don't cancel the recv unless it was issued or else we generate an error whenever we launch an app without having to launch daemons (e.g., a completely local launch to mpirun)
...
This commit was SVN r22256.
2009-12-03 04:28:43 +00:00
George Bosilca
7bf1d7a1c4
A more asynchronous startup over rsh/ssh.
...
This commit was SVN r22253.
2009-12-02 20:29:32 +00:00
Ralph Castain
a0d5c80ce0
Add a new framework for discovering local resource information such as cpu type/model, #cpus, available physical memory, etc. Two initial components (darwin and linux) are provided. This is needed to support bootstrap operations where daemons are started at node boot, and applications where initial knowledge of cpu identification is needed to guide framework component selection.
...
Add orte configuration option to control the use of the framework in the system. Although the code will build, it will not be active unless configured with --enable-bootstrap.
If bootstrap is enabled and the new opal_sysinfo framework can successfully determine the cpu model, pass that info to the application as an MCA param to support some work at Sun.
Also, have daemons report back the resources they find to guide process mapping in bootstrap operations (i.e., where the daemon starts at node boot as opposed to being launched at application start).
Adjust some platform files to enable these capabilities.
This commit was SVN r22244.
2009-11-30 23:11:25 +00:00
Ralph Castain
e38a0eab9f
Remove the fddp and sensor frameworks - relocated to new cluster mgr project
...
This commit was SVN r22240.
2009-11-27 22:14:47 +00:00
Rainer Keller
70a69e796f
- Get rid of a small nuisance: after installation of the
...
alps-resid script, set it to exec, to allow:
export OMPI_ALPS_RESID=`$OMPI/share/openmpi/ras-alps-command.sh`
This commit was SVN r22234.
2009-11-25 19:01:33 +00:00
Ralph Castain
92733b13d9
Add a couple of new tests to the orte system.
...
Modify the job_complete check so we don't kill jobs when a single proc was terminated by ORTE command via plm.terminate_procs
Still dies gracefully with a ctrl-c, and behaves as before when using plm.terminate_job
This commit was SVN r22227.
2009-11-20 01:47:49 +00:00
Ralph Castain
5e031d9ded
Let a restarted process have access to all known nodes instead of only those already in its prior job map
...
This commit was SVN r22225.
2009-11-19 19:45:11 +00:00
Ralph Castain
852e5d9ee0
Add some diag output
...
This commit was SVN r22224.
2009-11-19 19:43:36 +00:00
Ralph Castain
a401f05ea3
Add some diagnostics to chase down forced termination of procs. Ensure that procs are removed from the local data list upon termination
...
This commit was SVN r22223.
2009-11-19 19:43:10 +00:00
Ralph Castain
8dc08e304f
No longer require name passed separately
...
This commit was SVN r22221.
2009-11-19 19:41:41 +00:00
Ralph Castain
1a44b84b25
If a process is in certain states (e.g., polling for messages in the event lib), then it can blissfully ignore SIGTERM when we try to order it to die. Unfortunately, the OS thinks the process actually did die, leading us to leave orphaned procs around.
...
The only sure way to kill the thing is with SIGKILL. After hours spent trying to debug this bizarre situation with a reliable reproducer, I finally tracked it down and fixed it.
Go figure...I sure can't.
This commit was SVN r22220.
2009-11-19 17:25:15 +00:00
Shiqing Fan
11ad25fa77
A few windows fixes:
...
Add a missing value for the configure file.
Fix the bug that generating wrong svn version number.
Correct the wrong string length of the headnode name.
cmr:v1.5
cmr:v1.3.4
This commit was SVN r22219.
2009-11-18 09:43:47 +00:00
Ralph Castain
840766a894
Update the rmcast APIs to include tag params and reorder them to look like their rml cousins
...
This commit was SVN r22218.
2009-11-17 15:58:59 +00:00
Ralph Castain
aea1ab3bd6
Remove diagnostic
...
This commit was SVN r22216.
2009-11-11 22:16:15 +00:00
Ralph Castain
6496ce7212
Expand the reliable multicast APIs to support sending/recving of iovecs
...
This commit was SVN r22213.
2009-11-11 22:10:35 +00:00
Rainer Keller
366bd96c88
- Allow to work without xt-catamount module on Jaguar,
...
reducing the amount of components, that up to now needed to be
deselected.
This commit was SVN r22205.
2009-11-09 14:26:24 +00:00
Rainer Keller
f121e46db1
- Finalize ornl_configure
...
This commit was SVN r22178.
2009-11-01 03:25:57 +00:00
Rainer Keller
7dfe709ac1
- Initialize n before usage.
...
This commit was SVN r22169.
2009-10-29 15:52:53 +00:00
Ralph Castain
13d86e100b
Courtesy of Ralph and Jeff:
...
Continue the reorganization of the configure system. Move files from the main config directory to their appropriate level-specific config directories. Modify the configure system to correctly handle compiler detection, test, and setup so that all things pertaining to opal and orte are done at the lower level, with the ompi configure system only looking at mpi-specific options.
Ensure the wrapper compilers for orte and ompi only get built when appropriate. Add support for c++ to the orte wrapper compilers, both script and non-script versions.
This commit was SVN r22138.
2009-10-24 01:04:35 +00:00
Tim Mattox
4acfbe6554
Unfortunately, the typo's that r22129 tried to fix were not
...
as simple as I or Ralph had hoped. This should be the real fix,
or very close to it. I can now see both the sensor and rmcast
information from ompi_info when configured
with --enable-monitoring --enable_multicast
This commit was SVN r22131.
The following SVN revision numbers were found above:
r22129 --> open-mpi/ompi@02ff00dfb5
2009-10-23 02:38:51 +00:00
Pavel Shamis
7425255be5
Fixing compilation failure. Adding missing include.
...
This commit was SVN r22119.
2009-10-21 16:28:40 +00:00
Ralph Castain
ee82d42a1c
Add a new sensor component that pulls data via an external shared memory interface
...
Only builds when the appropriate library is present
This commit was SVN r22114.
2009-10-20 23:45:35 +00:00
Ralph Castain
f1f156d57b
Make rmaps base open function play nicely with ompi_info
...
This commit was SVN r22111.
2009-10-20 07:28:23 +00:00
Ralph Castain
ff9d72b3ab
Add a new multicast tag for collecting ps data
...
This commit was SVN r22107.
2009-10-16 04:21:22 +00:00
Ralph Castain
49ce2b4342
Add a new interface to the rmcast framework to query the output channel for the proc
...
This commit was SVN r22105.
2009-10-15 17:47:42 +00:00
Ralph Castain
99c67183d2
Minor cleanups, mainly to ensure we correctly block on blocking sends
...
This commit was SVN r22102.
2009-10-15 02:39:15 +00:00
Ralph Castain
2665825693
Correct an error that causes the system to "bounce" when we order a job killed. We didn't used to discriminate between a process being ordered to die, and a process that was aborted by an external signal. Unfortunately, that means the error mgr gets called and told a process abnormally aborted when we order termination, thus causing the errmgr to send out a "kill procs" command again.
...
Wouldn't be so bad, except...the errmgr orders the termination of ALL procs, which kills any other job that should have been left alone.
Add a new proc and job state indicating "killed_by_cmd" so we can tell the difference between a proc/job that was deliberately terminated by us vs one that is killed by external signal.
This change was tested to ensure it didn't interfere with ctrl-c operation (it doesn't - we order termination of all jobs when we get a ctrl-c).
This commit was SVN r22100.
2009-10-14 22:49:56 +00:00
Ralph Castain
18960a9c5a
Refactor the multicast support so the data type objects can be accessed beyond just the one component
...
Ensure that the local node is included in the allocation prior to bootstrap discovery
This commit was SVN r22099.
2009-10-14 17:43:40 +00:00
Ralph Castain
bc869636be
Reset the verbosity levels to suppress debug output
...
This commit was SVN r22095.
2009-10-13 15:29:38 +00:00
Ralph Castain
e501589b3b
Cleanup the bootstrap procedure for multiple daemons starting up
...
This commit was SVN r22094.
2009-10-13 15:14:54 +00:00
Ralph Castain
c25dd14440
Correctly set the multicast interface, cleanup a comment
...
This commit was SVN r22093.
2009-10-13 15:14:28 +00:00
Ralph Castain
d8d80d6f1a
Closes trac:2054. Check if a user specifies more cpus-per-rank than there are cpus in a socket - if so, politely tell them "you are stupid" and abort.
...
This commit was SVN r22091.
The following Trac tickets were found above:
Ticket 2054 --> https://svn.open-mpi.org/trac/ompi/ticket/2054
2009-10-13 04:19:07 +00:00
Ralph Castain
1475d34c13
Ensure we default to byslot mapping
...
This commit was SVN r22090.
2009-10-11 23:50:42 +00:00
Ralph Castain
84cc847be8
Next phase of auto-wireup using multicast. Enable use of multicast groups to separate comm from different application groups. Have the orted bootstrap message go to a different rml tag so the node can be added to the pool.
...
This commit was SVN r22083.
2009-10-10 01:19:56 +00:00
Ralph Castain
40e2299fa7
Test to ensure that num_procs was provided for the resilient mapper - it cannot be used with options like npernode.
...
Cleanup the show_help text file
This commit was SVN r22082.
2009-10-09 15:26:23 +00:00
Shiqing Fan
7dff65cbc9
Clean up a little bit.
...
Add an option for setting up the job name.
This commit was SVN r22053.
2009-10-06 07:52:43 +00:00
Ralph Castain
dcab61ad83
Restore the prior default rank assignment scheme for round-robin mappers. Ensure that each app_context has sequential vpids.
...
This commit was SVN r22048.
2009-10-02 03:16:18 +00:00
Ralph Castain
a15c58c583
Fix the proc assignment into the job data object during assignment of vpids as comm_spawned procs were being overwritten by their parents with the same vpid.
...
Add a little debug output when updating proc state
This commit was SVN r22042.
2009-10-01 13:44:34 +00:00
Ralph Castain
51f64aaf96
Add a new ras module to support bootstrap operations. Additional functionality may eventually be required in the component, but for now all it does is provide a mechanism for ensuring that other allocations don't confuse the system.
...
Only active if specifically directed to use it
This commit was SVN r22040.
2009-09-30 23:30:24 +00:00
Ralph Castain
1d7ab97c84
Update the multicast framework to allow specification of different message scopes per various RFCs. Redefine the API a little to utilize channel numbers without worrying about the specifics of their addressing
...
This commit was SVN r22037.
2009-09-30 14:40:43 +00:00
Ralph Castain
5a24d6f60e
Remove an option that the orteds don't actually support...
...
This commit was SVN r22027.
2009-09-29 02:08:27 +00:00
Ralph Castain
c749fefbd0
Instead of an odls-base mca param, make report_bindings a global param so that we can (a) detect it was set in the plm, and then (b) ensure it gets passed along to remote orteds so they will comply with the request.
...
This commit was SVN r22021.
2009-09-28 03:17:15 +00:00
Ralph Castain
47c9a5409e
Ensure that tools init the multicast channel correctly
...
This commit was SVN r22020.
2009-09-28 03:15:51 +00:00
Ralph Castain
ef0fd8b8d1
Return an error code if the job failed to start
...
This commit was SVN r22019.
2009-09-26 03:34:58 +00:00
Ralph Castain
e337fa686e
Correct handling of pointer array indexing
...
This commit was SVN r22018.
2009-09-26 03:33:55 +00:00
Ralph Castain
709b36efb4
Cleanup auto-wireup and enable tools to "discover" the HNP via multicast
...
This commit was SVN r22012.
2009-09-25 01:00:09 +00:00
Abhishek Kulkarni
2af7657db1
A few changes to the FTB notifier interface:
...
- add an orte ftb notifier help file for more verbose error messages
- check if we can connect to the FTB during component->query and close
the component, if we cannot.
- make the ftb component interface methods static.
- add mca parameters to set override the default subscription style and
priority.
This commit was SVN r22011.
2009-09-24 23:56:41 +00:00
Ralph Castain
3167f0a0a0
Complete the next round of the multicast framework development. Needs further polish, upgrade to handle message fragmentation - but good enough for auto-bootstrap of orteds.
...
Teach the ess cm module to bootstrap orted launch
This commit was SVN r22006.
2009-09-23 20:57:49 +00:00
Josh Hursey
c9bd045cff
move {{{ess_env_ft_event_update_process_info}}} into SnapC {{{snapc_full_app_ft_event_update_process_info}}} where it should have been all along.
...
This commit was SVN r22004.
2009-09-23 18:29:13 +00:00
Josh Hursey
a6ee73156c
Add a verbose debug options. And add some error prints in the ESS' ft_event code.
...
This commit was SVN r22003.
2009-09-23 17:05:49 +00:00
Josh Hursey
2769091261
Fix for the stalled scenario in which 'options' might be reset to NULL inadvertently.
...
Thanks to MTT for picking this up.
This commit was SVN r22002.
2009-09-23 13:26:48 +00:00
Ralph Castain
dff0d01673
Yet another paffinity cleanup...sigh.
...
1. ensure that orte_rmaps_base_schedule_policy does not override cmd line settings
2. when you try to bind to more cores than we have, generate a not-enough-processors error message
3. allow npersocket -bind-to-core combination - because, yes, somebody actually wants to do it.
This commit was SVN r21996.
2009-09-22 18:44:53 +00:00
Josh Hursey
5406fdfb80
Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option).
...
This commit looks larger than it really is since it includes a fair amount of code cleanup.
The SIGSTOP/SIGCONT+checkpointing work uses some of the functionality in r20391. Basic use case below (note that the checkpoint generated is useable as usual if the stopped application is terminated).
{{{
shell 1) mpirun -np 2 -am ft-enable-cr my-app
... running ...
shell 2) ompi-checkpoint --stop -v MPIRUN_PID
[localhost:001300] [ 0.00 / 0.20] Requested - ...
[localhost:001300] [ 0.00 / 0.20] Pending - ...
[localhost:001300] [ 0.01 / 0.21] Running - ...
[localhost:001300] [ 1.01 / 1.22] Stopped - ompi_global_snapshot_1234.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt
shell 2) killall -CONT mpirun
... Application Continues execution in shell 1 ...
}}}
Other items in this commit are mostly cleanup that has been sitting off-trunk for too long:
* Add a new {{{opal_crs_base_ckpt_options_t}}} type that encapsulates the various options that could be passed to the CRS. Currently only TERM and STOP, but this makes adding others ''much'' easier.
* Eliminate ORTE_SNAPC_CKPT_STATE_PENDING_TERM, since it served a redundant purpose with the new options type.
* Lay some basic ground work for some future features.
This commit was SVN r21995.
The following SVN revision numbers were found above:
r20391 --> open-mpi/ompi@0704b98668
2009-09-22 18:26:12 +00:00
Ralph Castain
8da3aa8d5c
Some (hopefully final!) adjustments and corrections to the paffinity support:
...
1. default -npersocket to force -bind-to-socket
2. if we cannot get a value for cores/socket, try using #logical cpus. otherwise, default to 1 core
3. add missing error message for not-enough-processors
4. since we no longer loop through orte_register_params twice, put the auto-detect of
topology info in the rte_init for hnp and std_orted
5. fix bind-to-core, bysocket combination
This commit was SVN r21992.
2009-09-22 15:41:03 +00:00
Ralph Castain
12613352eb
Add missing header file
...
This commit was SVN r21990.
2009-09-22 13:07:57 +00:00
Ralph Castain
2210989e2d
Update the cm ess module to support orted bootstrap. Continue work towards bootstrap capability.
...
This commit was SVN r21989.
2009-09-22 02:16:40 +00:00
Ralph Castain
c3f9096fd9
Add a reliable multicast framework, with an initial basic module. This is configured out unless specifically requested via --enable-multicast.
...
This commit was SVN r21988.
2009-09-22 00:58:29 +00:00
Terry Dontje
0ccf2d87b6
rename do-not-bind to bind-to-none and clean up an error message
...
This commit was SVN r21980.
2009-09-21 17:00:02 +00:00
Terry Dontje
13be2d2a00
correct mistype in odle should be odls call to orte_show_help
...
This commit was SVN r21979.
2009-09-21 13:22:37 +00:00
Ralph Castain
7138fd131f
Final cleanup on new paffinity "if-avail" messages, plus fix one bug reported by Terry
...
This commit was SVN r21978.
2009-09-19 17:43:21 +00:00
Ralph Castain
2028017554
Modify the paffinity system to handle binding directives that are "soft" - i.e., when someone directs that we bind if the system supports it. This allows community members to distribute OMPI with default MCA param files that direct general binding policies, without having the distributed software fail if the system cannot support those policies.
...
The new options work by adding an ":if-avail" qualifier to the "bind-to-socket" and "bind-to-core" MCA params. If the system does not support this capability, the job will launch anyway. Without the qualifier, the job will abort with an error message indicating that the required functionality is not supported on this system.
This commit was SVN r21975.
2009-09-18 19:48:42 +00:00
Ralph Castain
98a4450df6
Fix the seq mapper by initializing the proc object to NULL before claiming a slot for it
...
This commit was SVN r21969.
2009-09-17 05:18:37 +00:00
Ralph Castain
ae31af7dec
Enable monitoring if configured to do so. Update the sensor framework
...
This commit was SVN r21964.
2009-09-09 21:00:27 +00:00
Ralph Castain
5fb3d13c24
Cleanup some pointer array addressing
...
This commit was SVN r21963.
2009-09-09 20:59:17 +00:00
Ralph Castain
e554fc282d
Add some diagnostic output when daemons die
...
This commit was SVN r21960.
2009-09-09 18:16:50 +00:00
Ralph Castain
c20d977a30
Report the allocate event, if requested
...
This commit was SVN r21959.
2009-09-09 17:47:58 +00:00
Ralph Castain
2688ad2c9f
Ensure the odls_types are included when referencing the APIs
...
This commit was SVN r21958.
2009-09-09 17:47:13 +00:00
Ralph Castain
51b13b3d5c
A few minor cleanups in where threads are unlocked.
...
Reset mpirun's exit code when we restart failed procs
This commit was SVN r21955.
2009-09-09 05:31:06 +00:00
Ralph Castain
c877b1a5f8
Silence a compiler warning about no format
...
This commit was SVN r21951.
2009-09-08 15:03:14 +00:00
Ralph Castain
81b8bc5b54
Silence a compiler warning about no format
...
This commit was SVN r21950.
2009-09-08 15:02:48 +00:00
Ralph Castain
142036f2c0
Issue an error message and abort if the user requests a number of processes that conflicts with nperxxx directives when evaluated against available resources
...
This commit was SVN r21949.
2009-09-07 03:36:10 +00:00
Jeff Squyres
e1fe03ad44
Minor grammar fixes, and use "#" for separating lines, not blank lines.
...
This commit was SVN r21931.
2009-09-03 07:02:21 +00:00
Ralph Castain
0421a49844
Update the xml support to allow -xml-file foo whereby we redirect all xml formatted output (and ONLY xml formatted output) to a specified file
...
This commit was SVN r21930.
2009-09-02 18:03:10 +00:00
Ralph Castain
d3d34f8f15
Correct a bug in the assignment of node index value. Ensure we set the app number so that MPI attributes get set correctly.
...
This commit was SVN r21927.
2009-09-02 01:15:44 +00:00
Ralph Castain
50ca27c1c8
Ensure that procs launched natively by slurm do not mistakenly identify themselves as daemons to the system
...
This commit was SVN r21926.
2009-09-01 17:57:15 +00:00
Lenny Verkhovsky
2a594fec6c
added help message to rankfile mapper when failed if using alias instead of full hostname
...
This commit was SVN r21919.
2009-09-01 11:17:32 +00:00
Ralph Castain
59645c5c8e
Per direction from the slurm team, change the envar we look at to get our allocation
...
This commit was SVN r21915.
2009-08-30 15:57:27 +00:00
Ralph Castain
0394a4884d
Setup cpus-per-proc and cpus-per-rank as synonyms, both in mca params and on mpirun cmd line
...
This commit was SVN r21914.
2009-08-30 14:30:36 +00:00
Ralph Castain
ef4cdeeb69
Fix round-robin mapping when bind-to-socket in cases where #procs > #sockets and #cores
...
This commit was SVN r21913.
2009-08-29 03:36:21 +00:00
Ralph Castain
433673c64f
Report bindings in all cases, including external bindings and slot lists
...
This commit was SVN r21911.
2009-08-28 13:58:46 +00:00
Ralph Castain
59f08dd2ff
Support the combination of npersocket and bind-to-core
...
This commit was SVN r21909.
2009-08-28 02:31:26 +00:00
Shiqing Fan
fb777134cf
Adjust the command string length.
...
This commit was SVN r21905.
2009-08-27 13:42:55 +00:00
Ralph Castain
2d27bc9824
Default npersocket to bind-to-socket unless otherwise directed
...
This commit was SVN r21904.
2009-08-27 13:21:14 +00:00
Shiqing Fan
ffd55631bc
Deal with the case when the prefix is NULL.
...
This commit was SVN r21902.
2009-08-27 13:11:18 +00:00