Ralph Castain
4ce07ace61
Allow the user to set the send/recv buf size for udp. Don't declare existing nb recvs to be an error.
...
This commit was SVN r23210.
2010-05-26 14:29:36 +00:00
Ralph Castain
ab6e06f5b3
Reorganize the rmcast code to capture common code elements. Increase max msg size for spread and udp transports. Cleanup the spread configuration doc.
...
This commit was SVN r23207.
2010-05-25 22:36:57 +00:00
Ralph Castain
02cc0cde83
Only activate this module if specifically requested
...
This commit was SVN r23203.
2010-05-24 18:42:32 +00:00
Abhishek Kulkarni
f04dcffecd
Wrap the connection failed check with a SOS macro to extract the native error code.
...
This commit was SVN r23202.
2010-05-23 16:42:08 +00:00
Ralph Castain
73ebb748bb
Ignore comm failures when shutting down orteds
...
This commit was SVN r23201.
2010-05-23 02:57:03 +00:00
Ralph Castain
e8f98661bb
Fix a couple of plm modules that were calling a stale function
...
This commit was SVN r23200.
2010-05-23 02:55:47 +00:00
Ralph Castain
7c43d6c0f5
Don't drop a core file when we abort due to a lost connection
...
This commit was SVN r23199.
2010-05-22 18:09:40 +00:00
Jeff Squyres
fec7918eea
Some paffinity functions had their return status overloaded:
...
* If < 0, it's an OPAL_ERR_* value
* If >= 0, it's the actual output value of the function
This is problematic for the OPAL_SOS stuff. This commit changes those
functions to always return OPAL_* statuses and send the output value
back through output parameters (like 95% of the rest of the code
base). This avoids the confusion with OPAL_SOS stuff and makes
paffinity work again (e.g., mpirun --bind-to-core ...).
I updated all paffinitiy modules for the new function signatures, and
bumped the paffinity API version up to 2.0.1. I don't think the
version change will matter, though, because we'll be introducing
support for hardware threads soon, which will either bump the
paffinity version again or we'll replace paffinity with
a new framework.
This commit was SVN r23197.
2010-05-21 16:55:28 +00:00
Ethan Mallove
57eee4d75c
* Can't put var declarations in the middle of code
...
* Use OBJ_RELEASE on data that was OBJ_NEW'd
* Limit single-line char width
* Use ORTE_ERR_BAD_PARAM on a rankfile typo, not ORTE_ERR_SILENT
* Add copyright
This commit was SVN r23196.
2010-05-21 15:30:38 +00:00
Shiqing Fan
857f1669e2
Solve a few compilation problems on Windows.
...
This commit was SVN r23193.
2010-05-21 14:30:15 +00:00
Ralph Castain
aaaeea6f17
Once again, fix the blasted rank_file mapper. I can't guarantee that I fixed it correctly, but at least now it compiles!
...
This commit was SVN r23190.
2010-05-21 09:46:42 +00:00
Ethan Mallove
e751f3c21c
Add a check for a duplicate rank assignment in the rankfile parser (Fixes trac:2414)
...
This commit was SVN r23186.
The following Trac tickets were found above:
Ticket 2414 --> https://svn.open-mpi.org/trac/ompi/ticket/2414
2010-05-20 18:38:03 +00:00
Ralph Castain
ef3c88cbd2
If we have ordered jobs to terminate, then we should ignore comm_failed reports from daemons as they may be dropping out
...
This commit was SVN r23185.
2010-05-20 12:37:09 +00:00
Ralph Castain
05e05089b8
Ignore failed comm connections if it is our connection that failed
...
This commit was SVN r23184.
2010-05-20 03:13:09 +00:00
Abhishek Kulkarni
abe13d802c
Silence warnings by commenting out unused functions in the "hnp" notifier component.
...
This commit was SVN r23181.
2010-05-19 22:46:05 +00:00
Abhishek Kulkarni
118ce0e166
OMPI FTB component updates
...
* register FTB events from an event schema file
* define more FTB events
* minor fixes
This commit was SVN r23180.
2010-05-19 22:05:06 +00:00
Ralph Castain
c7d7a18318
Little more cleanup from SOS
...
This commit was SVN r23175.
2010-05-19 16:28:58 +00:00
Josh Hursey
f57e73d4e5
add a few more missing SOS includes
...
This commit was SVN r23168.
2010-05-18 15:00:07 +00:00
Rolf vandeVaart
cdd2d09c69
Fix broken compile.
...
This commit was SVN r23167.
2010-05-18 12:43:21 +00:00
Jeff Squyres
76d1bb3c96
Per some off-list discussion with Ralph and George, MPIR_being_debugged
...
really should be volatile.
This commit was SVN r23166.
2010-05-18 11:42:41 +00:00
Abhishek Kulkarni
260eaa24dd
Add a missing header. Bleh.
...
This commit was SVN r23165.
2010-05-17 23:55:53 +00:00
Abhishek Kulkarni
8335ff3893
Fixing a branch merge issue. Looks like we picked the wrong branch when resolving
...
a conflict.
This commit was SVN r23164.
2010-05-17 23:49:46 +00:00
Abhishek Kulkarni
afbe3e99c6
* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with
...
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
back the native error code.
* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
(OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
decode 'ret' to get the native error code.
This commit was SVN r23162.
2010-05-17 23:08:56 +00:00
Abhishek Kulkarni
f7f4dd87ab
Change ORTE_ERR_LOG macro to make it SOS-aware.
...
If the logged error is an SOS-encoded error, use the SOS log function otherwise
use the existing errmgr log function.
This commit was SVN r23161.
2010-05-17 23:02:13 +00:00
Abhishek Kulkarni
9c5860706f
Merge improvements to the "notifier" framework from the OPAL SOS and the ORTE WDC mercurial branches into the SVN trunk.
...
A brief description of the improvements can be found at
https://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC#ChangesdonetotheORTEnotifier
This commit was SVN r23157.
2010-05-17 22:48:05 +00:00
Abhishek Kulkarni
f5b9bc4ff1
Add a new "HNP" component to the notifier framework.
...
This component proxies notification messages up to the HNP. This
component runs in both the HNP and non-HNP processes for ease of
selection (e.g., so you can "--mca notifier hnp" (vs. "--mca
notifier hnp,non_hnp"). It auto-detects where it is running and
does the Right Thing -- if it's in the HNP process, it sets up to
receive incoming proxied messages. If it's not in the HNP, then it
proxies all messages to the HNP.
This commit was SVN r23156.
2010-05-17 22:43:43 +00:00
Abhishek Kulkarni
197ec7586d
Add a new "file" component to the notifier framework.
...
When this component is selected, the notification messages are sent to a file.
The file can be a plain file or stdout or stderr.
The MCA parameter "notifier_file_name" can be used to specify the suffix of the file the notification messages should be sent to.
The default suffix is "wdc" and the file full name is "output-wdc".
This commit was SVN r23155.
2010-05-17 22:39:52 +00:00
Ralph Castain
7e6985edbf
Cleanup warnings
...
This commit was SVN r23150.
2010-05-16 20:23:26 +00:00
Ralph Castain
de85049477
Cleanup warnings
...
This commit was SVN r23149.
2010-05-16 20:22:18 +00:00
Ralph Castain
88f5217a12
Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired).
...
Add new mca params to test:
orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch
orte_debugger_test_attach: Test debugger colaunch after debugger attachment
To test co-launch at job start, just set the orte_debugger_test_daemon param.
To test co-launch upon attach:
set orte_debugger_test_daemon
set orte_debugger_test_attach=1
set orte_enable_debug_cospawn_while_running=1
set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching
Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon.
This commit was SVN r23144.
2010-05-14 18:44:49 +00:00
Ralph Castain
b9f0615727
Correct some logic for tracking launch progress
...
This commit was SVN r23122.
2010-05-12 18:39:10 +00:00
Ralph Castain
7ce34223f1
Per off-list discussion, implement the new OMPI exit status policy (soon to be on wiki) and further cleanup error reporting to cover new cases.
...
Implement process migration when failed nodes are detected. Some testing still required
This commit was SVN r23121.
2010-05-12 18:11:58 +00:00
Ralph Castain
306533fdb8
Replace a missing line that shutdown a peer that failed comm.
...
This commit was SVN r23120.
2010-05-12 18:09:35 +00:00
Ralph Castain
871f445848
Ignore nodes that are "down" when generating maps
...
This commit was SVN r23119.
2010-05-12 18:08:40 +00:00
Ralph Castain
4bd25f587c
Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t
...
he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so.
Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort.
This commit was SVN r23112.
2010-05-11 00:34:12 +00:00
Ralph Castain
86033074ce
Grr...it's a job state, not a proc state
...
This commit was SVN r23110.
2010-05-07 16:50:10 +00:00
Ralph Castain
70bef5ca95
It is possible for the jdata param to be NULL, so protect the verbose output in that case
...
This commit was SVN r23109.
2010-05-07 16:48:44 +00:00
Shiqing Fan
8ea8563462
Include the new component config file for Windows into tarball.
...
This commit was SVN r23108.
2010-05-07 14:30:12 +00:00
Ralph Castain
d6a1d7a082
Little more cleanup on paffinity. Provide a specific error code for affinity not supported so we can better report the problem. Move the error reporting to orterun so we only get one error message. Update the darwin paffinity module to return the correct new error codes.
...
This commit was SVN r23107.
2010-05-07 14:04:55 +00:00
Ralph Castain
d4f56cff61
More cleanup on paffinity....groan
...
It is okay to not have a paffinity module IF you aren't using paffinity anyway. So don't error out of MPI_Init because a paffinity module wasn't selected.
Cleanup error reporting in the odls default module to (once and for all!) eliminate messages originating in the fork'd process. Create some new error codes to allow us to pass enough info back to the parent process to provide useful error messages.
This commit was SVN r23106.
2010-05-06 20:57:17 +00:00
Jeff Squyres
477201e161
Fix "make dist" breakage
...
This commit was SVN r23105.
2010-05-06 18:47:20 +00:00
Shiqing Fan
76726f6094
Update a few module configuration for Windows.
...
This commit was SVN r23104.
2010-05-05 12:22:04 +00:00
Ralph Castain
2ff1ae13e1
Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set.
...
This commit was SVN r23102.
2010-05-05 00:48:43 +00:00
Ralph Castain
99f223210d
Add some contributed examples of how to start and configure the spread library. Do a little more cleanup on the spread module, and ensure that it isn't selected if spread isn't running.
...
This commit was SVN r23101.
2010-05-04 23:44:00 +00:00
Ralph Castain
43005b88e0
Catch one more spot...thanks to Sam for reporting it
...
This commit was SVN r23094.
2010-05-04 17:21:15 +00:00
Ralph Castain
f4ae2885e2
Add new error constant
...
This commit was SVN r23090.
2010-05-04 13:44:33 +00:00
Ralph Castain
cd569f8a79
Restore the global restart capability
...
This commit was SVN r23089.
2010-05-04 02:40:29 +00:00
Ralph Castain
3ca0b4138b
Let the nidmap functions update a new orte_process_info field as to the number of daemons in the system
...
This commit was SVN r23088.
2010-05-04 02:40:09 +00:00
Ralph Castain
8c5f442ee0
Fix some bugs in the spread rmcast component
...
This commit was SVN r23086.
2010-05-04 02:38:37 +00:00
Ralph Castain
8e7faf9119
Add a new test for the db framework, fix some minor bugs in the daemon module
...
This commit was SVN r23085.
2010-05-04 02:38:11 +00:00
Ralph Castain
9dfb5c7c62
Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set
...
This commit was SVN r23082.
2010-05-03 04:11:03 +00:00
Ralph Castain
5103ffead6
Continue work on local recovery for orteds
...
This commit was SVN r23080.
2010-05-03 04:08:13 +00:00
Ralph Castain
bcff0d6301
Some minor cleanup in the rmcast framework, ensure that a default multicast group is always defined for each app
...
This commit was SVN r23079.
2010-05-03 04:07:14 +00:00
Ralph Castain
f994a7edf4
Add recovery data to the jobdat object
...
This commit was SVN r23078.
2010-05-03 04:06:13 +00:00
Ralph Castain
323224b84b
Allow the restart data to be retrieved from a job's app->env or default that given to orterun
...
This commit was SVN r23077.
2010-05-03 04:05:19 +00:00
Ralph Castain
3f262bf0b6
Add a new reliable multicast component based on the "spread" library
...
Thanks to Srini Nariangadu (Cisco) for the contribution!
This commit was SVN r23076.
2010-05-02 17:29:41 +00:00
Ralph Castain
c93af95351
Since there is no defined behavior for the case where all application procs exit normally, but some or all have non-zero returns, just output a warning telling the user how many procs meet that criteria. Let the return code of mpirun in that scenario reflect any errors in OMPI/ORTE itself.
...
Clearly a temporary solution until a defined behavior can be established.
This commit was SVN r23075.
2010-04-30 19:01:10 +00:00
Ralph Castain
c62418d76d
Add a new test that checks behavior when we call exit with a non-zero return code after calling finalize - don't ask why.
...
Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status.
This commit was SVN r23071.
2010-04-29 19:58:44 +00:00
Ralph Castain
5689bd6d30
Remove debug
...
This commit was SVN r23062.
2010-04-28 22:14:23 +00:00
Ralph Castain
319758e3e0
Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params:
...
1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis
2. orte_max_local_restarts - default max number of local restarts, can be overridden
3. orte_max_global_restarts - default max number of relocates, can be overridden
Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code.
This commit was SVN r23057.
2010-04-28 04:06:57 +00:00
Ralph Castain
67568e7dec
Fix a warning
...
This commit was SVN r23049.
2010-04-27 15:59:24 +00:00
Ralph Castain
401fb22847
Fix the pack/unpack of process state values by getting the actual packing size to match the new value size.
...
Improve the debug output by printing the string
This commit was SVN r23048.
2010-04-27 15:58:04 +00:00
Ralph Castain
efc7d3e2f6
Update the LSF PLM module to ensure that the path is correctly passed.
...
Many thanks to Teng Lin for the error report - and the patch!
This commit was SVN r23047.
2010-04-27 03:46:04 +00:00
Ralph Castain
3434296836
Ensure we don't have a trailing separator on the end of our tmpdir as (a) it really looks weird, and (b) some exotic systems interpret that as indicating the rest of the path is to be treated as absolute. Makes for very strange and interesting behavior...
...
This commit was SVN r23046.
2010-04-27 03:40:44 +00:00
Ralph Castain
55889934d8
After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon.
...
Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly.
This commit was SVN r23045.
2010-04-27 03:39:32 +00:00
Ralph Castain
b9893aacc5
Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here:
...
1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified.
2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set.
This support must be enabled by configuring --enable-sensors.
ompi_info and orte-info have been updated to include the new framework.
Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized.
Implementation continues.
This commit was SVN r23043.
2010-04-26 22:15:57 +00:00
Ralph Castain
43a89bbace
Extend the process and job states by adding values for exceeding sensor bounds. This changes the job state field to 32-bit to also provide room for future expansion.
...
This commit was SVN r23036.
2010-04-26 12:36:40 +00:00
Ralph Castain
e3164d2ac1
For the autogen-challenged (i.e., Jeff), create a new ORTE constant that tells the user that a required module was not found. Update the errmgr select function to output the error if no module is found.
...
This commit was SVN r23032.
2010-04-24 01:39:26 +00:00
Ralph Castain
b62ebb8b94
Enable these to build so others can more easily begin implementation
...
This commit was SVN r23031.
2010-04-23 22:46:33 +00:00
Ralph Castain
9c69900bbc
Ensure we get the jdata object when updating proc state so we don't segv if report-progress is set. Thanks to Sam for reporting it!
...
This commit was SVN r23029.
2010-04-23 15:14:21 +00:00
George Bosilca
95394456ca
No Windows EOF.
...
This commit was SVN r23024.
2010-04-23 05:51:29 +00:00
Ralph Castain
efbb5c9b7c
Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base:
...
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.
* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.
* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.
* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons
* update the routed modules to reflect that process state is updated in the errmgr
* ensure that the orted's open the errmgr and select their appropriate module
* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place
* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.
* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios
This commit was SVN r23023.
2010-04-23 04:44:41 +00:00
Shiqing Fan
d1e66bdd01
Use variables instead of hard-coded compiler flags, in order to support various C/C++ compilers on Windows.
...
This commit was SVN r23016.
2010-04-21 12:45:00 +00:00
Ralph Castain
86228aee38
Provide two new opal paffinity utilities for printing a hex representation of the cpu set and parsing that string back into a cpu set on the other end. Also add a new MCA param for passing the cpu set applied to a process during launch down to that process so it can know what we attempted to do.
...
All to be used in some new MPI extensions provided by Jeff so that users can easily query their binding situation.
This commit was SVN r22998.
2010-04-19 22:16:35 +00:00
Ralph Castain
4d06125a33
Establish a method by which a process knows if it has been bound by mpirun. This helps resolve a problem where a process gets "bound" to all available resources, which looks to the opal paffinity system as "not bound". This can cause mpi_init to attempt to "bind" the process itself, causing unintended behavior.
...
This commit was SVN r22985.
2010-04-17 01:58:26 +00:00
Ralph Castain
41428e6b61
Issue a warning if a requested binding operation results in processes being bound to all available processes, which is the equivalent of not being bound at all.
...
See the following email thread for further details:
http://www.open-mpi.org/community/lists/devel/2010/04/7745.php
This commit was SVN r22984.
2010-04-17 01:02:41 +00:00
Ralph Castain
6c379fed2e
IOF components should not assume they will be selected when queried - thus, they should not perform init functions until after selection. Create init/finalize entry points for that purpose, and have select init the module after it has been selected.
...
This commit was SVN r22982.
2010-04-16 18:51:27 +00:00
Ralph Castain
2ecc9fc2b3
Additional diag output
...
This commit was SVN r22981.
2010-04-16 14:48:37 +00:00
Ralph Castain
4308922f59
Ensure that any application-specific selection of ess module doesn't get overridden by what is given to the orted or orterun
...
Cleanup tool name determination for CM
This commit was SVN r22980.
2010-04-15 18:10:50 +00:00
Ralph Castain
c3e4c40cdf
Add another multicast tag for updating state
...
This commit was SVN r22979.
2010-04-15 18:08:53 +00:00
Ralph Castain
eeccf2f15c
Some minor changes to support vm's
...
This commit was SVN r22975.
2010-04-14 01:20:43 +00:00
Ralph Castain
854dc12fc0
Add a tag for multicasting IOF messages
...
This commit was SVN r22972.
2010-04-13 22:51:26 +00:00
Ralph Castain
81329c637e
Indentation corrections
...
This commit was SVN r22971.
2010-04-13 17:47:34 +00:00
Ralph Castain
8da781af84
Continue developing support for distributed virtual machines - minor changes to ensure correct jobid gets used and that dvm's can communicate with tools
...
This commit was SVN r22958.
2010-04-12 22:33:09 +00:00
Shiqing Fan
96b20a29b5
An easy solution to make singleton work on Windows.
...
This commit was SVN r22952.
2010-04-10 16:30:59 +00:00
Ralph Castain
d3ed4e68b7
Utilize a non-used mapping policy bit to define a policy that uses only existing alive daemons to support virtual machines and restarting processes on already-active nodes
...
This commit was SVN r22951.
2010-04-10 05:02:47 +00:00
Ralph Castain
4f8279df3d
Enable substitution of the communication calls in the orted when sending messages back to the HNP by creating a function for this purpose and saving the pointer to it in orte_odls_base. Higher level libraries can then override the default function to use their own method.
...
This commit was SVN r22950.
2010-04-09 18:50:10 +00:00
Ralph Castain
c32f046d7c
Tiny cleanup - when the user kills us with a ctrl-c, there really isn't a need to tell him "your procs died and we don't know why". Just shaddup and die.
...
This commit was SVN r22949.
2010-04-09 18:47:35 +00:00
Terry Dontje
282a537cf7
This commit fixes 2370, by having the solaris paffinity module return error codes for get_physical_processor_id and having odls_default_fork_local_proc check get_physical_processor_id for OPAL_ERROR
...
This commit was SVN r22948.
2010-04-09 15:10:46 +00:00
Brad Benton
101b896f2e
IBM has approved the release of the LoadLeveler sample code under the
...
BSD license. Consequently, a more restrictive licensing clause that was
originally associated with the LoadLeveler sample code documentation and
replicated in a comment block in this file has been removed.
This commit was SVN r22947.
2010-04-08 19:41:44 +00:00
Terry Dontje
929c58e38d
This commit fixes trac:2073
...
This commit was SVN r22946.
The following Trac tickets were found above:
Ticket 2073 --> https://svn.open-mpi.org/trac/ompi/ticket/2073
2010-04-08 18:17:44 +00:00
Ralph Castain
75e99e6118
Do a better job of selecting cm ess component, handle tool and daemon issues
...
This commit was SVN r22942.
2010-04-07 18:59:21 +00:00
Ralph Castain
f1fc344336
Add some diagnostics
...
This commit was SVN r22941.
2010-04-07 18:58:17 +00:00
Ralph Castain
8e29a6858a
Properly handle the case when a daemon is given both parts of its name
...
This commit was SVN r22935.
2010-04-06 22:41:18 +00:00
Ralph Castain
a1e82e9d05
Per discussion with Josh, cleanup the errmgr API by creating separate modules for the public vs internal APIs. This mirrors the architecture used in other frameworks that had similar requirements.
...
Remove the orcm errmgr module - moving to the orcm code base so it can utilize orcm communications and not interfere with ompi-related operations.
This commit was SVN r22931.
2010-04-05 22:59:21 +00:00
Ralph Castain
1caba7af2f
Fix a bunch of compiler warnings reported by Jeff
...
This commit was SVN r22930.
2010-04-03 00:20:19 +00:00
Ralph Castain
84c7973df8
Update the #procs in the job prior to assigning vpids for each app_context.
...
This commit was SVN r22929.
2010-04-03 00:03:35 +00:00
Ralph Castain
6b43b76f9d
Some updates required for generating a LAM-style virtual machine. Retain the local node if requested. Properly setup the daemon job map for a VM launch.
...
This commit was SVN r22928.
2010-04-03 00:03:01 +00:00
Ralph Castain
de6679dbd3
Truly respect the -quiet option. Make it an mca param so someone doesn't have to put it solely on the cmd line. Tell show_help to shaddup as well.
...
This commit was SVN r22926.
2010-04-02 14:19:38 +00:00
Ralph Castain
25a9a195b0
When the user requests "quiet", mpirun isn't supposed to output "helpful error messages" - so shaddup!
...
This commit was SVN r22925.
2010-04-02 07:13:11 +00:00
Ralph Castain
ed0f42fa49
Fix a bug courtesy of Jeff - since check_job_complete removes the child object and releases it, preserve the pointer to the next item on the list prior to working with it
...
This commit was SVN r22924.
2010-04-02 07:08:34 +00:00
Jeff Squyres
c57a8fba5a
Improve the help message when mpirun cannot find an executable.
...
Refs trac:2035
This commit was SVN r22922.
The following Trac tickets were found above:
Ticket 2035 --> https://svn.open-mpi.org/trac/ompi/ticket/2035
2010-04-01 13:26:29 +00:00
Ralph Castain
871a9e0df4
Track process heartbeats with time_t, be a little less restrictive on who can retrieve an orte_job_t object
...
This commit was SVN r22921.
2010-03-31 19:20:06 +00:00
Josh Hursey
62f8d3c471
r22885 missed a few symbol updates when it changed ompi_want_ft to opal_want_ft
...
This commit was SVN r22916.
The following SVN revision numbers were found above:
r22885 --> open-mpi/ompi@522a23d6a3
2010-03-30 16:47:39 +00:00
Ralph Castain
f6bfaa76ba
Add some debug output to job_complete. If no session dirs were created, then cannot check for abort file - which wouldn't be created anyway
...
This commit was SVN r22903.
2010-03-29 23:21:03 +00:00
Ralph Castain
24c3b4f849
Add the sysinfo framework to the "info" tools, especially since the odls_base_open function calls it!
...
This commit was SVN r22901.
2010-03-29 20:47:29 +00:00
Ralph Castain
2603bd8a47
Eliminate a race condition (first reported by Josh) when deliberately killing procs. Need to cancel the waitpid callback for the proc, then properly flag it as dead (both not-alive and waitpid-fired) so that the system cleans up properly.
...
This commit was SVN r22900.
2010-03-28 16:08:05 +00:00
Ralph Castain
4f9db20d94
Couple of minor cleanups
...
This commit was SVN r22899.
2010-03-28 15:41:27 +00:00
Ralph Castain
1bf9684ebb
Don't include jobs in the nidmap if they aren't mapped jobs
...
This commit was SVN r22886.
2010-03-25 22:54:57 +00:00
Ralph Castain
522a23d6a3
A few changes to the FT-related configure options:
...
1. fix a bug that caused an infinite loop in configure when specifying want-ft but not want-ft-thread by removing a stale reference to the opal-progress-thread option
2. add want-ft=orcm so we can build the orcm errmgr component
3. cleanup the use of "ompi_want_ft_xxx" and replace it with "opal_want_ft_xxx" so that naming conventions are preserved
This commit was SVN r22885.
2010-03-25 22:53:48 +00:00
Josh Hursey
e4f2d03d28
ErrMgr Framework redesign to better support fault tolerance development activities.
...
Explained in more detail in the following RFC:
http://www.open-mpi.org/community/lists/devel/2010/03/7589.php
This commit was SVN r22872.
2010-03-23 21:28:02 +00:00
Ralph Castain
0b9552cd4e
Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly.
...
This commit was SVN r22870.
2010-03-23 20:47:41 +00:00
Josh Hursey
9e967a3a9b
Revert this change, since even in the CR case we want to reset this var to NULL.
...
Thanks to Jeff for the catch.
This commit was SVN r22868.
2010-03-23 19:55:21 +00:00
Shiqing Fan
9591680ec0
One of the binaries was generated from a wrong source.
...
This commit was SVN r22865.
2010-03-23 09:56:11 +00:00
Ralph Castain
62e751a95c
Add a tag
...
This commit was SVN r22862.
2010-03-22 15:46:00 +00:00
Ralph Castain
d49f93b743
Cleanup the initialization handshake for multicast apps
...
This commit was SVN r22855.
2010-03-19 20:15:01 +00:00
Ralph Castain
74bd4adc6b
Add some diagnostics, correctly check for existing channel
...
This commit was SVN r22854.
2010-03-19 08:21:01 +00:00
Ralph Castain
abbdc2b527
Pass the job family to tools that need to connect to specific HNPs
...
This commit was SVN r22853.
2010-03-19 04:01:33 +00:00
Ralph Castain
a479e6c320
Provide the sender's name for blocking recv's
...
This commit was SVN r22852.
2010-03-19 04:00:34 +00:00
Ralph Castain
8fb71c0fe6
Add some helpful defined values
...
This commit was SVN r22850.
2010-03-19 03:59:29 +00:00
Ralph Castain
e291fc2c69
With Jeff's help, get the libraries to link as required.
...
Update ompi_info and orte-info to include the new framework.
Fix some selection logic and a typo'd variable name
Still remains ompi_ignored until we complete testing
This commit was SVN r22848.
2010-03-18 02:12:59 +00:00
Ralph Castain
3cd96928a9
Use the OMPI_CHECK_PACKAGE macro to check both header file and library existence before building the component.
...
Still haven't gotten the right libraries linked in...so add ompi_ignore/unignore until we get it all fully integrated.
This commit was SVN r22843.
2010-03-17 00:46:12 +00:00
Ralph Castain
b400b84162
Merge in the modified thread configure option branch per today's telecon.
...
Remove the --enable-progress-threads option as this is no longer functional, and hardcode OPAL_ENABLE_PROGRESS_THREADS to 0.
Replace the --enable-mpi-threads option with --enable-mpi-thread-multiple as this is clearer as to meaning. This option automatically turns "on" opal thread support if it wasn't already so specified. If the user specifies --disable-opal-multi-threads --enable-mpi-thread-multiple, we will error out with a message
Add a new --enable-opal-multi-threads option that turns "on" opal thread support without doing anything wrt mpi-thread-multiple
This commit was SVN r22841.
2010-03-16 23:10:50 +00:00
Ralph Castain
ffd5be6aa1
Add a new framework to ORTE for saving and recovering state information. Two components are included that use the db or dbm library for storing the data, with a distributed hash table component coming later.
...
Note that each of these components will only be selected if specifically requested - otherwise, a "NULL" component will be used. The framework is only opened by the HNP and orteds, though neither is currently coded to save/restore state
This commit was SVN r22839.
2010-03-16 20:59:48 +00:00
Rainer Keller
814fb9399f
- Further patches for support on NetBSD (and DragonFly) by
...
Aleksej Saushev.
Dont use bash or bashism in shell scripts
We should use Posix' setpgid(0,0), which is equivalent to setpgrp().
This commit was SVN r22829.
2010-03-15 05:33:42 +00:00
Josh Hursey
e9b5162d79
Fix the configure logic for --with-ft so that it properly takes a comma separated list.
...
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.
The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.
This commit was SVN r22824.
2010-03-12 23:57:50 +00:00
Josh Hursey
b43d621f30
Remove an errant '$' in the configure.m4 files. Was causing problems with configure.
...
This commit was SVN r22821.
2010-03-12 20:08:22 +00:00
Ralph Castain
c16cd10bb2
Save the username, if specified, for each node
...
This commit was SVN r22817.
2010-03-11 15:24:18 +00:00
Ralph Castain
7105207b1c
If we only have one app participating in a all_gather (which lies under a modex as well), then we need to ensure that the returned buffer has the proper packing order so it can be unpacked correctly.
...
This commit was SVN r22815.
2010-03-10 19:22:06 +00:00
Ralph Castain
7ebf72b4aa
Trivial cleanup
...
This commit was SVN r22813.
2010-03-10 18:24:38 +00:00
Ralph Castain
7fd7b7a8cc
Fix the load_balance mapper so that it sets the #procs in the job before attempting to compute vpids
...
This commit was SVN r22812.
2010-03-10 17:52:19 +00:00
Ralph Castain
17936e6e5f
Ensure we cleanly terminate if an executable cannot be found
...
This commit was SVN r22805.
2010-03-10 16:45:08 +00:00
Josh Hursey
b73237c92a
Identify the process sending the update in the verbose message (helps debugging of process control).
...
This commit was SVN r22804.
2010-03-10 00:23:24 +00:00
Shiqing Fan
49502af2ba
fix the type cast.
...
This commit was SVN r22800.
2010-03-09 10:02:50 +00:00
Ralph Castain
4355134991
Let the vm launcher specify the mapping policy
...
This commit was SVN r22797.
2010-03-08 19:13:21 +00:00
Ralph Castain
bfa39d7f7e
Update the seq mapper to support lists from -host. Reorg the dash_host code to provide an ordered list as required by the seq mapper
...
This commit was SVN r22795.
2010-03-08 09:54:49 +00:00
Ralph Castain
9e7f621a98
Port Brad's paffinity change to the 1.4 branch over to the trunk so we don't lose it going forward.
...
This commit was SVN r22794.
2010-03-07 18:44:22 +00:00
Ralph Castain
2a0f7e95ee
Don't double account for the killed local proc - only adjust num_local_procs when the proc actually dies.
...
This commit was SVN r22787.
2010-03-05 13:53:18 +00:00
Ralph Castain
b2e24693c4
Check the return status when we forward stdin and remove the recipient when they are no longer alive
...
This commit was SVN r22786.
2010-03-05 13:41:28 +00:00
Ralph Castain
577eef1491
Pretty-print the recvd command for debug purposes
...
This commit was SVN r22785.
2010-03-05 13:38:20 +00:00
Ralph Castain
cdae19cf7b
Add a convenience macro to make a job family
...
This commit was SVN r22784.
2010-03-05 13:35:09 +00:00
Ralph Castain
f2c65dc70f
Ensure that the errmgr does not take action if the process was terminated by a "kill_procs" command as this can lead to circular logic.
...
Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits.
This commit was SVN r22783.
2010-03-05 13:22:12 +00:00
Ralph Castain
ef6c432e22
Fix a nasty bug where we would hang if an application trapped signals such as SIGTERM - a permissible thing to do. In such cases, we removed the process from the waitpid system and then sent it a SIGTERM. If the application trapped that and attempted to cleanly terminate, it would send us a sync message - and the daemon would then add it back to its local child list, causing both the daemon and the process to hang.
...
In this revision, we let the process terminate/exit however it can, and then pick it up via the usual waitpid.
This commit was SVN r22781.
2010-03-05 04:14:56 +00:00
Shiqing Fan
db747e4390
Remove the old timing parameter but using orte_timing instead. Thanks for Rainer.
...
This commit was SVN r22775.
2010-03-04 15:00:03 +00:00
Ralph Castain
c88fe1ea54
Create a new mca parameter to control creation of session directories. Defaults to true so that the current behavior of always creating them is preserved. If set to false (0), then don't create session directories. Helps in those environments where session directories are a problem.
...
Tell the sm btl that it cannot run if no session directories were created.
This commit was SVN r22756.
2010-03-02 15:18:33 +00:00
Ralph Castain
cd1efbb41e
Try and do a better job of cleanup in abnormal termination. Ensure the daemons whack session directories prior to disabling signal traps. Ensure that the HNP and daemons all cleanup when they are doing an internal abort.
...
This commit was SVN r22755.
2010-03-02 14:51:23 +00:00
Ralph Castain
b692645772
Remote daemons should -always- whack any lingering session directories when exiting
...
This commit was SVN r22749.
2010-03-02 05:28:53 +00:00
Ralph Castain
69fe5ca69b
Correctly compute bynode mapping, even in the presence of a $#$%#@^$ rankfile
...
This commit was SVN r22748.
2010-03-02 05:21:42 +00:00
Ralph Castain
bef06d52bc
Silence compiler warning
...
This commit was SVN r22747.
2010-03-01 21:04:26 +00:00
Ralph Castain
5514d9c673
Fix the stupid rankfile mapper again, hopefully not breaking everything else to accommodate it. Looks like the round-robin mappers still work, at least...
...
This commit was SVN r22746.
2010-03-01 20:40:47 +00:00