Jeff Squyres
5bebdb97fa
Need these header files for NetBSD. Thanks for the heads-up from
...
Aleksej Saushev.
This commit was SVN r23343.
2010-07-02 17:38:57 +00:00
Ralph Castain
ee4564c13b
Add some useful debug
...
This commit was SVN r23339.
2010-07-02 03:35:47 +00:00
Ralph Castain
f3d90dfb8d
Fully restore fault recovery, both at the individual process and daemon level.
...
NOTE: MPI fault recovery remains unavailable pending merge from Josh. This only covers ORTE-level processes.
This commit was SVN r23335.
2010-07-01 19:45:43 +00:00
Ralph Castain
7190415977
Fix JEFF's mistake - we cannot use orte_show_help if execv fails because we already closed all the file descriptors!
...
This commit was SVN r23334.
2010-07-01 19:41:26 +00:00
Ralph Castain
510ade9503
Do not use nodes that are flagged as down or do-not-use for this map. Modify error output to reflect possible reasons no nodes would be available
...
This commit was SVN r23333.
2010-07-01 19:39:31 +00:00
Ralph Castain
81a65f2c67
Define a new node state
...
This commit was SVN r23332.
2010-07-01 19:38:23 +00:00
Ralph Castain
f5548b8e0f
remove a potential locking conflict, and let emacs go ahead and reformat the function (sigh)
...
This commit was SVN r23331.
2010-07-01 19:37:53 +00:00
Ralph Castain
d463aec2f6
Don't try to send to dead daemons, keep accounting straight so we don't hang
...
This commit was SVN r23330.
2010-07-01 19:37:02 +00:00
Ralph Castain
dd85689560
Cleanup pointer array addressing
...
This commit was SVN r23329.
2010-07-01 19:33:10 +00:00
Ralph Castain
26fbae447e
Don't try to forward input when we already ordered shutdown. Check return codes on sends
...
This commit was SVN r23328.
2010-07-01 19:32:08 +00:00
Ralph Castain
3237b9ec87
Print a nice error message when a daemon fails, and exit with a non-zero status
...
This commit was SVN r23314.
2010-06-28 16:38:54 +00:00
Ralph Castain
a1ea6bc130
Ignore debugger daemon termination status - we don't care how they died.
...
This commit was SVN r23306.
2010-06-26 03:08:50 +00:00
Ralph Castain
099c3aad97
Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0.
...
This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all.
Thanks to Dong Ahn (LLNL) for catching this problem!
Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm.
This commit was SVN r23300.
2010-06-24 05:13:53 +00:00
Shiqing Fan
681df0089b
Add a few new files into the tarball.
...
This commit was SVN r23297.
2010-06-22 16:45:56 +00:00
Ralph Castain
8b2a682fba
Return a silent error when -do-not-launch is given
...
This commit was SVN r23291.
2010-06-22 01:06:10 +00:00
Shiqing Fan
2e5e9f0a03
Fix a wrong windows path in hpn_contack, which causes problems when looking up in the session directories. Add two more ess module for Windows.
...
This commit was SVN r23286.
2010-06-21 09:47:33 +00:00
Ethan Mallove
fc37e408c2
Avoid SEGV in case rsh/ssh is not in PATH
(refs trac:1490)
...
This commit was SVN r23278.
The following Trac tickets were found above:
Ticket 1490 --> https://svn.open-mpi.org/trac/ompi/ticket/1490
2010-06-17 14:58:09 +00:00
Ralph Castain
1e90b91b84
Unset envars we set during initialization so we leave environ intact after orte_finalize.
...
Thanks to Damien Gunter for pointing it out.
This commit was SVN r23277.
2010-06-17 13:42:21 +00:00
Ralph Castain
628ffd1d6e
Make the mcast channel assignments unsigned ints so they can be used as array indices. Assign input/output channels for apps. Cleanup some bugs in open_channel
...
This commit was SVN r23275.
2010-06-16 19:40:59 +00:00
Ralph Castain
6cbe947810
Modify the multicast scheme so that applications have separate input and output channels to avoid cross-talk. Update the multicast test to conform.
...
This commit was SVN r23271.
2010-06-15 03:50:31 +00:00
Ralph Castain
da43547983
Don't define the active_jobid until -after- the job has been setup.
...
Cleanup references to pointer_array objects
This commit was SVN r23250.
2010-06-09 02:16:05 +00:00
Jeff Squyres
f1a7b5cc33
Make "processor affinity not supported" error message a little better:
...
* Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic
OPAL_ERR_NOT_SUPPORTED case.
* When odls_default detects that processor affinity is not supported,
it prints a specific message about it, and then it suppressed a
generic HNP help message that would normally follow it (i.e., it's
easier to have the "processor affinity is not supported" show_help
message last).
* Use some symbolic names in odls_default instead of fixed int's,
just for slight readability improvements in the code.
* Introduce orte_show_help_suppress(), which gives the ability to
suppress any future showings of any arbitrary show_help() message.
This is useful if you display message X and want to suppress
message Y. This suppression *only* works in environments where
orte_show_help() does coalescing.
This commit was SVN r23249.
2010-06-08 20:16:07 +00:00
Ralph Castain
e52a54183f
Let max restarts be associated with an app_context instead of a job so that individual apps can have different values. Default to a single job-level value
...
This commit was SVN r23248.
2010-06-07 14:21:08 +00:00
Ralph Castain
799a77a187
Some updates to the routed-cm module so it properly supports the tcp rmcast module
...
This commit was SVN r23247.
2010-06-07 14:19:32 +00:00
Ralph Castain
bd045468e5
Let apps use the ess cm module too...
...
This commit was SVN r23246.
2010-06-07 14:16:34 +00:00
Ralph Castain
ec7b5dae2b
Add missing include file
...
This commit was SVN r23245.
2010-06-07 14:15:25 +00:00
George Bosilca
f453265de2
Only call gettimeofday once.
...
This commit was SVN r23235.
2010-06-02 09:44:37 +00:00
Ralph Castain
69410f2a87
Ensure that we report the state on debugger daemon co-launch so that the spawn properly releases
...
This commit was SVN r23233.
2010-06-01 23:23:00 +00:00
Ralph Castain
b60c369489
Add missing rml tag
...
This commit was SVN r23232.
2010-06-01 22:58:23 +00:00
Shiqing Fan
2697a37363
Use the correct type for IO vector base.
...
This commit was SVN r23229.
2010-06-01 15:40:11 +00:00
Ralph Castain
36e6c11c5e
Little cleanup
...
This commit was SVN r23211.
2010-05-27 02:49:09 +00:00
Ralph Castain
4ce07ace61
Allow the user to set the send/recv buf size for udp. Don't declare existing nb recvs to be an error.
...
This commit was SVN r23210.
2010-05-26 14:29:36 +00:00
Ralph Castain
ab6e06f5b3
Reorganize the rmcast code to capture common code elements. Increase max msg size for spread and udp transports. Cleanup the spread configuration doc.
...
This commit was SVN r23207.
2010-05-25 22:36:57 +00:00
Ralph Castain
02cc0cde83
Only activate this module if specifically requested
...
This commit was SVN r23203.
2010-05-24 18:42:32 +00:00
Abhishek Kulkarni
f04dcffecd
Wrap the connection failed check with a SOS macro to extract the native error code.
...
This commit was SVN r23202.
2010-05-23 16:42:08 +00:00
Ralph Castain
73ebb748bb
Ignore comm failures when shutting down orteds
...
This commit was SVN r23201.
2010-05-23 02:57:03 +00:00
Ralph Castain
e8f98661bb
Fix a couple of plm modules that were calling a stale function
...
This commit was SVN r23200.
2010-05-23 02:55:47 +00:00
Ralph Castain
7c43d6c0f5
Don't drop a core file when we abort due to a lost connection
...
This commit was SVN r23199.
2010-05-22 18:09:40 +00:00
Jeff Squyres
fec7918eea
Some paffinity functions had their return status overloaded:
...
* If < 0, it's an OPAL_ERR_* value
* If >= 0, it's the actual output value of the function
This is problematic for the OPAL_SOS stuff. This commit changes those
functions to always return OPAL_* statuses and send the output value
back through output parameters (like 95% of the rest of the code
base). This avoids the confusion with OPAL_SOS stuff and makes
paffinity work again (e.g., mpirun --bind-to-core ...).
I updated all paffinitiy modules for the new function signatures, and
bumped the paffinity API version up to 2.0.1. I don't think the
version change will matter, though, because we'll be introducing
support for hardware threads soon, which will either bump the
paffinity version again or we'll replace paffinity with
a new framework.
This commit was SVN r23197.
2010-05-21 16:55:28 +00:00
Ethan Mallove
57eee4d75c
* Can't put var declarations in the middle of code
...
* Use OBJ_RELEASE on data that was OBJ_NEW'd
* Limit single-line char width
* Use ORTE_ERR_BAD_PARAM on a rankfile typo, not ORTE_ERR_SILENT
* Add copyright
This commit was SVN r23196.
2010-05-21 15:30:38 +00:00
Shiqing Fan
857f1669e2
Solve a few compilation problems on Windows.
...
This commit was SVN r23193.
2010-05-21 14:30:15 +00:00
Ralph Castain
aaaeea6f17
Once again, fix the blasted rank_file mapper. I can't guarantee that I fixed it correctly, but at least now it compiles!
...
This commit was SVN r23190.
2010-05-21 09:46:42 +00:00
Ethan Mallove
e751f3c21c
Add a check for a duplicate rank assignment in the rankfile parser (Fixes trac:2414)
...
This commit was SVN r23186.
The following Trac tickets were found above:
Ticket 2414 --> https://svn.open-mpi.org/trac/ompi/ticket/2414
2010-05-20 18:38:03 +00:00
Ralph Castain
ef3c88cbd2
If we have ordered jobs to terminate, then we should ignore comm_failed reports from daemons as they may be dropping out
...
This commit was SVN r23185.
2010-05-20 12:37:09 +00:00
Ralph Castain
05e05089b8
Ignore failed comm connections if it is our connection that failed
...
This commit was SVN r23184.
2010-05-20 03:13:09 +00:00
Abhishek Kulkarni
abe13d802c
Silence warnings by commenting out unused functions in the "hnp" notifier component.
...
This commit was SVN r23181.
2010-05-19 22:46:05 +00:00
Abhishek Kulkarni
118ce0e166
OMPI FTB component updates
...
* register FTB events from an event schema file
* define more FTB events
* minor fixes
This commit was SVN r23180.
2010-05-19 22:05:06 +00:00
Ralph Castain
c7d7a18318
Little more cleanup from SOS
...
This commit was SVN r23175.
2010-05-19 16:28:58 +00:00
Josh Hursey
f57e73d4e5
add a few more missing SOS includes
...
This commit was SVN r23168.
2010-05-18 15:00:07 +00:00
Rolf vandeVaart
cdd2d09c69
Fix broken compile.
...
This commit was SVN r23167.
2010-05-18 12:43:21 +00:00