Ralph Castain
8db7a9f9a7
Add regular expression generator to encode complete nid/pid maps - decoder to come.
...
This commit was SVN r21455.
2009-06-17 02:54:20 +00:00
Ralph Castain
85e55c5087
Make the IOF macros match for debug vs optimized builds
...
This commit was SVN r21453.
2009-06-16 22:30:53 +00:00
Rolf vandeVaart
633b996a0f
Add sys/wait.h so we can compile on Solaris.
...
This commit was SVN r21451.
2009-06-16 19:48:43 +00:00
Ralph Castain
e9fc0a74fb
Silence compiler warnings
...
This commit was SVN r21445.
2009-06-16 13:34:31 +00:00
Ralph Castain
74bd80afd9
Do not preload binaries or files if the app isn't being executed on this node
...
This commit was SVN r21444.
2009-06-16 03:12:30 +00:00
Ralph Castain
d1dd8c2653
Ensure we accurately count the number of new daemons to be launched, especially if we are restarting processes.
...
Have the resilient mapper also setup for new daemons in case the PLM needs them.
This commit was SVN r21437.
2009-06-15 13:55:01 +00:00
Abhishek Kulkarni
63845591a0
This fix adds a missing topic (cant-open-logfile) to help-orte-top.txt
...
and changes another incorrectly titled topic from pid-required
to no-contact-given.
This commit was SVN r21431.
2009-06-13 23:05:44 +00:00
Ralph Castain
c0c56e30c9
Add a missing function to the resilient mapper so it defines daemons in case they are needed
...
This commit was SVN r21428.
2009-06-12 19:48:13 +00:00
Ralph Castain
44bb265a52
Add a new MPI_Info key to preposition OMPI libraries - implementation underway, but this just defines and passes the new key
...
This commit was SVN r21425.
2009-06-12 17:53:13 +00:00
Ralph Castain
170327e575
Reorg the rmaps components to collect shared code for byslot and bynode mapping in the base so we quit duplicating it in every mapper
...
This commit was SVN r21424.
2009-06-12 17:52:17 +00:00
Ralph Castain
1ee4acb247
Cleanup how we handle pointer arrays in the odls base fns to avoid potential segfaults
...
This commit was SVN r21423.
2009-06-12 17:51:23 +00:00
Ralph Castain
70b8c89b44
Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it".
...
Also complete implementation to printout app_context objects so we see all the fields.
This commit was SVN r21408.
2009-06-10 19:01:08 +00:00
Ralph Castain
f24cefe3d2
Shift the check for adequate file descriptors to after we check if this proc is part of the job to be launched - no point in doing the check more often than absolutely required
...
This commit was SVN r21406.
2009-06-10 15:18:48 +00:00
Ralph Castain
87d7d693f0
Add a notifier call when the oob retries are exceeded so sys admins are aware of the problem
...
This commit was SVN r21405.
2009-06-10 15:17:16 +00:00
George Bosilca
77a6f27d44
Update the call to orte_plm_base_create_jobid based on the new interface.
...
This commit was SVN r21393.
2009-06-08 23:53:53 +00:00
Ralph Castain
86d55d7ebf
Fix tight loops over comm_spawn by checking to see if the system has enough child procs and file descriptors available before attempting to launch. If not, introduce a 1sec delay and then test again. This provides a chance for the orted to complete processing of proc terminations from other children, hopefully creating room for the new proc(s).
...
Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn.
This commit was SVN r21390.
2009-06-08 18:28:26 +00:00
Shiqing Fan
5a90b3068e
Two type casts.
...
This commit was SVN r21388.
2009-06-07 12:51:46 +00:00
Ralph Castain
0a67bcb653
Minor cleanups
...
This commit was SVN r21387.
2009-06-06 15:44:00 +00:00
Ralph Castain
ccf6b2cb8c
Implement the ability to register callbacks when specified error states occur
...
This commit was SVN r21386.
2009-06-06 01:15:31 +00:00
Ralph Castain
0336460b0a
Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes.
...
This commit was SVN r21385.
2009-06-06 01:08:47 +00:00
Ralph Castain
3815bfbba6
Provide a better error message when the oob cannot send a message after exhausting retries, and then have the proc abort so the job doesn't just hang forever.
...
Since it could be a daemon that needs to abort, cleanup the abort sequence so the daemon can exit as cleanly as possible.
This commit was SVN r21361.
2009-06-02 23:57:12 +00:00
Ralph Castain
882b40182b
Add --show-progress option to mpirun
...
This commit was SVN r21360.
2009-06-02 23:52:59 +00:00
Ralph Castain
30a357bd8d
Provide a "progress meter" for launch that outputs progress as we are launching, especially on large jobs. Also, provide a timeout mechanism so that we cleanly abort if we don't get a response from the next daemon in a specified time.
...
This commit was SVN r21359.
2009-06-02 23:52:02 +00:00
Ralph Castain
303e3a1d39
Add a resilient mapping capability - currently maps by fault groups (if provided), still need to add the remapping capability for failed procs.
...
This commit was SVN r21350.
2009-06-02 03:23:20 +00:00
Jeff Squyres
df7c387155
I'm (temporarily?) removing this entry because there's no .window file
...
in this directory and it's causing "make dist" to break.
Shiqing -- is there a missing file in this directory? If so, please
add it and restore the EXTRA_DIST line I just removed. Thanks!
This commit was SVN r21340.
2009-06-01 14:07:08 +00:00
Ralph Castain
137104b786
Initial support for CMs - needs to be pruned as CM support develops
...
This commit was SVN r21335.
2009-05-30 20:57:23 +00:00
Ralph Castain
4a31c65126
Add initial support for CMs
...
This commit was SVN r21334.
2009-05-30 20:55:55 +00:00
Ralph Castain
a1005a716f
Allow CM's to select the default errmgr component. Add support for error function callbacks
...
This commit was SVN r21333.
2009-05-30 20:43:42 +00:00
Ralph Castain
a95731fc68
Minor update to let apps set their own component selections if desired, while preserving slave behavior
...
This commit was SVN r21332.
2009-05-30 20:42:23 +00:00
Ralph Castain
4e0223a638
Add the ability to directly launch procs via rsh/ssh. Collect common functions in plm/base. Create a new global param to set assume_same_shell, alias'd back to plm_rsh_assume_same_shell (not deprecated).
...
This commit was SVN r21328.
2009-05-30 01:10:25 +00:00
Jeff Squyres
5ea1b776f7
Remove a compiler warning about an empty format string. The proper
...
way to have no abort message is to pass NULL (the errmanager is smart
enough to handle this case and not emit any extra message).
This commit was SVN r21311.
2009-05-28 13:32:37 +00:00
Jeff Squyres
1834fc4ac6
Nysal noticed some repeated header files; removed.
...
This commit was SVN r21310.
2009-05-28 12:05:42 +00:00
Ralph Castain
17485ca604
Minor mod per Greg Watson, plus some cleanups to make George smile...or at least grimace a little less! :-)
...
This commit was SVN r21309.
2009-05-28 00:55:01 +00:00
Ralph Castain
cf3a5c44db
Modify the xml output per devel-list discussion with Greg Watson
...
This commit was SVN r21285.
2009-05-27 00:43:54 +00:00
Jeff Squyres
e6a32f13bb
Add missing header file
...
This commit was SVN r21273.
2009-05-26 20:57:44 +00:00
Iain Bason
e7ff2368d6
This fixes trac:1930.
...
Emit a more informative error message when the file descriptor limit is
reached during an accept() call. Also, abort when the accept fails to
avoid an infinite loop.
Emit a more informative error message when the help file can't be opened.
This commit was SVN r21271.
The following Trac tickets were found above:
Ticket 1930 --> https://svn.open-mpi.org/trac/ompi/ticket/1930
2009-05-26 20:03:21 +00:00
Rolf vandeVaart
db04b6ca71
This change does two things. First, do not emit error
...
messages when delivering a signal (like STOP or CONT)
to a non-existant process. This fixes trac:1929.
Also, only print one error message in the other cases.
This commit was SVN r21263.
The following Trac tickets were found above:
Ticket 1929 --> https://svn.open-mpi.org/trac/ompi/ticket/1929
2009-05-22 14:59:27 +00:00
Ralph Castain
cc7620c210
Fix orte-ps so it properly ignores/reports stale HNPs, but continues to provide output on running ones. Add a timeout on the send side of the comm so we don't hang while trying to send the info request to the non-existent HNP.
...
This commit was SVN r21257.
2009-05-21 02:42:21 +00:00
Jeff Squyres
154dc846b5
Remove extra ";", which caused a bazillion warnings in MTT.
...
This commit was SVN r21255.
2009-05-20 13:16:31 +00:00
Rainer Keller
5c80033aa2
- Eliminate icc warning w/ regard to __attribute__((__format__)) on
...
function pointers... Needed checking in opal_check_attributes.m4
This commit was SVN r21254.
2009-05-20 00:39:22 +00:00
Ralph Castain
fc88f04bdd
Initialize var before use - thanks to Ashley for pointing out the problem
...
This commit was SVN r21249.
2009-05-18 14:21:29 +00:00
Ralph Castain
f139cfd28a
Fully enable the use of static ports to minimize connections on mpirun. When static ports are provided, daemons will automatically use routes defined by the selected routed module to callback to mpirun during startup, thus elimating the dedicated daemon-to-mpirun connection. Therefore, the total number of connections on mpirun will equal the fanout of the routed module (instead of #nodes in job).
...
Add a new tm ess module that exploits this capability.
Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function.
Additional fixes included:
1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with!
2. fix a duplicate free when attempting to execute a non-existent app
3. cleanup an typo in the comm utilities
4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps
Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info.
This commit was SVN r21248.
2009-05-16 04:15:55 +00:00
Ralph Castain
484a6f58f2
Repair orte-ps by updating some of the interface code. Add ability to recover from attempting to contact non-responsive HNPs due to stale session directories. Implement the -j option. Turn "off" the -p option as it doesn't work and will take a little while to actually implement it (if anyone really cares).
...
This commit was SVN r21245.
2009-05-15 13:21:18 +00:00
Rainer Keller
b98a095d22
- Similar to r21229, check for return code from
...
orte_rml_base_update_contact_info
This commit was SVN r21233.
The following SVN revision numbers were found above:
r21229 --> open-mpi/ompi@9ad9b20847
2009-05-14 00:36:51 +00:00
Rainer Keller
9b7c4c2354
- Update comment
...
This commit was SVN r21232.
2009-05-14 00:32:43 +00:00
Rainer Keller
7950e37eed
- Fix Coverity CID 890;
...
Use snprintf instead sprintf
This commit was SVN r21231.
2009-05-14 00:30:32 +00:00
Rainer Keller
b3daf7184a
- Fix Coverity CID 1272:
...
Need to release src in case of error
- Fix Coverity CID 1271:
basename is allocated by opal_basename and later overwritten
This commit was SVN r21230.
2009-05-14 00:28:35 +00:00
Rainer Keller
9ad9b20847
- Fix Coverity CID #15 .
...
This commit was SVN r21229.
2009-05-14 00:25:02 +00:00
Rainer Keller
73fd329cbd
- Add the proper __opal_attribute_format__(__printf__...) to
...
declarations.
This commit was SVN r21226.
2009-05-14 00:10:59 +00:00
Ralph Castain
fd5dd9c4cb
Ensure we correctly cycle through map_by_slot when mapping leftover procs in rankfile mapper
...
This commit was SVN r21219.
2009-05-12 15:41:55 +00:00