1
1

2288 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
8db7a9f9a7 Add regular expression generator to encode complete nid/pid maps - decoder to come.
This commit was SVN r21455.
2009-06-17 02:54:20 +00:00
Ralph Castain
85e55c5087 Make the IOF macros match for debug vs optimized builds
This commit was SVN r21453.
2009-06-16 22:30:53 +00:00
Rolf vandeVaart
633b996a0f Add sys/wait.h so we can compile on Solaris.
This commit was SVN r21451.
2009-06-16 19:48:43 +00:00
Ralph Castain
e9fc0a74fb Silence compiler warnings
This commit was SVN r21445.
2009-06-16 13:34:31 +00:00
Ralph Castain
74bd80afd9 Do not preload binaries or files if the app isn't being executed on this node
This commit was SVN r21444.
2009-06-16 03:12:30 +00:00
Ralph Castain
d1dd8c2653 Ensure we accurately count the number of new daemons to be launched, especially if we are restarting processes.
Have the resilient mapper also setup for new daemons in case the PLM needs them.

This commit was SVN r21437.
2009-06-15 13:55:01 +00:00
Abhishek Kulkarni
63845591a0 This fix adds a missing topic (cant-open-logfile) to help-orte-top.txt
and changes another incorrectly titled topic from pid-required
to no-contact-given.

This commit was SVN r21431.
2009-06-13 23:05:44 +00:00
Ralph Castain
c0c56e30c9 Add a missing function to the resilient mapper so it defines daemons in case they are needed
This commit was SVN r21428.
2009-06-12 19:48:13 +00:00
Ralph Castain
44bb265a52 Add a new MPI_Info key to preposition OMPI libraries - implementation underway, but this just defines and passes the new key
This commit was SVN r21425.
2009-06-12 17:53:13 +00:00
Ralph Castain
170327e575 Reorg the rmaps components to collect shared code for byslot and bynode mapping in the base so we quit duplicating it in every mapper
This commit was SVN r21424.
2009-06-12 17:52:17 +00:00
Ralph Castain
1ee4acb247 Cleanup how we handle pointer arrays in the odls base fns to avoid potential segfaults
This commit was SVN r21423.
2009-06-12 17:51:23 +00:00
Ralph Castain
70b8c89b44 Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it".
Also complete implementation to printout app_context objects so we see all the fields.

This commit was SVN r21408.
2009-06-10 19:01:08 +00:00
Ralph Castain
f24cefe3d2 Shift the check for adequate file descriptors to after we check if this proc is part of the job to be launched - no point in doing the check more often than absolutely required
This commit was SVN r21406.
2009-06-10 15:18:48 +00:00
Ralph Castain
87d7d693f0 Add a notifier call when the oob retries are exceeded so sys admins are aware of the problem
This commit was SVN r21405.
2009-06-10 15:17:16 +00:00
George Bosilca
77a6f27d44 Update the call to orte_plm_base_create_jobid based on the new interface.
This commit was SVN r21393.
2009-06-08 23:53:53 +00:00
Ralph Castain
86d55d7ebf Fix tight loops over comm_spawn by checking to see if the system has enough child procs and file descriptors available before attempting to launch. If not, introduce a 1sec delay and then test again. This provides a chance for the orted to complete processing of proc terminations from other children, hopefully creating room for the new proc(s).
Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn.

This commit was SVN r21390.
2009-06-08 18:28:26 +00:00
Shiqing Fan
5a90b3068e Two type casts.
This commit was SVN r21388.
2009-06-07 12:51:46 +00:00
Ralph Castain
0a67bcb653 Minor cleanups
This commit was SVN r21387.
2009-06-06 15:44:00 +00:00
Ralph Castain
ccf6b2cb8c Implement the ability to register callbacks when specified error states occur
This commit was SVN r21386.
2009-06-06 01:15:31 +00:00
Ralph Castain
0336460b0a Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes.
This commit was SVN r21385.
2009-06-06 01:08:47 +00:00
Ralph Castain
3815bfbba6 Provide a better error message when the oob cannot send a message after exhausting retries, and then have the proc abort so the job doesn't just hang forever.
Since it could be a daemon that needs to abort, cleanup the abort sequence so the daemon can exit as cleanly as possible.

This commit was SVN r21361.
2009-06-02 23:57:12 +00:00
Ralph Castain
882b40182b Add --show-progress option to mpirun
This commit was SVN r21360.
2009-06-02 23:52:59 +00:00
Ralph Castain
30a357bd8d Provide a "progress meter" for launch that outputs progress as we are launching, especially on large jobs. Also, provide a timeout mechanism so that we cleanly abort if we don't get a response from the next daemon in a specified time.
This commit was SVN r21359.
2009-06-02 23:52:02 +00:00
Ralph Castain
303e3a1d39 Add a resilient mapping capability - currently maps by fault groups (if provided), still need to add the remapping capability for failed procs.
This commit was SVN r21350.
2009-06-02 03:23:20 +00:00
Jeff Squyres
df7c387155 I'm (temporarily?) removing this entry because there's no .window file
in this directory and it's causing "make dist" to break.

Shiqing -- is there a missing file in this directory?  If so, please
add it and restore the EXTRA_DIST line I just removed.  Thanks!

This commit was SVN r21340.
2009-06-01 14:07:08 +00:00
Ralph Castain
137104b786 Initial support for CMs - needs to be pruned as CM support develops
This commit was SVN r21335.
2009-05-30 20:57:23 +00:00
Ralph Castain
4a31c65126 Add initial support for CMs
This commit was SVN r21334.
2009-05-30 20:55:55 +00:00
Ralph Castain
a1005a716f Allow CM's to select the default errmgr component. Add support for error function callbacks
This commit was SVN r21333.
2009-05-30 20:43:42 +00:00
Ralph Castain
a95731fc68 Minor update to let apps set their own component selections if desired, while preserving slave behavior
This commit was SVN r21332.
2009-05-30 20:42:23 +00:00
Ralph Castain
4e0223a638 Add the ability to directly launch procs via rsh/ssh. Collect common functions in plm/base. Create a new global param to set assume_same_shell, alias'd back to plm_rsh_assume_same_shell (not deprecated).
This commit was SVN r21328.
2009-05-30 01:10:25 +00:00
Jeff Squyres
5ea1b776f7 Remove a compiler warning about an empty format string. The proper
way to have no abort message is to pass NULL (the errmanager is smart
enough to handle this case and not emit any extra message).

This commit was SVN r21311.
2009-05-28 13:32:37 +00:00
Jeff Squyres
1834fc4ac6 Nysal noticed some repeated header files; removed.
This commit was SVN r21310.
2009-05-28 12:05:42 +00:00
Ralph Castain
17485ca604 Minor mod per Greg Watson, plus some cleanups to make George smile...or at least grimace a little less! :-)
This commit was SVN r21309.
2009-05-28 00:55:01 +00:00
Ralph Castain
cf3a5c44db Modify the xml output per devel-list discussion with Greg Watson
This commit was SVN r21285.
2009-05-27 00:43:54 +00:00
Jeff Squyres
e6a32f13bb Add missing header file
This commit was SVN r21273.
2009-05-26 20:57:44 +00:00
Iain Bason
e7ff2368d6 This fixes trac:1930.
Emit a more informative error message when the file descriptor limit is
reached during an accept() call.  Also, abort when the accept fails to
avoid an infinite loop.

Emit a more informative error message when the help file can't be opened.

This commit was SVN r21271.

The following Trac tickets were found above:
  Ticket 1930 --> https://svn.open-mpi.org/trac/ompi/ticket/1930
2009-05-26 20:03:21 +00:00
Rolf vandeVaart
db04b6ca71 This change does two things. First, do not emit error
messages when delivering a signal (like STOP or CONT)
to a non-existant process.  This fixes trac:1929.
Also, only print one error message in the other cases.

This commit was SVN r21263.

The following Trac tickets were found above:
  Ticket 1929 --> https://svn.open-mpi.org/trac/ompi/ticket/1929
2009-05-22 14:59:27 +00:00
Ralph Castain
cc7620c210 Fix orte-ps so it properly ignores/reports stale HNPs, but continues to provide output on running ones. Add a timeout on the send side of the comm so we don't hang while trying to send the info request to the non-existent HNP.
This commit was SVN r21257.
2009-05-21 02:42:21 +00:00
Jeff Squyres
154dc846b5 Remove extra ";", which caused a bazillion warnings in MTT.
This commit was SVN r21255.
2009-05-20 13:16:31 +00:00
Rainer Keller
5c80033aa2 - Eliminate icc warning w/ regard to __attribute__((__format__)) on
function pointers... Needed checking in opal_check_attributes.m4

This commit was SVN r21254.
2009-05-20 00:39:22 +00:00
Ralph Castain
fc88f04bdd Initialize var before use - thanks to Ashley for pointing out the problem
This commit was SVN r21249.
2009-05-18 14:21:29 +00:00
Ralph Castain
f139cfd28a Fully enable the use of static ports to minimize connections on mpirun. When static ports are provided, daemons will automatically use routes defined by the selected routed module to callback to mpirun during startup, thus elimating the dedicated daemon-to-mpirun connection. Therefore, the total number of connections on mpirun will equal the fanout of the routed module (instead of #nodes in job).
Add a new tm ess module that exploits this capability.

Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function.

Additional fixes included:

1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with!

2. fix a duplicate free when attempting to execute a non-existent app

3. cleanup an typo in the comm utilities 

4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps

Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info.

This commit was SVN r21248.
2009-05-16 04:15:55 +00:00
Ralph Castain
484a6f58f2 Repair orte-ps by updating some of the interface code. Add ability to recover from attempting to contact non-responsive HNPs due to stale session directories. Implement the -j option. Turn "off" the -p option as it doesn't work and will take a little while to actually implement it (if anyone really cares).
This commit was SVN r21245.
2009-05-15 13:21:18 +00:00
Rainer Keller
b98a095d22 - Similar to r21229, check for return code from
orte_rml_base_update_contact_info

This commit was SVN r21233.

The following SVN revision numbers were found above:
  r21229 --> open-mpi/ompi@9ad9b20847
2009-05-14 00:36:51 +00:00
Rainer Keller
9b7c4c2354 - Update comment
This commit was SVN r21232.
2009-05-14 00:32:43 +00:00
Rainer Keller
7950e37eed - Fix Coverity CID 890;
Use snprintf instead sprintf

This commit was SVN r21231.
2009-05-14 00:30:32 +00:00
Rainer Keller
b3daf7184a - Fix Coverity CID 1272:
Need to release src in case of error

 - Fix Coverity CID 1271:
   basename is allocated by opal_basename and later overwritten

This commit was SVN r21230.
2009-05-14 00:28:35 +00:00
Rainer Keller
9ad9b20847 - Fix Coverity CID #15.
This commit was SVN r21229.
2009-05-14 00:25:02 +00:00
Rainer Keller
73fd329cbd - Add the proper __opal_attribute_format__(__printf__...) to
declarations.

This commit was SVN r21226.
2009-05-14 00:10:59 +00:00
Ralph Castain
fd5dd9c4cb Ensure we correctly cycle through map_by_slot when mapping leftover procs in rankfile mapper
This commit was SVN r21219.
2009-05-12 15:41:55 +00:00