1
1

1840 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
7a802a9d3a Move the plm designation to the argv from the env to support those systems not setup to pass env via rsh.
This commit was SVN r21495.
2009-06-22 18:08:45 +00:00
Ralph Castain
c199dbb241 Revert r21480 - we already did open/select the PLM on non-HNP daemons. This commit broke slave launches on Torque and SLURM as it caused the PLM to be open/selected twice.
The open/select of the PLM is done in orte/mca/ess/base/ess_base_std_orted.c. It only is done when the PLM MCA param is set directing a specific PLM be selected. The function

orte_plm_base_orted_append_basic_args

clears the params passed to the daemon of any PLM selection passed to the HNP. Each PLM then adds a PLM directive if-and-only-if backend PLM support is desired. At present, Torque, SLURM, and rsh all specify this support and direct that the backend orted open the "rsh" PLM.

This commit was SVN r21488.

The following SVN revision numbers were found above:
  r21480 --> open-mpi/ompi@ed585bce8a
2009-06-20 03:58:00 +00:00
Camille Coti
ed585bce8a Initialize the PML if we are a non-HNP daemon.
If we do not initialize the PML, non-HNP daemons will not be able to use its functions. For example, RSH needs it when the tree_spawn mode is 
enabled: daemons call orte_pml.remote_spawn() function to spawn their children in the deployment tree. 

This commit was SVN r21480.
2009-06-19 18:50:06 +00:00
Ralph Castain
771ce035a5 Complete implementation of regular expression generator and parser - now handles leading zero's and suffix in node names.
This commit was SVN r21468.
2009-06-18 04:36:00 +00:00
Jeff Squyres
2a5813ac2d Silence a compiler warning.
This commit was SVN r21459.
2009-06-17 12:26:38 +00:00
Ralph Castain
85e55c5087 Make the IOF macros match for debug vs optimized builds
This commit was SVN r21453.
2009-06-16 22:30:53 +00:00
Rolf vandeVaart
633b996a0f Add sys/wait.h so we can compile on Solaris.
This commit was SVN r21451.
2009-06-16 19:48:43 +00:00
Ralph Castain
e9fc0a74fb Silence compiler warnings
This commit was SVN r21445.
2009-06-16 13:34:31 +00:00
Ralph Castain
74bd80afd9 Do not preload binaries or files if the app isn't being executed on this node
This commit was SVN r21444.
2009-06-16 03:12:30 +00:00
Ralph Castain
d1dd8c2653 Ensure we accurately count the number of new daemons to be launched, especially if we are restarting processes.
Have the resilient mapper also setup for new daemons in case the PLM needs them.

This commit was SVN r21437.
2009-06-15 13:55:01 +00:00
Ralph Castain
c0c56e30c9 Add a missing function to the resilient mapper so it defines daemons in case they are needed
This commit was SVN r21428.
2009-06-12 19:48:13 +00:00
Ralph Castain
170327e575 Reorg the rmaps components to collect shared code for byslot and bynode mapping in the base so we quit duplicating it in every mapper
This commit was SVN r21424.
2009-06-12 17:52:17 +00:00
Ralph Castain
1ee4acb247 Cleanup how we handle pointer arrays in the odls base fns to avoid potential segfaults
This commit was SVN r21423.
2009-06-12 17:51:23 +00:00
Ralph Castain
70b8c89b44 Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it".
Also complete implementation to printout app_context objects so we see all the fields.

This commit was SVN r21408.
2009-06-10 19:01:08 +00:00
Ralph Castain
f24cefe3d2 Shift the check for adequate file descriptors to after we check if this proc is part of the job to be launched - no point in doing the check more often than absolutely required
This commit was SVN r21406.
2009-06-10 15:18:48 +00:00
Ralph Castain
87d7d693f0 Add a notifier call when the oob retries are exceeded so sys admins are aware of the problem
This commit was SVN r21405.
2009-06-10 15:17:16 +00:00
George Bosilca
77a6f27d44 Update the call to orte_plm_base_create_jobid based on the new interface.
This commit was SVN r21393.
2009-06-08 23:53:53 +00:00
Ralph Castain
86d55d7ebf Fix tight loops over comm_spawn by checking to see if the system has enough child procs and file descriptors available before attempting to launch. If not, introduce a 1sec delay and then test again. This provides a chance for the orted to complete processing of proc terminations from other children, hopefully creating room for the new proc(s).
Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn.

This commit was SVN r21390.
2009-06-08 18:28:26 +00:00
Shiqing Fan
5a90b3068e Two type casts.
This commit was SVN r21388.
2009-06-07 12:51:46 +00:00
Ralph Castain
0a67bcb653 Minor cleanups
This commit was SVN r21387.
2009-06-06 15:44:00 +00:00
Ralph Castain
ccf6b2cb8c Implement the ability to register callbacks when specified error states occur
This commit was SVN r21386.
2009-06-06 01:15:31 +00:00
Ralph Castain
0336460b0a Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes.
This commit was SVN r21385.
2009-06-06 01:08:47 +00:00
Ralph Castain
3815bfbba6 Provide a better error message when the oob cannot send a message after exhausting retries, and then have the proc abort so the job doesn't just hang forever.
Since it could be a daemon that needs to abort, cleanup the abort sequence so the daemon can exit as cleanly as possible.

This commit was SVN r21361.
2009-06-02 23:57:12 +00:00
Ralph Castain
30a357bd8d Provide a "progress meter" for launch that outputs progress as we are launching, especially on large jobs. Also, provide a timeout mechanism so that we cleanly abort if we don't get a response from the next daemon in a specified time.
This commit was SVN r21359.
2009-06-02 23:52:02 +00:00
Ralph Castain
303e3a1d39 Add a resilient mapping capability - currently maps by fault groups (if provided), still need to add the remapping capability for failed procs.
This commit was SVN r21350.
2009-06-02 03:23:20 +00:00
Jeff Squyres
df7c387155 I'm (temporarily?) removing this entry because there's no .window file
in this directory and it's causing "make dist" to break.

Shiqing -- is there a missing file in this directory?  If so, please
add it and restore the EXTRA_DIST line I just removed.  Thanks!

This commit was SVN r21340.
2009-06-01 14:07:08 +00:00
Ralph Castain
137104b786 Initial support for CMs - needs to be pruned as CM support develops
This commit was SVN r21335.
2009-05-30 20:57:23 +00:00
Ralph Castain
4a31c65126 Add initial support for CMs
This commit was SVN r21334.
2009-05-30 20:55:55 +00:00
Ralph Castain
a1005a716f Allow CM's to select the default errmgr component. Add support for error function callbacks
This commit was SVN r21333.
2009-05-30 20:43:42 +00:00
Ralph Castain
a95731fc68 Minor update to let apps set their own component selections if desired, while preserving slave behavior
This commit was SVN r21332.
2009-05-30 20:42:23 +00:00
Ralph Castain
4e0223a638 Add the ability to directly launch procs via rsh/ssh. Collect common functions in plm/base. Create a new global param to set assume_same_shell, alias'd back to plm_rsh_assume_same_shell (not deprecated).
This commit was SVN r21328.
2009-05-30 01:10:25 +00:00
Jeff Squyres
5ea1b776f7 Remove a compiler warning about an empty format string. The proper
way to have no abort message is to pass NULL (the errmanager is smart
enough to handle this case and not emit any extra message).

This commit was SVN r21311.
2009-05-28 13:32:37 +00:00
Ralph Castain
17485ca604 Minor mod per Greg Watson, plus some cleanups to make George smile...or at least grimace a little less! :-)
This commit was SVN r21309.
2009-05-28 00:55:01 +00:00
Ralph Castain
cf3a5c44db Modify the xml output per devel-list discussion with Greg Watson
This commit was SVN r21285.
2009-05-27 00:43:54 +00:00
Jeff Squyres
e6a32f13bb Add missing header file
This commit was SVN r21273.
2009-05-26 20:57:44 +00:00
Iain Bason
e7ff2368d6 This fixes trac:1930.
Emit a more informative error message when the file descriptor limit is
reached during an accept() call.  Also, abort when the accept fails to
avoid an infinite loop.

Emit a more informative error message when the help file can't be opened.

This commit was SVN r21271.

The following Trac tickets were found above:
  Ticket 1930 --> https://svn.open-mpi.org/trac/ompi/ticket/1930
2009-05-26 20:03:21 +00:00
Rolf vandeVaart
db04b6ca71 This change does two things. First, do not emit error
messages when delivering a signal (like STOP or CONT)
to a non-existant process.  This fixes trac:1929.
Also, only print one error message in the other cases.

This commit was SVN r21263.

The following Trac tickets were found above:
  Ticket 1929 --> https://svn.open-mpi.org/trac/ompi/ticket/1929
2009-05-22 14:59:27 +00:00
Ralph Castain
cc7620c210 Fix orte-ps so it properly ignores/reports stale HNPs, but continues to provide output on running ones. Add a timeout on the send side of the comm so we don't hang while trying to send the info request to the non-existent HNP.
This commit was SVN r21257.
2009-05-21 02:42:21 +00:00
Jeff Squyres
154dc846b5 Remove extra ";", which caused a bazillion warnings in MTT.
This commit was SVN r21255.
2009-05-20 13:16:31 +00:00
Rainer Keller
5c80033aa2 - Eliminate icc warning w/ regard to __attribute__((__format__)) on
function pointers... Needed checking in opal_check_attributes.m4

This commit was SVN r21254.
2009-05-20 00:39:22 +00:00
Ralph Castain
f139cfd28a Fully enable the use of static ports to minimize connections on mpirun. When static ports are provided, daemons will automatically use routes defined by the selected routed module to callback to mpirun during startup, thus elimating the dedicated daemon-to-mpirun connection. Therefore, the total number of connections on mpirun will equal the fanout of the routed module (instead of #nodes in job).
Add a new tm ess module that exploits this capability.

Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function.

Additional fixes included:

1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with!

2. fix a duplicate free when attempting to execute a non-existent app

3. cleanup an typo in the comm utilities 

4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps

Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info.

This commit was SVN r21248.
2009-05-16 04:15:55 +00:00
Rainer Keller
b98a095d22 - Similar to r21229, check for return code from
orte_rml_base_update_contact_info

This commit was SVN r21233.

The following SVN revision numbers were found above:
  r21229 --> open-mpi/ompi@9ad9b20847
2009-05-14 00:36:51 +00:00
Rainer Keller
9b7c4c2354 - Update comment
This commit was SVN r21232.
2009-05-14 00:32:43 +00:00
Rainer Keller
b3daf7184a - Fix Coverity CID 1272:
Need to release src in case of error

 - Fix Coverity CID 1271:
   basename is allocated by opal_basename and later overwritten

This commit was SVN r21230.
2009-05-14 00:28:35 +00:00
Rainer Keller
73fd329cbd - Add the proper __opal_attribute_format__(__printf__...) to
declarations.

This commit was SVN r21226.
2009-05-14 00:10:59 +00:00
Ralph Castain
fd5dd9c4cb Ensure we correctly cycle through map_by_slot when mapping leftover procs in rankfile mapper
This commit was SVN r21219.
2009-05-12 15:41:55 +00:00
Shiqing Fan
9f624cd9a2 A small fix, the right one to use in orte.
This commit was SVN r21213.
2009-05-12 09:53:34 +00:00
Ralph Castain
d396f0a6fc Per the discussion on the devel list, move the binding of processes to processors from MPI_Init to process start. This involves:
1. replacing mpi_paffinity_alone with opal_paffinity_alone - for back-compatibility, I have aliased mpi_paffinity_alone to the new param name. This caus
es a mild abstraction break in the opal/mca/paffinity framework - per the devel discussion...live with it. :-) I also moved the ompi_xxx global variable
 that tracked maffinity setup so it could be properly closed in MPI_Finalize to the opal/mca/maffinity framework to avoid an abstraction break.

2. Added code to the odls/default module to perform paffinity binding and maffinity init between process fork and exec. This has been tested on IU's odi
n cluster and works for both MPI and non-MPI apps.

3. Revise MPI_Init to detect if affinity has already been set, and to attempt to set it if not already done. I have *not* tested this as I haven't yet f
igured out a way to do so - I couldn't get slurm to perform cpu bindings, even though it supposedly does do so.

This has only been lightly tested and would definitely benefit from a wider range of evaluation...

This commit was SVN r21209.
2009-05-12 02:18:35 +00:00
Ralph Castain
fa839f4a30 Fix a bug in the rankfile mapper when nooversubscribe is set
This commit was SVN r21208.
2009-05-11 23:44:59 +00:00
Ralph Castain
388292aed5 Fix my old "friend" singleton comm_spawn
This commit was SVN r21207.
2009-05-11 14:53:02 +00:00